The Helicobacter pylori Genome Project: insights into H. pylori population structure from analysis of a worldwide collection of complete genomes

Thorell, Kaisa; Muñoz-Ramírez, Zilia Y.; Wang, Difei; Sandoval-Motta, Santiago; Boscolo Agostini, Rajiv; Ghirotto, Silvia; Torres, Roberto C.; Falush, Daniel; Camargo, M. Constanza; Rabkin, Charles S.

doi:10.1038/s41467-023-43562-y

Download PDF

Article
Open access
Published: 11 December 2023

The Helicobacter pylori Genome Project: insights into H. pylori population structure from analysis of a worldwide collection of complete genomes

Nature Communications volume 14, Article number: 8184 (2023) Cite this article

12k Accesses
3 Citations
51 Altmetric
Metrics details

Subjects

Abstract

Helicobacter pylori, a dominant member of the gastric microbiota, shares co-evolutionary history with humans. This has led to the development of genetically distinct H. pylori subpopulations associated with the geographic origin of the host and with differential gastric disease risk. Here, we provide insights into H. pylori population structure as a part of the Helicobacter pylori Genome Project (HpGP), a multi-disciplinary initiative aimed at elucidating H. pylori pathogenesis and identifying new therapeutic targets. We collected 1011 well-characterized clinical strains from 50 countries and generated high-quality genome sequences. We analysed core genome diversity and population structure of the HpGP dataset and 255 worldwide reference genomes to outline the ancestral contribution to Eurasian, African, and American populations. We found evidence of substantial contribution of population hpNorthAsia and subpopulation hspUral in Northern European H. pylori. The genomes of H. pylori isolated from northern and southern Indigenous Americans differed in that bacteria isolated in northern Indigenous communities were more similar to North Asian H. pylori while the southern had higher relatedness to hpEastAsia. Notably, we also found a highly clonal yet geographically dispersed North American subpopulation, which is negative for the cag pathogenicity island, and present in 7% of sequenced US genomes. We expect the HpGP dataset and the corresponding strains to become a major asset for H. pylori genomics.

A 500-year tale of co-evolution, adaptation, and virulence: Helicobacter pylori in the Americas

Article Open access 02 September 2020

Within-host evolution of Helicobacter pylori shaped by niche-specific adaptation, intragastric migrations and selective sweeps

Article Open access 22 May 2019

Methylome evolution suggests lineage-dependent selection in the gastric pathogen Helicobacter pylori

Article Open access 12 August 2023

Introduction

Helicobacter pylori has co-existed with humans for more than 100,000 years. It is the primary etiologic agent associated with gastric diseases such as ulcers and gastric cancer. Still, while over half the world population is colonized with H. pylori, less than 2% will end up with gastric cancer^1,2. The intimate symbiotic relationship with humans, together with predominantly vertical transmission, has led H. pylori to evolve into multiple distinct geographic populations^3,4,5. The phylogeographic structure of H. pylori is classified into major populations (“hp”) and subpopulations (“hsp”) that correlate with ancient human migrations^3,4,6. However, most worldwide efforts in this regard have been based on the analysis of only a handful of genes rather than whole genomes^3,4. The risk of developing disease from H. pylori infection varies greatly by geography⁷ and genomic studies of both humans and H. pylori are required to identify the factors that modify this risk.

The Helicobacter pylori Genome Project (HpGP) is an international and multidisciplinary initiative to sequence and map H. pylori population structure by collecting strains worldwide. Here we analyze 1011 H. pylori genomes, sequenced with PacBio Single Molecule, Real-Time long-read technology, which made it possible to acquire complete assemblies. By relating the HpGP dataset to a reference set of known population assignment, we were able to quantify, with great resolution, the different inferred ancestral sources of H. pylori subpopulations and the recent and ongoing admixture among subpopulations.

Results

HpGP is a dataset of high quality and worldwide representation

The HpGP has assembled clinical strains from 50 countries, including 12 countries from which no H. pylori genome sequences have previously been published (Table 1). Out of the 1011 genomes, all but seven were completely circularized (Supplementary Data 1).

Table 1 Summary of the HpGP strain collection

Full size table

To investigate the population structure of the HpGP dataset, we performed fineSTRUCTURE (FS), chromosome painting, and network analyses of shared core genome features as described⁸, and discriminant analysis of principal components (DAPC)⁹. To anchor the dataset, we used 255 H. pylori reference genomes with known Hp/hsp population assignments, representing 17 global subpopulations (Supplementary Data 2). In total, the core genome (set of homologous genes present in >95% of genomes) of the HpGP dataset, the HpGP-26695 reference genome, and 255 worldwide references, consisted of 1227 genes.

The fineSTRUCTURE global analysis revealed four main H. pylori population clusters: (i) Southwest Europe, including Latin America and Northeast Africa, (ii) Northern and Central Europe, Middle East, and Central Asia, (iii) Western and Southern Africa, including Africa2 and North, South and Central America, and (iv) North, Central and East Asia, and Indigenous populations in America. In total, these formed 17 main subpopulations (Fig. 1 and Supplementary Figs. 1 and 2). The network and DAPC analyses supported this structure but with six main clusters of differentiation (Fig. 2 and Supplementary Fig. 3).

**Fig. 1: World map of HpGP strain origins and population assignments.**

**Fig. 2: Distance network analyses of the core genome of the *H. pylori* strains studied.**

South Africa (DAPC group 4, hpAfrica2 in FS) and the reference genomes from Australia/New Guinea (DAPC group 6, hpSahul in FS) differentiated extensively from the others. A further DAPC analysis not considering these two groups showed a clear separation of two of the remaining clusters from the others: one composed of isolates of African and American origin (DAPC group 2, FS cluster III) and one that includes isolates from Central/East Asia and Indigenous Americans (DAPC group 5, FS cluster IV). The remaining (groups 1 and 3) were more similar and intertwined, representing Southern Europe/Northeast Africa and Eurasia/Central Asia and Americas, respectively (Supplementary Fig. 3). The population assignments according to the respective analyses are summarized in Supplementary Data 3.

The hpEurope subpopulations span from the Atlantic coast to South Asia

In the fineSTRUCTURE analysis, three main European/Eurasian subpopulations emerged (Supplementary Figs. 1 and 2), of which hspNEurope and hspSWEurope have previously been described^8,10. The hspEurasia population is proposed in this study, and includes the already reported hspCEurope/hspSEurope^8,10,11,12, and hspMiddleEast¹⁰. Previous studies had limited coverage of Eastern Europe and the Middle East. The HpGP strains from Lithuania, Latvia, Russia, Poland, Bulgaria, Türkiye, and Jordan allowed mapping of the Eurasian H. pylori relationships with unprecedented detail (Fig. 1). Two northern European populations showed an east-west differentiation, in hspNEurope an east clade with genomes from Latvia, Lithuania, and Russia separated from a north-western clade with genomes from UK, Sweden, Iceland, and Canada (Supplementary Fig. 2). Within hspEurasia, three main clades could be noted of which two spanned from west to east (Supplementary Fig. 2). The first, Central-Eastern European hspEurasia1, dominates in Germany, Poland, Lithuania, Latvia, Türkiye, and Russia, while hspEurasia2 is more Southern with representation from France in the west, via Italy and Greece, to Jordan and Iran in the Middle East. Thirdly, hspEurasia3 includes genomes from India and Bangladesh, but also Greece, which separated from the others but were still within the hspEurasia subpopulation.

The European subpopulations have different ancestry proportions

To further investigate the proposed subpopulations we inferred ancestry by comparing genomes within our contemporary dataset in a directed chromosome painting using only the proposed H. pylori ancestral populations hpAfrica2, hpNEAfrica, hspAfrica1WAfrica, hpAsia2, hspUral, hpNorthAsia, and hspEAsia as donors (i.e., contributors of genomic ancestry)^3,10. We confirmed a gradient in inferred ancestry along both the north-south axis with increasing Asian ancestry and decreasing African ancestry in the hspEurasia1 and hspNEurope populations and the east-west axis with hspSWEurope having a higher proportion of hspAfrica1WAfrica ancestry and with the similar contribution of hpNEAfrica as the Eurasia2 population (Fig. 3 and Supplementary Fig. 3).

**Fig. 3: Inferred ancestral genomic contributions to the Eurasian HpGP genomes.**

The more central Asian hspEurasia3 on the other hand, showed markedly higher hpAsia2 ancestry than the other hpEurope populations, concordant with its geographical co-existence with hpAsia2. Interestingly, hspUral was a more pronounced Asian ancestor for all the hpEurope subpopulations than hpNorthAsia and hspEAsia, the latter two being very even contributors, except for in hspNEurope, where hpNorthAsian ancestry was slightly higher. This relationship was also supported by the network analysis (Fig. 2).

Central Asia can be described with increased resolution but still has underrepresented regions

Apart from the relatively well-investigated hspEAsian subpopulation¹³, the fineSTRUCTURE analysis grouped the central Eurasian strains into three main clades: hpAsia2 and two clades preliminarily termed hpNorthAsia and hspUral, based on their association with reference strains previously described by Moodley et al.¹⁴.

HpAsia2 is one of the main ancestral populations of H. pylori but has been comparatively understudied. In the HpGP dataset, genomes belonging to hpAsia2 are mainly from India, Bangladesh, Myanmar, and Nepal, with the Nepalese forming a clade slightly separated from the others (Supplementary Fig. 2a, c). As seen in Fig. 1, hpAsia2 co-exists with the hpEurope hspEurasia3 population in all these countries, except for Myanmar, where only hpAsia2 is present. The DAPC analysis, on the other hand, did not distinguish hpAsia2 from hspNEurope and hspEurasia using k = 6, while the separation was evident and very consistent using k = 17 (Supplementary Fig. 4).

HpNorthAsia was previously established as one of the main Siberian populations using Multilocus Sequence Typing (MLST)¹⁴. In our reference panel, hpNorthAsia (including hspAltai) and its subpopulation hspSiberia1 were represented by genomes from central and eastern Siberia. In our analyses, these two populations did not segregate, and HpGP genomes from Kazakhstan and Kyrgyzstan were also associated with this cluster (Fig. 1 and Supplementary Fig. 1).

hspUral has been suggested as a southern central Asian subpopulation of hpAsia2. In our dataset, a cluster with a relatively wide geographical representation from Kazakhstan and Kyrgyzstan to Indonesia and Japan (Fig. 1 and Supplementary Figs. 1 and 2) is associated with the hspUral reference genomes. Our main chromosome painting analysis suggested the proposed hspUral population to contain two subclades with very different painting profiles (Supplementary Fig. 2), which was supported by the DAPC and network analysis, and the fineSTRUCTURE principal component analysis (PCA) (see https://hpgp.shinyapps.io/Interactive_figures, Fig. 4). The ancestral contributions to the central Asian genomes confirmed the HpGP “hspUral” clade not to have pronounced contribution by the hspUral references but relatively high hpAsia2, hpNorthAsia and hspEAsia painting proportions (Fig. 3). The variability of contributions was also high within the clade, suggesting this may not constitute one pure subpopulation but may consist of representatives of several HpAsia subpopulations (https://hpgp.shinyapps.io/Interactive_figures, Fig. 2). One hpAsia2 reference genome, L7, from Ladakh in northern India grouped with this cluster, especially close to two Nepalese genomes. Several “hspUral” genomes also showed an association with hpSahul in the chromosome painting (Supplementary Fig. 2), which may indicate a relationship between this group and the recently suggested hpRyukyu¹⁵.

**Fig. 4: In-depth analysis of clonal relationships in the global *H. pylori* dataset.**

African and African-descent genomes

The HpGP dataset includes African genomes from understudied countries such as Algeria, Democratic Republic of Congo (DRC), Ghana, and Nigeria, adding to previous knowledge from the Gambia and South Africa. The fineSTRUCTURE analysis confirms earlier observations of the presence of four African populations in this continent, hpAfrica2, in the HpGP dataset represented in South Africa; hspAfrica1SAfrica, which reaches as far north as DRC; hspAfrica1WAfrica represented in the Gambia, as previously reported, and hpNEAfrica. However, the Ghanaian and Nigerian genomes grouped with the more admixed hspAfrica1NorthAmerica and hspAfrica1MiscAmericas populations, interspersed with, and by chromosome painting indistinguishable from genomes from the US, Puerto Rico (US territory), Dominican Republic, Colombia, and Brazil, likely a result of the trans-Atlantic slave trade from West Africa into the Americas.

The East African reference genomes from Sudan and Ethiopia grouped within the hspSWEurope umbrella but distinctive from the European SWEurope clade, instead forming a cluster with North American genomes (Fig. 1 and Supplementary Fig. 1). In the fineSTRUCTURE PCA plots, especially pronounced in PC10 and PC11, the reference hpNEAfrican genomes and the US HpGP genomes clearly formed two segregated groups except for one Malaysian and one Swiss genome that grouped with the references (https://hpgp.shinyapps.io/Interactive_figures, Fig. 5). Both the DAPC and network analysis supported the separation of the hpAfrica1 population from hpNEAfrica, the latter being intermingled with genomes from southern Europe and Iberia.

**Fig. 5: Summary of population classifications.**

The Algerian strains did not cluster with the other African strains, but within hspSWEurope, together with genomes from Israel and Colombia in a cluster we termed SWEurope2. Despite showing slightly higher West and Northeast African and lower Asian ancestry than SWEurope1 (https://hpgp.shinyapps.io/Interactive_figures, Fig. 2), our analysis confirmed that North African H. pylori more closely resemble Iberian and Middle Eastern bacteria than African bacteria.

North America hosts a geographically dispersed deep clone

The HpGP dataset contains 68 genomes from the wide geographical representation of the continental US. This feature allowed us to identify a novel subpopulation of 15 US isolates, which showed high similarity and clustered together with genomes of hpNEAfrican ancestry in the fineSTRUCTURE analysis (Supplementary Figs. 1 and 2) and of which none carried the cag pathogenicity island (cagPAI).

High levels of sequence homogeneity within H. pylori are unexpected as unrelated strains differ in their DNA sequence at almost all genes. To further investigate the novel US subpopulation, we performed core genome (cg) MLST of the entire dataset (Fig. 4a). Within the HpGP, over 64% of strain pairs differ in sequence at all the 1040 genes. Even amongst strains sampled from the same country, 34% differ in all the genes. Only 0.15%, 798 pairs, shared similarity at >1% of genes. All but 213 of these pairs are between strains in the same country. Nearly a tenth (66) of these pairs is found between a group of 12 US strains, showing allele distances between 0.83 and 0.94 (17–6% identical alleles, respectively). Thus, this group represents older clonal relationships, a putative “deep clone”; a set of strains that share a recent common ancestor but have diverged via homologous recombination at a large fraction of their genome. Three strains are somewhat less related to these 12, sharing between 1% and 7% of genes, and were conservatively excluded from this clonal group. Other pairs involving more than two samples from the same population also showed deep clonal relationships (e.g., hspSWEuropeChile). However, the amount and pattern of alleles shared between these samples could be better explained by genetic drift and further analysis within this population is needed to define the boundaries of a putative clone.

The HpGP strains from the deep clonal group were sampled from California, Wisconsin, Tennessee, Arkansas, Georgia, and Texas and, in total, represented a fifth of the HpGP US genomes. Kmer-based clustering analysis showed an additional five public genomes from two other geographical sources, Ohio and Louisiana, associating closely with the proposed deep clonal group. We used ClonalFrameML to estimate the relationships between the genomes. Assuming a previously estimated 1.38 × 10^–5 mutation rate per site per year¹⁶, the common ancestor lived an estimated 175 years before the strains were collected (95% confidence interval, 107–227 years), while the majority of internal nodes are estimated to be less than 50 years old (Fig. 4b). Thus, the sampled strains are not epidemiologically associated with each other, and instead represent independent strains from a circulating population of clonally related bacteria, which we suggest calling Hp_Clone_US-1.

Latin American subpopulations are more admixed than others

A total of 238 strains from different regions of Latin America were included in the HpGP (Table 1). In the fineSTRUCTURE analysis, most Latin American strains clustered into two previously described populations, hspAfrica1MiscAmerica and hspSWEuropeLatinAmerica^8,11, and in hspSWEuropeChile (Supplementary Figs. 1 and 2). Around one-third of the Latin American genomes clustered in non-Latin American populations, the majority in hspAfrica1SAfrica, and hspSWEurope. However, there were also hspEAsia genomes in Argentina, Brazil, and Chile and two hspEurasia genomes from Brazil. Generally, the Latin American genomes were more admixed than their European and African counterparts, with a higher African proportion in hspSWEurope Latin American genomes and a higher European proportion in genomes grouping with hspAfrica1 (https://hpgp.shinyapps.io/Interactive_figures, Fig. 3)

Notably, most Chilean isolates clustered in a separate group, hspSWEuropeChile, (Supplementary Fig. 1), similar to Colombian isolates (hspSWEuropeColombia) previously described^8,11. This population is close to hspSWEuropeLatinAmerica and hspSWEurope, as can be seen in the fineSTRUCTURE PCA, particularly in components PC1 and PC7 (https://hpgp.shinyapps.io/Interactive_figures, Fig. 5). However, in the DAPC and network analyses, these strains are dispersed but still near hspSWEurope (Fig. 2 and Supplementary Fig. 3), which is supported by very high self-painting proportions in the chromosome painting analyses (Supplementary Fig. 2), and high pairwise similarities between the genomes of this subpopulation in the cgMLST analysis (Fig. 4a).

Indigenous American H. pylori have different ancestral contributions

The fineSTRUCTURE analysis confirmed the hspIndigenousAmerica group^8,14. This population is made up of isolates from urban areas of mixed human ancestry, as well as Indigenous communities. HspIndigenousAmerica can be subdivided into two groups called hspIndigenousNAmerica and hspIndigenousSAmerica (Supplementary Figs. 1 and 2). While hspIndigenousNAmerica is composed of strains from Indigenous communities in North America (Canada and US), the hspIndigenousSAmerica group mostly contains isolates from Latin American regions. In this dataset, we added observations of this subpopulation in Chile, Mexico, Peru, Spain, and the US.

According to the ancestral chromosome painting, and corroborated by the network results, hspIndigenousSAmerica shows a higher proximity to hspEAsia, while hspIndigenousNAmerica has a higher Indigenous-ancestral proportion and is closer to hpNorthAsia in the network analysis, even relatively distanced from hspIndigenousSAmerica (Fig. 2, https://hpgp.shinyapps.io/Interactive_figures, Fig. 3).

Discussion

The intimate association between humans and H. pylori started at the beginning of our species and represents a unique story of co-evolution between kingdoms that has fascinated researchers and the public and contributed to understanding human migration dynamics^14,17. However, the challenge is to understand the consequences of this thousands-of-years of co-evolution for human health, and on the whole-genome level, bacterial population structure has mostly been studied in the setting of specific geographical areas^{8,10,13,14,15,18,19,20}. Ongoing analyses by the HpGP Research Network are comparing between strains from patients with different gastric diseases in order to identify genetic and epigenetic bacterial features that determine human pathogenicity. The HpGP provides a publicly available worldwide collection of complete genomes and epigenomes with high-quality metadata for future investigations of H. pylori pathobiology.

Here we present a phylogeographic characterization of the HpGP genomes and outline the global population structure of this bacterium. We used three complementing comparative genomics approaches, fineSTRUCTURE/Chromosome Painting analysis, DAPC, and network analysis of pairwise distances, including interactive visualization of the data, which allowed us to study different aspects of the genomic relationships. A summary of the classifications using the different methods, including their relation to previously reported populations, is presented in Fig. 5, with details in Supplementary Data 3. The higher dynamic range of the DAPC and network analysis clearly showed that hpAfrica2 and hpSahul were very distant from all other populations (Fig. 2, Supplementary Fig. 3b and https://hpgp.shinyapps.io/Interactive_figures, Fig. 1), and the DAPC presented another four main clusters of similarity: a South/West African cluster and a NEAfrica/SWEurope cluster, of which both also had a high presence in the Americas, a North-Central Eurasian cluster, and a North/East Asian cluster, which also included hspIndigenousAmerica. All analyses, however, additionally provided evidence for strong interactions between the hspEurasia and hspSWEurope genomes, and in the 3D plots of ancestry contribution, these populations form a continuum of different ancestry levels, rather than being discrete populations (https://hpgp.shinyapps.io/Interactive_figures, Figs. 2 and 4). Iterating the DAPC analysis to test the consistency of classifications showed, for example, that northeast European genomes from Latvia, Lithuania, Poland, and Russia interchangeably were classified to the clusters corresponding to hspNEurope and hspEurasia. Similarly, some Spanish and Latin American genomes jumped between clusters corresponding to different subpopulations of hspSWEurope (Supplementary Figs. 3d and 4d). However, it was infrequent that genomes were reclassified across the main populations, which supported the relative stability of categories. A few genomes, especially from Indonesia, showed chimeric chromosome painting patterns, for example, a hspUral/hspEAsia combination and a hspUral/hpNEAfrica combination, which constitute rare and exciting intersects between distant populations.

The finding of a highly homogenous group of geographically dispersed genomes in the US motivated us to search for evidence of distant clonal relationships amongst all HpGP strains. The exceptional recombination rate of H. pylori means that strains with a common ancestor a few hundred years ago will have recombined most of their genomes, eliminating evidence of a shared clonal frame. Furthermore, an estimated 3.5 billion humans are infected with H. pylori²¹ meaning that the current bacterial population size is enormous. As a result, it has been rare to find evidence of clonal relationships between strains collected from distant geographic locations. However, the availability of complete genomes makes it possible to detect deep clones that have recombined in a large fraction of their genome but still share some signal of clonal descent, and the probability of sampling clonally related strains increases quadratically with sample size, meaning that clones will become increasingly common as database sizes increase.

The frequency of the deep clone Hp_Clone_US-1 in the US population is likely somewhere between 3% (proportion in non-HpGP US samples) and 18% (proportion in HpGP), while it has not yet been found outside the US. The US population in the year 1830 was less than 13 million individuals and has increased to over 330 million through natural population growth and immigration. Assuming the lineage was introduced into the US by a single individual around 1830 and infected 10-fold or more people in each human generation, it would be present in around 3 million individuals today, or about 4% of sampled individuals. These calculations ignore factors such as mixed infection and are subject to many uncertainties but demonstrate that a high level of non-vertical transmission and a significant fitness advantage over other H. pylori is necessary to explain the current frequency of Hp_Clone_US-1 in US individuals.

The relative frequency of different transmission routes in the spread of H. pylori remains unclear, and while there is evidence of frequent vertical transmission in some populations, other evidence suggests the infection spreads more readily among children^22,23. Recent work has emphasized the role of transmission within communities, especially in locations without modern sanitary infrastructure. Our results imply that Hp_Clone_US-1 has been expanding continuously, with several pairs of strains isolated from patients in different states having estimated common ancestors within the last 70 years, which suggests the possibility of occasional mass transmission events in the 20th-century USA. Identification of further clones worldwide should provide additional information to understand when and how some lineages of H. pylori can spread fast through human populations. Interestingly, all members of the clone lack the cag pathogenicity island, suggesting that also Cag negative strains can be highly competitive under modern conditions.

We note that several geographical regions and human populations remain understudied. Acquiring a better coverage of H. pylori whole genomes from South and Central Asia, and a broader representation from the Russian Federation is pivotal. These additional samples would not only offer deeper insights into the hspUral subclades but might also illuminate the possibility of uncovering novel subpopulations stemming from the main ancestral group, HpAsia2. Also, the African continent is still poorly studied in terms of H. pylori genomics, which severely limits our understanding of not only population structure but important aspects of bacterial virulence and pathophysiology.

This HpGP manuscript was designed as a landmark paper, detailing Helicobacter pylori population structure in a global, high-quality dataset. Our intention is for the manuscript to serve as a launching point for individual researchers to deepen the exploration of the detailed data generated by our network. We hope the material (i.e., data and strains) generated by the HpGP, including shared resources, codes, and interactive visualizations, together with our main results, will be widely used and will facilitate secondary analyses with the ultimate goal of reducing the burden of the pathologies associated with this bacterial carcinogen.

Methods

Sample acquisition

The HpGP samples represent a convenient set. Contributors of samples were identified through advertisements at international scientific meetings, direct invitations to known colleagues and investigators with published sets of H. pylori strains, as well as referrals. A limited number of H. pylori genomes was publicly available from Spain, one of the main countries responsible for colonial activities in the Americas. Thus, in collaboration of members of the Spanish Association of Gastroenterology, we oversampled this country to better understand the admixed genomes from individuals from Latin America and the Caribbean.

We obtained gastric tissues (fresh frozen with and without culture media; n = 351) and cultures (pooled or single colonies; n = 660) of H. pylori from patients with non-atrophic gastritis (n = 606), advanced intestinal metaplasia (n = 172, with extension to gastric corpus or incomplete type restricted to antrum), and gastric cancer (n = 233). Samples were collected between 1995 and 2020. Biospecimens were shipped to the Division of Gastroenterology, Hepatology, and Nutrition at Vanderbilt University for processing. Before shipment, clinical information and sample descriptions were submitted to the coordinating center at the US National Cancer Institute to confirm eligibility. Biospecimens from the 72 collaborating centers were shipped frozen on dry ice. All individuals provided informed consent, and local Institutional Review Boards approved sample collection. The HpGP was exempted from institutional review board evaluation by the National Institutes of Health Office of Human Subjects Research Protection. The summary statistics of 1011 included strains are presented in Table 1, and corresponding NCBI accession numbers and genome statistics are presented in Supplementary Data 1.

Isolation and expansion of H. pylori strains and DNA extraction

Gastric tissues (biopsies or fragments from resections) were homogenized under sterile conditions in 100 μL of sterile phosphate-buffered saline (PBS, pH 7.4) using a homogenizer (Kimble–Kontes, Vineland, NJ, US). Then, 300 μL of sterile PBS was added to each sample, mixed, and plated onto two selective Trypticase soy agar (TSA) plates with 5% sheep blood containing vancomycin (20 mg/L), bacitracin (200 mg/L), nalidixic acid (10 mg/L) and amphotericin B (2 mg/L) (Sigma, St Louis, MO, US). In addition, a 1:10 dilution was plated on a no-antibiotic TSA plate (BBL; LABSCO, Nashville, TN, US). Agar plates were incubated under microaerobic conditions (Campy Pak Plus envelope, BBL) at 37 °C for 4–6 days until small gray translucent colonies appeared. Gram stains and assays for oxidase and urease were performed. Colony morphology was consistent with the characteristic shape of H. pylori colonies. A pool and one single colony of H. pylori were expanded and frozen into 1 mL of freezing media (Brucella broth plus 15% glycerol). The single colony was also expanded and used for DNA extraction using Qiagen, QIAamp DNA Mini kit (Qiagen, Catalog number 51306), following the protocol and using the EB buffer to elute the DNA. Original cultures (pooled or single colonies) were processed using the same protocol.

PacBio whole-genome library preparation and sequencing

DNA samples were sequenced at the Cancer Genomics Research Laboratory at the US National Cancer Institute. The manufacturer’s protocol was performed for constructing whole-genome libraries from microbial DNA using the SMRTbell Template Prep Kit. Briefly, 1000 ng of genomic DNA, as determined by Quant-iT™ PicoGreen dsDNA Reagent (Thermo Fisher Scientific, Waltham, MA, US), was sheared using the g-TUBE (Covaris, Inc., Woburn, MA, US) to an average fragment size of 10 kb. Following fragmentation and purification, DNA damage, and end repair, hairpin adapters were ligated to the fragment ends to generate SMRTbell libraries. For sequencing on the PacBio RSII instrument, standard hairpin adapters were used. For sequencing on the PacBio Sequel and Sequel II instruments, barcoded hairpin adapters were used. For the Sequel and Sequel II, barcoded SMRTbell libraries were pooled (up to 8 for the Sequel and up to 48 for the Sequel II), and stringent purification was performed using AMPure PB beads to remove small fragments. Following purification, sequencing primer annealing and DNA polymerase binding of the pooled SMRTbell libraries was performed according to the manufacturer’s protocols. SMRT sequencing of the libraries proceeded on the PacBio instrument using 1 SMRT Cell per isolate (RSII) or pool of libraries (Sequel, Sequel II). Genome coverage ranged from 111 to 5678× (median, 949×).

Genome assembly

Since samples were sequenced from multiple generations of PacBio instruments (RSII, Sequel, Sequel II), raw data from the RSII in h5 format were converted to Sequel’s subreads XML format so that the same analytical pipeline could be applied to data from all three instruments sequencer. Thus, RSII data were reanalyzed by the same analysis pipeline as Sequel and Sequel II after initial assembly by HGAP3. The original assemblies (from HGAP3) and the new assemblies from the newer SMRTlink assembly tools (HGAP4 or Microbial Assembly) were compared and were highly consistent. Newly reassembled RSII data that generated a single contiguous chromosomal contig were kept.

Whole-genome assembly using raw subreads was performed using SMRTLink’s HGAP4/Microbial Assembly, as well as Hifiasm v0.13-r308²⁴ on the HiFi (circular consensus sequencing, CCS) reads. Prior to Hifiasm, raw subreads were converted to circular consensus reads, filtering for CCS reads with minimum predicted read quality higher than 0.99. Chromosomes and plasmids were assembled, which proved to be the most accurate and efficient to achieve complete assembly of all contigs in silico. Circularization with circlator v1.5.3²⁵ was performed on every contig in each strain. Bacterial chromosomal contig start points were all shifted to NusB gene with an additional 12 nt at the 3’-end. MUMmer v3.23²⁶ was used to perform a self-alignment to screen for assembly issues or artifactual contigs, and prokka v1.14.6²⁷ annotation of conserved H. pylori genes was run on both raw subread and HiFi read assemblies. The assemblies generated from raw subreads were generally used for methylation calling and downstream analysis unless they failed to be circularized, or prokka annotation suggested a pseudogene percentage higher than 5%, or they contained an unexpected number of tRNA/rRNAs. In those samples, the Hifiasm assembly was used. Candidate chromosomal contigs were also aligned to a published 26695 H. pylori strain (NC_000915.1) as a sanity check to verify the contig shared high homology with known H. pylori. The detailed analyses of HpGP plasmid sequences and their geographic and chromosomal contexts will be reported in full elsewhere.

Assembly quality control

To address the assembly quality, we applied the 3Cs protocol suggested by PacBio (https://www.pacb.com/blog/beyond-contiguity/). First, we assessed sequence contiguity and determined that the HpGP de novo assemblies all have a contig N50 over 1 Mb. As expected, single chromosomal contigs range from ~1.5 to 1.7 Mb. Second, we measured the completeness of our assemblies using BUSCO (Benchmarking Universal Single-Copy Orthologs) scores²⁸ v5.1.3. BUSCO checks the presence or absence of highly conserved genes, and a score >95% is considered a good assembly. For the assemblies that did not achieve a BUSCO score as high as 95%, we either discarded that sequence, or a second attempt was carried out either in silico or in the laboratory. All 1011 HpGP assemblies have BUSCO scores above 95%. To further measure correctness, we checked the ratio of pseudogenes, including frameshifted, incomplete, internal stop, ambiguous residues, and multiple problems against the total number of genes. Any assembly with a ratio of pseudogene of more than 5% of the total was discarded. Although the HpGP set includes assemblies from three different PacBio sequencing instruments (15 RSII, 832 Sequel, and 164 Sequel II), the measures of assembly quality (contiguity, BUSCO scores, genomic sequence length, number of total genes, and number of pseudogenes) were similar for the 1011 assemblies from these instruments.

Finally, a consolidated QC report was generated to summarize contig lengths, BUSCO score, and coverage depth (Supplementary Data 1). The minimal chromosomal contig average confidence QV score among most strains was as high as 90.

National Center for Biotechnology Information (NCBI) annotation

The HpGP chromosomal sequences were submitted to NCBI, including annotation with the NCBI Prokaryotic Genome Annotation Pipeline, PGAP (https://www.ncbi.nlm.nih.gov/genome/annotation_prok/)^29,30,31. For sequences that could not be circularized (n = 7), 100 Ns were added to mark breakpoint locations in the genomic sequences. The individual accession numbers and genome statistics are presented in Supplementary Data 1.

Representative genome dataset

To relate the HpGP dataset to previous knowledge about H. pylori population structure, we used a reference dataset representing the 17 global H. pylori subpopulations (n = 255 genomes, see Supplementary Data 2) described prior to January 2022. To acquire a balanced dataset, we selected 15 genomes per subpopulation, and for the subpopulations with more reported genomes, we selected representatives based on (1) consistency of population assignments in previous publications, (2) assembly quality (contig number, genome since 1.7 ± 0.2 Mbp, and (3) as wide geographical representation within the subpopulation as possible to try to encompass the full breadth of each subpopulation. For the African continent, hpAfrica2, hspAfrica1SAfrica, hspAfrica1WAfrica, and hpNEAfrica were represented, and from Europe hspNEurope, hspSEurope, and hspSWEurope. From Asia, we complemented hpAsia2 and hspEAsia with newly published genomes from hpNorthAsia and the proposed subpopulations hspSiberia, and hspUral¹⁴. For the Americas hspSWEuropeLatinAmerica, hspAfrica1MiscAmericas, hspAfrica1NorthAmerica, hspIndigenousAmericaN, and hspIndigenousAmericaS were represented⁸, and lastly, for Oceania we included hpSahul. The genomes were annotated using prokka v1.14.6²⁷ as previously described^8,11.

Core genome analysis

All population structure analyses (fineSTRUCTURE, Chromosome Painting, Network analysis, and DAPC, were based on the same core gene alignment. This was generated using the prokka-annotated 1011 HpGP genomes plus a resequenced ATCC reference strain 26695 (HpGP-26695) and the 255 representative genomes, a total of 1267 genomes. The analysis was performed using the panaroo pipeline v1.2.10³² using 90% protein sequence identity and 75% gene length coverage cut-off.

Population structure analysis

The genome-wide haplotype data was calculated as described previously³³: we conducted SNP calling for each alignment, and imputation for polymorphic sites with missing frequency <1% using BEAGLE v.3.3.2³⁴. This genome-wide haplotype contained 387,927 SNPs in 1227 genes and was used to define isolate populations and subpopulations based on the similarity of the haplotype copying profiles obtained by fineSTRUCTURE v4. Then, fineSTRUCTURE³⁵ analysis was performed with 200,000 iterations of both the burn-in and Markov chain Monte Carlo (MCMC) method to cluster individuals based on the coancestry matrix as described³⁶. The results were visualized as a heat map with each cell indicating the proportion of DNA “chunks” a recipient receives from each donor. Furthermore, the posterior distribution of the clusters was visualized using fineSTRUCTURE’s tree-building algorithm to define the populations and subpopulations produced. With the previously obtained coancestry matrix, multiple principal component analysis (PCA) was calculated to analyze the population structure in detail. Principal components (PCs) 1 to 11 were calculated and visualized using R.

DAPC analysis

We employed discriminant analysis of principal components (DAPC) to further investigate the genetic structure of our data. DAPC describes clusters in genetic data by creating synthetic variables (discriminant functions) that maximize variance among groups while minimizing variance within groups⁹. DAPC is a multivariate approach, not model based; hence, it makes no assumptions about Hardy-Weinberg or linkage equilibrium on genetic loci. Before running the DAPC, we assessed the number of clusters most supported for our H. pylori dataset by employing the find.clusters function in adegenet R package, comparing the results of 100 independent runs using a custom-made R script and selecting the optimal number of clusters according to Bayesian information criteria (BIC). In order to assess the uncertainty of the group assignments of each individual, we visualized posterior group membership probabilities based on the DAPC analysis using the function compoplot.

Using SNP-sites v2.5.1³⁷, we extracted 601,000 SNPs from the 1267 genome panaroo core gene alignment. We employed the function optim.a.score of the adegenet R package to identify the optimal number of principal components to consider for the analyses, as too many could lead to overfitting, while a low number of components could decrease discriminatory power between groups.

We first ran a DAPC employing the most supported number of clusters/groups as estimated by the find.cluster analysis: K = 6. Initially, we ran the DAPC considering all the sequences in our alignment so as to visualize the entire genetic variability of our data. Given the outlier position of the two groups, we subsequently ran another DAPC analysis excluding these outliers to better emphasize the differences among the other clusters. Finally, we computed posterior group membership probability for each individual. This parameter is based on the retained discriminant functions of the DAPC analysis and represents the probability of each sample to be assigned to a group, which can be interpreted in order to assess how clear-cut or admixed the clusters are. We also ran the DAPC procedure considering K = 17, the same number of clusters identified by the fineSTRUCTURE analysis.

Network analysis of core genes

The core gene alignment obtained with panaroo was used to estimate distances with PAUP³⁸ v4.0a166, using maximum likelihood criteria. Each distance was normalized between 0 and 1 as previously described⁸. With this normalization, 0 means the highest genetic similarity, and 1 signifies the highest dissimilarity between two strains. Next, a complete network is created, where all pairs of strains have a measure of genetic distance based on this previous normalization. Strains are represented as vertices, and their distances are represented as edges. In the beginning, this network is fully connected and has no perceptible structure. A process of edge and node pruning is carried out to reveal the underlying structure of the genetic similarity between strains. This process consists of ranking the values of the edges and removing them subsequently, starting with the most dissimilar (equal or close to one). This process is continued until the network is subdivided into a determined number of Connected Components (CC). We consider a CC as a set of more than two nodes connected between them but isolated from other groups of nodes. If a single node is stripped of all its edges (singleton), we discard this node from the set of nodes of the resulting network. Figure 2 was created following this pruning process with a CC threshold of 2. This means that edges and singletons were removed until the full network was separated into two groups of nodes, with the separated group being the hpAfrica2 group.

Chromosome painting

Full dataset analysis

To identify the patterns of shared genomic content of H. pylori isolates, we conducted chromosome painting using ChromoPainterV2³⁵, designating all genomes as recipients (1011 HpGP genomes), and randomly selected ~20 isolates per population as donors (335 genomes; Supplementary Data 3). Each strain was painted using all the other donor samples and the result is visualized in a bar plot built with R.

HpGP only ancestral chromosome painting

Since we have no genomes from the true historically ancestral populations, genomic ancestry is commonly inferred from contemporary representatives of these populations. A second chromosome painting was thus performed using hpAfrica2, hspAfrica1WAfrica, hpNEAfrica, hpAsia2, hpNorthAsia, hspUral, and hspEAsia as donors to infer ancestral contributions to the populations in the HpGP dataset^3,10. We also only selected donors among the reference collection for which H. pylori population assignment and geographical origin were concordant (Supplementary Data 3).

Core gene multilocus sequence typing (cgMLST)

To investigate the existence of clonal relationships in H. pylori, we estimated the total number of identical loci shared among strains from the HpGP dataset by performing a cgMLST as implemented by chewBBACA³⁹ software v2.8.5. chewBBACA uses a gene-by-gene method to compare coding sequences and assign alleles based on a BLAST Score Ratio (BSR)⁴⁰. We first used Prodigal⁴¹ v2.6.3, including the option -t to create a training file from the assembled version of the 26695 H. pylori reference strain resequenced as part of the HpGP dataset. Then, the “CreateSchema” module of chewBBACA was applied to the 1011 HpGP genomes and the Prodigal training file to estimate a whole-genome MLST (wgMLST) scheme. The 3943 wgMLST genes were then compared with the “AlleleCall” module, using the default BSR threshold of 0.6. A total of 867 genes identified as paralogs were removed from the wgMLST using the “RemoveGenes” module, reducing the scheme to 3076 loci. We then used the “ExtractCgMLST” module to create a cgMLST with all loci present in more than 95 percent of strains (--t 0.95), obtaining a total of 981,110 alleles for 1040 loci, an average of 943 different alleles per locus.

We last used the cgMLST allelic profile to calculate pairwise distances with GrapeTree⁴² v1.5.0, running it in “--wgMLST” mode with the “distance” method (-method distance) while ignoring missing data (--missing 0). We analyzed the distribution of cgMLST distances between pairs of strains in categories such as “US clone”, “US clone boundary”, “US non-clone”, “Chile”, “Chilean hspSWEuropeChile”, “non-Chilean hspSWEuropeChile”, “within the same country”, and “between different countries”, as depicted in Fig. 4a.

Analysis of public US genomes

We downloaded all whole-genome sequences publicly available in the EnteroBase H. pylori database (https://enterobase.warwick.ac.uk/species/index/helicobacter) with the US as the country of isolation as of September 18, 2022 (n = 226). Sixty-seven sequences were either isolated from non-human hosts, results of experimental infections, repeated samplings from the same individual or overlapping the HpGP set, thus were excluded. The remaining 151 genomes (Supplementary Data 4) were combined with the HpGP US genomes and the 255 references in a kmer-based genomic distance analysis using mash v2.3⁴³. The five genomes clustering with the US deep clone were added to the dataset used for in-depth analysis.

Dating of the US deep clone

A core gene alignment of the highly clonal US genomes, including the five public ones, was generated with panaroo using the settings described above. Three genomes, HpGP-USA-401, HpGP-USA-404, and HpGP-USA-414 had diverged from the clone both by phylogeny and chromosome painting profile and were excluded from further analysis. A phylogenetic tree was computed using PhyML v3.1⁴⁴ and input to ClonalFrameML v1.11-3-g4f13f23⁴⁵, executed using default parameters. Node ages were determined using the R BactDating package⁴⁶, using 10,000 Markov chain Monte Carlo iterations and a mutation rate of 1.38 × 10⁻⁵ per site per year, as has previously been estimated¹⁶.

Data visualization

The map figures of the dataset’s geographical distribution, including the gray background map, were plotted using the ggplot2⁴⁷ and ggmaps⁴⁸ package in R. The painting profiles were summarized as described above, and plotting and statistical analysis was performed in R using the ggplot2 and plotly⁴⁹ packages.

Strain availability

The HpGP set of H. pylori strains is available from the US National Cancer Institute for scientific purposes upon a reasonable request. However, restrictions apply to its availability as some samples require authorization from contributing centers to be distributed to third parties.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The whole-genome sequences generated within the HpGP have been deposited in the NCBI GenBank database under BioProject accession code PRJNA529500 [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA529500] (Supplementary Data 1). NCBI or equivalent public accessions for the reference set are listed in Supplementary Data 2. The whole HpGP genome dataset and the 255 reference genomes are also deposited to Zenodo, DOI: 10.5281/zenodo.10048320. Source Data for the individual figures are available with this paper.

Code availability

The computational scripts to process the data and plot figures are available at https://github.com/HpGP/Code-and-Data v1.0. This code is also archived on Zenodo under https://doi.org/10.5281/zenodo.8381170.

References

Fox, J. G. & Wang, T. C. Inflammation, atrophy, and gastric cancer. J. Clin. Investig. 117, 60–69 (2007).
Article PubMed CAS Google Scholar
Conteduca, V. et al. H. pylori infection and gastric cancer: state of the art (review). Int. J. Oncol. 42, 5–18 (2013).
Article PubMed CAS Google Scholar
Falush, D. et al. Traces of human migrations in Helicobacter pylori populations. Science 299, 1582–1585 (2003).
Article ADS PubMed CAS Google Scholar
Linz, B. et al. An African origin for the intimate association between humans and Helicobacter pylori. Nature 445, 915–918 (2007).
Article ADS PubMed PubMed Central Google Scholar
Moodley, Y. et al. Age of the association between Helicobacter pylori and man. PLoS Pathog. 8, e1002693 (2012).
Article PubMed PubMed Central CAS Google Scholar
Yamaoka, Y. Helicobacter pylori typing as a tool for tracking human migration. Clin. Microbiol. Infect. 15, 829–834 (2009).
Article PubMed PubMed Central CAS Google Scholar
Sung, H. et al. Global Cancer Statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 71, 209–249 (2021).
Article PubMed Google Scholar
Munoz-Ramirez, Z. Y. et al. A 500-year tale of co-evolution, adaptation, and virulence: Helicobacter pylori in the Americas. ISME J. 15, 78–92 (2021).
Article PubMed Google Scholar
Jombart, T., Devillard, S. & Balloux, F. Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genet 11, 94 (2010).
Article PubMed PubMed Central Google Scholar
Thorpe, H. A. et al. Repeated out-of-Africa expansions of Helicobacter pylori driven by replacement of deleterious mutations. Nat. Commun. 13, 6842 (2022).
Article ADS PubMed PubMed Central CAS Google Scholar
Thorell, K. et al. Rapid evolution of distinct Helicobacter pylori subpopulations in the Americas. PLoS Genet. 13, e1006546 (2017).
Article PubMed PubMed Central Google Scholar
Berthenet, E. et al. A GWAS on Helicobacter pylori strains points to genetic variants associated with gastric cancer risk. BMC Biol. 16, 84 (2018).
Article PubMed PubMed Central Google Scholar
You, Y. et al. Genomic differentiation within East Asian Helicobacter pylori. Microb. Genom. https://doi.org/10.1099/mgen.0.000676 (2022).
Moodley, Y. et al. Helicobacter pylori’s historical journey through Siberia and the Americas. Proc. Natl Acad. Sci. USA. https://doi.org/10.1073/pnas.2015523118 (2021).
Suzuki, R. et al. Helicobacter pylori genomes reveal Paleolithic human migration to the east end of Asia. iScience 25, 104477 (2022).
Article ADS PubMed PubMed Central CAS Google Scholar
Didelot, X. et al. Genomic evolution and transmission of Helicobacter pylori in two South African families. Proc. Natl Acad. Sci. USA. 110, 13880–13885 (2013).
Article ADS PubMed PubMed Central CAS Google Scholar
Moodley, Y. & Linz, B. Helicobacter pylori sequences reflect past human migrations. Genome Dyn. 6, 62–74 (2009).
Article PubMed CAS Google Scholar
Kumar, N., Albert, M. J., Al Abkal, H., Siddique, I. & Ahmed, N. What constitutes an Arabian Helicobacter pylori? Lessons from comparative genomics. Helicobacter. https://doi.org/10.1111/hel.12323 (2017).
Kumar, N. et al. Comparative genomic analysis of Helicobacter pylori from Malaysia identifies three distinct lineages suggestive of differential evolution. Nucleic Acids Res. 43, 324–335 (2015).
Article PubMed CAS Google Scholar
Oleastro, M., Rocha, R. & Vale, F. F. Population genetic structure of Helicobacter pylori strains from Portuguese-speaking countries. Helicobacter. https://doi.org/10.1111/hel.12382 (2017).
Li, Y. et al. Global prevalence of Helicobacter pylori infection between 1980 and 2022: a systematic review and meta-analysis. Lancet Gastroenterol. Hepatol. 8, 553–564 (2023).
Article PubMed Google Scholar
Ford, A. C. et al. Effect of sibling number in the household and birth order on prevalence of Helicobacter pylori: a cross-sectional study. Int. J. Epidemiol. 36, 1327–1333 (2007).
Article PubMed Google Scholar
Goodman, K. J. & Correa, P. Transmission of Helicobacter pylori among siblings. Lancet 355, 358–362 (2000).
Article PubMed CAS Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Article PubMed PubMed Central CAS Google Scholar
Hunt, M. et al. Circlator: automated circularization of genome assemblies using long sequencing reads. Genome Biol. 16, 294 (2015).
Article PubMed PubMed Central Google Scholar
Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).
Article PubMed PubMed Central Google Scholar
Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069 (2014).
Article PubMed CAS Google Scholar
Manni, M., Berkeley, M. R., Seppey, M., Simao, F. A. & Zdobnov, E. M. BUSCO Update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol. Biol. Evol. 38, 4647–4654 (2021).
Article PubMed PubMed Central CAS Google Scholar
Li, W. et al. RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation. Nucleic Acids Res. 49, D1020–D1028 (2021).
Article ADS PubMed CAS Google Scholar
Haft, D. H. et al. RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Res. 46, D851–D860 (2018).
Article PubMed CAS Google Scholar
Tatusova, T. et al. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res. 44, 6614–6624 (2016).
Article PubMed PubMed Central CAS Google Scholar
Tonkin-Hill, G. et al. Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol. 21, 180 (2020).
Article PubMed PubMed Central Google Scholar
Yahara, K., Didelot, X., Ansari, M. A., Sheppard, S. K. & Falush, D. Efficient inference of recombination hot regions in bacterial genomes. Mol. Biol. Evol. 31, 1593–1605 (2014).
Article PubMed PubMed Central CAS Google Scholar
Browning, B. L. & Browning, S. R. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 84, 210–223 (2009).
Article PubMed PubMed Central CAS Google Scholar
Lawson, D. J., Hellenthal, G., Myers, S. & Falush, D. Inference of population structure using dense haplotype data. PLoS Genet. 8, e1002453 (2012).
Article PubMed PubMed Central CAS Google Scholar
Yahara, K. et al. Chromosome painting in silico in a bacterial species reveals fine population structure. Mol. Biol. Evol. 30, 1454–1464 (2013).
Article PubMed PubMed Central CAS Google Scholar
Page, A. J. et al. SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microbial. Genomics. https://doi.org/10.1099/mgen.0.000056 (2016).
Wilgenbusch, J. C. & Swofford, D. Inferring evolutionary trees with PAUP*. Curr. Protoc. Bioinformatics Chapter 6, Unit 6.4. https://doi.org/10.1002/0471250953.bi0604s00 (2003).
Silva, M. et al. chewBBACA: a complete suite for gene-by-gene schema creation and strain identification. Microb. Genom. https://doi.org/10.1099/mgen.0.000166 (2018).
Rasko, D. A., Myers, G. S. & Ravel, J. Visualization of comparative genomic analyses by BLAST score ratio. BMC Bioinforma. 6, 2 (2005).
Article Google Scholar
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinforma. 11, 119 (2010).
Article Google Scholar
Zhou, Z. et al. GrapeTree: visualization of core genomic relationships among 100,000 bacterial pathogens. Genome Res. 28, 1395–1404 (2018).
Article PubMed PubMed Central CAS Google Scholar
Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 132 (2016).
Article PubMed PubMed Central Google Scholar
Guindon, S., Delsuc, F., Dufayard, J. F. & Gascuel, O. Estimating maximum likelihood phylogenies with PhyML. Methods Mol. Biol. 537, 113–137 (2009).
Article PubMed CAS Google Scholar
Didelot, X. & Wilson, D. J. ClonalFrameML: efficient inference of recombination in whole bacterial genomes. PLoS Comput. Biol. 11, e1004041 (2015).
Article ADS PubMed PubMed Central Google Scholar
Didelot, X., Croucher, N. J., Bentley, S. D., Harris, S. R. & Wilson, D. J. Bayesian inference of ancestral dates on bacterial phylogenetic trees. Nucleic Acids Res. 46, e134 (2018).
Article PubMed PubMed Central Google Scholar
Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer-Verlag New York). https://ggplot2.tidyverse.org (2016).
Kahle, D. W. H. ggmap: spatial visualization with ggplot2. R. J. 5, 144–161 (2013).
Article Google Scholar
Collaborative Data Science (Plotly Technologies Inc., Montréal, QC, 2015).

Download references

Acknowledgements

Our special thanks are extended to all the individuals who were the hosts of the strains of the HpGP collection, who represent the human populations that bear the burden of H. pylori-associated disease, and whose biological samples serve to advance research aimed at reducing this disease burden. We immensely thank Lisa D. Finkelstein, Ramona Bhattacharya, Jillian M. Varonin, Mary Jane Williams, Karen Williams Kinney, and Melissa A. Raymond from the US National Institutes of Health (National Cancer Institute’s Technology Transfer Center, National Cancer Institute’s Division of Cancer Epidemiology, and Genetics, and Division of Logistic Services) for their administrative and logistic support in establishing the collaboration agreements and importing the multiple sets of biospecimens. We also thank the US Centers for Disease Control and Prevention’s Import Permit Program. We dedicate this work to our deceased colleagues Pablo Luna, Radislav Nakov, Bongani Kaimila, and Khean Lee Goh who passed away in recent years.

The HpGP was mainly supported by the Intramural Research Program from the US National Cancer Institute (NCI), National Institutes of Health (NIH). This work was supported in part by the intramural research programs of the US National Library of Medicine, the US National Institute on Minority Health and Health Disparities, and the US National Institute of Allergy and Infectious Diseases. The members of the bioinformatics group received support from the Swedish Society for Medical Research (K.T.), Assar Gabrielsson Foundation (K.T.), and Magnus Bergvall Foundation (K.T.). The collaborating centers for sample collection received grant support from the US NIH (P01CA116087, R01CA077955, R01DK058587 and P30DK058404 to R.M.P.; P01CA028842 and R01CA190612 to K.T.W.; P01CA028842, R01CA190612, K07CA125588, R03CA167773 and P30CA068485 to D.R.M.; K08CA252635 to R.J.H., K22CA226395 to M.G.-P.; and U54GM133807 to M.C.-C.), the German Federal Ministry of Education and Research (BMBF-0315905D, ERA-NET PathoGenoMics to P.M.), the French Association pour la Recherche Contre le Cancer (8412 to F.M.), the French Institut National du Cancer (07/3D1616/IABC-23-12/NC-NG and 2014-152 to F.M.), the Canceropole Grand Sud-Ouest (2010-08-canceropole GSO-Universite Bordeaux 2 to F.M.), the Japanese National Institutes of Health (DK62813 to Y.Y.), the Japanese Ministry of Education, Culture, Sports, Science, and Technology (18KK0266, 19H03473, 21H00346 and 22H02871 to Y.Y.), the National Fund for Innovation and Development of Science and Technology from the Ministry of Higher Education Science and Technology of the Dominican Republic (2012-2013-2A1-65 and 2015-3A1-182 to M.C.), the National Cancer Center of South Korea (2210630, I.J.C.), ArcticNet (RES0010178 to K.J.G.), the Network of Centres of Excellence of Canada, the Canadian Institutes for Health Research (MOP115031 to K.J.G.), Alberta Innovates Health Solutions (201201159 to K.J.G.), the University of Malaya-Ministry of Higher Education (UM.C/625/1/HIR/MOHE/CHAN-02 to J.V.), the Ministry of Science and Technology of Vietnam, the Kyrgyz State Medical Academy, the Italian Ministry of Health for Institutional Research, the Chilean National Fund for Health Research and Development (FONIS A19/0188, FONDECYT 1230504 and ANID-FONDAP 152220002 to A.R.; CONICYT-FONDAP 15130011 and FONDECYT 1231773 to A.H.C.), the Chilean Cancer Prevention and Control Center, the Horizon 2020 Programme of European Union (825832, “CeLac and European consortium for a personalized medicine approach to Gastric Cancer,” LEGACy, to T.F.-K. and A.R.), the Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP; 2014/26847-0, 2018/14267-2, 2018/02972-3 to E.D.-N.), the Departamento de Ciência e Tecnologia (DECIT), Ministry of Health, Brazil (PRONON, SIPAR 2500.035-167/2015-23 to E.D.-N.), the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq, 314344/2020-9 to E.T.-S.), the Universidad de Costa Rica (742-B9-310 and 742-90912-19 to V.R.-M.), LABGIPAT (S.D.-B.), the Hospital Clínica Bíblica (C.C.-N.), the Greek Ministry of Culture and Education (InfeNeutra Project, NSRF 2007-2013, MIS450598, D.N.S.), the National Strategic Reference Framework Operational Program “Competitiveness, Entrepreneurship and Innovation” (NSRF 2014-2020, MIS5002486, D.N.S.), the Hellenic Helicobacter pylori Study Group (2012-2016, B.M.-G.), the Hellenic Society of Gastroenterology (National Multicenter Laboratory Surveillance Studies, 2018-2019, B.M.-G.), the Ministry of Science and Technology, Executive Yuan, Taiwan (109-2314-B-002-096; MOST 111-2314-B-002-012; MOST 109-2314-B-002-090-MY3 to J.-M.L. and M.-S.W.), the National Research Foundation of Singapore, the Singapore Ministry of Health’s National Medical Research Council (Open Fund-Large Collaborative Grant, MOH-OFLCG18May-0003), the University of Puerto Rico Comprehensive Cancer Center, the Fondo Nacional de Desarrollo Científico y Tecnológico (196-2015-FONDECYT to C.C.), Universidad Científica del Sur, and Instituto Nacional de Enfermedades Neoplasicas (INEN, Peru).

The computations and data storage required for the analyses presented were enabled by resources in projects snic-2021/22-229 and snic-2021/23-234 provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) and the Swedish National Infrastructure for Computing (SNIC) at the UPPMAX HPC, partially funded by the Swedish Research Council through grant agreements 2022-06725, and 2018-05973.

Funding

Open access funding provided by University of Gothenburg.

Author information

These authors contributed equally: Kaisa Thorell, Zilia Y. Muñoz-Ramírez.
These authors jointly supervised this work: M. Constanza Camargo, Charles S. Rabkin.

Authors and Affiliations

Department of Chemistry and Molecular Biology, University of Gothenburg, Gothenburg, Sweden
Kaisa Thorell
Facultad de Ciencias Químicas, Universidad Autónoma de Chihuahua, Chihuahua, Chihuahua, México
Zilia Y. Muñoz-Ramírez
Cancer Genomics Research Laboratory, Frederick National Laboratory for Cancer Research, Frederick, MD, USA
Difei Wang, Yunhu Wan, Belynda Hicks, Bin Zhu, Meredith Yeager, Amy Hutchinson, Kedest Teshome, Kristie Jones & Wen Luo
Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, MD, USA
Difei Wang, Alisa M. Goldstein, Nan Hu, Philip R. Taylor, Minkyo Song, Andrés J. Gutiérrez-Escobar, Kai Yu, Bin Zhu, Christian C. Abnet, Stephen J. Chanock, M. Constanza Camargo & Charles S. Rabkin
Instituto Nacional de Medicina Genómica, Ciudad de México, México
Santiago Sandoval-Motta
Consejo Nacional de Ciencia y Tecnologia, Cátedras CONACYT, Ciudad de México, México
Santiago Sandoval-Motta
Centro de Ciencias de la Complejidad, Universidad Nacional Autónoma de México, Ciudad de México, México
Santiago Sandoval-Motta
Department of Life Sciences and Biotechnology, University of Ferrara, Ferrara, Italy
Rajiv Boscolo Agostini & Silvia Ghirotto
Centre for Microbes Development and Health, Institute Pasteur Shanghai, Shanghai, China
Roberto C. Torres & Daniel Falush
Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
Judith Romero-Gallo, Uma Krishna, Richard M. Peek Jr, M. Blanca Piazuelo, Keith T. Wilson, John T. Loh & Timothy L. Cover
Department of Natural and Life Sciences, Faculty of Sciences, University of Algiers 1 Benyoucef Benkhedda, Algiers, Algeria
Naïma Raaf
Departamento de Medicina Interna, Hospital Alemán, Buenos Aires, Argentina
Federico Bentolila
Department of Gastroenterology, Dhaka Medical College and Hospital, Dhaka, Bangladesh
Hafeza Aftab
Department of Environmental and Preventive Medicine, Oita University Faculty of Medicine, Yufu, Japan
Junko Akada, Takashi Matsumoto & Yoshio Yamaoka
Department of Pathobiology, Pharmacology and Zoological Medicine, Faculty of Veterinary Medicine, Ghent University, Ghent, Belgium
Freddy Haesebrouck
Hospital Universitario Japones, Santa Cruz de la Sierra, Bolivia
Rony P. Colanzi
A.C.Camargo Cancer Center, São Paulo, São Paulo, Brazil
Thais F. Bartelli, Diana Noronha Nunes, Adriane Pelosof, Claudia Zitron Sztokfisz & Emmanuel Dias-Neto
Núcleo de Pesquisas em Oncologia, Universidade Federal do Pará, Belém, Pará, Brazil
Paulo Pimentel Assumpção
Medical University of Sofia, Sofia, Bulgaria
Ivan Tishkov
Department of Biochemistry, University of Dschang, Dschang, Cameroon
Laure Brigitte Kouitcheu Mabeku
Faculty of Medicine and Dentistry, Department of Medicine, University of Alberta, Edmonton, AB, Canada
Karen J. Goodman, Janis Geary, Taylor J. Cromarty & Nancy L. Price
Queen’s University, Kingston, ON, Canada
Douglas Quilty
Department of Hematology and Oncology, Faculty of Medicine, Pontificia Universidad Católica de Chile, Santiago, Chile
Alejandro H. Corvalan
Department of Pediatric Gastroenterology and Nutrition, Faculty of Medicine, Pontificia Universidad Católica de Chile, Santiago, Chile
Carolina A. Serrano
Department of Gastroenterology, Faculty of Medicine, Pontificia Universidad Católica de Chile, Santiago, Chile
Robinson Gonzalez & Arnoldo Riquelme
Facultad de Ciencias Biológicas, Universidad de Concepción, Concepción, Chile
Apolinaria García-Cancino & Cristian Parra-Sepúlveda
Cáncer Lab, Departamento de Ciencias Biomédicas, Facultad de Medicina, Universidad Católica del Norte (Coquimbo), Chile, Coquimbo, Chile
Giuliano Bernal
Hospital Hanga Roa, Easter Island, Chile
Francisco Castillo
Grupo de Investigación en Biología del Cáncer, Instituto Nacional de Cancerología, Bogotá DC, Colombia
Maria Mercedes Bravo
Departamento de Biología, Universidad de Nariño, Pasto, Nariño, Colombia
Alvaro Pazos
Escuela de Medicina, Universidad del Valle, Cali, Valle, Colombia
Luis E. Bravo
Division of Comparative Medicine, Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
James G. Fox
Instituto de Investigaciones en Salud, Universidad de Costa Rica, San Jose, Costa Rica
Vanessa Ramírez-Mayorga & Silvia Molina-Castro
Laboratorio de Patología General y Gastrointestinal (LABGIPAT), San Jose, Costa Rica
Sundry Durán-Bermúdez
Servicio de Gastroenterología y Endoscopía Digestiva, Hospital Clínica Bíblica, San Jose, Costa Rica
Christian Campos-Núñez & Manuel Chaves-Cervantes
Faculty of Medicine, Osaka Metropolitan University, Osaka, Japan
Evariste Tshibangu-Kabamba
Faculty of Medicine, University of Mbuji-Mayi, Mbuji Mayi, Kasai-Oriental, Democratic Republic of the Congo
Evariste Tshibangu-Kabamba & Ghislain Disashi Tumba
Faculty of Medicine, University of Kinshasa, Kinshasa, Democratic Republic of the Congo
Antoine Tshimpi-Wola, Patrick de Jesus Ngoma-Kisoko & Dieudonné Mumba Ngoyi
Department of Parasitology, National Institute of Biomedical Research, Kinshasa, Democratic Republic of the Congo
Dieudonné Mumba Ngoyi
Instituto de Microbiología y Parasitología, Universidad Autónoma de Santo Domingo, Santo Domingo, Dominican Republic
Modesto Cruz & Celso Hosking
Dominican-Japanese Digestive Disease Center, Dr Luis E. Aybar Health and Hygiene City, Santo Domingo, Dominican Republic
José Jiménez Abreu
Bordeaux Institute of Oncology, BRIC U1312, INSERM, Bordeaux, and National Reference Center for Campylobacters & Helicobacters, CHU de Bordeaux, Bordeaux, France
Christine Varon, Lucie Benejat, Quentin Jehanne, Philippe Lehours & Francis Megraud
Medical Research Council Unit, The Gambia at the London School of Hygiene & Tropical Medicine, Banjul, The Gambia
Ousman Secka
Department of Gastroenterology, Hepatology and Infectious Diseases, Otto-von-Guericke University Magdeburg, Magdeburg, Germany
Alexander Link & Peter Malfertheiner
Department of Biochemistry, School of Biological Sciences, University of Cape Coast, Cape Coast, Central Region, Ghana
Michael Buenor Adinortey
Department of Internal Medicine and Therapeutics, School of Medical Sciences, University of Cape Coast, Cape Coast, Central Region, Ghana
Ansumana Sandy Bockarie
Department of Molecular Biology and Biotechnology, School of Biological Sciences, University of Cape Coast, Cape Coast, Central Region, Ghana
Cynthia Ayefoumi Adinortey
Department of Biology Education, Faculty of Science Education, University of Education, Winneba, Ghana
Eric Gyamerah Ofori
Laboratory of Medical Microbiology, Hellenic Pasteur Institute, Athens, Greece
Dionyssios N. Sgouras & Beatriz Martinez-Gonzalez
Department of Gastroenterology, Alexandra Hospital, Athens, Greece
Spyridon Michopoulos
Department of Gastroenterology, Athens Medical, P. Faliron Hospital, Athens, Greece
Sotirios Georgopoulos
Facultad de Ciencias Médicas, University of San Carlos of Guatemala, Guatemala City, Guatemala
Elisa Hernandez
Unidad de Gastroenterología, Hospital Roosevelt, Guatemala City, Guatemala
Braulio Volga Tacatic
Gastrocentro, S.A., Guatemala City, Guatemala
Mynor Aguilar
Departamento de Medicina Interna, Hospital de Occidente, Santa Rosa de Copán, Honduras
Ricardo L. Dominguez
School of Medicine, University of Alabama at Birmingham (UAB), Birmingham, AL, USA
Douglas R. Morgan
Landspítali – The National University Hospital of Iceland, Reykjavík, Iceland
Hjördís Harðardóttir, Anna Ingibjörg Gunnarsdóttir, Hallgrímur Guðjónsson, Jón Gunnlaugur Jónasson & Einar S. Björnsson
University of Iceland, Reykjavík, Iceland
Jón Gunnlaugur Jónasson & Einar S. Björnsson
Department of Microbiology, Kasturba Medical College, Manipal Academy of Higher Education, Manipal, Karnataka, India
Mamatha Ballal & Vignesh Shetty
Department of Medicine, University of Cambridge, Cambridge, UK
Vignesh Shetty
Universitas Airlangga, Surabaya, East Java, Indonesia
Muhammad Miftahussurur, Titong Sugihartono, Ricky Indra Alfaray, Langgeng Agung Waskito & Kartika Afrida Fauzia
University of Indonesia, Jakarta, Indonesia
Ari Fahrial Syam & Hasan Maulahela
Digestive Disease Research Institute, Tehran University of Medical Sciences, Tehran, Iran
Reza Malekzadeh & Masoud Sotoudeh
Clinical Microbiology Laboratory and Research Institute, Tzafon Medical Center, affiliated with Azrieli Faculty of Medicine, Bar Ilan University, Poriya, Israel
Avi Peretz & Maya Azrad
Azrieli Faculty of Medicine, Bar Ilan University, Safed, Israel
Avi Peretz, Maya Azrad & Avi On
Pediatric Gastroenterology and Nutrition Unit, Tzafon Medical Center, Poriya, Israel
Avi On
Unit of Immunopathology and Oncological Biomarkers, Centro di Riferimento Oncologico di Aviano, Aviano, Italy
Valli De Re & Stefania Zanussi
Unit of Oncological Gastroenterology, Centro di Riferimento Oncologico di Aviano, Aviano, Italy
Renato Cannizzaro
Unit of Pathology, Centro di Riferimento Oncologico di Aviano, Aviano, Italy
Vincenzo Canzonieri
Department of Gastroenterology and Metabolism, Nagoya City University Graduate School of Medical Sciences, Nagoya, Japan
Takaya Shimura
School of Medicine, Kyorin University, Mitaka, Tokyo, Japan
Kengo Tokunaga, Takako Osaki & Shigeru Kamiya
Jordan University of Science and Technology, Ar-Ramtha, Jordan
Khaled Jadallah & Ismail Matalka
Department of Surgical Diseases Internship, Astana Medical University, Nur-Sultan, Kazakhstan
Nurbek Igissinov
Kyrgyz State Medical Academy, Bishkek, Kyrgyzstan
Mariia Satarovna Moldobaeva & Attokurova Rakhat
Center for Gastric Cancer, National Cancer Center, Goyang, South Korea
Il Ju Choi
Department of Internal Medicine, Chung-Ang University Hospital, Seoul, South Korea
Jae Gyu Kim
College of Medicine, Seoul National University, Seoul, South Korea
Nayoung Kim
Institute of Clinical and Preventive Medicine, University of Latvia, Riga, Latvia
Mārcis Leja, Reinis Vangravs, Ģirts Šķenders, Aiga Rūdule & Ilze Kikuste
Digestive Diseases Centre GASTRO, Riga, Latvia
Mārcis Leja & Aigars Vanags
Riga East University Hospital, Riga, Latvia
Mārcis Leja, Ģirts Šķenders & Dace Rudzīte
Department of Gastroenterology, Institute for Digestive Research, Medical Academy, Lithuanian University of Health Sciences, Kaunas, Lithuania
Juozas Kupcinskas, Jurgita Skieceviciene, Laimas Jonaitis, Gediminas Kiudelis, Paulius Jonaitis, Vytautas Kiudelis & Greta Varkalaite
Department of Medical Microbiology, Faculty of Medicine, Universiti Malaya, Kuala Lumpur, Malaysia
Jamuna Vadivelu, Mun Fai Loke & Kumutha Malar Vellasamy
Medical Education Research and Development Unit, Faculty of Medicine, Universiti Malaya, Kuala Lumpur, Malaysia
Jamuna Vadivelu
Departamento de Patología, Instituto Nacional de Cancerología, Mexico City, Mexico
Roberto Herrera-Goepfert
Servicio de Endoscopía, Instituto Nacional de Cancerología, Mexico City, Mexico
Juan Octavio Alonso-Larraga
Defence Services General Hospital, Yangon, Yangon, Yangon Region, Myanmar
Than Than Yee & Kyaw Htet
Nippon Medical School, Tokyo, Japan
Takeshi Matsuhisa
Department of Gastroenterology, Maharajgunj Medical Campus, Tribhuvan University Teaching Hospital, Kathmandu, Nepal
Pradeep Krishna Shrestha
Division of Health Sciences, Abu Dhabi Women’s Campus, Higher Colleges of Technology, Abu Dhabi, United Arab Emirates
Shamshul Ansari
Department of Community Medicine, Babcock University, Ilishan, Ogun State, Nigeria
Olumide Abiodun
Department of Medicine, Babcock University, Ilishan, Ogun State, Nigeria
Christopher Jemilohun
Department of Medicine, College of Medicine, University of Ibadan, Ibadan, Oyo, Nigeria
Kolawole Oluseyi Akande
Department of Nursing, Crescent University, Abeokuta, Ogun State, Nigeria
Oluwatosin Olu-Abiodun
University Teaching Hospital, University of Jos, Jos, Plateau, Nigeria
Francis Ajang Magaji
University of Calabar Teaching Hospital, Calabar, Cross River, Nigeria
Ayodele Omotoso & Uchenna Okonkwo
University of Nigeria Teaching Hospital, Ituku-Ozalla, Enugu State, Nigeria
Chukwuemeka Chukwunwendu Osuagwu
Department of Internal Medicine, Federal Medical Center Abeokuta, Abeokuta, Ogun State, Nigeria
Opeyemi O. Owoseni
Faculty of Health Sciences, Universidad Científica del Sur, Lima, Peru
Carlos Castaneda
Departamento de investigación, Instituto Nacional de Enfermedades Neoplasicas, Lima, Peru
Miluska Castillo
Universidad Peruana Cayetano Heredia, Lima, Peru
Billie Velapatino
Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, USA
Robert H. Gilman
Department of Microbiology, Wroclaw Medical University, Wrocław, Poland
Paweł Krzyżek & Grażyna Gościniak
Department of Surgery Teaching, Wroclaw Medical University, Wrocław, Poland
Dorota Pawełka
Department of Pharmaceutical Microbiology, Medical University of Lublin, Lublin, Poland
Izabela Korona-Glowniak
Department of Gastroenterology with Endoscopic Unit, Medical University of Lublin, Lublin, Poland
Halina Cichoz-Lach
Instituto Nacional de Saúde Dr. Ricardo Jorge, Lisboa, Portugal
Monica Oleastro
Instituto de Patologia e Imunologia Molecular da Universidade do Porto, Porto, Portugal
Ceu Figueiredo, Jose C. Machado & Rui M. Ferreira
Instituto de Investigação e Inovação em Saúde, Universidade do Porto, Porto, Portugal
Ceu Figueiredo, Jose C. Machado & Rui M. Ferreira
Faculdade de Medicina da Universidade do Porto, Porto, Portugal
Ceu Figueiredo & Jose C. Machado
Department of Pancreatic, Biliary and Upper Digestive Tract Disorders, A. S. Loginov Moscow Clinical Scientific Center, Moscow, Russia
Dmitry S. Bordin
Department of General Medical Practice and Family Medicine, Tver State Medical University, Moscow, Russia
Dmitry S. Bordin
Department of Propaedeutic of Internal diseases and Gastroenterology A.I. Yevdokimov Moscow State University of Medicine and Dentistry, Moscow, Russia
Dmitry S. Bordin
Department of Faculty Therapy and Gastroenterology, Omsk State Medical University, Omsk, Russia
Maria A. Livzan
Scientific Research Institute of Medical Problems of the North, Federal Research Centre “Krasnoyarsk Science Centre” of the Siberian Branch of Russian Academy of Science, Krasnoyarsk, Russia
Vladislav V. Tsukanov
Cancer Science Institute of Singapore, National University of Singapore, Singapore, Singapore
Patrick Tan
Cancer and Stem Cell Biology Program, Duke NUS Medical School, Singapore, Singapore
Patrick Tan
Genome Institute of Singapore, Agency for Science, Technology and Research, Singapore, Singapore
Patrick Tan
Department of Medicine, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
Khay Guan Yeoh & Feng Zhu
Department of Gastroenterology and Hepatology, National University Health System, Singapore, Singapore
Khay Guan Yeoh
Chris Hani Baragwanath Academic Hospital, Johannesburg, South Africa
Reid Ally
University of the Witwatersrand, Johannesburg, South Africa
Reid Ally
Max von Pettenkofer Institute of Hygiene and Medical Microbiology, Faculty of Medicine, LMU Munich, Munich, Germany
Rainer Haas & Wolfgang Fischer
Hospital Universitario Donostia, San Sebastian, Spain
Milagrosa Montes, María Fernández-Reyes, Esther Tamayo & Jacobo Lizasoain
Department of Gastroenterology, Bioodonostia Health Research Institute - Donostia University Hospital, Universidad del País Vasco (UPV/EHU), Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), San Sebastian, Spain
Luis Bujanda
Digestive Diseases Unit, Parc Taulí Hospital Universitari. Institut d’Investigació i Innovació Parc Taulí (I3PT-CERCA), Universitat Autònoma de Barcelona, Barcelona, Spain
Sergio Lario, María José Ramírez-Lázaro, Xavier Calvet & Eduard Brunet-Mas
Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas, Instituto de Salud Carlos III, Madrid, Spain
Sergio Lario, María José Ramírez-Lázaro, Xavier Calvet & Eduard Brunet-Mas
Lozano Blesa University Clinic Hospital, Zaragoza, Spain
María José Domper-Arnal & Sandra García-Mateo
Aragon Health Research Institute, Zaragoza, Spain
María José Domper-Arnal & Sandra García-Mateo
Miguel Servet University Hospital, Zaragoza, Spain
Daniel Abad-Baroja
Hospital General de Granollers, Barcelona, Spain
Pedro Delgado-Guillena
Gastroenterology Department, Hospital Clínic of Barcelona, Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas, Facultat de Medicina i Ciències de la Salut, Universitat de Barcelona, Barcelona, Spain
Leticia Moreira
Hospital Universitari de Bellvitge, L’Hospitalet de Llobregat, Barcelona, Spain
Josep Botargues
Department of Gastroenterology, Hospital Universitario Central de Asturias, Oviedo, Asturias, Spain
Isabel Pérez-Martínez & Eva Barreiro-Alonso
Diet, Microbiota and Health Group, Instituto de Investigación Sanitaria del Principado de Asturias, Oviedo, Asturias, Spain
Isabel Pérez-Martínez
Farmacology Group, Instituto de Investigación Sanitaria del Principado de Asturias, Oviedo, Asturias, Spain
Eva Barreiro-Alonso
Instituto Universitario de Oncología del Principado de Asturias, Oviedo, Asturias, Spain
Eva Barreiro-Alonso
Hospital General Universitario Gregorio Marañón, Madrid, Spain
Virginia Flores
Gastroenterology Unit, Hospital Universitario de La Princesa, Instituto de Investigación Sanitaria Princesa, Madrid, Spain
Javier P. Gisbert
Universidad Autónoma de Madrid, Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas, Madrid, Spain
Javier P. Gisbert
Gastroenterology Department, Hospital Universitario de Navarra, Pamplona, Navarra, Spain
Edurne Amorena Muro
Hospital de Leon, Leon, Spain
Pedro Linares & Laura Alcoba
Institut of Biomedicine, University of León, Consortium for Biomedical Research in Epidemiology and Public Health, Leon, Spain
Vicente Martin
Instituto de Investigación Sanitaria INCLIVA, Hospital Clínico Universitario de Valencia, Valencia, Spain
Tania Fleitas-Kanonnikoff
Biochemistry Department, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
Hisham N. Altayeb
Faculty of Medical laboratory Science, Sudan University of Science and Technology, Khartoum, Sudan
Hisham N. Altayeb
Centre for Translational Microbiome Research, Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet, Solna, Sweden
Lars Engstrand
University of Skövde, Skövde, Sweden
Helena Enroth
Institute for Infectious Diseases, University of Bern, Bern, Switzerland
Peter M. Keller
Clinical Bacteriology/Mycology Unit, University Hospital Basel, Basel, Switzerland
Peter M. Keller
Institute of Medical Microbiology, University of Zurich, Zürich, Switzerland
Karoline Wagner
Clinic for Gastroenterology and Hepatology, University Hospital Zurich, Zürich, Switzerland
Daniel Pohl
College of Medicine, National Taiwan University, Taipei City, Taiwan
Yi-Chia Lee, Jyh-Ming Liou & Ming-Shiang Wu
Medical Microbiology Department, Cerrahpasa Medical Faculty, Istanbul University-Cerrahpasa, İstanbul, Türkiye
Bekir Kocazeybek & Suat Sarıbas
General Surgery Department, Cerrahpasa Medical Faculty, Istanbul University-Cerrahpasa, İstanbul, Türkiye
İhsan Tasçı & Süleyman Demiryas
Medical Pathology Department, Cerrahpasa Medical Faculty, Istanbul University-Cerrahpasa, İstanbul, Türkiye
Nuray Kepil
Lawrence General Hospital, Lawrence, MA, USA
Luis Quiel
Carson Tahoe Regional Medical Center, Carson City, NV, USA
Miguel Villagra
White River Medical Center, Batesville, AR, USA
Morgan Norton & Deborah Johnson
Department of Medicine, Stanford University, Stanford, CA, USA
Robert J. Huang & Joo Ha Hwang
Albert Einstein College of Medicine, Bronx, NY, USA
Wendy Szymczak, Saranathan Rajagopalan, Emmanuel Asare, William R. Jacobs Jr. & Haejin In
Rutgers Cancer Institute, New Brunswick, NJ, USA
Haejin In
Georgia Cancer Center’s Biorepository, Augusta University, Augusta, Georgia
Roni Bollag & Aileen Lopez
Augusta University Medical Center, Augusta, Georgia
Edward J. Kruse & Joseph White
Section of Gastroenterology and Hepatology, Department of Medicine, Baylor College of Medicine, Houston, TX, USA
David Y. Graham & Yoshio Yamaoka
Enteric Diseases Laboratory Branch, Division of Foodborne, Waterborne, and Environmental Diseases, Centers for Disease Control and Prevention, Atlanta, GA, USA
Charlotte Lane, Yang Gao & Patricia I. Fields
Gi Care for Kids, LLC, Children’s Center for Digestive Healthcare, LLC, Atlanta, GA, USA
Benjamin D. Gold
University of Puerto Rico Comprehensive Cancer Center, San Juan, Puerto Rico
Marcia Cruz-Correa & María González-Pons
University of Puerto Rico Medical Sciences Campus, San Juan, Puerto Rico
Marcia Cruz-Correa
Gastrointestinal and Other Cancers Research Group, Division of Cancer Prevention, National Cancer Institute, Rockville, MD, USA
Luz M. Rodriguez
Department of Endoscopy, Cho Ray Hospital, Ho Chi Minh City, Vietnam
Vo Phuoc Tuan, Ho Dang Quy Dung & Tran Thanh Binh
Department of Hepatogastroenterology, 108 Military Central Hospital, Hanoi, Vietnam
Tran Thi Huyen Trang & Vu Van Khien
Sequencing Facility Bioinformatics Group, Bioinformatics and Computational Science Directorate, Frederick National Laboratory for Cancer Research, Frederick, MD, USA
Xiongfong Chen & Yongmei Zhao
Cancer Research Technology Program, Frederick National Laboratory for Cancer Research, Frederick, MD, USA
Castle Raley, Bailey Kessing & Bao Tran
Center for the Evolutionary Origins of Human Behavior, Kyoto University, Inuyama, Japan
Yukako Katsura
Instituto de Ciencias Biomédicas (ICBM), Facultad de Medicina, Universidad de Chile, Santiago, Chile
Patricio Gonzalez-Hormazabal
School of Life Sciences, University of Warwick, Coventry, UK
Xavier Didelot
Department of Biology & Biochemistry, University of Bath, Bath, UK
Sam Sheppard
Departamento de Genética, Ecologia e Evolução, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Eduardo Tarazona-Santos & Roxana Zamudio
National Institute on Minority Health and Health Disparities, Bethesda, MD, USA
Leonardo Mariño-Ramírez
Division of Microbiology, Department of Biology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
Steffen Backert
Institute of Experimental Internal Medicine, Otto-von-Guericke-Universität Magdeburg, Magdeburg, Germany
Michael Naumann
Laboratory of Experimental Medicine and Pediatrics, Faculty of Medicine and Health Sciences, University of Antwerp, Antwerp, Belgium
Annemieke Smet
Department of Molecular Microbiology, Washington University School of Medicine, St. Louis, MO, USA
Douglas E. Berg
Genomics and Health Area, FISABIO – Public Health, Valencia, Spain
Álvaro Chiner-Oms
CIBER in Epidemiology and Public Health, Madrid, Spain
Álvaro Chiner-Oms & Iñaki Comas
Tuberculosis Genomics Unit, Instituto de Biomedicina de Valencia, Consejo Superior de Investigaciones Científicas, Valencia, Spain
Iñaki Comas & Francisco José Martínez-Martínez
Quadram Institute Bioscience, Norwich, UK
Roxana Zamudio
National Institute of Infectious Diseases, Tokyo, Japan
Koji Yahara
Center for Advanced Biotechnology and Medicine, Rutgers University, New Brunswick, NJ, USA
Martin J. Blaser
New England Biolabs, Ipswich, MA, USA
Tamas Vincze, Richard D. Morgan & Richard J. Roberts
Bacterial Pathogenesis and Antimicrobial Resistance Unit, National Institute of Allergy and Infectious Diseases, Bethesda, MD, USA
John P. Dekker
Unidad de Investigación en Enfermedades Infecciosas y Parasitarias, UMAE Pediatría, Instituto de Seguro Social, Mexico City, Mexico
Javier Torres
Veterans Affairs Tennessee Valley Healthcare System, Nashville, TN, USA
Timothy L. Cover
Department of Genetics, SOKENDAI University, Mishima, Shizuoka, Japan
Mehwish Noureen
Pathogen Genome Bioinformatics and Computational Biology, Research Institute for Medicines, Faculty of Pharmacy, Universidade de Lisboa, Lisboa, Portugal
Filipa F. Vale
Instituto de Biosistemas e Ciências Integrativas, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal
Filipa F. Vale
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Joshua L. Cherry
Division of International Epidemiology and Population Studies, Fogarty International Center, National Institutes of Health, Bethesda, MD, USA
Joshua L. Cherry
Faculty of Information Science and Technology, Hokkaido University, Sapporo, Japan
Naoki Osada
Department of Molecular Oncology, Chiba University, Chiba, Japan
Masaki Fukuyo
Bioinformation and DDBJ Center, National Institute of Genetics, Mishima, Shizuoka, Japan
Masanori Arita
Research Center for Micro-Nano Technology, Hosei University, Tokyo, Japan
Ichizo Kobayashi
National Institute for Basic Biology, National Institutes of Natural Sciences, Aichi, Japan
Ikuo Uchiyama

Authors

Kaisa Thorell
View author publications
You can also search for this author in PubMed Google Scholar
Zilia Y. Muñoz-Ramírez
View author publications
You can also search for this author in PubMed Google Scholar
Difei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Santiago Sandoval-Motta
View author publications
You can also search for this author in PubMed Google Scholar
Rajiv Boscolo Agostini
View author publications
You can also search for this author in PubMed Google Scholar
Silvia Ghirotto
View author publications
You can also search for this author in PubMed Google Scholar
Roberto C. Torres
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Falush
View author publications
You can also search for this author in PubMed Google Scholar
M. Constanza Camargo
View author publications
You can also search for this author in PubMed Google Scholar
Charles S. Rabkin
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

Contributions

Manuscript conception and design: K.T. and Z.Y.M.R. Analysis supervision: K.T., S.G., and D.F. Data analysis: K.T., Z.Y.M.R., D.W., S.S.M., R.B.A., S.G., and R.C.T. Interpretation of results: K.T., Z.Y.M.R., S.S.M., R.B.A., S.G., R.C.T., D.F., M.C.C., and C.S.R. Manuscript writing: K.T., Z.Y.M.R., D.W., R.C.T., D.F., and M.C.C. Data coordinator: D.W. Editing of the manuscript: S.S.M., R.B.A., S.G., HpGP Research Network and C.S.R. Sample acquisition: HpGP Research Network. Conception and design of the HpGP initiative: M.C.C. and C.S.R. HpGP study coordinators: M.C.C. and C.S.R.

Corresponding author

Correspondence to Kaisa Thorell.

Ethics declarations

Competing interests

J.P.G. has served as a speaker, consultant, and advisory member for or has received research funding from Mayoly, Allergan, Diasorin, Gebro Pharma, and Richen. E.B.-M. has served as a speaker and consultant for Janssen, Chiesi, Kern and Takeda. R.M.F., J.C.M., and C.F. own patent WO/2018/169423 on microbiome markers for gastric cancer, and R.J.R. works for New England Biolabs, a company that sells research reagents, including restriction enzymes and DNA methyltransferases, to the scientific community. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Description of Additional Supplementary Files

Supplementary Dataset 1

Supplementary Dataset 2

Supplementary Dataset 3

Supplementary Dataset 4

Reporting Summary

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Thorell, K., Muñoz-Ramírez, Z.Y., Wang, D. et al. The Helicobacter pylori Genome Project: insights into H. pylori population structure from analysis of a worldwide collection of complete genomes. Nat Commun 14, 8184 (2023). https://doi.org/10.1038/s41467-023-43562-y

Download citation

Received: 05 September 2023
Accepted: 13 November 2023
Published: 11 December 2023
DOI: https://doi.org/10.1038/s41467-023-43562-y

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.