Variations in cag pathogenicity island genes of Helicobacter pylori from Latin American groups may influence neoplastic progression to gastric cancer

Helicobacter pylori (HP) colonizes the human stomach and induces acute gastritis, peptic ulcer disease, atrophic gastritis, and gastric adenocarcinoma. Increased virulence in HP isolates derives from harboring the cag (cytotoxin-associated genes) pathogenicity island (cagPAI). We analyzed the microvariants in cagPAI genes with the hypothesis that they may play an important role in determining HP virulence. We tested DNAs from cagA positive patients HP isolates; a total of 74 patients with chronic gastritis (CG, N = 37), intestinal metaplasia (IM, N = 21) or gastric cancer (GC, N = 16) from Mexico and Colombia. We selected 520 non-synonymous variants with at least 7.5% frequency in the original sequence outputs or with a minimum of 5 isolates with minor allele. After adjustment for multiple comparisons, no variants were statistically significantly associated with IM or GC. However, 19 non-synonymous showed conventional P-values < 0.05 comparing the frequency of the alleles between the isolates from subjects with gastritis and isolates from subjects with IM or GC; 12 of these showed a significant correlation with the severity of the disease. The present study revealed that several cagPAI genes from Latin American Western HP strains contains a number of non-synonymous variants in relatively high frequencies which could influence on the clinical outcome. However, none of the associations remained statistically significant after adjustment for multiple comparison.


Results
Performance of genome-wide sequencing. We prepared sequence libraries and performed whole-genome sequencing on 92 HP strains, including reference strain 26695. Details of the sequencing outcome are included in supplementary table 1. We obtained a total number of reads per samples between 53,427 and 1,366,065. Genomes were assembled, and we found that the percentage of reads that aligned to the reference sequence ranged between 0% and 60.51%. We excluded from further analysis 3 genomes for which no reads could be aligned, 9 where cagPAI was absent, 2 strains isolated from the same patient and reference strain 26695 that was analyzed only as quality control (data not shown).
The coverage of the whole genome for the 74 samples selected for further analysis was on average 39.52×(range 6.31×-114.45×), and coverage for the cagPAI was 85.31x on average (range 11.08×-466.87×). The number of reads that aligned to the cagPAI per sample ranged between 3201 and 43462. cagA gene was missing in 11 of 92 sequenced strains and cagγ in one strain.
Variability by gene. We summarized the genetic variability detected in the twenty-six cagPAI genes in Table 1. For each gene we computed the degree of variability as the number of sites showing a different genotype compared with the reference strain out of the total number of nucleotides in the gene, both as synonymous and non-synonymous variations (causing therefore differences at amino acid level). We assessed a range of variability from 9.54% in cagF to 31.22% in cagA at DNA level calculated as ratio of polymorphic position to gene length, while the amino acid variability ranged from 1.8% in the cagE gene, which is the minimum variation that we found (with the exception of the analyzed region of cagY in which we did not find non-synonymous variations), to 17.82% in cagC calculated as ratio of non-synonymous polymorphisms to number of amino acid in the translated protein. The number of polymorphisms identified for each gene are summarized in Table 1.
Comparison of frequencies of polymorphisms showing a differential distribution between gastritis and iM or Gc cases. When comparing the frequency of the 520 selected alleles between the isolates from subjects with gastritis and isolates from subjects with IM or GC we found statistically significant differences in allele distribution for three polymorphisms in cagA gene (Q/K427R, N467G and V1041T), three in cagC gene (V22I, V37I, I45V), one in the cagE gene (K981E), one in cagL gene (S10F), one in cagX gene (G11N), one in cagS (G146D), one in cagζ (S35A), three in cagδ (V353I, P406L, N407E) and one in cagβ (N125A) ( Table 2). Furthermore 4 polymorphisms in cagA (V52I, G65R, S194F and Q/R427K) showed a significant trend with grade of the disease.
When IM and GC were analyzed separately and compared with the non-atrophic gastritis, 7 SNPs showed a marginally (P < 0.05) statistically significant association when comparing gastritis vs. IM and 10 when comparing gastritis vs. GC (Table 3). In the cagA gene 3 of these variants were associated with risk of IM (V52I, S194F and Q/R427K) and one with GC (N467G), as shown in Table 3. In the cagC gene we detected three SNPs with a differential distribution between gastritis and GC (V22I, V37I, I45V, see Table 3). In cagL gene polymorphism (S10F) showed a differential allelic distribution between isolates derived from IM cases and gastritis cases (OR = 0.14; 95% CI 0.04-0.46, P = 0.002) (Fig. 1, Table 3). For the cagX gene polymorphism G11N showed a differential distribution between cases of IM and gastritis (OR = 0.20; 95% CI 0.06-0.71, P = 0.011) (Fig. 1, Table 3). For cagζ one polymorphism (S35A) showed a differential allelic distribution between isolates derived from IM cases and gastritis cases (OR = 8.80; 95% CI 2.29-33.84, P = 0.001) (Fig. 1, Table 3). In cagδ two adjacent polymorphisms www.nature.com/scientificreports www.nature.com/scientificreports/ showed a differential distribution between cases of cancer cases and gastritis in particular the association for P406L showed an OR = 7.97, 95% CI 2.03-31.27 and for N407E OR = 10.29 95% CI 2. 45-43.15. Additionally, the allelic distributions of these two polymorphisms were significantly different between Mexican and Colombian samples (data not shown), namely the variant allele frequencies were extremely low in Colombia, therefore the associations were driven by the Mexican cancer cases.
Next, a multiple comparison analysis was performed by applying a Bonferroni-corrected threshold, and none of the above described SNPs showed a P-value lower than the threshold adjusted for this type of analysis of P = 9.6×10 −5 (0.05/520). None of the SNPs reached this study-wise P-value. Supplementary table 2 lists all the polymorphisms observed in 24 analyzed genes. epiYA and cM motif analysis. We also analyzed EPIYA (A, B or C) and CM (cm) motifs distribution in 72 cagA positive sequenced strains [30][31][32] , while two strains were cagPAI positive but lacking the cagA gene. We found a high degree of variability consisting of 12 different patterns, all of the Western Type (supplementary figure 1). Most of the strains (50) presented the A/B/cm/C/cm pattern, followed by the pattern A/B/cm/C/cm/C/cm (in 9 strains), and the pattern A/B/cm/A/B/cm/C/cm/ (in 3 strains). Other less frequent patterns are described in supplementary figure 1. There was no difference in the distribution of the various patterns when compared between the three different disease groups.

Discussion
This study was conducted in Latin American HP strains, in order to identify specific cagPAI micro-variants associated with high-grade gastric lesions. This effort is of high clinical and translational importance as the presence of cagA gene does not predict outcomes of HP infection, particularly in high-risk populations where the majority of the strains carry cagPAI. The present study not only confirmed the extremely high variability in the cagPAI genes, but also pointed to several variants with potential clinical relevance in a few genes for future studies.
Other study have investigated the whole cagPAI 33 however none have performed an extensive analysis of polymorphic variant of each gene in the region.  Table 1. Overview of genetic variability in genes in HP cagPAI. a Number of polymorphic positions differ from number of variants because we found that several indel variants span more than one nucleotide. b non-synonymous variants with at least 7.5% frequencies when compared to the reference sequence.
Overall sequence variability derived from this study for the selected cagPAI genes was consistent with that reported previously using different sequencing techniques 28,34 , and extend the study to a higher number of genes. These results support the signature of diversifying selection through bacterial evolution in the proteins that are surface-exposed and involved in interactions with host molecules 34 . However, the frequencies of amino acid variants in cagA, cagC and cagγ found in this study were substantially higher than those previously reported 28,34 . This may be partially due to a greater number of strains under investigation in our work compared to previous publications. While we cannot completely rule out artifacts from this high throughput platform, such artifacts should affect both synonymous and non-synonymous variants.
In a previous study conducted with amplicon sequencing by 454 for 84 Mexican and 11 Venezuelan samples we reported 10 non-synonymous SNPs with differential allelic distribution between gastritis and gastric cancer at conventional P-values between 0.01-0.05 28 . In the present project that included equal numbers of Mexican and Colombian strains we did not see any disease association with these 10 variants. Although variant frequencies were not markedly different between Mexico and Colombia strains, particularly for variants showing a significant association with IM + GC, we have previously reported important phylogenetic differences between strains of these two countries 29 . These phylogenetic differences may partly explain discordant results between the previous and current studies.
Previous publications reported an association of GC with the presence of variants in position 58 and 59 of cagL protein; in two studies 35,36 the concurrent presence of tyrosine (Y) in amino acid position 58 and glutamic acid (E) in position 59 (Y58E59) compared with the combination aspartic acid (D58) and Lysine (K59), induced more efficiently a shift of gastric integrin a5b1 in the corpus, which has been related with gastric carcinogenesis. In our previous publication we did not observe the Y amino acid in position 58 in any sample, although we did find that carriers of D at this position are at lower risk of GC in comparison with the asparagine (N) carriers 28 . In the current work we confirmed the absence of polymorphism in Y position 58 and the presence of N58D polymorphism. We also observed the E59K polymorphism but we did not find association with either IM or GC in our populations. It should be noted that there was a major difference between our current and previous studies 28 in the composition of geographical origins of the samples. Our former study included Colombian while the current study included Venezuelan strains; this is relevant considering our recent report where we document adaption of HP genome to different Latin American populations 37 . Furthermore, Gorrel and co-worker 38 performed an expanded analysis of this region analyzing the sequence from amino acid 58 to 62 and found significant differences according to the geographical origin; in this sense, we confirmed the predominance of the DKMGE aminoacid sequence.
Some of the variants may warrant further studies as the SIFT program 39 predicts them to be damaging non-tolerant. In particular the cagL S10F variant (located at codon 10) changes from serine (polar) to phenylalanine (non-polar). Interestingly, among the 43 Asian strains recently sequenced, no single S10F variant was found, suggesting that this is a Western-strain specific variant 35 . Thus, overall, the differences observed in these studies are likely to be driven by geographical origins of HP.
The other variant that is considered to be detrimental non-tolerant is located at residue 11 of cagX, exchanging glycine (non-polar) to asparagine (polar). cagX, Vir9 homologous protein, has been found recently to be necessary for the formation of HP pilus 40 , mediating the stabilization of cagT which is a T4SS structural protein 41,42 . www.nature.com/scientificreports www.nature.com/scientificreports/ Thus, cagX mutants prevent cagA biological activities 40 . In this context, our finding of a protective effect of the N allele is compatible with a lesser virulent behavior.
One cagC variant, located at codon 22, replacing valine with isoleucine, was associated with risk of gastric cancer and predicted to be intolerant by SIFT. Valine to isoleucine substitutions have been reported to result in changes in protein structure, kinetics and stability in both bacteria [43][44][45] and humans [45][46][47] . However, the possible function of this polymorphism remains unclear.
The four polymorphisms in genes cagζ, cagδ and cagβ showing a different distribution in IM or cancer cases were predicted to be tolerated by SIFT.
There were several variants at the N-terminal of cagA that showed rather strong (OR > 4.60) associations with high grade lesions (P = <0.025) ( Table 2) as well as a significant trend by type of lesions (P < 0.05) ( Table 3). Some involved significant changes in amino acid characteristics (e.g., Q427R, N467G), although none were predicted to be detrimental by SIFT. cagA N-terminal region (residue 1-884) has recently received intensive research interest owing to its ability to interact with exogenous molecules, including host tumor suppressors, adhesion molecules, inflammatory mediators as well as chemopreventive agents such as curcumin 48,49 . Further characterization of cagA N-terminal region may shed light on potential function of the variants found in this study.
Strengths of this work include the relatively large number of HP samples completely sequenced. This work adds significantly to the number of HP complete genomes publicly available, and it substantially increases the available number of HP genomes from Latin America, spanning also different stages in the natural history of HP infection and progression to gastric cancer. These new data have already been used for an in-depth phylogenetic analysis of Latin American HP genomes 29 . Additionally, the sequences used in the final analysis are of high quality, with an average coverage of about 40×, which is more than enough for a thorough assessment of sequence variation.
On the other hand, we acknowledge some limitations in this study, which was designed as a first stage to screen candidate cagPAI variants to be validated in a larger study, and sample size felt short to obtain reliable risk estimate for high-grade gastric lesions, particularly when IM and cancer were considered separately. In the same vein, the sample was too small to assess combinations of several potentially interesting variants. Also, our study did not include Asian strains that present marked differences in the CagA EPIYA region, and thus our results do not apply to Asian strains. Furthermore, our sequencing platform HiSeq was not suitable to analyze long repeat regions such as those in cagY, which are rather common in Hp bacterial genome. That is a major reason why we limited cagY analysis to its conserved region (Yc). Finally, our data were exclusively based on cultivable HP from gastric biopsies. Little is known as to whether bacteria that are easy to grow in vitro are genetically different from those that are difficult to grow in vitro but able to survive in the human stomach for extended periods of time.
Despite the several limitations discussed above, the present study revealed that several cagPAI genes from Latin American Western HP strains contain a number of non-synonymous variants in relatively high frequencies. Some of these variants warrant further investigation to better understand their clinical significance in larger association studies, as well as experimental studies to elucidate their biological functions, and bioinformatic analysis to gain structural insights of the sequence variants.

Methods
Study population. Strains analyzed in this study were isolated from patients recruited in the context of a multi-centric study based in Latin America 50,51 . Sequences used in the present work are largely overlapping with those reported in a phylogenetic analysis on Latin American HP strains 29 . Patients attended the gastroenterology or oncology services and were subjected to endoscopy for diagnostic purposes. We isolated HP in 92 of these patients and sequenced them, however 18 were dropped due to poor quality or because they did not carry the cagPAI. Samples included in the following analysis were therefore 74 HP clinical isolates from 74 individuals recruited in Colombia (N = 37) and Mexico (N = 37, nine of which have been already sequenced with the 454 technology for 5 genes 28 ). Thirty-seven of these subjects had non-atrophic or atrophic gastritis, 21 intestinal metaplasia (IM) and 16 distal gastric cancer (GC). Table 4 shows pertinent characteristics of the population. For Mexican samples, all the patients signed an informed consent and the study was approved by ethical committees of the Instituto Mexicano del Seguro Social (IMSS) and General Hospital of the Secretaria de Salud (SS), Mexico City, Mexico 28 . For Colombia, the clinical studies where patients were originally recruited were approved by the Ethical and Research Committee of the Instituto Nacional de Cancerología, and all the patients signed an informed consent. This study was approved by the Ethical and Research Committee of the Instituto Nacional de Cancerología 52 . All research was performed in accordance with relevant guidelines and regulations.
Whole-genome HP sequencing. Nextera XT sample preparation kit (Illumina) was used for preparation of the sequencing libraries according the manufacturer's instructions. 1 ng of dsDNA libraries was quantified with Picogreen and used as input for Illumina sequencing. High Sensitivity DNA Kit (Agilent Technologies) was used to verify the fragment length distribution of the libraries using the agilent Bioanalyzer. Sequencing was performed on an Illumina HiSeq. 2500 sequencer with v4 PE125 chemistry at the Genomics and Proteomics Core Facility of DKFZ.
Bioinformatic and statistical methods. Raw sequences with Phred score >30 were analyzed with FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to further assess quality. The resulting sequencing output, in fastq format, was assembled with the "map by reference" function of the Geneious software platform (http://www.geneious.com/), considering the sequence of the 26695 strain (NC_000915.1) as reference. A consensus sequence was determined for cagA (HP0547), cagC (HP0546), cagE (HP0544), cagF (HP0543), cagI (HP0540), cagL (HP0539), c-terminal sequence of cagY (HP0527) (339 nucleotides at the 3' end of the genes encoding for the last carboxyl terminal 113 amino acids of CagY, cagYc) 28 , cagX (HP0528), cagγ (HP0523), and single nucleotide polymorphisms and small insertions and deletions were identified for each gene. For the analysis of cagA gene an additional strategy was used to better analyze the EPIYA and CM motifs (C-MET motif mediate CagA multimerization and membrane targeting) 30,31 . Illumina reads of the cagA gene were extracted and realigned by the "de novo assemble" option of the same software. To exclude potential artifacts in sequencing and to enrich variants with clinical relevance, we selected a total of 520 non-synonymous variants with at least 7.5% frequency in the HP isolates we included in the analysis or with a minimum of 5 isolates with the variant allele. A Bonferroni-corrected threshold (P = 0.05/520 = 9.6×10 −5 ) was used to adjust for multiple comparisons. The variant alleles were determined using strain 26695 as reference. When more than one variant resulted in substitution of the same amino acid, we also analyzed the frequencies of the combined amino acid variant. We used non-atrophic gastritis as the control group to compare frequencies of variants in IM and GC groups or a combined group with the two pathologies, and genotypes at a given locus were dichotomized, a selected variant vs. all other genotypes. P-values for differences in allelic frequencies between the control and IM/GC combined or separately were determined by the Fisher's exact test (2-sided). We also computed odds ratios (OR) 95% confidence interval (CI) for these gastric pathologies using logit estimators to obtain the CI even for the variants with zero frequencies in any category. For the variants that showed an unadjusted Fisher P-value of <0.05, we further tested linear trends of their frequencies across the three histological groups, control, IM and GC, using Mantel-Haenszel chi-square test. These analyses were performed by SAS version 9. The effect of DNA polymorphisms on the predicted proteins was evaluated with a bioinformatics tool: SIFT (Sorting Intolerant From Tolerant) http://sift.jcvi.org 53 .

Data availability
All sequences from Mexican strains are deposited at DDBJ/ENa/GenBank under the bioprojects PRJNA338771 and PRJNA203445. Genome sequences from Colombia are deposited under the bioproject PRJNA352848. GenBank accession numbers for 69 out of 74 analyzed strains are reported in supplementary table 1, submission of the remaining strains to GenBank is ongoing. In the meantime, the data are available upon request to the corresponding author.