Glycosaminoglycan biosynthesis pathway in host genome is associated with Helicobacter pylori infection

Helicobacter pylori is a causative pathogen of many gastric and extra-gastric diseases. It has infected about half of the global population. There were no genome-wide association studies (GWAS) for H. pylori infection conducted in Chinese population, who carried different and relatively homogenous strain of H. pylori. In this work, we performed SNP (single nucleotide polymorphism)-based, gene-based and pathway-based genome-wide association analyses to investigate the genetic basis of host susceptibility to H. pylori infection in 480 Chinese individuals. We also profiled the composition and function of the gut microbiota between H. pylori infection cases and controls. We found several genes and pathways associated with H. pylori infection (P < 0.05), replicated one previously reported SNP rs10004195 in TLR1 gene region (P = 0.02). We also found that glycosaminoglycan biosynthesis related pathway was associated with both onset and progression of H. pylori infection. In the gut microbiome association study, we identified 2 species, 3 genera and several pathways had differential abundance between H. pylori infected cases and controls. This paper is the first GWAS for H. pylori infection in Chinese population, and we combined the genetic and microbial data to comprehensively discuss the basis of host susceptibility to H. pylori infection.


Association analysis on H. pylori infection status and DOB measurements.
In order to identify genetic loci involved in H. pylori infection, we constructed two models in GWAS after standard quality control (QC) processes. Logistic regression was performed in 224 cases and 256 controls and linear regression was performed in 224 cases. The scree plot of principal component analysis (PCA) was shown in Supplementary  Fig. S2a. The top 2 components could explain most of the variations of the population and no clear population stratification was observed ( Supplementary Fig. S2b). The variant-based associations showed no genome-wide significant single nucleotide polymorphisms (SNPs) (P < 5e−8). There were still SNPs reached suggestive level (P < 1e−5) (Fig. 1), 10 SNPs for logistic regression and 59 SNPs for linear regression (Supplementary Table S1 and Supplementary Table S2). The Q-Q plots of both associations were shown in Supplementary Fig. S3. The inflation factor was 1.00445 for case/control analysis and 1.01076 for DOB values association, indicating there were no population stratification for the analyses.
Then we further investigated the regions harbor suggestive significant SNPs. In case/control association, two signals were located in genes LINC02011 (Fig. 2a) and CACNA1B (Fig. 2b) regions, another signal was located in intergenic region between NEDD1 and RMST (Supplementary Table S1 and Supplementary Fig. S4). Copy number variation in LINC02011 region has been reported to be associated with autism spectrum disorder 18,19 . This region also harbored another non-coding gene FGD5-AS1, it was reported to regulate gastric cancer cell proliferation and chemoresistance 20 . Protein encoded by gene CACNA1B was involved in calcium channel activity. This gene has been found recurrently mutated in 2 or more tumor tissues of 5 gastric cancer patients with H. pylori infection 21 . In linear regression based GWAS, 59 SNPs passed the nominal significance level (P < 1e−5). The signals were located in genes SLC25A3P1, DMRTB1, ST6GALNAC3, KIDINS220, DPP10, CELF2, CCDC77, FTO, ATP2C2-AS1 and LARGE1 regions and some intergenic regions. Their summary statistics were shown in Supplementary Table S2 and Supplementary Fig. S5. Among these regions, FTO (Fig. 2c) was strongly associated with obesity and BMI 22 . Obesity has been reported to be associated with H. pylori infection 23 . It is also a risk factor for gastric cancer 24 . One study found that gastric cancer tissues had high FTO expression as compare with the adjacent non-cancerous tissues 25 . LARGE1 gene (Fig. 2d) was involved in immune response for disease pathogens 26 . Set-based association for further identification of susceptibility genes and pathways for H. pylori infection. To improve GWAS power, set-based analyses were performed, which combined the GWAS summary statistics of all variants within a putative biologically functional unit (coding region and possible regulatory region, genes within one pathway) to obtain a single P value that represented the significance level of the associations between H. pylori infection and the functional unit. Our SNPs were mapped onto 25,928 genes for gene-based association. The top associated genes for case/control comparison were LOC105376331 and LINC02011 (gene-level P < 1e−04). And the most significant genes for linear regression were LARGE1 and LINC00561 (gene-level P < 1e−04). They failed to pass genome-wide gene-based significance threshold (0.05/25928). Then we further mapped these genes to canonical pathways from molecular signature database 27,28 .
164 pathways were included in the analyses. Top pathways associated with H. pylori infection (P < 0.05) were listed in Tables 2 and 3. In the logistic model-based pathway association, the glycosphingolipid biosynthesis related pathway (glycosphingolipid biosynthesis-lacto and neolacto series) was the most significant pathway. www.nature.com/scientificreports/ One study reported that after H. pylori eradicated successfully, glycosphingolipid biosynthesis-lacto and neolacto series pathway was significantly altered 29 . In the linear model-based pathway association, the most significant pathway was antigen processing and presentation pathway. Glycosaminoglycan biosynthesis of heparan sulfate (HS) pathway was significant in the both two models, it was a pathway correlated with bacterial adherence 30

Replication of previous reported H. pylori infection related genes.
We have checked the two signals reported in previous GWAS in European population 9 . The rs10004195 in gene TLR1 was replicated in  www.nature.com/scientificreports/ the linear association study (P = 0.02), of which the minor allele T has reduced effects on H. pylori infection. The SNP rs368433 was filtered out by our QC analysis because it is a low-frequency SNP in Asian (dbSNP: minor allele frequency (MAF) = 0.009). The SNP rs10004195 was located in a region which included multiple TLRs (TLR10, TLR1 and TLR6). TLRs play crucial roles in activating innate immunity. The SNP rs368433 was located in FCGR2A gene region. Further gene-based association combining the information from neighboring SNPs within a region showed that TLRs region was replicated (P < 0.05) based on the linear regression results. FCGR2A was not replicated (P > 0.05).
Differences of gut microbiomes were identified between H. pylori infected groups and controls. Stool samples from 403 individuals were collected and metagenome sequencing was utilized to study the gut microbiome between 185 H. pylori infected cases and 218 controls. In total, we identified 961 taxonomic categories from kingdom to species assigned by MetaPhlAn2 32 . Among them, 782 categories were classified to the kingdom of bacteria and its sub-levels. After removing categories with carriers less than 10%, 301 categories in different taxonomic levels were remained. Then we compared the differences of bacterial abundance between case and control groups at different taxonomic levels. After false discovery rate (FDR) multiple testing adjustment, the abundance of two species were significantly different, Holdemanella biformis (FDR-adjusted P = 0.0579) and Catenibacterium mitsuokai (FDR-adjusted P = 0.0785) ( Table 4). These 2 bacteria categories all demonstrated higher presence ratio in case group rather than control group. H. biformis is one of the species in the family Erysipelotrichaceae which is a key butyrate producing taxa [33][34][35] . The Erysipelotrichaceae was associated with inflammation-related gastrointestinal diseases 36 . As for C. mitsuokai, its abundance would increase in mice after intervention with a high-fat and highsugar diet 37 , and it was closely related to unhealthy fasting serum lipid profile in obese women 38 . This species is the only known species from the genus Catenibacterium. The abundance of three genera were significantly different, 'Erysipelotrichaceae noname' , (FDR-adjusted P = 0.0212), Catenibacterium (FDR-adjusted P = 0.0349), Prevotella (FDR-adjusted P = 0.0861) ( Table 4). 'Erysipelotrichaceae noname' is an unidentified genus annotated by MetaPhlAn2 32 below the family Erysipelotrichaceae mentioned above. Prevotella is one of the bacteria producing acetate while acetic acid participating in cholesterol synthesis 39,40 . The abundance of these categories were higher in the H. pylori infected groups. We didn't observe significant differences between H. pylori infected cases and controls in terms of alpha and beta diversities of bacteria. The species classified below these three genera were listed in Supplementary Table S3.  www.nature.com/scientificreports/ Furthermore, we identified 37 H. pylori infection specific metabolic pathways which have different abundance in the case and control groups through logistic regression at nominal significance level (P < 0.05) (Supplementary Table S4). These results suggested that pathways involved in amino acids biosynthesis (e.g.: L-isoleucine, L-ornithine were observed in H. pylori infection negative individuals, while L-lysine, L-lysine biosynthesis, S-adenosyl-L-methionine cycle I, L-citrulline metabolism and L-histidine degradation pathways had higher abundance in H. pylori infection positive groups), nucleotides metabolism (e.g.: pyrimidine related pathways were enriched in H. pylori infection negative group, while purine related pathways were enriched in H. pylori infection positive groups.), and other energy metabolites (e.g.: pyridoxal 5'-phosphate biosynthesis, glycolysis, homolactic fermentation, mannitol cycle, folate related pathways and peptidoglycan pathways) may be different between H. pylori infection case and control groups. Interestingly, we found that folate biosynthesis pathway was associated with H. pylori infection in both our genetic and microbiome study. It has been reported that folate level was significantly decreased (18.6%) in H. pylori infection individuals than those who were not infected in a low-income population in the United States 31 .

Discussion
In this study, we performed genome-wide association study on 224 H. pylori infected individuals and 256 negative individuals and also compared their differences of gut microbiome with the aim of gaining more insights into the bases and consequences of H. pylori infection. We identified 3 genetic regions which were associated with H. pylori infection status and 14 genetic regions which were related to the bacteria load at suggestive significance level (P < 1e−5). Further pathway-based analysis showed that several pathways were associated with H. pylori infection. Those pathways were related to immune response and glycosylation. One of the two reported loci from previous GWAS analysis was replicated in our data. In gut microbiome analysis, 2 species, 3 genera and several pathways were identified to have different abundance between H. pylori infected cases and controls.
While comparing the results with the reported GWAS about H. pylori infection in 2013 9 , we found that rs10004195 in gene TLR1 was associated with the severity of H. pylori infection (P = 0.02023), but it was not associated with the infection status. TLRs are known to be essential in providing protective immunity against infection. The minor allele of rs10004195 may be associated with less efficiency of anti-inflammatory TLR1-TLR2 signaling 9 . The other reported SNP rs368433 was filtered out by QC in our genetic data due to its low frequency in Chinese population (MAF = 0.009).
One possible cause of these different results may be the distinct methods for identifying H. pylori infection. In our study, we adopted C13 test to detect H. pylori infection. While in the GWAS for European population 9 , they tested anti-H. pylori serum IgG antibody titers. C13 test has higher sensitivity and specificity than anti-H. pylori serum IgG antibody titers test 41 . Although, anti-H. pylori serum IgG antibody titers could represent previous infection, it can hardly represent the current bacteria load in an individual. Besides, C13 test is a more convenient handling method. Therefore, we have chosen C13 breath tests in our study to measure both onset and severity of H. pylori infection. In addition, the population ancestry differences could be another cause for the variation of GWAS results. SNP rs368433 is common in European population and significantly associated with H. pylori infection 9 . However, it is a low frequency SNP (MAF = 0.009) in Chinese population, and it was filtered out in our QC processes. We also checked the adjacent SNPs whose position are close to rs368433, they were not associated with H. pylori infection in our data. The true causal variant could still be hidden in this genomic region, but due to different linkage disequilibrium (LD) structure among European and Chinese populations, the alleles tagging true causal variants may be different. Future fine mapping will be required to extend the region and dig the hidden information.
Another difference for H. pylori infection between Chinese and European population is that they carry distinct H. pylori strains 4 . Almost all Asian H. pylori strains contain the complete cag-pathogenicity island (PAI) which was associated with more severe disease in H. pylori infected individuals 4 .
When comparing our results to existing functional studies, the functions of our newly identified genes and pathways were highly relevant to H. pylori infection. FGD5-AS1 regulates cancer gastric cell proliferation 20 . CAC-NA1B had accumulated somatic mutations in gastric tumor with H. pylori infection 21 . Genetic variants associated with obesity were also ranked top in our results. FTO is one of the leading genes associated with obesity and BMI 42,43 . Studies also found that gastric cancer tissues had high FTO expression compared with the adjacent noncancerous tissues 25 . Obesity was considered as a risk factor for H. pylori infection 44 . Interestingly, differences of obesity related bacteria were also identified in our gut microbiome analysis. C. mitsuokai was enriched in H. pylori infection positive group, which was also enriched in obese women 38 . The associations among obesity, genetics of obesity, obesity related bacteria and H. pylori infection were reported pair wisely, but their causal relationships are still largely unknown. It is reported that H. pylori persistent infection had a negative effect on the fall of BMI values in Chinese obese population 45 . This comorbidity between H. pylori infection and obesity may be caused by the pleiotropic effects of FTO and other genes. We also found that glycometabolism-related genes and pathways were associated with H. pylori infection. LARGE1 encoded a glycosylase, which played an important role in host-pathogen interactions 46 . Our pathway-based associations identified that glycosphingolipid biosynthesis and glycosaminoglycan biosynthesis were associated with H. pylori infection. Glycosphingolipids (GSLs) are components of cell-surface or basement membrane-associated proteoglycans produced by both epithelial and mesenchymal cells 47 . They are known to play a role as receptors in pathogen invasion 48 . Glycosaminoglycans have been shown to be recognized by H. pylori outer-membrane proteins and to be involved in the adhesion of the bacteria in cell-line models 30 . In the functional pathway enrichment of microbiome association study, we found that glycolysis related pathways were correlated with H. pylori infection. Many studies found that H. pylori can metabolize glucose by both oxidative and fermentative pathways [49][50][51][52][53] . H. pylori infected gastric epithelial cells have exhibited increased glycolysis and increased expression of Lon protease 1 (Lonp1), Lonp1 is a key www.nature.com/scientificreports/ regulator of metabolic reprogramming in H. pylori-induced gastric carcinogenesis 54,55 . We also identified two immune-related pathways, including antigen-processing and presentation pathway as well as autoimmune thyroid disease pathway. Study has reported that H. pylori affects the antigen presentation activity through microRNA 56 .
H. pylori infection also has immune evasion strategies. Therefore, it was as expected that the genetic variations of immune-related genes may affect the susceptibilities of H. pylori infection. While extensive number of studies have been focused on the alterations of gastric microbiota 29 since H. pylori mainly colonizes the stomach 57 , little is known about the impact of H. pylori on gut microbiota. We have performed association analyses between H. pylori infection status and gut microbiome using shotgun metagenome sequencing. Compared to existing 16S rRNA gene sequencing for gut microbiota, shotgun sequencing has more power to identify less abundant taxa than 16S sequencing 58 . We identified 2 species, 3 genera and several pathways had differential abundance between H. pylori infected cases and controls. Interestingly, we found that folate biosynthesis pathway was associated with H. pylori infection in both genetic and microbiome studies. Reduction of vitamin B12 and folate concentrations has been reported in H. pylori-infected patients 59 . Folate acid was synthesized by bacteria 60 . Altogether, it is suggested that H. pylori infection could alter gut microbiome and thereby influence the nutrition supply of the host. More metabolomic and nutrition studies may be required to study the consequences of H. pylori infection.
There are some limitations of current study. The sample size was not large enough to boost the signals to genome-wide significance. We have leveraged on the set-based association by combining the SNPs within a biological meaningful set and generated a summed statistic for the whole set. Our set-based association identified glycosaminoglycan and immune-related pathways, which were functional relevant with H. pylori infection. We also studied the gut microbiome differences between H. pylori infected cases and controls, the results showed more supporting evidence that the glycosylation related function was involved in H. pylori infection. Together, we provided a list of candidate genes, pathways and bacteria associated with H. pylori infection for further replication and validation.

Methods
Sample recruitment and data collection. The samples of current study were selected from a cohort study of Shenzhen local area. Individuals who had C13 test and no symptoms of any gastrointestinal conditions were selected. C13 test is used for detecting H. pylori infection. It involves taking C13 isotope-labeled urea molecules by mouth, and then measuring the changes in carbon dioxide in the exhaled carbon by an isotope mass spectrometer. The main measurement index was the δ value, which measured the 13  Whole genome sequencing. The whole blood drawn from the participant vein was stored in the ethylenediaminetetraacetic acid (EDTA) anticoagulant tubes to avoid hemolysis, while the plasma was obtained by centrifugation (3000 rpm, 10 min) and was preserved at -80 ℃ until assay. The white cells were isolated for genomic DNA extraction. Whole-genome sequencing (WGS) to 30× were conducted on BGI-seq500 sequencer. WGS data were aligned and variants called by the Picard/BWA 61 /GATK 62 pipeline. SNPs with mapping quality greater than 40, sequencing depth greater than 4, variant quality greater than 2.0, Phred score of Fisher's test P value for stand bias smaller than 60.0, haplotype score smaller than 13.0 and distance of alternative allele from the end of reads greater than 8.0 were kept for following analyses.
GWAS quality control. The common practice of GWAS QC is conducted through two dimensions of the data: samples and variants. All QC assessments and successive filtering were done using PLINK1.9 63 . Samples had more than 1% missing genotyped SNPs, different genetic sex with the record in the phenotypic database, abnormal autosomal heterozygosity were removed. Sex conflict and abnormal heterozygosity may be caused by DNA sample contamination. One of paired samples who have genetic relationship within three degree of relatedness was filtered out. SNPs had high rate of missing genotypes, and deviated from the Hardy-Weinberg equilibrium (HWE) test (P ≤ 1e−05) as well as SNPs on X and Y chromosomes and mitochondria were also removed. After variant QC, 5,169,391 common SNPs (MAF > 0.05) were including for the association analyses. PCA was performed using SNPs on autosomal chromosome by PLINK1.9 63 to investigate population stratification. No clear sub-cluster was observed. Typical north to south Grandaunt was demonstrated by the first principal component (PC). Linear and logistic regression added top two PCs as covariates.
Single-variant-based association. When phenotypes and genotypes have passed all the QC processes, the statistical analysis of disease associations can be performed. The basic analysis of genome-wide association data is a series of single-locus statistical tests, examining each SNP independently for association to the disease status 64 . The types of statistical tests utilized mostly depend on the types of disease phenotypes. For continuous phenotype, DOB values of the C13 test results, additive model into linear regression was performed in 224 cases to identify genes associated with severity of H. pylori infection. While for binary trait, the case/control status, additive model into logistic regression was utilized in 224 cases and 256 controls to search for genes related to susceptibility of H. pylori. According to the scree plot of PCA ( Supplementary Fig. S2), the top 2 PCs could explain most of the variations of the population and no clear population stratification observed ( Supplementary  Fig. S2) 63 . The Manhattan plots were generated in R 3.6.1. Regional plots were generated by LocusZoom tool 65 (http:// locus zoom. org/).

Gene/pathway-based association.
The standard analysis of genome-wide association study uses single SNP marker as the test unit. However, due to small sample size, the complex LD structures and ethnic differences among different populations, many replication studies have failed. To improve GWAS power, the genebased association has been proposed 66 . The gene-based analysis combined the summary statistics of all variants within a putative gene (coding region and possible regulatory region) to obtain a single P value that represents the significance level of diseases and genes. Similarly, variants within genes on the same pathway were combined for pathway-based analysis. Gene-based and pathway-based analyses were performed using GCTA-fastBAT tool 67,68 . In gene-based analysis, a gene region was defined as ± 8 kb of UTRs of a gene.
Metagenomics sequencing. Fresh stool samples were collected from recruited volunteers for metagenomics sequencing. The fecal DNA extractions were processed following the MetaHIT protocol, then single-end metagenomics sequencing was performed using BGISEQ-500 platform. The low-quality reads were discarded, and the host DNA were removed based on human genome reference (hg38) by SOAP2 69 (version 2.22). Taxonomic analysis was performed using MetaPhlAn2 32 following removal of human reads. The relative abundance of species and genera were used in the current study. Meanwhile, gene family abundance and pathway abundance of each sample were profiled using HUMAnN2 70 with default parameters. HUMAnN2 70 reported the abundance of gene families from the UniProt Reference Clusters (UniRef90), which were further mapped to microbial pathways from the MetaCyc metabolic pathway database. Gut microorganisms belonging to Kingdom Bacteria were extracted for correlation analysis between H. pylori infection and microbial abundance. Bacteria were excluded if they have less than 10% carriers in the population. After filtering, the abundance of each bacterium was rescaled into percentage within an individual. Then, Wilcoxon test was used to compare the abundance of bacteria between H. pylori cases and controls. The P values at the same taxonomic levels were adjusted by FDR. Alpha diversity assessed the bacteria diversity within individuals was compared in different H. pylori infection status groups using Wilcoxon test. Beta diversity illustrating distance between individuals in terms of bacterial composition was demonstrated as principal coordinates analysis (PCoA) plot. Alpha and beta diversity analyses were conducted in R 3.6.2 with vegan 71 , phyloseq 72 and microbiome packages 73 . Additionally, we dichotomized the abundance data to represent whether a bacterium is present in an individual. Logistic regression adjusted for age and sex was then used to analyze the associations between the presence of bacteria and the status of H. pylori infection. FDR method was used for multiple testing adjustment. In pathway analysis, pathways present in over 10% of samples were selected for subsequent analyses.
Ethnics statement. The study protocol was reviewed and approved by the Institutional Review Board of BGI (No. BGI-IRB 16068). Review procedures in BGI-IRB meet GCP principles and relevant national regulations for this application. Written informed consent was obtained from all the study participants prior to their enrolment.

Data availability
The data supported the findings of this study has been deposited into CNGB Sequence Archive (CNGB) 74 of China National GeneBank DataBase (CNGBdb) 75  www.nature.com/scientificreports/