Main

Gut microorganisms and humans have evolved in a symbiotic relationship. Humans provide an intestinal environment with resources for microorganisms to live, and gut microbes can provide bioactive molecules that affect human physiology and mediate the impact of dietary and environmental exposures on humans7,8,9,10,11. Gut microorganisms can also protect their host against other pathogenic microorganisms, train the immune system and play other important roles in human health12,13. Although there are some data showing host–microorganism symbiotic relationships14,15,16, genetics-based evidence remains limited. So far, several human genomic loci have been associated with the abundance of several taxa, including well-replicated associations with the LCT and ABO genes1. However, little is known about genetic interaction between the human genome and the gut microbiome, a fact supported by the discovery of population-specific strains17. This led us to reason that associations between genetic variants in the human genome and those in the human metagenome can provide functional insights into the host–microorganism symbiotic relationship. To our knowledge, such analyses have not yet been carried out at the whole-genome scale.

Bacterial genomes are known to evolve rapidly. Genomic variation leads to bacterial strains that can differ in fitness, carbohydrate utilization, metabolizing capacity, pathogenicity and other biological properties18. Bacterial structural variations (SVs) are highly variable genomic segments, of variable lengths, that can exert pronounced effects on microbial functionality, increasing bacterial genome plasticity and enabling rapid adaptation to environments19. SVs are common in human gut microbial genomes and there is a large inter-individual difference in microbial SVs between humans20,21,22. Identification of deletion SVs (dSVs; genomic regions that are either detectable or absent in the metagenomic sample) or variable SVs (vSVs; genomic regions whose abundances are highly variable across samples) using metagenomic sequencing has revealed that gut microbial SVs are related to human health20,21,22. Longitudinal analysis has demonstrated that gut microbial SVs show species-specific temporal stability22. This suggests a potential adaptation of gut bacteria to the individual-specific intestinal environment. However, little is known about how human genetics shapes the individual’s intestinal environment and exert selective pressure on the genetic landscape of the gut microbiome. The limited studies carried out thus far usually focused on bacterial or viral isolates23,24. Genetic association between human genetic variants and microbial SVs may thus help us understand the mechanisms underlying the symbiotic relationship between gut microorganisms and their human host.

In the present study, we carried out a large-scale meta-analysis of genetic associations between human genotypes and microbial SVs in the gut microbiome, involving 9,015 individuals from four Dutch cohorts. Associations significant at the Bonferroni-corrected P < 0.05 level were then replicated in a Tanzanian cohort (n = 279). Follow-up bioinformatics and experimental validation pinpointed causal genes involved in host–microbiome interaction and improved our functional understanding of human genetic regulation of gut microbial genetic diversity.

Heritability of gut microbial SVs

This study involved 9,015 Dutch individuals for whom both metagenomic and host genetic data were available (Fig. 1a). These individuals came from four Dutch cohorts: the Dutch Microbiome Project8 (DMP; n = 7,372), Lifelines-DEEP25 (LLD; n = 981), the 500 Functional Genomics Project26 (500FG; n = 396) and 300-Obesity21 (300OB; n = 266). To replicate associations in individuals with a different genetic background and lifestyle, we involved the 300-Tanzanian cohort (300TZFG; n = 279) as a replication cohort. The analysis workflow is presented in Fig. 1a.

Fig. 1: An overview of the workflow and microbial SV data.
figure 1

a, The workflow of this GWAS on gut microbial SVs. In this study we integrated data for 9,015 individuals from whom both gut microbial metagenomic and host genetic data are available from four cohorts. Data from a Tanzanian cohort of 279 individuals were included as a replication cohort. We generated gut microbial SV profiles based on metagenomic sequencing, with the dSVs and vSVs then subjected to association analysis with the genotypes of more than 6 million common SNPs in the human genome. The genetic associations are presented in box plots for vSVs, bar plots for dSVs and Manhattan plots for the whole-genome level. Created with BioRender.com. b, The number of common microbial SVs detected in 49 species for GWAS. Each bar represents a species. The y axis refers to the number of common SVs detected in that species. dSVs and vSVs are coloured in green and blue, respectively. c, A pie chart of the 3,552 common SVs, including 1,666 dSVs and 1,886 vSVs, involved in the GWAS.

Source Data

We used SGV-Finder20 to generate SV profiles. In brief, this method mapped sequencing reads to reference genomes, resolved possible ambiguous read alignments and then split the microbial genomes into bins. The metagenomic coverage of these bins was compared across samples (Methods). SGV-Finder identifies bins with coverage close to 0 in 25–75% of samples as dSVs and bins that show variable coverage as vSVs. SV identification is possible only for gut microbial species with sufficient metagenomic sequencing coverage (Methods). In total, we detected 14,196 SVs in 108 gut microbial species, including 10,265 dSVs and 3,931 vSVs, with 3–379 SVs per species (Extended Data Fig. 1a,b and Supplementary Table 1). The species with the largest number of SVs were Dorea formicigenerans, Dorea longicatena and Blautia wexlerae (Extended Data Fig. 1c and Supplementary Table 1). The number of samples with sufficient coverage to detect SVs ranged from 11 to 7,716 for different species. The abundance of these species collectively accounted for an average of 80.8% of faecal microbiome composition (range 17.8% to 97.1%; Extended Data Fig. 1d). To ensure statistical power for association with host genetics, we selected those vSVs that were detected in at least 10% of samples and dSVs with deletion rates between 5% and 95% (Methods). This resulted in 3,552 SVs including 1,666 dSVs and 1,886 vSVs from 49 bacterial taxa (Fig. 1b,c and Supplementary Tables 13).

To assess the extent to which the gut microbial SVs can be determined by host genetics, we first estimated the heritability of 1,339 out of 3,552 SVs that were present in 1,092 first- or second-degree relative pairs from the DMP cohort. After correcting for species abundance, family-based heritability estimation (h2) revealed one heritable dSV at a false discovery rate of <0.05 (Supplementary Table 4): a 2-kilobase (kb) dSV of F. prausnitzii (577–579 bp) with an estimated h2 of 0.38. In addition, 26 dSVs and 51 vSVs showed nominally significant heritability (P < 0.05), with an average h2 of 0.28 and 0.41, respectively (Supplementary Tables 4 and 5). Next, we compared SV heritability with species abundance heritability and observed an additional effect of host genetics on microbial SV level (Extended Data Fig. 2 and Supplementary Note 1). However, this study still lacks sufficient power for heritability calculation and comparison. Accurate heritability estimations of species abundance and microbial genetic variation would require a much larger sample size and careful experimental design (for example, twin studies).

ABO locus and F. prausnitzii SVs

Next we associated the 3,552 SVs with more than 6 million human single nucleotide polymorphisms (SNPs) per cohort, followed by a meta-analysis. The genetic associations significant at the Bonferroni-corrected P < 0.05 level were all associations between the ABO locus and SVs of F. prausnitzii, including four dSVs and one vSV (Fig. 2, Extended Data Fig. 3a and Supplementary Tables 6 and 7). The strongest association was found between rs635634 and a 2-kb dSV region (577–579 kb) of F. prausnitzii (bmeta = 0.88, Pmeta = 1.21 × 10−45). The SNP rs635634 is located in the ABO gene, which encodes a glycosyltransferase that modifies oligosaccharides on the cell surface and determines the ABO blood group. The ABO locus is one of the few loci that have repeatedly been associated with the abundances of several gut bacteria, including Collinsella species, Bifidobacterium and Faecalibacterium2,6,27.

Fig. 2: A Manhattan plot of genome-wide associations of human SNPs and gut microbial SVs.
figure 2

GWAS results for dSVs (top) and vSVs (bottom). The x axis shows the genomic position on the human chromosomes (chromosomes 1–22) for both the top and bottom panels. The y axes in both panels show statistical significance as −log10[P] estimated using a linear mixed model by fastGWA. The plotted P values are not adjusted for multiple testing. The red horizontal lines indicate the study-wide significance cutoffs determined using the Bonferroni method: 3.00 × 10−11 for dSV and 2.65 × 10−11 for vSV associations. Significantly associated loci are highlighted in yellow and labelled with the nearby human gene and the corresponding species name.

Source Data

We further replicated identified associations in the 300TZFG28 cohort, which had distinct genetic background, lifestyle and environmental exposures (Supplementary Note 2). SVs of F. prausnitzii were detected in 201 individuals from 300TZFG, either at similar or different frequencies compared to those observed in the Dutch cohorts (Extended Data Fig. 3b). We detected 156 associations with the ABO locus at a nominally significant level (P < 0.05; Supplementary Table 8). Two F. prausnitzii dSVs, 575–577 and 577–579, showed association with ABO (Extended Data Fig. 3c,d), encompassing both shared signals and population-specific signals.

In addition to the ABO association, our study also yielded 210 independent suggestive associations (clumping linkage disequilibrium r2 < 0.1) at the genome-wide significance P < 5 × 10−8 level: 58 associations with dSVs involving 17 species and 152 associations with vSVs involving 33 species (Supplementary Tables 6 and 7).

ABO association is dependent on FUT2

ABO genotype determines host blood type, and we further analysed whether the association with ABO SNPs represented the association with ABO-coded blood groups in Dutch samples. Blood groups were imputed using SNP genotype data (Methods). Indeed, all five ABO-associated F. prausnitzii SVs were associated with the host’s ABO blood group (Extended Data Fig. 4 and Supplementary Table 9). The F. prausnitzii 577–579-kb dSV region was more frequent in individuals with blood group A or AB than in individuals with blood group B or O (Pmeta = 1.24 × 10−44, PDMP = 1.03 × 10−32). The association was also dependent on FUT2 secretor status, which determines whether fucosyl precursors of A- or B-antigens are secreted into body fluids and intestinal mucus. The secretor-determining SNP rs679574 itself was suggestively associated with the presence of this dSV (Pmeta = 2.92 × 10−9), and A-antigen presence was associated with the F. prausnitzii 577–579 dSV only in FUT2 secretors (Pmeta = 4.85 × 10−51, PDMP_secretors = 9.39 × 10−37, PDMP_nonsecretors = 0.88; Fig. 3a). After correcting for the population genetic structure of F. prausnitzii (Extended Data Fig. 5a and Supplementary Table 10), F. prausnitzii associations with the ABO locus remained significant (PDMP = 2.24 × 10−32; Supplementary Table 11).

Fig. 3: GalNAc utilization underlies the ABO association with F. prausnitzii SVs.
figure 3

a, Association between ABO blood type and the presence or absence of genomic segment 577–579 kb (SV 577–579) of F. prausnitzii is dependent on FUT2 secretor status. Individuals were grouped into different groups on the basis of blood types (A versus B or O blood types) and FUT2 secretor status (secretors versus non-secretors). The y axis refers to the fraction of individuals with the 577–579-kb region in the DMP dataset. Unadjusted association P values are reported, based on linear mixed models. b, A scheme of the GalNAc pathway identified in the associated SV region, which is divided into the initial cleavage from A-antigen (step 0) and five key steps of GalNAc utilization (steps 1–5; Supplementary Note 3). agaF + V + C+ D genes encode four subunits of the GalNAc PTS system II complex protein. DHAP, dihydroxyacetone phosphate; FBP, fructose-1,6-bisphosphate; F6P, fructose 6-phosphate; GAP, glyceraldehyde 3-phosphate; PTS, phosphotransferase system; TBP, tagatose 1,6-bisphosphate. Created with BioRender.com. c, A phylogenetic tree of the strains used for the culture experiments and organization of genes involved in GalNAc utilization in the SV region. The x axis indicates the base pair position starting from the flanking gene dinB. The different coloured lines indicate the location of the same genes in different strains. The names of strains with and without the SV region are in red and black, respectively. d,e, Growth curves of F. prausnitzii strains with or without SV 577–579 on medium with GalNAc (d) and galactose (e). The x axis refers to time points in hours. The y axis refers to cell density as measured as the optical density at 600 nm (OD600nm). Points with bars on the growth curves represent means ± s.d. of three replicates. f, Fold change of gene expression following GalNAc induction compared to glucose induction. The y axis shows the fold change of the expression of the agaC, agaD, agaV and agaF genes in the strains HTF-495 and ATCC 27768 following GalNAc induction relative to glucose induction. Each dot indicates one replicate. Bars and error bars indicate means ± s.e. of three replicates.

Source Data

The ABO locus was previously associated with the abundance of Faecalibacterium species in a German cohort, with a rather modest effect size (β = −0.14, P = 4.33 × 10−9)6. This association was not replicated in two other studies2,27 or in our cohorts (PDMP_secretors = 0.08; Extended Data Fig. 5b). Notably, we did observe a significant interaction between the blood group and dSV 577–579 (PDMP_secretors = 1.47 × 10−3) on the abundance of F. prausnitzii, suggesting that the ABO association with F. prausnitzii abundance may depend on the presence of the dSV region.

GalNAc pathway in the SV region

A-antigen is an oligosaccharide that can be secreted into intestinal mucus and degraded by carbohydrate-active enzymes of gut bacteria29,30,31. Therefore, we reasoned that the associated SV regions may give F. prausnitzii the capacity to utilize saccharides released from A-antigen as a carbohydrate source. All five ABO-associated F. prausnitzii SVs were modestly correlated with each other (Spearman correlation R > 0.13, P < 0.05; Supplementary Table 12). After adjusting for other associated SVs, the strength of associations decreased, and the association of two dSVs (577–579 and 1154–1155) out of the five SVs remained significant after Bonferroni-correction, suggesting that other SVs partially tag the same signal as the top 577–579 dSV (Supplementary Table 13). However, most of the dSVs still showed significant associations, especially the top ABO-associated 577–579-kb dSV region. This means that the 577–579-kb dSV captured most of the signal, but not all. To fine-map the microbial genomic region that captures the causal genes, we isolated F. prausnitzii from human faeces, carried out whole-genome sequencing and selected 12 distinct F. prausnitzii strains. Seven strains showed a deletion that overlaps with the top ABO-associated 577–579 segment (Supplementary Fig. 1), expanding this 2-kb dSV region to a 23-kb region. We then used the F. prausnitzii HTF-238 strain with this complete region (2,640–2,663 kb) as the reference for gene characterization.

In this expanded region, we identified 27 genes (Supplementary Table 14), including those involved in carbohydrate metabolism, particularly the pathway involved in GalNAc metabolism, including a cluster of genes responsible for the uptake and metabolism of d-galactosamine and GalNAc (Fig. 3b,c and Supplementary Table 14). GalNAc sugar is part of the A-antigen encoded by ABO, and it might be used as an energy source for bacteria when it is secreted to mucus32. Specifically, the region contains one gene, GH109, that encodes a glycoside hydrolase that can cleave GalNAc from A-antigen, as well as nine genes involved in five key metabolic steps of downstream GalNAc utilization (Fig. 3b and Supplementary Note 3). Moreover, the region also contains two genes involved in the galactose degradation pathway (the Leloir and tagatose 6-phosphate (T6P) pathways). Other genes and genetic elements in this region, including transcriptional regulators, transposons and several uncharacterized genes, were not likely to be directly involved in carbohydrate metabolism.

Furthermore, we found that this SV region is likely to be a mobile element. By investigating SV sharedness between cohousing individuals, we found evidence to support the transmission of GalNAc-containing strains between people. Moreover, a 4-year follow-up analysis in 119 individuals shows a higher frequency of gain than of loss of GalNAc-containing strains over time (Extended Data Fig. 6a–e, Supplementary Fig. 2 and Supplementary Note 4).

Bacteria can use GalNAc as a carbon source

As multiple genes involved in carbohydrate metabolism were identified in the SV region of F. prausnitzii, we next investigated whether the genes in this region are crucial for bacterial utilization of the specific monosaccharide substrates, including GalNAc, galactose, glucose, lactose, mannose, N-acetylglucosamine, fructose, N-acetylneuraminic acid and 2′-fucosyllactose. All 12 selected F. prausnitzii strains were subjected to growth rate experiments in yeast casitone fatty acids (YCFA) medium with the monosaccharides above as the sole carbohydrate source, and YCFA without a carbohydrate source was used as a negative control.

The GalNAc utilization pathway turned out to be crucial for bacterial growth in the GalNAc medium. Strains lacking the GalNAc pathway could not grow (Fig. 3d), whereas six out of seven strains (except ATCC 27768) with the GalNAc pathway could grow, although HTF-383 exhibited slightly slower growth and reached a similar cell density level at a later time (Extended Data Fig. 7a). In contrast to the findings for GalNAc utilization, all strains were able to grow on galactose, but those with the region containing the Leloir and T6P pathways showed a higher growth rate than those without (Fig. 3e), indicating that these pathways, although not essential, can improve galactose utilization efficiency. The presence or absence of pathways in this region did not show a notable influence on bacterial utilization of other monosaccharides (Extended Data Fig. 7b).

Inversion affects GalNAc gene expression

ATCC 27768 was the only strain that harbours the GalNAc pathway that did not grow in the GalNAc medium. However, the GalNAc region is reversed in ATCC 27768 (Fig. 3c), and this genomic inversion may result in dysfunction of this pathway. Thus, we carried out a GalNAc induction experiment to investigate the transcription of GalNAc genes and potential regulators (ptsH, rhaR and immR) in this region. ATCC 27768 was first pre-cultured in a glucose medium, and the resulting bacterial culture was split and transferred to either glucose or GalNAc medium (Methods). We then compared the expression fold change in GalNAc medium to that in glucose medium. The positive control was the close relative strain HTF-495, which can grow in GalNAc medium. The negative control was HTF-441, which lacks the GalNAc utilization gene cluster (Extended Data Fig. 8).

Gene expression of GalNAc genes was not detected in HTF-441, confirming their absence (data not shown). Notably, following GalNAc induction, the expression of three GalNAc uptake genes, agaC, agaD and agaV, was only marginally increased in ATCC 27768, whereas these genes showed a marked increase in HTF-495. For instance, GalNAc induction resulted in a 63.5-fold increase in agaC expression in HTF-495 compared to glucose induction, but in only a threefold change in ATCC 27768 (Fig. 3f). However, the expression of other GalNAc genes showed similar fold changes in ATCC 27768 and HTF-495 (Extended Data Fig. 8). This suggests that genomic inversion of ATCC 27768 affects the expression of only GalNAc uptake genes and not GalNAc metabolism genes.

GalNAc pathway in other taxa

So far, the ABO locus has been associated with the abundances of nine bacterial taxa2,6,27 (Supplementary Table 15), including those of three species: C. aerofaciens, Faecalicatena lactaris and Bifidobacterium bifidum. However, except for those of the genus Collinsella, none of these associations have been replicated in multiple studies. We wondered whether the presence of the GalNAc pathway may explain the ABO association with the abundance of those taxa. We therefore extracted 10,487 assembled genomes of ABO-associated species from the Unified Human Gastrointestinal Genome collection33, including 1,103 assemblies of C. aerofaciens, 484 of F. lactaris, 1,109 of B. bifidum and 7,791 of F. prausnitzii (Supplementary Table 16). We then carried out an orthologue search for the GalNAc pathway genes. We found that GalNAc genes were present in 28–95% of assemblies (Fig. 4a and Supplementary Table 16). However, the complete pathway was found in only 2,678 assemblies (26%), including 1,794 F. prausnitzii strains (23%) and 884 C. aerofaciens strains (80%) (Fig. 4b,c and Supplementary Table 16). The high fraction of GalNAc-pathway-containing strains of C. aerofaciens supports the association between Collinsella abundance and ABO. In accordance with these results, we also confirmed GalNAc utilization capacity for two C. aerofaciens strains (Fig. 4d–g). However, we did not detect the complete GalNAc pathway in B. bifidum genomes, suggesting a potentially different underlying mechanism for B. bifidum associations with human blood type.

Fig. 4: GalNAc utilization capacity of strains of F. prausnitzii and other ABO-associated species.
figure 4

a, Completeness of the GalNAc pathway in four gut microbial species associated with human ABO blood type: F. prausnitzii, C. aerofaciens, F. lactaris and B. bifidum. The upper bar plot shows the number of strains (intersection size), and the combination of black dots in each vertical column underneath represents the presence of the genes in the corresponding step. The bar plot on the left represents the number of strains (set size) containing the genes of each corresponding GalNAc metabolism step. b,c, Proportion of strains with a complete GalNAc pathway in F. prausnitzii (b) and C. aerofaciens (c). dg, Growth curves of C. aerofaciens strains containing GalNAc pathway genes on medium supplemented with different sugars. The x axis indicates the hours after culturing in the medium. The y axis indicates cell density measured as the OD600nm value. Points on the growth curves represent means ± s.d. of three replicates. YCFA medium and YCFA–glucose (Glc) medium were used as negative and positive controls, respectively. GalNAc and galactose (Gal) were supplied in YCFA medium to test whether C. aerofaciens can grow with the monosaccharides released from A-antigen and B-antigen as their sole carbohydrate sources.

Source Data

GalNAc utilization supports human health

We further estimated the total abundance of GalNAc genes in the whole microbial community. These GalNAc genes showed a strong intercorrelation, indicating that they are probably present as a gene cluster and function collaboratively. Similarly, the abundance levels of GalNAc genes were associated with the ABO blood type in FUT2 secretors (Extended Data Fig. 9 and Supplementary Table 17). The significance observed at the gene level was much stronger than the association with the F. prausnitzii SV region, with the lowest P value of 4.19 × 10−223 observed for lacC (Extended Data Fig. 9 and Supplementary Table 17).

We further reasoned that the abundance of GalNAc genes might be more relevant for human health in individuals with mucosal A-antigens than for those without. To check this, we characterized individuals in our cohorts as having either genetically determined presence or absence of A-antigen in intestinal mucus, based on their ABO and FUT2 genotypes. FUT2 secretors with A-antigens (A or AB blood type) were identified as individuals with mucosal A-antigen, and all others were considered individuals without mucosal A-antigen. In line with our previous findings, the abundance of GalNAc genes showed remarkable differences between individuals with and without mucosal A-antigen. The top associations were found for the lacC gene involved in catalytic step 4 from T6P to tagatose 1,6-bisphosphate (P = 1.30 × 10−280) and the gatYkbaY gene involved in catalytic step 5 from tagatose 1,6-bisphosphate to dihydroxyacetone phosphate or glyceraldehyde 3-phosphate (P = 2.60 × 10−259; Fig. 5a,b and Supplementary Table 17). As many gut microorganisms can have the GalNAc pathway, we further reasoned that the presence of mucosal A-antigen can provide an extra energy source to promote the growth of GalNAc utilizers. In agreement with this, our findings showed that the abundances of GalNAc genes were positively associated with microbial richness and diversity and that these associations were stronger in individuals with mucosal A-antigen (Pheterogeneity < 0.05, I2 > 0.7; Fig. 5c, Extended Data Fig. 10a and Supplementary Table 18). For instance, the correlation between the abundance of the agaF gene and microbial richness was 0.26 (Spearman correlation, P = 1.79 × 10−29) in individuals with mucosal A-antigen but only 0.13 (Spearman correlation, P = 1.13 × 10−16) in individuals without mucosal A-antigen (Supplementary Table 18). We observed similar results after correcting for the presence of the 577–579 dSV and F. prausnitzii and C. aerofaciens abundances.

Fig. 5: Comparison of GalNAc associations between two groups of individuals with or without mucosal A-antigen.
figure 5

a,b, Comparison of the abundance of GalNAc genes, lacC (a) and gatYkbaY (b), between two groups of individuals (nwithout = 3,866 and nwith = 1,868). The violin plots show ln-transformed gene abundance with units of reads per kilobase million (RPKM). The inner box plots represent summary statistics: the centre line represents the median, the box hinges represent the lower and upper quartiles of the distribution, whiskers extend no further than 1.5× interquartile range from the hinges, and data beyond the end of the whiskers are outliers plotted as individual points. Unadjusted P values are reported, based on linear mixed models. c, Comparison of the correlations of GalNAc metabolism gene abundance with gut microbiome α-diversity and richness between two groups: individuals with mucosal A-antigen (x axis; n = 1,868) and those without (y axis, n = 3,866). Each dot represents a Spearman correlation coefficient between a GalNAc gene with the Shannon index (green) or richness (blue). The error bars indicate the confidence interval of the correlation R estimation based on s.e. d,e, The association between GalNAc metabolism gene abundance and host phenotypes in individuals with (d; n = 1,868) and without (e; n = 3,866) mucosal A-antigens (unadjusted P values are estimated from linear regression). Hash symbols indicate group-specific significant associations. Asterisks indicate significant associations shared by the two groups (Bonferroni-corrected P < 0.05). Exact P values are listed in Supplementary Table 19. Positive associations are in red. Negative associations are in blue. The colour gradients reflect the effect size. BMI, body mass index; BristolType, Bristol stool type; GeneralHealth, general health score; HDL, high-density lipoprotein cholesterol; TG, triglyceride levels.

Source Data

Similarly, we associated the abundances of microbial GalNAc genes with 240 environmental exposure and health-related parameters in individuals with and without mucosal A-antigen. At the Bonferroni-corrected P < 0.05 level, we detected 50 significant associations in the A-antigen presence group and 17 associations in the A-antigen absence group. Notably, microbial GalNAc gene abundances were significantly associated with blood glucose, Bristol stool type and general health only in individuals with mucosal A-antigen (linear regression, Bonferroni-corrected P < 0.05, Pheterogeneity < 0.05; Fig. 5d,e, Extended Data Fig. 10b, and Supplementary Table 19). Although we observed 11 significant associations between GalNAc genes and blood triglycerides and high-density lipoprotein in both groups, the effect sizes in the individuals with mucosal A-antigen are higher than in those without (Pheterogeneity < 0.05; Extended Data Fig. 10b).

Discussion

We carried out a genome-wide association study (GWAS) between host genetics and gut microbial SVs in 9,015 individuals from four Dutch cohorts. We found that the human ABO-encoded A blood group is strongly associated with a genomic fragment in F. prausnitzii harbouring a GalNAc metabolism gene cluster. This association was replicated in a Tanzanian cohort. Strain culture experiments showed that the GalNAc pathway is essential for utilization of GalNAc as a carbohydrate source, which explains the previously observed associations between the ABO locus and the relative abundances of F. prausnitzii and C. aerofaciens.

Several studies have been carried out linking microbial abundance with host genetics in small- or medium-sized cohorts of up to several thousand samples, and genetic effects on microbial abundance were generally found to be small2,3,4,5,6,27,34,35,36,37,38. Although several attempts have been made to extend this to microbial functionality level, these analyses were based on the annotations of metabolic pathways, which are far from complete. Our study demonstrates that associations of host genetics with bacterial SVs can help pinpoint putative causal genes and close the gap from species abundance to functionality. Notably, our study included taxonomic abundance as a covariate in the association analyses to identify associations with specific SV regions that are independent of taxa abundance. Our study highlights the importance of moving from taxonomic abundance measurements to bacterial pathways and gene levels for developing a better understanding of the effect of host genetics on the gut microbiome. We have demonstrated this for the ABO locus, where the A or AB blood type coded by the ABO genotype in FUT2 secretors was associated with bacterial GalNAc gene abundances (lowest P = 4.19 × 10−223) and with an SV region containing the GalNAc pathway in F. prausnitzii (P = 4.85 × 10−51), whereas no ABO association was observed with the abundance of F. prausnitzii (P = 0.08) in our cohorts.

In addition to ABO, our analysis also yielded 210 suggestive associations at the genome-wide significance level (P < 5 × 10−8), including genetic variants associated with diabetic neuropathy (rs10773589, located close to the TMEM132D gene) that affected the presence of an Anaerostipes hadrus dSV and variants affecting expression of the FBLN5 gene (encoding fibulin 5, an extracellular matrix protein that may have a role in bacterial adhesion) that were associated with dSVs of Collinsella species.

The association between ABO and the GalNAc pathway was previously observed in a mosaic pig population39. In pigs, the GalNAc pathway was identified in Erysipelotrichaceae species. However, the abundance of Erysipelotrichaceae species in our human cohorts is relatively low, accounting for only 0.05% of the total community on average. We did not detect any associations between ABO and Erysipelotrichaceae or their SVs in our human cohorts. Instead, F. prausnitzii and C. aerofaciens were likely to be the major GalNAc users in the human gut, with 23.1% of F. prausnitzii and 81.1% of C. aerofaciens assemblies containing the complete GalNAc pathway. Moreover, in contrast to the findings of the study in pigs, in which the association between ABO and the GalNAc pathway was independent of the FUT2 genotype, the association we observed in humans was strongly dependent on FUT2 secretor status. Our data also suggest that the presence of GalNAc genes in individuals who are genetically predisposed to have secreted mucosal A-antigen may benefit human health. In addition, we found indications that the GalNAc genes can be made dysfunctional through genomic inversion and that they can be transmitted among bacteria and shared between humans.

The ABO blood group has been associated with various complex diseases and traits in humans, such as venous thromboembolism, lipid levels and other cardiometabolic phenotypes, as well as susceptibility to and severity of many infectious diseases including dengue, malaria and severe acute respiratory syndrome coronavirus 2 infection40,41,42. For example, ABO A blood group has been found to increase the risk of early childhood asthma and Streptococcus pneumoniae infection43; affect the serum level of ICAM-1, a cell-surface glycoprotein typically expressed on endothelial cells and immune cells44; and increase the risk of coronary artery disease45 and affect circulating levels of cardiovascular-disease-related proteins46. The widespread relevance of the ABO locus in human health highlights the importance of our human-based microbiome association study. The strong association between ABO and bacterial GalNAc-metabolizing genes, and the link of the latter to microbial diversity and richness, support a new hypothesis that ABO may affect human health through its effect on the gut microbiome, in addition to already known mechanisms. Given this information, it might be beneficial to increase GalNAc-utilizing strains such as F. prausnitzii and C. aerofaciens to increase microbial diversity, which could have a beneficial impact on the general health of individuals with mucosal A-antigen. In line with this, our data also showed that bacterial GalNAc gene abundance is positively associated with human health, depending on the presence of mucosal A-antigen.

Our study represents a framework of investigating the crosstalk between our human ‘first genome’ and microbial ‘second genome’. We acknowledge several limitations in our study. First, we focused on the common dSVs and vSVs in gut microbial genomes, assessed on the basis of the abundance and distribution of short reads mapped along bacterial genomes. Our study did not capture other types of SV, such as inversions and translocations, whose comprehensive identification will require whole-genome resequencing and de novo assembly of short or, ideally, long reads. Nonetheless, we could show that genomic inversion could result in dysfunction of the GalNAc pathway. Second, our study did not include other types of genetic variation, such as single nucleotide variants (SNVs), which have great potential impact on bacterial functionality and host–microorganisms interaction. However, analysing genetic associations across the millions of SNVs in the human genome and the hundreds of millions of SNVs in the metagenome would require a much larger sample size. Moreover, functional annotation of SNVs is still challenging. The third limitation of the current study is related to the use of faecal microbiota data to represent the gut microbiome. It is important to note that the microbiome is not entirely the same across the different intestinal compartments, and further investigation into the microbiome of different gastrointestinal tract segments and mucosal layers would provide a more comprehensive landscape of host–microorganisms genetic crosstalk47. Fourth, our primary analyses involved only Dutch cohorts, which are very geographically and genetically homogeneous, although we were able to include a Tanzanian replication cohort with a different genetic background, diet and environmental exposure profile. Future work is needed to assess host genetic and microbial genetic associations in more diverse populations to build a better understanding of host–microbiome co-adaptation and co-divergence, as well as to aid in fine-mapping of causal genes.

Methods

Cohort description

DMP

The DMP consists of 8,719 individuals and is part of the Lifelines study, a multidisciplinary prospective population-based cohort study that utilizes a unique three-generation design to examine health and health-related behaviours in 167,729 people living in the northern Netherlands. Lifelines uses a broad range of investigative procedures to assess the biomedical, socio-demographic, behavioural, physical and psychological factors that contribute to health and disease, with a special focus on multi-morbidity and complex genetics48.

Microbiome data generation for the DMP was described elsewhere8. In brief, fresh-frozen faecal samples were collected from participants of the DMP study. Microbial DNA was isolated using the QIAamp Fast DNA Stool Mini Kit (Qiagen) by the QIAcube automated sample preparation system (Qiagen). Metagenomic sequencing was carried out at Novogene, China using the Illumina HiSeq 2000 sequencer. After filtering, 8,534 DMP samples were used for SV calling.

DMP genotype data generation was described previously2. In brief, genotyping was carried out using the Infinium Global Screening Array MultiEthnic Diseases version. Missing genotypes were imputed using Haplotype Reference Consortium (HRC) panel v.1.1 (ref. 49). Only bi-allelic SNPs with imputation quality >0.4, minor allele frequency (MAF) > 0.05, call rate >0.95 and Hardy–Weinberg equilibrium P-value > 10−6 were retained. A total of 7,738 samples had both metagenomic and genotype data after quality control (QC)2. We further removed 349 samples overlapping with the LLD cohort. This resulted in phenotype, metagenomic and genotype data being available for 7,389 DMP samples.

LLD

The LLD cohort is another part of the Lifelines cohort consisting of 1,539 individuals. Microbiome data generation for LLD was described elsewhere25. Fresh-frozen faecal samples were collected, and DNA was isolated with the AllPrep DNA/RNA Mini Kit (Qiagen, catalogue number 80204). Sequencing was carried out using the Illumina HiSeq platform at the Broad Institute, Boston. A total of 1,135 metagenomic samples passed QC.

Genotyping was carried out using the CytoSNP and ImmunoChip assays, as previously described50, and missing genotypes were imputed using the HRC v.1.1 reference panel49. A total of 984 samples had phenotype, metagenomic and genotype data.

500FG

The 500FG cohort is part of the Dutch Human Functional Genomics Project (DHFGP) and consists of 534 individuals. The metagenomic data generation was described previously26,51. Briefly, DNA was isolated from faecal samples with the AllPrep DNA/RNA Mini Kit, and libraries were sequenced on the Illumina HiSeq 2000 platform. A total of 450 metagenomic samples passed QC and were included in SV calling.

500FG genotype data generation was described previously52. Briefly, genotyping was carried out using the Illumina HumanOmniExpressExome-8 v.1.0 SNP chip. Missing genotypes were imputed using the Genome of the Netherlands as a reference panel53. After QC, 396 samples had phenotype, metagenomic and genotype data.

300OB

300OB is also part of the DHFGP and consists of 302 individuals with body mass index > 27 kg m−2. Metagenomic data generation was described previously26,54 and was carried out using a similar protocol and analysis pipeline to those of LLD. A total of 302 samples had metagenomic data available for SV calling.

300OB genotype data generation was described previously55. In brief, samples were genotyped on the Illumina HumanCoreExome-24 BeadChip Kit or the Illumina Infinium Omni-express chip. Imputation was carried out using the HRC v.1.1 reference panel49. After genotype QC, 274 samples had phenotype, genotype and metagenomic data available.

300TZFG

For replication in non-European individuals, we included 300TZFG, a population cohort of 323 individuals from both rural and urban areas of Tanzania. This study is part of the DHFGP. Metagenomic data generation has been described previously28. Briefly, bacterial DNA was isolated using the AllPrep 96 PowerFecal DNA/RNA kit (Qiagen), and libraries were sequenced on the Illumina NovaSeq 6000 platform. A total of 320 samples passed QC and were available for SV calling.

Host genotype data generation was described previously56. In brief, samples were genotyped on the Global Screening Array SNP chip, and genotype imputation was carried out using Minimac4 with the HRC v.1.1 reference panel. After genotype QC, phenotype, genotype and metagenomic data were available for 279 samples.

QC of metagenomic sequencing data

We removed host-genome-contaminated reads and low-quality reads from the raw metagenomic sequencing data using KneadData (v.0.7.4), Bowtie2 (v.2.3.4.3)57 and Trimmomatic (v.0.39)58. In brief, the data-cleaning procedure included two main steps: raw reads mapped to the human reference genome GRCh37 (hg19) were filtered out; and adapter sequences and low-quality reads were filtered out using Trimmomatic with default settings (SLIDINGWINDOW:4:20 MINLEN:50).

Taxonomic abundance

We estimated the relative abundance of gut microbial species from the cleaned metagenomic reads using Kraken2 (v.2.1.2)59 in conjunction with Bracken (v.2.6.2)60 based on the same reference genomes included in the database of SGV-Finder, and MetaPhlAn 3 (ref. 61) based on the MetaPhlAn database of clade-specific marker genes (mpa_v30). The first of these was used in the GWAS analysis to remove the confounding effect of species abundance, and the last of these was used for the gut microbiome diversity and richness calculation.

Metagenomic SV detection

SVs are highly variable genomic segments within bacterial genomes that can be absent from the metagenomes of some individuals and present with variable abundance in other individuals. On the basis of the cleaned metagenomic reads, we detected microbial SVs using SGV-Finder with default parameters. SGV-Finder (v.1) was developed and described previously20 and can detect two types of SV—vSVs and dSVs.

In brief, the SV-calling procedure includes two main steps: resolving ambiguous reads with multiple alignments according to the mapping quality and genomic coverage using the iterative-coverage-based read assignment algorithm and reassigning ambiguous reads to the most likely reference with high accuracy; and splitting the reference genomes of each microbial species into genomic bins and examining the coverage of genomic bins across all samples. For the determination of dSVs within each species, the genomic bins are classified as deleted (coverage close to 0) or retained (coverage close to median coverage of the genome) bins in each sample, and those that are deleted in 25–75% of samples are kept in the analysis as raw dSVs. The raw dSVs that are highly correlated in co-occurrence are further merged into larger SV regions to produce the final dSV profile. For the determination of vSVs within each species, the coverage of genomic bins within each sample is standardized using the Z-score approach. Each bin is then assessed across all samples, and those that are highly variable on the basis of a β′ distribution are kept as raw vSVs. The raw vSVs that are highly correlated in standardized coverage are further merged into large SV regions to produce the final vSV profile.

To define the genes that belong to the SV region, we expanded the genomic coordinates of SVs 1 kb upstream and downstream, with the genes that overlap with the expanded genomic region considered genes that belong to the corresponding SV.

To identify highly variable genomic segments and detect SVs, we used the reference database provided by SGV-Finder, which is based on the proGenomes database (http://progenomes1.embl.de/)62. We called SVs using default parameters in a larger panel of 13,195 samples from 10 datasets: 7 population cohorts (HMP1 (ref. 63), HMP2 (refs. 64,65), DMP8, LLD baseline25,48, LLD follow-up22, 500FG66 and 300TZFG28) and 3 disease cohorts (300OB67, IBD68 and HIV69). This resulted in 10,265 dSVs and 3,931 vSVs. All bacterial species with SV calling were present in at least 75 samples. For the current study, we focused on the four Dutch cohorts for which host genetic data were also available: DMP, LLD baseline, 500FG and 300OB. We removed samples with <5% of SVs called. After sample removal, SV and genotype data were available for 9,015 samples from the four cohorts: DMP (n = 7,372), LLD baseline (n = 981), 500FG (n = 396) and 300OB (n = 266).

SV filtering and normalization

First, we carried out filtering per cohort. Only SVs that were called in >10% of samples were used in the analyses. In addition, we removed dSVs with a MAF (frequency of either deletion or its absence) <5% and with both reference and alternative allele count ≤80 (this number was determined on the basis of the recommendation that the number of cases and controls is >10× the number of predictors in the generalized linear model association test70; see below). Next, we kept only SVs that were present in at least two cohorts. vSV data were normalized using inverse normal rank transformation for the heritability and association analyses.

Heritability estimation

We estimated SV heritability using the GREML software from the GCTA toolbox (v.1.94.1). We applied the family-based approach71 implemented in GREML on the SV data from the DMP cohort because this cohort has the largest sample size and contains relatives. A total of 7,389 samples with genotype and microbiome data were used for the analysis. To estimate heritability, we used default settings correcting for age, sex, total metagenomic sequencing read number and species abundance. Heritability estimates for species abundance and the corresponding confidence intervals were obtained from ref. 8, which estimated heritability on the basis of family relations in the same DMP cohort.

GWAS and meta-analysis

The manipulation of human genotype datasets was conducted using PLINK (version alpha 2.1). Association analysis was carried out using fastGWA from the GCTA toolbox (v.1.94.1)72, per cohort per SV. For dSVs, we used the generalized linear mixed model-based version of the tool (--fastGWA-mlm-binary)73. In the association analyses, we used a sparse genetic relationship matrix (GRM) created from the full GRM built on genotyped (non-imputed) SNPs with MAF > 5% using GCTA with default options (--make-grm and --make-bK-sparse 0.05). The following covariates were added to the model: age, sex, total metagenomic sequencing read number and centred log ratio (CLR)-transformed species abundance. The total read count was standardized to have a mean of zero and a variance of one. Meta-analysis was carried out using the Metal software (version 2020-05-05)74 with default options (weighting cohort-based P values according to sample size). To control for multiple testing, we applied the Bonferroni-corrected genome-wide significance threshold (5 × 10−8/SV number) and considered association results with P values below this threshold as statistically significant. For dSVs, the P-value threshold was 5 × 10−8/1,666 = 3.00 × 10−11. For vSVs, it was 5 × 10−8/1,886 = 2.65 × 10−11.

Association with ABO blood group

We used two approaches to determine the ABO blood group. In the DMP cohort, we determined the blood group on the basis of three variants (rs8176719, rs41302905 and rs8176746), as described previously2. For LLD and 500FG, in which some of these variants were not genotyped, we used a less sensitive approach based on two SNPs, rs8176693 (T allele determines blood group B) and rs505922 (T allele determines blood group O), as reported in previously published papers75,76. Association of blood groups with F. prausnitzii SVs was carried out in R (v.4.1.0) using (generalized) linear mixed models using the R package lme4qtl (v.0.2.2). This package allows a kinship matrix to be included as a random effect to account for sample relatedness. For each cohort, we created a kinship matrix based on a GRM built by GCTA using the function kinship from the R package kinship2 (v.1.9.6). We corrected for the same covariates as in the GWAS as described above. Meta-analysis was carried out using Metal74.

Population genetic structure of F. prausnitzii

We calculated an SV-based between-sample microbial genetic dissimilarity based on Canberra distance for each microbial species separately using the vegdist() function of the R package vegan (v.2.6-2) to generate species-specific genetic distance matrices (MSV). We then carried out a principal coordinate analysis based on MSV using the pcoa() function of the R package ape (v.5.6-2), with the negative eigenvalues corrected with Cailliez’s method53.

Phylogenetic tree construction

For the F. prausnitzii strains with SVs containing the GalNAc utilization gene cluster, we first constructed a phylogenetic tree using the RAxML approach based on 81 accurately selected single-copy marker genes77. We then constructed another phylogenetic tree using RAxML (v.8) based on the GalNAc utilization genes located in the SV region78. The phylogenetic trees were converted to between-strain cophenetic distances using the cophenetic() function from the R package stats (v.4.3.0).

The phylogenetic tree shown in Fig. 3c was constructed using CSI Phylogeny 1.4 on the basis of SNPs of whole-genome sequences of the 12 isolates79 and was visualized using the R packages ggtree (v.3.2.1) and gggenomes (v.0.9.9.9000)80.

Cohousing and SV sharing

Cohousing information at the time of faecal sampling is known for 8,880 individuals from the DMP cohort. For this cohort, we removed individuals not cohousing with any other participant and those with no microbial or genetic information. For 2,631 participants, we assessed whether any individual cohousing with them at the time of sampling had F. prausnitzii 577–579. We then used a logistic regression using the presence or absence of 577–579 as a dependent variable and the secretion of A-antigens and the presence of household SV as independent variables to estimate the effect of the presence of SV in the household on SV presence in an individual. We also assessed the possible gain or loss of F. prausnitzii in 338 individuals whose gut microbiome was profiled again after 4 years22. For 119 individuals, F. prausnitzii SV profiles were generated at both time points.

Genomic island prediction

Genomic islands were predicted by SIGI-HMM81 and IslandPath-DIMOB82 as integrated into IslandViewer 4, a computational tool that integrates multiple genomic island prediction methods83. Both SIGI-HMM and IslandPath-DIMOB have been shown to have high overall accuracy, with IslandPath-DIMOB having a slightly higher recall and SIGI-HMM having a slightly higher precision.

Microbial gene annotation

The genes of F. prausnitzii strains and reference genomes used for gut microbial SV calling were annotated using MicrobeAnnotator (v.2.0.5)84 and Bakta (v.1.8.1)85. For the annotation of genes encoding glycoside hydrolase family 109 (GH109) in F. prausnitzii and C. aerofaciens strains, we first obtained 2,113 GH109 protein sequences from CAZy (http://www.cazy.org/GH109_characterized.html)86 and then conducted a homologue search of GH109 genes in the genomes of F. prausnitzii and C. aerofaciens strains using tblastn (v.2.5.0+)87 with the following parameters: -outfmt 7 -evalue 1e-10.

Homologue search in genes involved in the GalNAc pathway

We downloaded 10,487 assembled genomes of ABO-associated species from the Unified Human Gastrointestinal Genome collection33, including 1,103 assemblies of C. aerofaciens, 484 of F. lactaris, 1,109 of B. bifidum and 7,791 of F. prausnitzii. We then used the sequences of genes located in SV 577–579 as queries and carried out a homologue search in the assemblies using tblastn (v.2.5.0+)87 with the following parameters: -outfmt 7 -evalue 1e-10.

Protein family search and profiling with shortBRED

We searched the metagenomes for 27 bacterial proteins identified in the SV segment of F. prausnitzii (excluding dinB and HTF-238_02530, which were used as SV region markers and are not located within the SV), including the genes known to be involved in GalNAc metabolism, using the shortBRED toolkit (v.0.9.5)88. We extracted the genes located in the SV and converted the gene sequences to protein sequences, as required by shortBRED. We used the shortBRED tool shortbred_identify.py (v.0.9.5) to identify unique markers for the query genes, using the UniRef90 database (downloaded on 1 November 2021) as a negative control.

Next, the shortbred_quantify.py tool (v.0.9.5) was used to quantify these markers in metagenomes. First, we assessed the association of these gene abundances with the ABO blood group. We log-transformed the RPKM values provided by shortBRED and carried out a linear mixed model analysis using shortBRED gene abundances as outcomes and ABO A or AB blood group as a predictor accounting for sample relatedness using random effects in the lme4qtl package. We also included other covariates as predictors, including age, sex, total metagenomic sequencing read number and CLR-transformed F. prausnitzii abundance, together with four F. prausnitzii dSVs and one vSV found to be associated with ABO in the primary GWAS analysis.

Next, we estimated the association of gene abundance with the α-diversity (Shannon index and richness) of the gut microbiome in DMP using linear regression using the following formula:

α-diversity = SV 577–579 + F. prausnitzii taxonomic abundance + C. aerofaciens taxonomic abundance + gene abundance.

Bacterial strains and growth

The Faecalibacterium and Collinsella strains used in this study were from culture collections (ATCC and DSMZ) and our local strain collection (Department of Medical Microbiology, University Medical Center Groningen, Groningen, the Netherlands). On the basis of the presence or absence of SVs, the following Faecalibacterium strains were selected: F. prausnitzii A2-165 (DSM 17677), F. prausnitzii ATCC 27768, F. prausnitzii HTF-F (DSM 26943), F. prausnitzii HTF-112, F. prausnitzii HTF-495, F. prausnitzii HTF-238, F. prausnitzii HTF-383, F. prausnitzii 60C2, F. prausnitzii HTF-121, F. prausnitzii HTF-133, F. prausnitzii HTF−441 and F. prausnitzii FM4. Two strains of C. aerofaciens were selected on the basis of the presence of the GalNAc genes: C. aerofaciens 4PBA and C. aerofaciens HTF-129.

Strains were cultured in a modified YCFA medium supplemented with different carbohydrates (glucose, galactose, GalNAc, mannose, lactose, fructose, N-acetylglucosamine, 2-fucosyllactose and N-acetylneuraminic acid). YCFA medium was prepared as for YCFA–glucose (YCFAG) medium described before89 without the addition of glucose. YCFA medium was composed of (g l−1) 10 casitone, 2.5 yeast extract, 4 sodium bicarbonate, 0.45 dipotassium hydrogen phosphate, 0.45 potassium dihydrogen phosphate, 0.9 sodium chloride, 0.09 magnesium (II) sulfate heptahydrate, 0.12 calcium chloride dihydrate, 2.7 sodium acetate, 1 cysteine, 5 ml 0.02% resazurin and 0.2% haemin, 1 ml pink vitamin mixture and yellow vitamin mixture, and the liquid medium. The pink vitamin mixture (per 100 ml) contains 1 mg biotin, 1 mg cobalamin, 3 mg p-aminobenzoic acid, 5 mg folic acid and 15 mg pyridoxamine. The yellow vitamin mixture (per 100 ml) contains 5 mg thiamine and 5 mg riboflavin. The liquid medium includes 600 µl l−1 propionate (≥99% purity, Sigma-Aldrich), 100 µl l−1 isobutyrate (≥99% purity, Sigma-Aldrich), 100 µl l−1 isovalerate (≥99% purity, Sigma-Aldrich) and 100 µl l−1 valerate (≥99% purity, Sigma-Aldrich). The medium is adjusted to a final pH of 6.5.

Growth experiments were carried out in a Bactron 600 anaerobic incubator (Kentron Microbiome BV) using a 24-well flat-bottom-plate with total volume of 1 ml per well YCFA medium supplemented with 4.5 g l−1 of the desired carbohydrate source. Cultures were started at an initial OD600nm range of 0.10–0.15 by the addition of an overnight glucose-grown pre-culture, and growth was monitored anaerobically at 600 nm over 24 h at 37 °C. Readings were taken every 2 h, after 10 s shaking, using Epoch 2 (Agilent BioTek Instruments), and growth curves were generated using Gen5 software. Each growth condition was carried out in triplicate using three independent pre-cultures. Data of growth curves are reported as means ± s.d.

Gene expression analysis of GalNAc induction

Sample collection

The F. prausnitzii strains HTF-495, HTF-441 and ATCC 27768 were selected to test the mRNA expression level of genes on the basis of the shortest distance within the phylogenetic tree. The F. prausnitzii strains were pre-cultured individually in YCFAG medium overnight anaerobically at 37 °C in triplicate. To get enough biomass, these pre-cultures were used to inoculate fresh triplicates of each strain in a ratio of 1:20 (20 ml) and incubated for 24 h anaerobically at 37 °C in YCFAG medium. Each culture was then split into two tubes (10 ml per tube) and centrifuged at 3,000 r.p.m. for 10 min. The supernatants were removed and resuspended with 10 ml YCFAG or YCFA-GalNAc, separately for each culture, in a total of 18 samples. After 6 h of incubation, a 1:1 ratio (10 ml) of ice-cold killing buffer (20 mM Tris-HCl pH 7.5, 5 mM MgCl2, 20 mM NaN3) was added to the cultures. Samples were centrifuged at 3,000 r.p.m. for 10 min at 4 °C, and the supernatants were removed. The pellets were resuspended in 1 ml TRIzol (Invitrogen) and stored at −80 °C until further RNA isolation.

RNA isolation and cDNA synthesis

For RNA isolation, 200 µl of RNAse-free chloroform was added to each sample and incubated at room temperature for 5 min. After incubation, the samples were centrifuged at 12,000g at 4 °C, and the aqueous phase was recovered into a new tube. To precipitate RNA, 500 µl of RNAse-free isopropanol was added to each sample and mixed briefly. Samples were incubated for 10 min at room temperature and centrifuged for 10 min at 12,000g and 4 °C. The supernatant was removed, and the pellets were washed in 1 ml of 75% RNAse-free ethanol, vortexed briefly and centrifuged for 5 min at 7,500g at 4 °C. The supernatant was removed, and the pellets were air-dried at room temperature for 10 min. Afterward, the samples were resuspended with RNAse-free water.

Finally, DNA contamination was removed from 10 µg of the sample using TURBO DNA-free Kit (Invitrogen). cDNA was generated using the TaqMan Reverse Transcription Reagents (Invitrogen) with random hexamers.

Quantitative PCR

Samples were diluted to working concentration and used as a template for quantitative PCR (qPCR) amplification of the target genes (for primers, see Supplementary Table 20). Each reaction contained 10 μl of GoTaq qPCR Master Mix (Promega), 9 μl of DNA template (10 ng) and two times 0.5 μl primer solution (20 µM) in a total reaction volume of 20 μl. The amplification was carried out in a 7500 Real-Time PCR System (Applied Biosystems). The amplification program comprised two stages: an initial denaturation step at 95 °C for 2 min, followed by 40 two-step cycles at 95 °C for 15 s and at 60 °C for 1 min. At the end of the run, a melting curve analysis was carried out. The cycle threshold (Ct) value was first determined using the 7500 Real-Time PCR System detection system and then adjusted manually to set the threshold within the exponential phase of the curves. All qPCR reactions were carried out in triplicate. TheΔCt values of the genes of interest were obtained by correction for the Ct value of rpoA as the housekeeping gene. Afterward, the different \({2}^{-\Delta {C}_{{\rm{t}}}}\) values of each strain were calculated per condition. These values were used to determine the relative fold change expression of the genes after GalNAc induction compared to growth in glucose.

Ethical approval

The Lifelines study was approved by the ethics committee of the University Medical Center Groningen (METc2007/152). All participants signed an informed consent form before enrolment. Additional written consents were signed by the DMP participants or legal representatives for children aged under 18 years. The LLD study was approved by the Institutional Ethics Review Board of the University Medical Center Groningen (ref. M12.113965), the Netherlands. The 300OB study was approved by the IRB CMO Regio Arnhem-Nijmegen (number 46846.091.13). The 500FG study was approved by the Ethical Committee of Radboud University Nijmegen (NL42561.091.12, 2012/550). The inclusion of volunteers and experiments was conducted according to the principles expressed in the Declaration of Helsinki. All volunteers gave written informed consent before any material was taken. The 300FGTZ study was approved by the Ethical Committees of the Kilimanjaro Christian Medical University College (CRERC; number 936) and the National Institute for Medical Research (NIMR/HQ/R.8a/Vol. IX/2290) in Tanzania. The Tanzanian cohort provided consent for the use of their data for the purposes of this analysis.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.