It is increasingly recognized that a comprehensive understanding of the physiology and pathology of organisms requires integrated analysis of the host and its multiple microbiota1. In humans, the composition of the intestinal microbiota is associated with physiological and pathological parameters, including HDL cholesterol, fasting glucose levels and body mass index2. In livestock, ruminal microbiome composition is associated with methane production and feed efficiency3. These correlations reflect a complex interplay between host and microbiota that may include direct (causal) effects of the microbiome on the host’s physiology4. Several phenotypes correlated with microbiota composition are heritable5,6. This leads to the hypothesis that the genotype of the host may in part determine the composition of the microbiota, which may in turn affect the host’s phenotype4. It implies that the composition of the microbiota is partially heritable. Although studies in rodents support this7, evidence is less convincing in humans. Initial reports did not reveal a higher microbiota resemblance between monozygotic twins compared with dizygotic twins, suggesting a limited effect of the host genotype8. Better-powered studies provided evidence for a significant effect of host genetics on the abundance of taxa, particularly Christensenellaceae9. Loci that underpin microbiota heritability have remained difficult to identify in humans. Apart from variants that cause persistent expression of lactase (LCT) and are associated with decreased Bifidobacterium abundance, other GWAS loci have proven to be difficult to replicate2,10,11,12,13,14. The analysis of larger human cohorts is needed to gain a better understanding of the genetic architecture of microbiota composition.

To decipher the genetic architecture of intestinal microbiota composition in a large monogastric omnivore, we report the generation of a mosaic pig population and the longitudinal characterization of its intestinal microbiota. We observed a strong effect of the host genotype on microbiota composition and identified a locus with large effect on the abundance of specific taxa by controlling the concentration of N-acetyl-galactosamine in the gut and thereby affecting some of the species that use this metabolite as a carbon source.

A mosaic pig population to study complex traits

We generated a large (>7,500) mosaic population by intercrossing the offspring of 61 F0 founders from four Chinese and four western breeds for more than 10 generations (Supplementary Table 1 and Extended Data Fig. 1). Animals were reared in uniform housing and feeding conditions. We analysed more than 200 phenotypes (pertaining to body composition, physiology, disease resistance and behaviour), obtained transcriptome, epigenome and chromatin interaction data from multiple tissues, and collected plasma metabolome and microbiome data in up to 954 F6 and 892 F7 animals. The F0 animals were whole-genome sequenced at an average depth of 28.4-fold, and the F6 and F7 animals were sequence at an average depth of 8.0-fold. We called genotypes at 23.8 million single-nucleotide polymorphisms (SNPs) and 6.4 million insertion–deletions (indels) with a minor allele frequency (MAF) of ≥0.03 (>1/100 bp). The nucleotide diversity (π) (that is, the proportion of nucleotide sites that differ between homologous sequences in two breeds) between two Chinese breeds and between two European breeds was similar to that between Homo sapiens and Homo neanderthalensis (~3 × 10−3)15, whereas the π between a Chinese and a European breed approached half of that between human and chimpanzee (~4.3 × 10−3)16. The proportion of the eight founder genomes in F6 and F7 ranged from 11.2% to 14.7% at the genome level, and from 4.9% to 22.1% at the chromosome level. The median number of variants in high linkage disequilibrium (LD) (r2 ≥ 0.9) with an index variant was 30, and the median maximal distance with a variant in high LD (r2 ≥ 0.9) was 54 kb (Extended Data Fig. 1).

The intestinal microbiota of the healthy pig

We collected faeces at days 25 (suckling period), 120 (growing period) and 240 (slaughter day), as well as caecal and ileal content (F6 and F7) and caecal and ileal mucosal scrapings (F7 only) at day 240 (7 sample types, 12 data series). We generated 16S rRNA tags (V3–V4) for an average of 426 animals per data series (5,110 samples) (Supplementary Table 2.1). Tags were rarefied and clustered in 32,032 operational taxonomic units (OTUs). A total of 12,054 OTUs amounting to 98.7% of reads were retained and annotated to 41 phyla, 87 classes, 149 orders, 207 families, 360 genera and 150 species. The average microbiota composition of the 12 data series indicated high consistency across F6 and F7, yet marked differences between sample types (Fig. 1a, b and Supplementary Table 2.2). Even at the family level, some taxa were found to be nearly sample-type specific (Extended Data Fig. 2). The proteobacteria Enterobacteriaceae, Pseudomonadaceae and Pasteurellaceae, the firmicutes Clostridiaceae, Peptostreptococcaceae, Bacillaceae and Leuconostocaceae, and the actinobacteria Microbacteriaceae were at least ten times more abundant in ileal samples than in any other sample type. Among those, Leuconostocaceae were nearly digesta specific, whereas Pseudomonadaceae were nearly mucosa specific. The Bacteroidetes Odoribacteraceae and Rikenellaceae were at least ten times more abundant in day 25 faeces samples than in any other sample type. The firmicutes Christensenellaceae were nearly ten times more abundant in faeces samples than in any other sample type. This confirms that limiting the analysis of the intestinal microbiota to adult faeces can provide only a partial view of its complexity and the factors that determine it17. α-Diversity of faeces was lower at day 25 compared with at days 120–240, reminiscent of the enrichment of the intestinal flora between childhood and adulthood in humans8,18. α-Diversity was lower for ileal content than for caecal content and mucosa (Fig. 1c). β-Diversity tended to be inversely proportional to α-diversity, being higher for day 25 than for day 120 and 240 faeces. Variation in pairwise Bray–Curtis dissimilarities was highest for ileal content, which had the lowest α-diversity (Fig. 1d and Supplementary Table 2.3). The microbiota composition of pig day 240 (D240) faeces was more similar to that of human than to that of mouse faeces (Extended Data Fig. 2).

Fig. 1: Intestinal microbiota of the healthy pig.
figure 1

a, Joint principal coordinate analysis (PCoA) of 5,110 16S rRNA profiles. F6 day 25 faeces (D25, mauve), day 120 faeces (D120, red), day 240 faeces (D240, green), ileal content (IC, light blue), caecum content (CC, dark blue) (left). Middle, as described for the left plot but for F7. Right, F7 ileal content (IC, light blue), caecum content (CC, dark blue), ileal mucosa (IM, pink), caecal mucosa (CM, brown). b, The average microbiota composition of the 12 data series. Taxa are coloured by phylum and family within phylum, highlighting 43 families among the top 15 in at least one data series. The names of the corresponding phyla and families are shown in the key. The average composition of 106 human faeces and 6 mouse faeces (C57BL/6) samples is shown. c, α-Diversity values (Shannon’s index) for the 12 data series coloured as in a. Sample sizes are provided in Supplementary Table 2.1. The box plots show the median (centre line), interquartile range (box limits), 1.5× the interquartile range span (whiskers) and outliers (dots). d, β-Diversity values (pair-wise Bray–Curtis distances) for the 12 data series coloured as in a. Distances were computed for all sample pairs. Sample numbers and box plots are as described in c.

Microbiota heritability in mosaic pigs

We first examined the relationship between genetic relatedness (genome-wide SNP identity by state) and microbiota dissimilarity (Bray–Curtis distance, all taxa combined)2. We used two approaches to mitigate confounding of genetics and environment: (1) we restricted the analyses to full-sibling littermates raised in the same environment, and (2) we confronted genetic similarity and microbiota dissimilarity across generations (F6 and F7). Both approaches supported an effect of genetics on microbiota composition manifested by significant negative correlations between genetic similarity and microbiota dissimilarity (Fig. 2a, b).

Fig. 2: Heritability of microbiota composition in mosaic pigs.
figure 2

a, Correlation between genome-wide kinship (Θ) and microbiome dissimilarity (Bray–Curtis distance) within litter. The correlation (Spearman's r) was measured separately for the 12 data series. P values (one-sided) were computed by permutation. Adjusted r values were below the 50th percentile of the permutation values for 11 out of 12 (P = 0.0029). The empirical P value (one-sided) of r was ≤0.05/12 = 0.004 (Bonferroni corrected) for two data series. P values were combined across the 12 data series yielding an overall P value of 3 × 10−4. The number of litters (l) and animal pairs (n) used are given for each data series. b, The correlation between genome-wide kinship and microbiome dissimilarity (Bray–Curtis distance) across generations. We considered all possible pairs of F6 and F7 animals (not including sow–offspring pairs). Analyses were conducted for the five traits measured in both F6 and F7. r, p and n are as described in a. r values were below the 50th percentile of the permutation values for the five analysed sample types (P = 0.03). The empirical P value (one-sided) of r was ≤0.05/5 = 0.01 (Bonferroni corrected) for one sample type. P values for the five sample types were combined yielding an overall P value of 0.013. c, The frequency distribution of heritabilities of individual taxa sorted by sample type (left) or taxonomic level (right). The values obtained by joint analysis of F6 and F7. d, Total heritabilities2 computed by sample type and taxonomic level.

We then evaluated the heritability (h2) of the abundances of individual taxa/OTUs using F6 and F7 jointly (Fig. 2c and Supplementary Table 3.1). We computed total heritabilities (average heritability weighted by abundance)2 separately by sample type and taxonomic level. Total heritabilities were generally low (≤11.8%), yet higher for faecal (average D240, 10.2%) than for content samples (average ileal content, 5.7%) and lower for OTUs (average, 6.6%) than for higher taxonomic levels (average, 7.5%) (Fig. 2d). The correlation between \(\hat{{h}^{2}}\) estimates obtained separately in F6 and F7 was positive and significant for D120 (P = 4.2 × 10−5), D240 (P = 2.2 × 10−16) and caecum content (P = 2.3 × 10−3), supporting genuine genetic effects (Extended Data Fig. 3). We established a list of the 55 most likely heritable taxa/OTUs on the basis of congruent \(\hat{{h}^{2}}\) estimates in F6 (≥0.15), F7 (≥0.15), and across F6 and F7 (≥0.10) (Supplementary Table 3.1). It included the order Campylobacterales in D240 faeces, the species Bacteroides coprophilus in D25 faeces, and 53 OTUs of which two (D240 faeces, caecum content) were assigned to Christensenellaceae, five to Ruminococcaceae (four in D240 faeces, one in caecum content), one (D25) to Ruminococcus, one (D25) to Phascolarctobacterium, and one (D240) to RF32; that is, taxons characterized by h2 ≥ 0.10 in human studies19. With the exception of RF32, all of these belong to the order Clostridiales.

A miQTL affecting Erysipelotrichaceae species

Heritability quantifies the proportion of additive genetic variance but does not foretell genetic architecture: phenotypes with low heritability may be affected by variants with major effects, whereas highly heritable traits may be very polygenic. To gain insights into the genetic architecture of the gut microbiota composition in this population, we performed a genome-wide association study (GWAS). We applied two statistical models testing SNP effects on taxa abundance and taxa presence/absence, respectively11. We ran GWAS separately by sample type, taxon and generation for a total of 57,557 GWAS (involving 8,490 taxa) (Supplementary Table 4.1). This yielded 1,527 genome-wide significant signals (P ≤5 × 10−8). For these, we performed meta-analyses across sample types, separately in F6 and F7 (Extended Data Fig. 4). We identified six signals exceeding the experiment-wide discovery threshold (P = 1.5 × 10−12) in one cohort and the experiment-wide replication threshold (P = 0.007) in the other (Fig. 3a and Supplementary Table 4.24.3). All lead SNPs mapped within 3,037 bp from each other (chromosome 1) and were in high LD (Fig. 3b). They affected two OTUs (OTU476 and OTU327) as well as the genus p-75-a5, to which OTU476 is assigned. All three are Erysipelotrichaceae.

Fig. 3: A miQTL affecting Erysipelotrichaceae species.
figure 3

a, The result of genome-wide meta-analysis (across sample types) in F6 and F7 for OTU476. Reported log-transformed P values are nominal (that is, not corrected for multiple testing). b, Local magnified views of chromosome 1 (272.8–273 Mb) of OTU476 and OTU327 in F6 and F7. log-transformed nominal P values as in a. c, log[1/P] values in F6 (x-axis) and F7 (y-axis) for the association between SNP 1_272907239 and the abundance of 8,490 OTUs for all of the sample types and two analyses methods (abundance and presence/absence, explaining the two OTU476 values) (left). OTUs belonging to p-75-a5 and Erysipelotrichaceae are shown in red and yellow, respectively. Right, comparison of the distribution of association (1_272907239) P values for p-75-a5 and Erysipelotrichaceae OTUs with other OTUs in F6 and F7. Box plots are as described in Fig. 1c. The notches in the boxes correspond to 95% confidence intervals of the median values. Distributions were compared using Wilcoxon’s rank-sum test. P values (nominal) of the comparisons are given above the horizontal lines. c, LD (r2) between the four top SNPs and the 2.3 kb ABO deletion in F6 and F7.

To determine whether this microbiota QTL (miQTL) affects other taxa, we plotted the F6 and F7 association log[1/P] for lead SNP 1_272907239 and the 8,490 studied OTUs. OTU476 and OTU327 stood out as highly significant in F6 and F7 (Fig. 3c). Yet the P values for the 31 other p-75-a5 OTUs and the 83 other Erysipelotrichaceae OTUs were significantly shifted towards lower P values in both cohorts (Fig. 3c), and the sign was consistent with that for OTU476, OTU327 and p-75-a5 (Extended Data Fig. 4). This suggests that the chromosome 1 miQTL affects other species in this family. Notably, the h2 for p-75-a5 and OTU476 deviated significantly from 0 in D240 (0.10 and 0.14) and D120 (0.13 and 0.13) faeces, and h2 estimates for OTU327 in D120 faeces (0.13) (Supplementary Table 4.4).

QQ plots obtained after removing chromosome 1 variants (272.8–273.1 Mb interval) did not show convincing evidence for residual inflation of log[1/P] (Extended Data Fig. 4). Thus, the residual h2 is most likely highly polygenic.

A 3.5-million-year-old deletion in the pig ABO orthologue

All lead SNPs map to the 3′ end of the porcine N-acetyl-galactosaminyl transferase gene underlying the ABO blood group in humans (Fig. 3b). This is a strong candidate that modulates interactions with several pathogens20, and was recently suggested to affect the intestinal microbiota in healthy humans21,22,23. A 2.3 kb deletion encompassing the last exon has been reported in pigs24. We showed that it was in near-perfect LD (r2 ≥ 0.94) with the lead SNPs in our population (Fig. 3d). We confirmed its boundaries and showed that it results from an intrachromosomal recombination between short interspersed nuclear elements (SINEs). We showed using RNA-sequencing (RNA-seq) and expression quantitative trait loci (eQTL) analysis in 300 F7 caecum samples that the deleted allele produces reduced amounts (around 1/3 of normal) of a 7.4 kb transcript (with alternative eighth and ninth exons) encoding a protein that lost 62% of its amino-acids, including 7 out of 8 active sites25 (Fig. 4a). Our results indicate that the 2.3 kb deletion creates a null allele (hereafter, the O allele) and that is a very strong candidate for the causative miQTL mutation (Extended Data Figs. 5 and 6).

Fig. 4: A 3.5-million-year-old deletion in the pig ABO orthologue causes the miQTL.
figure 4

a, The structure of the porcine AO blood group gene. IGV (Integrated Genome Viewer) view of the genotypes of the 61 F0 animals showing 145 variants in a ~5 kb interval spanning the 2.3 kb deletion. The top two red rectangles show the 2.3 kb deletion. Homozygous alternative (light blue), heterozygous (dark blue) and homozygous reference (grey) genotypes are shown. The horizontal blue arrows mark SINEs that have mediated intrachromosomal recombination. The vertical black arrows mark the top variants from Fig. 3b. The effect of the 2.3 kb deletion on acetyl-galactosaminyl transferase transcripts is shown, including the creation of alternative exons 8 and 9, and the reduction of transcript levels to around one-third of normal. b, The effect of the AO genotype (AA, AO or OO) on the abundance of OTU476. The effect of the A allele is dominant over that of the O allele, and the miQTL effect is detected in the caecum (content and mucosa) and in day 120 and 240 faeces samples. Sample sizes are provided in Supplementary Table 4.1. The box plots are as described in Fig. 3d. c, Unweighted pair group method with arithmetic mean dendrogram based on the sequence similarity between 14 AA and 34 OO animals in a 5 kb window centred on the 2.3 kb deletion. CH, Chinese; EU, European; RU, Russian; DOM, domestic pigs; WB, wild boars; PHAC_AFR, common warthog; SUS_VERR, Javan warty pig; SUS_CEB, Visayan warty pig; SUS_SCR_VII, Sumatran wild boar. Breeds: BX, Bamaxiang; EH, Erhualian; LA, Laiwu; LD, Landrace; LW, Large White; PT, Piétrain; TB, Tibetan; WD, White Duroc. d, The peak of reduced population differentiation coinciding with the 2.3 kb deletion (red) in the porcine AO gene (blue). The position on chromosome 1 is shown on the x-axis; and 1/(mean F statistic) for all variants in a 2 kb sliding window is shown on the y-axis. The F statistic was computed as the ratio of the between-breed mean squares and the within-breed mean squares for the dosage of O allele. Chr., chromosome.

We examined the effect of the AO genotype on the abundance of OTU476, OTU327 and p-75-a5. The results showed that (1) the effect of the A allele is dominant over the O allele (Supplementary Table 5.1), and (2) that the effect manifests in D120 and D240 faeces, caecal content as well as mucosa, but not in D25 faeces and D240 ileal content and mucosa (Fig. 4b and Supplementary Table 5.1). In these samples (D120, D240, caecum content, caecal mucosa), the AO genotype explained 7.9%, 3.2% and 6.6% of the variation in the abundance of OTU476, OTU327 and genus p-75-a5 (Supplementary Table 5.2). Notably, the abundance of OTU476 and OTU327 was highest in caecal content (around 0.92% and 0.02% of reads in AA/AO animals, and around 0.47% and 0.003% of reads in OO animals, respectively; Extended Data Fig. 5).

In primates, the ABO locus is under strong balancing selection that has perpetuated identical-by-descent alleles segregating in humans, gibbons and Old World monkeys for tens of millions of years26. We analysed the sequences of the 61 F0 (Sus scrofa domestica), 18 wild boars (9 Asian, 9 European) (Sus scrofa), 1 Indonesian wild boar from Sumatra (Sus scrofa vittatus), 1 Visayan warty pig from the Philippines (Sus cebifrons), 1 Javan warty pig from Indonesia (Sus verrucosus) and one common warthog from Africa (Phacochoerus africanus) in a 50 kb window spanning the AO gene. Asian and European wild boars diverged from a common S. scrofa ancestor around 1 million years ago (Ma), S. scrofa and S. s. vittatus around 1.5 Ma, S. scrofa and S. cebifrons/verrucosus around 3.5 Ma, and S. scrofa and P. africanus around 10 Ma27. The 2.3 kb deletion segregated in all eight F0 breeds, in all Asian and European/American wild-boar populations, and in S. cebifrons (Supplementary Table 5.3). The sequence of the deletion breakpoint was identical in all of the samples, confirming the identical-by-descent nature of the porcine O allele (Extended Data Fig. 7). Consistent with the hypothesis of an ancestral trans-species polymorphism (rather than hybridization), the O allele of S. cebifrons lay outside the cluster of S. scrofa O alleles (Fig. 4c and Extended Data Fig. 7). Further supporting balancing selection, the AO gene showed a marked decrease in population differentiation maximizing at the 2.3 kb deletion (Fig. 4d). Thus, although largely unknown, the underlying selective forces have operated in at least two mammalian branches (primates and suidae), over substantially long periods and broad geographical ranges, pointing towards their pervasive nature (Extended Data Fig. 7).

The miQTL affects caecal GalNAc concentrations

As in humans, the porcine AO gene is broadly expressed, yet is particularly expressed in the intestines (Extended Data Fig. 7). The N-acetyl-galactosaminyl-transferase encoded by the A allele adds N-acetyl-d-galactosamine (GalNAc, α1-3 linkage) to H and Lewis antigens present on glycan substrates, including the heavily glycosylated mucins, which can be used as carbon source by intestinal bacteria20,28,29 (Extended Data Fig. 8). We reasoned that, owing to the abundance of intestinal mucus, the O allele might reduce the concentration of GalNAc in the intestine, thereby reducing the growth of bacterial species that are dependent on this sugar. We measured the concentration of free GalNAc, N-acetylglucosamine (GlcNAc) and N-acetylmannosamine (ManNAc) isomers (hereafter, HexNAc) in the caecum content of 278 D240 animals (124 F7 and 154 Duroc × Landrace × Large White) using liquid chromatography coupled with tandem mass spectrometry (LC–MS/MS). Intestinal HexNAc in pigs comprises approximately equal proportions of the GlcNAc and GalNAc isomers30. Dosage of the A allele (additive model) increased (P = 5.7 × 10−9) caecal HexNAc concentrations (Fig. 5a). Assuming that the increase is due only to GalNAc (given the enzymatic activity of the A allele), AA animals have at least twice the amount of free GalNAc compared with OO animals (Fig. 5a and Extended Data Fig. 9). Non-zero values in OO animals are primarily due to GlcNAc/ManNAc, and are probably also due to non-A antigen host and dietary GalNAc (GalNAc is ubiquitous at the core of all O-glycans, including intestinal mucins, and is a component of glycosaminoglycans such as chondroitin sulfate). Note that the effect of the A allele appears to be dominant on bacterial abundance, yet additive on GalNAc concentrations (P value of likelihood ratio test against the additive model = 0.12). This indicates that the additional increase in GalNAc availability in AA animals (versus AO animals) does not further favour the growth of OTU476, OTU327 and p-75-a5.

Fig. 5: The miQTL acts by increasing GalNAc concentrations and affects GalNAc-using bacteria.
figure 5

a, The effect of AO genotype on GalNAc concentrations in caecal content (nAA = 33, nAO = 118, nOO = 127). Concentrations were corrected for batch effect and scaled between 0 and 1 to equalize residual variance. P values (two-sided and nominal) for genotype contrasts were computed using Wilcoxon’s tests. The box plots are as described in Fig. 1c. b, The correlation between GalNAc concentration and OTU476 abundance within the AO genotype. Area under the curve (AUC) values for GalNAc corrected for batch effect and AO genotype and scaled as described above. The P value (nominal, two-sided) of Spearman’s correlation is given (P = 0.012). The shaded area corresponds to the 95% confidence region for the regression fit. c, Same as in b, with animals coloured by AO genotype. d, The GalNAc transport and catabolic pathway in OTU476-like strains. GalNAc-6-P, N-acetylgalactosamine-6-phosphate; GalN-6-P, galactosamine-6-P; Tag_6-P, tagatose-6-phosphate; Tag-1,6-PP, tagatose-1,6-biphosphate; GAP, glyceraldehyde-3-phosphate; DHAP, dihydroxyacetone-phosphate; 3PG, 3-phosphoglycerate; Pyr, pyruvate; Lact, lactate. Enzymes encoded in the GalNAc operon are shown in blue. Metabolites considered in the metabolic flux analysis are shown in bold. e, The proportion of 13C-labelled metabolites determined by GC–MS in the OTU476-like strain (4-8-110) fed 13C-labelled (red) versus regular GalNAc (green). f, In vivo (germ-free mice) E. coli versus OTU476-like strain competition with and without GalNAc. The proportion of 16S rRNA reads mapping to the 4-8-110 reference rRNA sequence versus E. coli rRNA sequence (that is, 1 minus the proportions shown in the figure correspond to reads mapping to the E. coli 16S rRNA) in the caecum content and faeces of 10 germ-free mice (Kunming line) inoculated by gavage with a pure culture of 4-8-110 and E. coli and force-fed with GalNAc (red bars) versus PBS (green bars). P values (nominal, two-sided, uncorrected) comparing the difference in abundance were determined using Wilcoxon tests.

Showing that the O deletion (1) reduces caecal GalNAc concentrations, and (2) reduces the abundance of OTU476, OTU327 and p-75-a5 does not prove that one causes the other. The effect on bacterial abundance could be mediated by an unidentified mechanism independent of GalNAc. However, the hypothesis that it is the change in GalNAc concentrations that causes the change in bacterial abundance makes a testable prediction. In this case, we may expect to observe a ‘residual’ correlation between GalNAc concentrations and bacterial abundance ‘within’ the AO genotype, that is, by blocking the effect of the AO genotype and exploiting other sources of variation of GalNAc and bacterial abundance (Extended Data Fig. 8). We performed these analyses and observed the predicted positive correlation for OTU476 (P = 0.012) and p-75-a5 (P = 0.010), consistently within the three genotypes (AA, AO and OO). The correlation was positive (and consistent within the three genotypes) albeit not significant (P = 0.3) for the less abundant OTU327 (Fig. 5b, c and Extended Data Fig. 9).

Taken together, these results make a strong case that the AO genotype affects intestinal GalNAc concentrations, and that this affects the abundance of some Erysipelotrichaceae species.

miQTL-responsive bacteria use GalNAc

To be used as carbon source by intestinal bacteria, GalNAc needs (1) to be released from the glycan structures by secreted glycoside hydrolases (GHs), (2) to be imported across the bacterial membranes by dedicated transport systems (TR), and (3) to be converted into intermediates of central metabolism by a specific catabolic pathway (CP). Although some bacteria have both GHs and TR–CP for specific monosaccharides, other bacteria have only the GHs (donors) or the TR–CP (acceptors)28. We reasoned that the bacteria that would respond to the miQTL would have a complete GalNAc TR–CP system (with or without GHs). To test this, we (1) isolated two bacterial strains (4-8-110 and 4-15-1) with a V3–V4 sequence similarity of 100% and 99.8% with OTU476, respectively, and sequenced their genome using the Oxford Nanopore Promethion system, and (2) built 3,111 metagenomic assembled genomes (MAGs) from shotgun data of 92 porcine intestinal samples, including 248 Erysipelotrichaceae. We compiled a list of 24 genes implicated in GalNAc use (TR–CP)28,31,32,33,34,35,36. These encode (1) 11 components of three GalNAc transporter systems (AgaPTS: agaE, agaF, agaV, agaW; TonB dependent transporter: omp, agaP, agaK; GnbPTS: gnbA, gnbB, gnbC, gnbD), (2) two GalNAc-6P deacetylases (agaA, nagA), (3) two galactosamine-6P (GalN-6P) isomerase and/or deaminases (agaI, agaS), (4) three tagatose-6P kinases (pfka, lacC, fruK), (5) four tagatose-1,6-PP aldolases or aldolase subunits (gatY-kbaY, gatZ-kbaZ, lacD, fba) and (6) two regulon repressors (agaR, gntR), for a total of six essential pathway constituents (Fig. 5d and Supplementary Table 6). Genes involved in the use of specific sugars tend to cluster and form operons of potentially coregulated genes (regulons) that support all or most of the essential TR–CP steps. The steps that are not encoded by the operon may be complemented in trans by genes encoding enzymes that may be less substrate-specific37,38. We searched for orthologues of the 24 genes in the two OTU476-like genomes and 3,111 MAGs. We generated two scores to evaluate the ability of bacterial species to use GalNAc. The first (pathway score) counted the number of essential steps in GalNAc use (out of six) that could be accomplished by the orthologues detected in the genome, irrespective of map position. The second (regulon score) counted the number of essential GalNAc-use steps that could be fulfilled by orthologues that were clustered in the genome, that is, forming a potential operon. We used agaS as an anchor gene to establish the regulon score28. The first notable observation was that at least one orthologue of agaS was found in the two (that is, all) OTU476-like strains, in 31% of Erysipelotrichaceae MAGs (n = 248), yet in only 2.9% of other MAGs (n = 2,863). The second was that both scores were significantly higher for Erysipelotrichaceae compared with the other MAGs (Ppathway = 2.0 × 10−16 and Pregulon = 2.0 × 10−16), and for the two OTU476-like strains compared with Erysipelotrichaceae and non-Erysipelotrichaceae MAGs combined (Ppathway = 2.2 × 10−3 and Pregulon = 1.2 × 10−5) (Extended Data Fig. 10). The two OTU476-like strains were characterized by a cluster with eight GalNAc genes, including orthologues of the four components of the AgaPTS transporter system (agaE, agaF, agaV, agaW), of nagA deacetylase, of agaS deaminase/isomerase, of fruK kinase and of the gatZ-kbaZ aldolase subunit (Extended Data Fig. 10). This amounted to a score of five for both pathway and regulon, corresponding, respectively, to the top 4.7% and 0.35% out of 3,113 MAGs, demonstrating the uncommon status of OTU476-like strains with regard to GalNAc use. Neither the 4-15-1 nor 4-8-110 genome encode GHs that are known to have α-N-acetylgalactosaminidase activity specific for the A antigen39,40,41, suggesting that these strains are acceptors only.

To confirm that the OTU476-like strains are able to import and catabolize GalNAc, we grew the 4-8-110 strain in the presence of 13C-labelled GalNAc and checked the appearance of 13C-labelled catabolites in the bacterial pellet using gas chromatography coupled with MS (GC–MS). After 13 h of culture, around 29% of glyceraldehyde-3-P and around 25% of dihydoxyacetone phosphate (the products of the conversion of tagatose-1,6-bisphosphate by gatZ aldolase) were labelled with 13C, demonstrating the use of GalNAc by strain 4-8-110. Approximately 23% of 3-phosphoglycerate and around 17% of lactate molecules but none of TCA metabolites were 13C-labelled, underscoring the predominance of glycolysis over oxidative phosphorylation in this anaerobe (Fig. 5d, e).

We further performed a gavage experiment in germ free mice to test the effect of the addition of GalNAc on the relative growth of 4-8-110 and Escherichia coli in vivo. We gavaged mice either with GalNAc (200 mg kg−1 live weight, yielding caecal GalNAc concentrations comparable to AA pigs; Extended Data Fig. 9) or phosphate-buffered saline (PBS), as well as a mixture of E. coli and the 4-8-110 OTU476-like strain. We euthanized the mice at day 12 and measured the relative abundance of E. coli and 4-8-110 using 16S rRNA sequencing in the caecum and faeces. Strain 4-8-110 was in essence not detectable in PBS-gavaged mice, whereas it accounted for an average of 0.9% of reads in the caecum content (P = 0.0079) and 0.7% of reads in the faeces (P = 0.0097) of GalNAc-gavaged mice (Fig. 5f).

GalNAc operon of miQTL-sensitive bacteria

At least 65 MAGs (including E. coli and other non-Erysipelotrichacea species) contain orthologues for the five key GalNAc TR–CP steps as do 4-15-1 and 4-8-110. We wondered why the chromosome 1 miQTL does not affect these species. The organization of their GalNAc operons may provide a hint. The GalNAc clusters of non-Erysipelotrichaceae species have the features of genuine regulons. The relevant ORFs tend to be adjacent to each other (spanning around 10 kb) and on the same strand, compatible with polycistronic mRNAs enabling coregulated expression. By contrast, the open reading frames (ORFs) of the GalNAc clusters of the two OTU476-like strains and at least one studied Erysipelotrichaceae span around 50 kb and 30 kb, respectively, and are distributed on both strands. Neither genome contained orthologues of agaR or gntR, which are negative regulators of GalNAc regulons and were observed in all other GalNAc-rich MAGs (Fig. 6a and Extended Data Fig. 11). This suggests that, in contrast to E. coli and other species, the OTU476-like strains and some Erysipelotrichaceae cannot sense GalNAc concentrations and induce expression of the genes necessary for GalNAc use only when needed33,35,36, but rather may express these constitutively. To test this hypothesis, we grew the two OTU476-like strains and E. coli with and without GalNAc in the medium and analysed their transcriptome using RNA-seq. The GalNAc regulon of E. coli was tightly regulated, with a near complete absence of transcription in the absence of GalNAc, and around 300-fold upregulation in its presence. By contrast, in the OTU476-like 4-15-1 strain, GalNAc TR–CP genes were expressed at nearly equal levels with and without GalNAc, as predicted. Strain 4-15-1 expression levels in the absence of GalNAc were higher than E. coli expression levels in the presence of GalNAc. The expression pattern in the 4-8-110 strain was intermediate between E. coli and 4-15-1: GalNAc genes were expressed in the absence of GalNAc (albeit at low levels) yet were upregulated (around five to tenfold) in its presence (Fig. 6a). The difference in the response of the transcriptome to GalNAc addition between OTU476-like strains and E. coli response was not limited to the GalNAc operon. Although addition of GalNAc did not have noticeable effects on genes outside of the GalNAc regulon in E. coli, it altered the expression of ≥605 and 225 genes (q ≤ 0.05) in 4-15-1 and 4-8-110, respectively (Fig. 6b). Six KEGG pathways were consequently perturbed in both strains (FDR ≤ 0.05): metabolic pathways, pyrimidine metabolism, alanine, aspartate and glutamate metabolism, biosynthesis of antibiotics, ABC transporters and biosynthesis of secondary metabolites (Supplementary Table 7).

Fig. 6: The GalNAc operon organization and transcriptome response of miQTL-responsive bacteria.
figure 6

a, The GalNAc operon organization and local transcriptome response to GalNAc addition in OTU476-like strains and E. coli. Top, GalNAc operon organization. Identified ORFs are represented as coloured boxes. Genes implicated in GalNAc import and catabolism are shown in red if they are part of the cluster and in green if located elsewhere in the genome. Genes with a known function unrelated to GalNAc are shown in blue. ORFs with an uncharacterized gene product are shown in grey. Gene acronyms are given next to the corresponding boxes. ORFs transcribed from the top and bottom strand are shown above and below the dotted line, respectively. The respective transcriptional directions are marked by arrows. Bottom, local gene expression levels (fragments per kb of exon model per million mapped reads (FPKM)) with (dark colour) and without (light colour) addition of GalNAc in the growth medium. The colours for the ORFs with a known function (GalNAc gene (red), other gene (blue)) are the same as in the top panels. The error bars are the s.e.m. from three replicates (individual values are shown as dots). b, The global transcriptome response to GalNAc addition in OTU476-like strains and E. coli. The log2-transformed fold change in expression for all genes in the respective genomes (4,419 in E. coli, 1,119 in OTU476-like strains, ranked according to genomic position) after addition of GalNAc in the medium is shown. GalNAc genes are shown in red and other genes are shown in blue. Insets: corresponding QQ plots showing the near absence of effects on genes other than the GalNAc regulon in E. coli versus the widespread response in OTU476-like strains. The P values used to generate the QQ plots are nominal, and were determined using DISEQ2.


In humans, the ABO genotype affects susceptibility to various viral (including SARS-CoV-1 and 2), bacterial and protozoan pathogens20,42,43, and possibly the composition of the intestinal microbiota21,22,23. Invoked mechanisms are usually immune related, including pathogen adhesion, toxin binding, soluble decoys and natural antibodies26,44. We provide strong evidence that the miQTL reported in this research functions by affecting intestinal GalNAc concentrations, and thereby the growth of GalNAc-using bacteria. A puzzling finding is that not all species with the ability to use GalNAc as a carbon source appear to be affected. We suggest that this could be connected to the observation that the GalNAc operon is inducible in species like E. coli, while being constitutively expressed at high levels in at least some of the OTU476-like strains. The GalNAc gene cluster as seen in the OTU476-like strains is a possible evolutionary intermediate towards the formation of a genuine regulon as seen in E. coli, facilitating horizontal transmission of a ‘selfish’ functional gene ensemble even if not yet adaptively coregulated37. The distinct behaviour of the GalNAc operon in the 4-15-1 and 4-8-110 strains in response to GalNAc may be indicative of an evolving regulatory mechanism distinct from the canonical repressor-based regulon system. These findings suggest an alternative modus operandi of the miQTL. Bacteria affected by it may not be at an advantage when GalNAc is present at a relatively higher concentration in the intestinal content (as in AA and AO animals), but rather at a disadvantage when GalNAc is present at low concentrations (as in OO animals) due to wasting energy transcribing and translating useless genes. By tightly regulating the expression of their operon in response to ambient GalNAc availability, species like E. coli may fair equally well in the gut of AA/AO pigs as in that of OO pigs and, therefore, not be affected by the miQTL. Note that, in addition to the distinct organization of their GalNAc operon, the transcriptome of the two studied OTU476-like strain responds more to GalNAc addition than that of E. coli. The precise significance of this observation remains to be established but may indicate alternative or additional factors that underpin differential sensitivity to the miQTL.

FUT2-null alleles segregate in humans and in homozygotic individuals; this precludes synthesis of the type I H antigen and therefore epistatically of the type I A and/or B antigens in non-O individuals20,21 (Extended Data Fig. 8). Although the FUT2H117R variant segregated in our population, there was no evidence for an effect of that or any other FUT2 variant on the abundance of OTU476, OTU327 or p-75-a5 with or without conditioning on AO genotype (data not shown).

The effect of the ABO genotype on intestinal microbiota composition in humans remains somewhat controversial. Despite suggestive evidence in a small (n = 71) cohort of separate microbiota-based clustering of AB and B versus A and O individuals45, a subsequent study in a larger cohort (n = 1,503) could not detect experiment-wide significant effects of the ABO genotype on gut microbiota composition46. More recently, an effect of ABO genotype was reported on the abundance of OTUs assigned respectively to Faecalibacterium and Bacteroides in a cohort of around 9,000 German individuals21, but this was not confirmed in a subsequent study (~18,000 individuals)47. We examined the effect of the ABO blood group on the abundance of around 75 OTUs assigned to Erysipelotrichaceae in human gut samples (Extended Data Fig. 12). None of the OTUs detected in human were as closely related to the pig OTU476, OTU327 or p-75-a5 as these were to each other. We found no evidence for an effect of ABO blood group on the abundance of any of these. What underlies the difference between pigs and humans is unclear. Either Erysipelotrichaceae strains susceptible to the ABO genotype are not present at sufficient levels in humans, or the carbohydrate composition of human intestinal content makes these strains less sensitive to variations in GalNAc concentrations. The abundance of p-75-a5 was found to differ significantly between African subsistence categories and to be highest in pastoralists (as compared to hunter-gatherers and agro-pastoralists) possibly as a result of interaction with livestock48,49. Repeating the experiments in pastoralists may reveal the same miQTL effect detected in this study.

While this paper was in the last phase of the publication process two papers came out reporting an effect of ABO genotype on intestinal microbiota composition in human, albeit on distinct taxa22,23.


Animal rearing and sample collection

This study focused on the sixth (F6) and seventh (F7) generation of a mosaic population generated as follows. An average of 3.6 boars (range, 3–4) and 4 sows (range, 2–5) from four indigenous Chinese pig breeds (Erhualian (EH), Bamaxiang (BX), Tibetan (TB), Laiwu (LA)) and four commercial European/American pig breeds (Landrace (LD), Large White (LW), Duroc (WD) and Piétrain (PT)) were successfully mated, constituting the F0 generation. For each Chinese breed, the boars were mated with the sows of one European breed, and the sows with the boars of another European breed to produce the F1 generation. Thus, every Chinese and every European breed is parent breed of two distinct F1 hybrid combinations each, for a total of eight F1 combinations (BX–LW, BX–PT, LA–PT, LA–LD, TB–LD, TB–WD, EH–WD, EH–LW). The F2 generation was obtained by mating each F1 hybrid combination with two others that did not share parental breeds for a total of eight F2 combinations (BX–LW × LA–PT, BX–PT × LA–LD, LA–PT × TB–LD, LA–LD × TB–WD, TB–LD × EH–WD, TB–WD × EH–LW, EH–WD × BX–LW, EH–LW × BX–PT). Every F2 combination was obtained by reciprocally crossing an average of 4 boars from one F1 combination with an average of 7.25 sows from the other. The F3 generation was obtained by mating each of the eight F2 hybrid combinations with the only complementary F2 combination that did not share any parental breeds for a total of four F3 combinations (BX–LW–LA–PT × TB–LD–EH–WD, BX–PT–LA–LD × TB–WD–EH–LW, LA–PT–TB–LD × EH–WD–BX–LW, LA–LD–TW–WD × EH–LW–BX–PT) expected to each have around 12.5% of their genome from each of the founder breeds. Every F3 combination was obtained by reciprocally crossing an average of 7 boars from one F2 combination with an average of 10.8 sows from the complementary one. The F4, F5, F6 and F7 generations were obtained by intercrossing 57 boars × 75 sows (F3 to F4), 62 boars × 97 sows (F4 to F5), 85 boars × 170 sows (F5 to F6) and 82 boars × 111 sows (F6 to F7) (Supplementary Table 1).

All F6 and F7 animals were born and reared at the experimental farm of the National Key Laboratory for swine Genetic Improvement and Production Technology, Jiangxi Agricultural University (Nanchang, Jiangxi) under standard and uniform housing and feeding conditions. Piglets remained with their mother during the suckling period and were weaned at about 46 days of age. Litters were transferred to 12-pig fattening pens with automatic feeders (Osborne Industries), minimizing splitting and merging of litters. All pigs were fed twice per day with formula diets containing 16% crude protein, 3,100 kJ digestible energy, 0.78% lysine, 0.6% calcium and 0.5% phosphorus. Water was available ad libitum from nipple drinkers. Males were castrated at 80 days. Faecal samples were manually collected from the rectum of experimental pigs at the ages of 25, 120 and 240 days, dispensed in 2 ml tubes, flash-frozen in liquid nitrogen and stored at −80 °C. The animals were slaughtered at day 240. The ileum and caecum were sealed at both ends with a sterile rope and extracted from the carcass. Within 30 min after slaughter, ileal and caecal luminal content were collected (F6 and F7 animals), the ileum and caecum were rinsed with sterile saline solution, and samples of ileal and caecal mucosa were scraped with a sterile microscopic slide (F7 animals only). Approximately 1 g of content or scrapings was packed in 2 ml sterile freezer tubes, flash-frozen in liquid nitrogen and stored at −80 °C. The number of samples of the different types available for further analysis are provided in Supplementary Table 2.1. All of the animals included in the analyses were healthy and did not receive any antibiotic treatment within one month of sample collection. All of the procedures involving animals were approved by the Ethics Committee of the Jiangxi Agricultural University (no. JXAU2011-006).

Genotyping by sequencing the F0, F6 and F7 generations

Genomic DNA was extracted from ear punches using a standard phenol–chloroform-based DNA-extraction protocol. DNA concentrations were measured using the Nanodrop-1000 instrument (Thermo Fisher Scientific), and the DNA quality of all of the samples was assessed by agarose (0.8%) gel electrophoresis. Genomic DNA was sheared to 300–400 bp fragment size. The 3′-ends were adenylated and indexed primers were ligated. The libraries were amplified by PCR using Phusion High-Fidelity DNA polymerase (NEB) according to the recommendations of the manufacturer (Illumina). The libraries were loaded onto the Illumina X-10 instrument (Illumina) for 2 × 150 bp paired-end sequencing by Novogene. We removed reads with a quality score of ≤20 for ≥50% of bases or ≥10% missing (N) bases. Read quality was checked using Fastqc ( Clean reads were aligned to the S. scrofa reference genome assembly 11.150 using BWA (v.0.7.17)51. BAM files of mapped reads were sorted by chromosome position using SAMTools (v.1.6)52. Indel realignment and marking of duplicates were performed using Picard (v.2.21.4, Individual genotypes were called from BAM files using Platypus (v.0.8.1)53. Individual genotypes were merged into a single VCF file using PLINK (v.1.9)54 encompassing a total of 39.3 million variants including 31,094,663 SNPs and 8,266,390 INDELs. Missing genotypes were imputed with Beagle (v.40)55. Genomic variants with MAF < 0.03 were removed.

Computing nucleotide diversity

Nucleotide diversity between pairs of breeds was computed from variant frequencies as follows:

$${\pi }_{i}=(\mathop{\sum }\limits_{j=1}^{{n}_{i}}1-({f}_{ij}^{{\rm{A}}}\times {f}_{ij}^{{\rm{B}}})-((1-{f}_{ij}^{{\rm{A}}})\times (1-{f}_{ij}^{{\rm{B}}})))/w$$

where πi is the nucleotide diversity in window i, ni is the number of variants in window i, \({f}_{{ij}}^{{\rm{A}}}\) is the frequency of variant j of window i in breed A,\(\,{f}_{{ij}}^{{\rm{B}}}\) is the frequency of variant j of window i in breed B, and w is the size of the windows in base pairs. The overall nucleotide diversity for a pair of breeds A and B was computed as the average of πi across all of the windows. The numbers reported are averages of overall nucleotide diversities for multiple pairs of breeds (within European, within Chinese, between European, between Chinese, between European and Chinese), computed for a window size of 1 million base pairs.

Estimating the contribution of the eight founder breeds in the F6 and F7 generation at the genome and chromosome level

We estimated the proportion of the genome of the eight founder breeds in the F6 and F7 generation according to ref. 56. Assume that the total number of variants segregating in the mosaic population is nT. Each of these variants has a frequency in each one of the founder breeds which we denote \({f}_{1}^{0.1}\to {f}_{{n}_{{\rm{T}}}}^{0.1}\) for breed 1, \({f}_{1}^{0.2}\to {f}_{{n}_{{\rm{T}}}}^{0.2}\) for breed 2, and so on, as well as a frequency in the F6 (or F7) generation, which we refer to as \({f}_{1}^{6}\to {{\rm{f}}}_{{n}_{{\rm{T}}}}^{6}\). We assume that there is a total of B breeds. We denote the proportion of the genome of breed 1 in generation F6 (or F7) as P1, of breed 2 in generation F6 (or F7) as P2, and so on. We estimated the values of P1, P2, and so on, using a set of linear equations:

$${f}_{1}^{6}={\sum }_{j=1}^{B}({P}_{j}\times {f}_{1}^{0.j})$$
$$\vdots $$
$${f}_{i}^{6}={\sum }_{j=1}^{B}({P}_{j}\times {f}_{i}^{0.j})$$
$$\vdots $$
$${f}_{{n}_{{\rm{T}}}}^{6}={\sum }_{j=1}^{B}({P}_{j}\times {f}_{{n}_{{\rm{T}}}}^{0.j}).$$

We used standard least square methods (lm function in R) to find the solutions of Pj that minimize the residual sum of squares. This was done for the entire genome, as well as by autosome.

16S rRNA data collection and processing

Microbial DNA was extracted from faeces, luminal content and mucosal scrapings using the QIAamp Fast DNA stool Mini Kit according to the manufacturer’s recommendations (Qiagen). DNA concentrations were measured using the Nanodrop-1000 instrument (Thermo Fisher Scientific), and DNA quality was assessed by agarose (0.8%) gel electrophoresis. The V3–V4 hypervariable region of the 16S rRNA gene was amplified with the barcode fusion primers (338F: 5-ACTCCTACGGGAGGCAGCAG-3, 806R: 5-GGACTACHVGGGTWTCTAAT-3) with 56 °C annealing temperature. After purification, PCR products were used for constructing libraries and sequenced on the Illumina MiSeq platform (Illumina) at Major Bio. The 16S rRNA sequencing data were submitted to the CNGB database and have accession number CNP0001069. The raw 16S rRNA gene sequencing reads were demultiplexed and primer and barcode sequences were trimmed using Trimmomatic (v.0.39)57. Reads with ≥10 consecutive same or ambiguous bases were eliminated. Clean paired-end reads were merged (minimum 10 bp overlap) into tags using FLASH (v.1.2.11)58. The average number of tags per sample was around 40,888 (Supplementary Table 2.1). Chimeric reads were removed using USEARCH (v.7.0.1090)59. Sequencing data were rarefied to 19,632 tags, that is, the lowest number of tags per sample. Tags were clustered in OTUs with VSEARCH (v.2.8.1)60 using 97% as the similarity threshold. OTUs that would not have ≥3 reads in at least two samples or were detected in ≤0.2% of the samples were ignored. In the end, 12,054 OTUs accounting for an average of 98.7% of total reads per sample were used for further analysis. The mean number of tags for the 12,054 OTUs retained for further analyses was 1.6 (range, 0.01– 702.2). OTUs were matched to taxa using the Greengenes (v.13.5) database and the RDP classifier (v.2.2)61. PCoA was performed with the ape and vegan R packages (v.3.5.3) using Bray–Curtis dissimilarities. Shannon’s index was used as α-diversity metric and computed using mothur (v.1.43.0)62. Bray–Curtis dissimilarity was used as β-diversity metric and computed using vegdist of the vegan package in R (v.3.5.3). The mouse faecal microbiome data were from ref. 63. The human faecal microbiome data were 16S rRNA data from 106 healthy individuals (L.S. et al., manuscript in preparation).

Measuring the heritability of microbiota composition (all taxa combined)

We first estimated the effect of host genetics on the composition of the intestinal microbiome by measuring the correlation between genome-wide kinship and microbiome dissimilarity. We computed genome-wide kinship (Θ) for all pairs of relevant individuals using the SNP genotypes at the above-mentioned 30.2 million DNA variants using either GEMMA (v.0.97)64 or GCTA (v.1.26)65. Both programs yielded estimates of Θ with the same distribution after standardization, albeit different raw values. We report results obtained using GEMMA (v.0.97). Microbiome dissimilarity was measured using the Bray–Curtis dissimilarity computed using the vegan R function66 and abundances of all OTUs. We computed Spearman’s (rank-based) correlations using the corrtest function in R (v.3.5.3). We first performed this analysis for each sample type and generation separately within litter, that is, only considering pairs of full-siblings born within the same litter, therefore in essence following ref. 67. We then performed the analysis across the F6 and F7 generations. The pairs of individuals considered were all F6–F7 animal pairs, which included no parent–offspring pairs. To account for dependencies characterizing the data, P values were determined empirically by permutation testing. We performed 1,000 permutations of kinship coefficients and Bray–Curtis dissimilarities within litter. Vectors of OTU abundances were permuted within litters, Bray–Curtis distances were recomputed and correlated with the unpermuted kinships. The empirical P value was determined as the proportion of permutations that yielded a Spearman’s correlation that was as low or lower than that obtained with the real data.

Measuring the heritability of the abundance of individual taxa

Heritabilities of log-transformed abundances of specific taxa/OTUs were estimated using a linear mixed model implemented with the lme4QTL R-package (v.3.5.3)68. P values for the heritability estimates were computed using the associated lrt_h2 function68. We first estimated heritabilities using information from the F6 and F7 generation jointly (F6 plus F7). The model included individual animal (that is, polygenic effect), dam, litter, pen and batch as random effects and generation as a fixed effect. The additive genetic relationship matrix determining covariances between individual animal effects was computed using GCTA (v.1.26)65. This F6 plus F7 analysis extracts information from the correlation between kinship and phenotypic resemblance (1) between F6 animals (within F6), (2) between F7 animals (within F7) and (3) between F6 and F7 animals (between F6 and F7). To evaluate whether these three sources of information were consistent, we also estimated—for each taxon or OTU—the three heritabilities separately. The within F6 and within F7 heritabilities were estimated using the same mixed model (except for the absence of the fixed generation effect) and lme4QTL package as for the F6 plus F7 analysis. The between F6 and F7 heritabilities were estimated by regressing the squared difference in abundance on additive relationship according to ref. 69—narrow sense heritability was estimated as \(-\hat{\beta }/(2{\hat{\sigma }}_{P}^{2})\) where \(\hat{\beta }\) is the least-square regression coefficient and \({\hat{\sigma }}_{P}^{2}\) an estimate of the phenotypic variance (Extended Data Fig. 3). The total heritability of the intestinal microbiome was further computed from the heritabilities and abundance of individual taxa/OTU according to ref. 2.

Mapping miQTL

miQTL were mapped using the GenABEL R package (v.3.5.3)70, applying two models according to ref. 11. The first fitted a linear regression between allelic dosage and log10-transformed taxa abundance (additive model). It was applied to all SNPs with MAF ≥ 0.05 (in the corresponding data series) and taxa (that is, at all taxonomic levels including OTU) with non-null abundance in at least 20% of samples (in the corresponding data series), ignoring samples with null abundance if those represented more than 5% of samples. The second fitted a logistic regression model between allelic dosage and taxon presence/absence in the corresponding sample (binary model). It was applied only to taxa present in ≥20% and ≤95% of individuals and SNPs with MAF ≥ 10% (as the test statistic was inflated under the null when using this model with MAF < 10%; Extended Data Fig. 4). Both models included sex, slaughter batch (21 for F6, 23 for F7) and the three first genomic principal components as fixed covariates. GWAS were conducted separately for each taxon × data series combination and P values were concomitantly adjusted for residual stratification by genomic control. P values were combined across traits and/or taxa using a Z-score. P values were converted to signed Z-values using the inverse of the standard normal distribution and summed to give a Z-score. Z-scores were initially calculated using METAL (v.3.0)71. To compute the P value of the corresponding Z-score while accounting for the correlation that exists between the phenotypic values of a given cohort across traits, we also computed the genome-wide (that is, across all of the tested SNPs) average (\(\bar{Z}\)) and standard deviation (\({\sigma }_{Z}\)) of the Z score. The P value of Z scores was (conservatively) computed by assuming that \(\left(Z-\bar{Z}\right)/{\sigma }_{Z}\) is distributed as N (0,1) under the null hypothesis. Both approaches yielded similar results.

De novo assembly of the A allele of the porcine AO acetyl-galactosaminyl transferase gene

We extracted high-quality genomic DNA from longissimus dorsi of a Bamaxiang female using a phenol–chloroform-based extraction method (Novogene Biotech). A 40 kb SMRTbell DNA library (Pacific Biosciences) was prepared using BluePippin for DNA size selection (Sage Science) and then sequenced on the PacBio Sequel platform (Pacific Biosciences) with P6-C4 chemistry at Novogene Biotech. We obtained a total of 18,148,470 subreads with an N50 length of 17,273 bp. Moreover, a paired-end library with insert size of 350 bp was constructed and sequenced on the Illumina Novaseq 6000 PE150 platform (2 × 150bp reads) at Novogene Biotech. PacBio reads were self-corrected using Canu (v.1.7.1) before assembly with Flye (v.2.4.2)72. Errors in the primary assembly were first corrected using PacBio subreads using racon (v.1.4.10)73, and Illumina paired-end reads were then mapped to the contigs using bwa-mem74 to polish the contigs using Pilon (v.1.23, Broad Institute)75. Lastz (v.1.02.00)76 and Minimap2 (v.2.17-r941)77 were used to compare the Bamaxiang contig and the 40 kb sequence spanning the ABO gene of the S. scrofa build 11.1 reference genome.

Developing a PCR assay to distinguish between AA, AO and OO pigs

We designed two pairs of primers to genotype the deletion in the F6 and F7 populations. The first pair of primers was located respectively within intron 7 of the ABO gene and downstream of the deletion (FP: 5′-GAGTTCCCCTTGTGGCTCAGT-3′, RP: 5′- TTGCCTAAGTCTACCCCTGTGC-3′). The second pair of primers was located in exon 8 (FP2: 5′-CGCCAGTCCTTCACCTACGAAC-3′, RP2: 5′-CGGTTCCGAATCTCTGCGTG-3′). PCR amplification was performed in a 25 µl reaction containing 50 ng genomic DNA and 1.5 U of LA Taq DNA polymerase (Takara) under thermocycle conditions of 94 °C for 4 min; 35 cycles of 94 °C for 1 min, 1 min at specific annealing temperature for each set of primers and 72 °C for 2 min; and 72 °C for 10 min on a PE 9700 thermal cycler (Applied Biosystem).

RNA-seq and eQTL analysis

A total of 300 caecum tissue samples from F7 pigs that also had microbiota and genotype data were used to extract total RNA with TRIzol (Invitrogen) according to the manufacturer’s manual. Total RNA was electrophoresed on 1% agarose gels. RNA purity and integrity were assessed using an eNanoPhotometer spectrophotometer (Implen) and a Bioanalyzer 2100 system (Agilent Technologies). A Qubit3.0 Fluorometer (Life Technologies) was used to measure RNA concentrations. Total RNA (2 µg) of each sample was used to constructed RNA-seq libraries, using the NEBNext UltraTMR NA Library Prep Kit for Illumina (NEB) according to the manufacturer’s protocol. In brief, oligo(dT) magnetic beads (Invitrogen) were used to enrich mRNA, which was then fragmented using a fragmentation buffer (Ambion, USA). cDNA was synthesized using 6 bp random primers and reverse transcriptase (Invitrogen). After purification, cDNA was end-repaired, and index codes and sequencing adaptors were ligated. After PCR amplification, purification and quantification, the libraries were sequenced on the Novaseq-6000 platform using 2 × 150 bp paired-end sequencing. Clean data were obtained by removing adapter reads, poly-N and low-quality reads from raw data. Cleaned reads from each sample were mapped to the complete ABO sequence from the Bamaxiang reference genome with the A allele at the ABO locus constructed by the authors using STAR (v.020201)78. Samtools (v.1.6)52 was used to convert SAM format to BAM format. The read counts mapping to ABO (exon 1 to 7) were quantified for each sample using featureCounts (v.1.6.4)79. The expression abundance of the ABO gene was normalized to FPKM. Gender and batch were treated as covariates to correct for gene expression levels, and the corrected residuals were used for subsequent analyses. GEMMA (v.0.97)64 was used to analyse the association of ABO expression level with genome-wide variants using a linear mixed model.

Whole-genome sequencing and bioinformatic analysis for wild boars, S. verrucosus and S. cebifrons

The genomes of six Russian wild boars, one Sumatran wild boar and one African warty hog were sequenced on the Illumina HiSeq X Ten platform at Novogene Biotech. Furthermore, six Chinese wild boars were sequenced in a previous study80, and we downloaded the genome sequencing data for eight other pigs from NCBI. Finally, we used a total of 22 genomes to call SNPs in the porcine ABO gene using GATK (v.4.2)81. We replaced the ABO gene of the S. scrofa build 11.1 genome with the 50 kb Bamaxiang contig sequence containing the A allele of ABO gene. The cleaned reads of the 22 individuals were aligned to the modified S. scrofa reference genome (build 11.1) using BWA (v.0.7.17)51.

Phylogenetic analysis of the O alleles in the Sus genus

We applied GATK (v4.2) to perform indel realignment, and proceeded to SNP and INDEL discovery and genotyping with UnifiedGenotyper across all 83 samples simultaneously using standard hard filtering parameters according to GATK best practices recommendations81,82. We restricted the analysis to the 14 AA and 34 OO animals, therefore circumventing the need to phase the corresponding genotypes. We defined windows of varying size (0.5 to 50 kb) centred around the 2.3 kb deletion. For all pairs of individuals, we computed a running sum over all of the variants in the window adding 0 when both animals had genotype AA (alternate) or RR (reference), 1 when one animal was AA and the other RR, and 0.5 in all other cases. The nucleotide diversity for the corresponding animal pair was then computed as the running sum divided by the window size in bp. We ignored the variants located in the 2.3 kb deletion in this computation. The ensuing matrix of pairwise nucleotide diversities was then used for hierarchical clustering and dendrogram construction using the hclust (method=“average”) R function corresponding to the unweighted pair group method with arithmetic mean (UPGMA).

Analysis of population differentiation

We quantified the degree of population differentiation by computing the effect of breed on the variance of allelic dosage using a standard one-way ANOVA fixed-effect model and a F-statistic computed as the ratio of the between breed mean squares (BMS) and within breed mean squares (WMS)83. BMS and WMS were computed as:

$${\rm{BMS}}=(\mathop{\sum }\limits_{i=1}^{B}\mathop{\sum }\limits_{j=1}^{{n}_{i}}{(\overline{{y}_{i}}-\overline{{y}_{{\rm{T}}}})}^{2})/(B-1)$$
$${\rm{WMS}}=(\mathop{\sum }\limits_{i=1}^{B}\mathop{\sum }\limits_{j=1}^{{n}_{i}}{({y}_{ji}-\overline{{y}_{i}})}^{2})/({N}_{{\rm{T}}}-B)$$

where yij is the allelic dosage of the alternate allele in individual j (of ni) of breed i (of B), \(\bar{{y}_{{\rm{i}}}}\) is the average allelic dosage in breed i (of B), and \(\bar{{y}_{{\rm{T}}}}\) is the average allelic dosage in the entire dataset. We computed the average of the corresponding F-statistic for all variants within a sliding window of fixed physical size (2 kb in Fig. 5e), and took the inverse of this mean as measure of population similarity. The corresponding profiles were nearly identical to those obtained by computing average FST values (fixation index) across variants (and taking the inverse) following ref. 84.

Profiling ABO gene expression level at various adult and embryo tissues

Total RNA was extracted using Trizol from 15 tissues (lung, hypophysis, skin, spinal cord, liver, spleen, muscle, hypothalamus, heart, blood, brain, caecum, stomach, duodenum and kidney) collected from an adult Bamaxiang sow and a Duroc pig embryo (day 75). RNA quality was monitored by agarose (1%) gel electrophoresis, and using the RNA Nano 6000 Assay Kit of the Bioanalyzer 2100 system (Agilent Technologies). RNA concentration was measured using Qubit RNA Assay Kit in Qubit 2.0 Flurometer (Life Technologies). Total RNA (1 µg) of each sample was used to constructed RNA-seq libraries. Sequencing libraries were generated using the TruSeq RNA Library Preparation Kit (Illumina) according to the manufacturer’s recommendations, and index codes were added to attribute sequences to each sample. In brief, mRNA was purified from total RNA using poly-T oligo-attached magnetic beads. First-strand cDNA was synthesized using random hexamer primer and M-MuLV Reverse Transcriptase (RNase H). Second-strand cDNA synthesis was subsequently performed using DNA polymerase I and RNase H. The remaining overhangs were converted into blunt ends by exonuclease/polymerase activities. After adenylation of the 3′ ends of the DNA fragments, Illumina adaptors were ligated. To select cDNA fragments of preferentially around 350–400 bp in length, the library fragments were purified with AMPure XP system (Beckman Coulter). PCR was performed using Phusion High-Fidelity DNA polymerase, Universal PCR primers and Index (X) Primer. PCR products were purified (AMPure XP system) and the library quality was assessed on the Agilent Bioanalyzer 2100 system. Clustering of the index-coded samples was performed on the cBot Cluster Generation System using the TruSeq PE Cluster Kit v3-cBot-HS (Illumia) according to the manufacturer’s instructions. Sequencing was performed on the Illumina NovaSeq platform and 150 bp paired-end reads were generated. Filtered reads were obtained by removing adapter sequences, poly-N and low-quality reads. Cleaned reads were mapped to the complete ABO sequence from the Bamaxiang reference genome sequence using HISAT2 (v.2.2.1)85. Samtools (v.1.6)52 was used to convert SAM format to BAM format. The read counts mapping to ABO (exon 1 to 7) were quantified for each sample using featureCounts (v.1.6.4)79. To adjust for the effect of sequencing depth, the expression abundance of ABO gene was normalized to Transcripts Per Million (TPM). Expression abundance of ABO was used to cluster and visualize the expression level of the 15 tissues from an adult Bamaxiang sow and a Duroc pig embryo using the functions dist(), hclust(), as.dendrogram() and set() implemented in R package stats and dendextend.

Testing the effect of the AO genotype on other phenotypes

The associations between the 2.3 kb ABO deletion and 150 traits were calculated in the F6 and F7 populations based on a meta-analysis combining the effects. The observed P value for a trait was calculated by testing a weighted mean of Z scores from F6 and F7 generations as follows: Z = (Z1W1 + Z2W2)/(W1 + W2), where Z1 = b1/SE1 and Z2 = b2/SE2, W1 = 1/(SE1)2 and W2 = 1/(SE2)2, where the subscripts 1 and 2 denote F6 and F7 generations, respectively; b1, b2, SE1 and SE2 were additive effects and standard errors of ABO locus on a given trait estimated from a linear mixed model, which accounted for population structure using a genomic relationship matrix derived from whole genome marker genotypes. A total of 250 and 254 traits were tested in the F6 and F7 generations, and 150 traits that were shared in the F6 and F7 generations were used for meta-analysis.

Determination of the concentration of N-acetyl-galactosamine in caecal lumen

Targeted LC–MS/MS analysis was performed to determine the concentration of the mixture of GalNAc, GlcNAc and ManNAc isomers (referred to as HexNAc) in caecal lumen samples. Samples of caecal content were collected at D240 for 124 F7 and 154 animals from a Duroc × Landrace × Large White commercial population. The samples were lyophilized, grounded into powder by a Mixer Mill MM 400 (30 Hz, 1 min) (Retsch) and stored at −80 °C until use. Approximately 15 mg of powder was weighed and extracted with 500 µl of 70% methanol/water. The mixture was vortexed for 10 min, and centrifuged at around 16,000g and 4 °C for 10 min. Supernatant (300 μl) was transferred into a new centrifuge tube, placed at −20 °C for 30 min, and centrifuged again at 16,128g at 4 °C for 3 min. Finally, 150 μl of the supernatant was collected for further LC–MS analysis. Separation was performed in a 50 mm × 2.1 mm × 1.8 µm ACQUITY UPLC BEH C18 Column (Waters) in an ExionLC AD System (AB Sciex). Linear ion trap and triple quadrupole scans were performed on the Applied Biosystems 6500 Triple Quadrupole (QTRAP 6500) system equipped with an ESI Turbo Ion-Spray interface. The operation was performed under positive ion mode and controlled by Analyst software (v.1.6.3) (Sciex). The ESI source parameters were set as follows: ion source: ESI+, source temperature: 550 °C, ion spray voltage (IS): 5,500 V (Positive), curtain gas (CUR): 35 psi. The GalNAc was analysed using multiple reaction monitoring. We established a standard curve using increasing and known amounts of GalNAc standard and used it to determine GalNAc concentrations per gram of dried caecal content.

Measuring the correlation between GalNAc concentrations and bacterial abundance within the AO genotype

Raw AUC (GalNAc) or bacterial abundances were showing considerable batch-to-batch variation both with regards to means and variances. We therefore transformed them to ranks within batch and projected these uniformly between 0 and 1, yielding equalized means of 0.5 and standard deviations ≈ 0.3 (scaled batch-corrected ranks). We then applied the same approach within AO genotype to the scaled batch-corrected ranks yielding scaled, batch- and genotype-corrected ranks. We computed Spearman’s correlations between the GalNAc and abundance scaled ranks.

Isolating 4-8-110 and 4-15-1

Faecal samples were collected from the rectum of healthy pigs at about 120 days and transferred immediately to anaerobic conditions. Fresh samples were homogenized with sterile 1× PBS (pH 7.0) in an anaerobic glovebox (Electrotek), which contained 10% hydrogen, 10% carbon dioxide and 80% nitrogen. The faecal suspension was diluted 10−6, 10−7, 10−8 and 10−9-fold, and plated on GAM medium (Nissui Pharmaceutical). Plates were incubated at 37 °C for 3 days in an anaerobic glovebox. Single clones were picked and streaked until pure colonies were obtained on GAM medium. Full-length 16S rRNA gene sequencing was performed after amplification using the following primers: 27 forward: 5′-AGAGTTTGATCCTGGCCTCAG-3′; and 1492 reverse, 5′-GGTTACCTTGTTACGACTT-3′). The isolates were stored at −80 °C in GAM broth containing 16% of glycerol until further use.

Oxford Nanopore sequencing

The strains 4-8-110 and 4-15-1 were recovered on GAM medium. Cells were collected at the period of logarithmic growth. Genomic DNA was extracted using the Blood & Cell Culture DNA Midi Kit (Qiagen) according to the manufacturer’s protocol. Libraries for whole-genome sequencing of the strains were constructed and sequenced on an ONT PromethION (Oxford Nanopore Technology) at NextOmics. To correct sequencing errors, a library for second-generation sequencing was also constructed for each of the two strains and sequenced (2 × 100 bp) on the BGISEQ platform (BGI). Bioinformatic analyses of sequencing data were performed following refs. 72,86. In brief, after quality control, the sequencing data were assembled using flye (v.2.6)72 with the parameter: --nano-raw, and the assembled genomes were corrected by combining the Oxford Nanopore data with the second-generation sequencing data using pilon with the default parameters. The encoded genes were predicted using prodigal (v.2.6.3) (parameter: -p none-g 11)87.

MAG assembly

A total of 92 faecal samples from eight pig populations, four intestinal locations and different ages were used for metagenomic sequencing and construction of MAGs. Microbial DNA was extracted as described above. The libraries for metagenomic sequencing were constructed according to the manufacturer’s instructions (Illumina), with an insert size of 350 bp for each sample, and 2 × 150 bp paired-ends sequenced on the NovaSeq 6000 platform. Raw sequencing data were filtered to remove adapter sequences and low-quality reads using fastp (v.0.19.41)88. Host genomic DNA sequences were filtered out using BWA (v.0.7.17)51. The clean reads of each sample were assembled into contigs using MEGAHIT (v.1.1.3) with the option ‘--min-count 2 --k-min 27 --k-max 87 --k-step 10 --min-contig-len 500’89. Single-sample metagenomic binning was performed using two different binning algorithms ‘--metabat2 --maxbin2’ using the metaWRAP package (v.1.1.1)90. The bins (MAGs) generated by the two binning algorithms were evaluated for quality and combined to form a MAG set using the bin_refinement module in metaWRAP (v.1.1.1). Metagenomic sequences were further assembled to optimize MAGs using the reassemble_bins module of metaSPAdes (v.3.15.3) in the metaWRAP pipeline. CheckM (v.1.0.12) was used to estimate the completeness and contamination of each MAG91. The MAGs with completeness <50% and contamination >5% were filtered out. Non-redundant MAGs were generated by dRep (v.2.3.2) at threshold of 99% average nucleotide identity (ANI)92.

Bioinformatic analyses

Gene prediction in MAGs was carried out using the annotate_bins module in metaWRAP (v.1.1.1). The FASTA file of amino acid sequences translated from coding genes was used to perform KEGG annotation using Ghost KOALA tool (v.2.2)93 on the KEGG website ( Taxonomic classification of MAGs was performed using PhyloPhlAn (v.0.99)94. The graphs in Fig. 5b, c were generated using custom Perl (v.5.10.1) and R scripts (v.3.5.3). Pathway and regulon scores were computed using a custom Perl script. Both scores included (1) one point for import (having orthologues of either the four components of AgaPTS (agaE, agaF, agaV and agaW) and/or the three components of the TonB dependent transporter (omp, agaP and agaK) and/or the four components of the GnbPTS transporter (gnbA, gnbB, gnbC and gnbD), (2) one point for GalNAc deacetylase activity (having an orthologue of agaA and/or nagA), (3) one point for GalN deaminase/isomerase (having on orthologue of agaS), (4) one point for tagatose-6-P kinase (having an orthologue of pfkA and/or lacC and/or fruK), and (5) one point for tagatose-1,6-PP aldolase (having an orthologue of gatY and/or gatZ and/or lacD and/or fba). For the pathway score, the orthologues could be located anywhere in the MAG; for the regulon score, they had to be located on the same sequence contig and in close proximity (2.5% of genome size) to the anchor gene agaS28. For the top hits, we manually checked whether proximity was confirmed either by the replication of the order in more than one MAG and/or by the colocalization of the genes on one and the same sequence contig. The effect of the MAG type (OTU476-like, Erysipelotrichaceae and others), completion, contig and genome size on pathway and regulon scores was estimated using the R lm function, and was highly significant. The P values for the Erysipelotrichaceae versus other contrast were directly obtained from the lm function. To conservatively estimate the P value of OTU476-like versus Erysipelotrichaceae + others we generated score residuals corrected for completion, contig number and genome size, and determined how many MAGs had scores as high or higher than the OTU476-like strains. The P values reported in Extended Data Fig. 10 correspond to the square of these proportions as there are two OTU476-like strains with same GalNAc cluster organization.

Determining the metabolic flux ratios from 13C experiments by GC–MS

13C-labelled GalNAc (N-acetyl-d-[UL-13C6]galactosamine) (99% isotopic purity of 13C) was bought from ZZBIO (Shanghai). The 4-8-110 strain was cultured in 6 ml of GAM medium (Nissui Pharmaceuticals) adding 1 g l−1 of 13C-labelled GalNAc or regular GalNAc (as control) in an anaerobic glovebox (Electrotek) in triplicate as described above. After 13 h of culture, the cultures were centrifuged for 5 min at 1,500g. Bacterial cells were resuspended in 0.6 ml of cold (−40 °C) 50% aqueous methanol containing 100 µM of norvaline as internal standard, plunged in dry ice for 30 min and then thawed on ice. We added 0.4 ml of chloroform, vortexed for 30 s, centrifuged for 10 min at 17,530g (4 °C), transferred the supernatant to new 1.5 ml tubes and dried the sample at −105 °C in a FreeZone Freeze Dryers (Labconco). Metabolites were derivatized for GC–MS analysis as follows. First, 70 μl of pyridine was added to the dried pellet and incubated for 20 min at 80 °C. After cooling, 30 µl of N-tert-butyldimethylsilyl-N-methyltrifluoroacetamide (Sigma-Aldrich) was added and the samples were reincubated for 60 min at 80 °C before centrifugation for 10 min at 17,530g (4 °C). The supernatant was transferred to an autosampler vial for GC/MS analysis. A Shimadzu QP-2010 Ultra GC-MS system was programmed for an injection temperature of 250 °C and injected with 1 µl of sample. The GC oven temperature started at 110 °C for 4 min, rising to 230 °C at 3 °C min−1 and then to 280 °C at 20 °C min−1 with a final hold at this temperature for 2 min. The GC flow rate with helium carrier gas was 50 cm s−1. The GC column used was a 20 m × 0.25 mm × 0.25 mm Rxi-5 ms. The GC–MS interface temperature was 300 °C and the ion source temperature was set at 200 °C, with 70 V ionization voltage. The mass spectrometer was set to scan m/z range 50–800, with a 1 kV detector. GC–MS data were analysed to determine isotope labelling. To determine 13C labelling, the mass distribution for known fragments of metabolites was extracted from the mass spectra. For each fragment, the retrieved data comprised mass intensities for the lightest isotopomer (without any heavy isotopes, M0), and isotopomers with increasing unit mass relative to M0. The mass distributions were normalized by dividing by the sum over all isoptomers and corrected for the natural abundance of heavy isotopes of the elements H, O and C using 7 × 7 matrix-based probabilistic methods as described previously95,96,97 and implemented in MATLAB (release R2021a). Labelling results are expressed as the average fraction of the particular compound that contains isotopic label from the particular precursor.

Gavage experiment

4-8-110 and an E. coli strain isolated from caecal content (with 4-15-1 and 4-8-110) were cultured in GAM medium (Nissui Pharmaceuticals) for 13 h and 4 h, respectively, in an anaerobic glovebox (Electrotek) as described above. Optical density values were adjusted to 0.2 by addition of GAM medium, glycerol added to 1/6 of the final volume, and aliquots were stored at −80 °C. Germ-free mice (Kunming line) were provided by Huazhong Agricultural University. All of the experimental procedures involving mice were approved by the Ethics Committee in Huazhong Agricultural University (HZAUMO-2021-0077). Two groups (A and B) of five female mice each were housed in two separate cages (temperature, 25 ± 2 °C; humidity, 45–60%; light cycle, 12 h–12 h light–dark; light hours, 06:30–18:30). Mice were gavaged with 100 μl of PBS (group A) or 100 μl of PBS + 80 g l−1 GalNAc (corresponding to 8 mg per 100 μl or ~200 mg kg−1 of live weight) (group B) directly in the mouth with a sterile syringe twice per day (09:00 and 16:00) for 10 consecutive days (days 2 to 11). All of the mice were further gavaged with 150 μl of the 4-8-110 glycerol stocks (see above) once per day (at 16:00) for five consecutive days (days 3 to 7), and with 150 μl of the E. coli glycerol stocks (see above) once per day (at 16:00) for three consecutive days (days 5 to 7). The mice were euthanized 5 days after the last bacterial inoculation (on day 12, based on ref. 98), and caecal content and faeces were collected for each mouse. Bacterial DNA was extracted using the QIAamp fast DNA stool mini kit (Qiagen). The V3–V4 region of the 16S rRNA gene was amplified and sequenced using the methods described above. Reads were mapped to the V3–V4 sequence of the previously determined 4-8-110 and E. coli strains. An average of 99.74% of the reads mapped either to the 4-8-110 or E. coli reference sequences.

Testing the inducibility of the GalNAc operon and transcriptome response upon GalNAc addition in E. coli and the OTU476-like strains

Bacterial growth

Bacteria were grown in an anaerobic glovebox (Electrotek) containing 10% hydrogen, 10% carbon dioxide and 80% nitrogen at 37 °C. Samples (4 ml) were pipetted after gentle shaking to measure optical density at 600 nm (OD600) using an ultraviolet spectrophotometer (Yoke Instrument). Experiments were conducted in triplicate.

RNA extraction

The cultured bacterial cells of the two OTU476 like strains (4-8-110 and 4-15-1) and E. coli were collected at the end of the exponential growth phase (11 h and 4 h, respectively). Total RNA was extracted from bacterial cells using TRIzol Reagent according to the manufacturer’s manuals (Thermo Fisher Scientific), and genomic DNA was removed using DNase I (TaKara). The quantity and quality of total RNA were evaluated using a 2100 Bioanalyser (Agilent) and NanoDrop-2000 (Thermo Fisher Scientific). The RNA samples with OD260/280 = 1.8–2.0, OD260/230 ≥ 2.0, RIN ≥ 6.5, 28S:18S ≥ 1.0, ≥100 ng μl−1 and ≥2 μg were used to construct libraries for RNA-seq.

Construction of libraries and sequencing

The libraries for RNA-seq were constructed according to the TruSeq RNA sample preparation Kit (Illumina) using 2 μg of total RNA. In brief, rRNA was removed from total RNA using the Ribo-Zero Magnetic kit (Epicenter). All mRNAs were broken into short fragments (200 nucleotides) by adding fragmentation buffer. Double-stranded cDNA was synthesized using the SuperScript double-stranded cDNA synthesis kit (Invitrogen) with random hexamer primers (Illumina). cDNA was processed for end-repair, phosphorylation and A base addition according to library construction protocol (Illumina). The constructed libraries were sequenced on the NovaSeq 6000 platform with a 2 × 150 bp paired-end strategy. Base calling and quality value calculations were performed using the Illumina GA Pipeline (v.1.6) (Illumina).

Bioinformatics analysis

To obtain clean reads, low-quality sequences, reads with more than 5% of N bases (unknown bases) and reads containing adaptor sequences were removed from the raw data with a Perl script. The cleaned sequences were mapped to the reference genomic sequences of two OTU476 like strains (generated by ourselves as described above) and E. coli (GenBank: NC_000913.3) using Bowtie2 (v.2.4.2)99. The read counts mapping to the reference genomes were quantified for each sample using featureCounts (v.1.6.4). The fragments per kilobase of exon model per million mapped reads method was used to calculate the gene expression level. We used the DESeq2 package100 in R (v.g3.5.3) to test for differential expression.

Analysing the effect of the ABO genotype on the abundance of Erysipelotrichaceae in humans


The data used correspond to the previously described CEDAR cohort101, which included 300 healthy individuals of European descent who were visiting the University Hospital (CHU) from the University of Liège as part of a national screening campaign for colon cancer. Blood samples and intestinal biopsies (ileum, colon and rectum) were collected with full consent. The experimental protocol was approved by the ethics committee of the University of Liège Academic Hospital. Informed consent was obtained before donation in agreement with the recommendations of the declaration of Helsinki for experiments involving human participants.


For microbiota analysis, DNA was extracted from biopsies using the QIAamp DNA Stool Mini Kit (QIAgen). Three 16S rRNA amplicons corresponding to the V1–V2, V3–V4 and V5–V6 variable regions were generated in separate PCR reactions and processed for paired-end (2 × 300 bp) NGS sequencing on the MiSeq instrument (Illumina) following the standard protocol at the GIGA genomics core facility.

Data processing

Reads were QV20 trimmed from the 3′ end, demultiplexed, primer sequences were removed using the bbduk tool (version 38.82)102. Reads mapping to the human genome were eliminated using the BBTools suite (v.38.82). The corresponding pipeline was constructed using Snakemake (v.7.0.1)103. Further analyses were performed using QIIME 2 (v.2018.11)104. The paired-end reads were denoised and joined using the DADA2 plugin (v.1.16)105 using batch-specific trimming length parameters yielding 9.1± 2.0 kb amplicon sequence variants (ASVs) per run for V1–V2, 4.5 ± 1.6 kb for V3–V4 and 6.8 ± 0.67 kb for V5–V6 amplicons. ASVs mapping to known contaminant taxa as well as ASVs with abundance negatively correlated with coverage depth were removed. Samples that contained more than 20% contaminant ASVs were eliminated from further analyses. ASVs were then clustered to 97% identity level OTUs using the DNACLUST program (v.r3)106. After OTU assignment, read counts were rarefied to 10,000 (V1–V2 and V5–V6) and 5,000 (V3–V4). As intestinal location only explored a minor proportion of the variance in OTU abundance (L.S. et al., manuscript in preparation), OTU abundances were averaged across locations. Local alignment identity of the detected ASVs with the OTU476 and OTU327 from the pig microbiome were measured using blastn107.

Association analysis with ABO blood group

The effect of ABO blood group on standardized abundances of individual OTUs was performed using a linear model (lm R function) including (1) ABO blood group (A, B, AB or O), (2) secretor status, (3) sex, (4) smoking status, (5) age and (6) body mass index. Analyses were conducted separately for the different amplicons.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.