Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# ABO genotype alters the gut microbiota by regulating GalNAc levels in pigs

## Abstract

The composition of the intestinal microbiome varies considerably between individuals and is correlated with health1. Understanding the extent to which, and how, host genetics contributes to this variation is essential yet has proved to be difficult, as few associations have been replicated, particularly in humans2. Here we study the effect of host genotype on the composition of the intestinal microbiota in a large mosaic pig population. We show that, under conditions of exacerbated genetic diversity and environmental uniformity, microbiota composition and the abundance of specific taxa are heritable. We map a quantitative trait locus affecting the abundance of Erysipelotrichaceae species and show that it is caused by a 2.3 kb deletion in the gene encoding N-acetyl-galactosaminyl-transferase that underpins the ABO blood group in humans. We show that this deletion is a ≥3.5-million-year-old trans-species polymorphism under balancing selection. We demonstrate that it decreases the concentrations of N-acetyl-galactosamine in the gut, and thereby reduces the abundance of Erysipelotrichaceae that can import and catabolize N-acetyl-galactosamine. Our results provide very strong evidence for an effect of the host genotype on the abundance of specific bacteria in the intestine combined with insights into the molecular mechanisms that underpin this association. Our data pave the way towards identifying the same effect in rural human populations.

## Main

It is increasingly recognized that a comprehensive understanding of the physiology and pathology of organisms requires integrated analysis of the host and its multiple microbiota1. In humans, the composition of the intestinal microbiota is associated with physiological and pathological parameters, including HDL cholesterol, fasting glucose levels and body mass index2. In livestock, ruminal microbiome composition is associated with methane production and feed efficiency3. These correlations reflect a complex interplay between host and microbiota that may include direct (causal) effects of the microbiome on the host’s physiology4. Several phenotypes correlated with microbiota composition are heritable5,6. This leads to the hypothesis that the genotype of the host may in part determine the composition of the microbiota, which may in turn affect the host’s phenotype4. It implies that the composition of the microbiota is partially heritable. Although studies in rodents support this7, evidence is less convincing in humans. Initial reports did not reveal a higher microbiota resemblance between monozygotic twins compared with dizygotic twins, suggesting a limited effect of the host genotype8. Better-powered studies provided evidence for a significant effect of host genetics on the abundance of taxa, particularly Christensenellaceae9. Loci that underpin microbiota heritability have remained difficult to identify in humans. Apart from variants that cause persistent expression of lactase (LCT) and are associated with decreased Bifidobacterium abundance, other GWAS loci have proven to be difficult to replicate2,10,11,12,13,14. The analysis of larger human cohorts is needed to gain a better understanding of the genetic architecture of microbiota composition.

To decipher the genetic architecture of intestinal microbiota composition in a large monogastric omnivore, we report the generation of a mosaic pig population and the longitudinal characterization of its intestinal microbiota. We observed a strong effect of the host genotype on microbiota composition and identified a locus with large effect on the abundance of specific taxa by controlling the concentration of N-acetyl-galactosamine in the gut and thereby affecting some of the species that use this metabolite as a carbon source.

## A mosaic pig population to study complex traits

We generated a large (>7,500) mosaic population by intercrossing the offspring of 61 F0 founders from four Chinese and four western breeds for more than 10 generations (Supplementary Table 1 and Extended Data Fig. 1). Animals were reared in uniform housing and feeding conditions. We analysed more than 200 phenotypes (pertaining to body composition, physiology, disease resistance and behaviour), obtained transcriptome, epigenome and chromatin interaction data from multiple tissues, and collected plasma metabolome and microbiome data in up to 954 F6 and 892 F7 animals. The F0 animals were whole-genome sequenced at an average depth of 28.4-fold, and the F6 and F7 animals were sequence at an average depth of 8.0-fold. We called genotypes at 23.8 million single-nucleotide polymorphisms (SNPs) and 6.4 million insertion–deletions (indels) with a minor allele frequency (MAF) of ≥0.03 (>1/100 bp). The nucleotide diversity (π) (that is, the proportion of nucleotide sites that differ between homologous sequences in two breeds) between two Chinese breeds and between two European breeds was similar to that between Homo sapiens and Homo neanderthalensis (~3 × 10−3)15, whereas the π between a Chinese and a European breed approached half of that between human and chimpanzee (~4.3 × 10−3)16. The proportion of the eight founder genomes in F6 and F7 ranged from 11.2% to 14.7% at the genome level, and from 4.9% to 22.1% at the chromosome level. The median number of variants in high linkage disequilibrium (LD) (r2 ≥ 0.9) with an index variant was 30, and the median maximal distance with a variant in high LD (r2 ≥ 0.9) was 54 kb (Extended Data Fig. 1).

## The intestinal microbiota of the healthy pig

We collected faeces at days 25 (suckling period), 120 (growing period) and 240 (slaughter day), as well as caecal and ileal content (F6 and F7) and caecal and ileal mucosal scrapings (F7 only) at day 240 (7 sample types, 12 data series). We generated 16S rRNA tags (V3–V4) for an average of 426 animals per data series (5,110 samples) (Supplementary Table 2.1). Tags were rarefied and clustered in 32,032 operational taxonomic units (OTUs). A total of 12,054 OTUs amounting to 98.7% of reads were retained and annotated to 41 phyla, 87 classes, 149 orders, 207 families, 360 genera and 150 species. The average microbiota composition of the 12 data series indicated high consistency across F6 and F7, yet marked differences between sample types (Fig. 1a, b and Supplementary Table 2.2). Even at the family level, some taxa were found to be nearly sample-type specific (Extended Data Fig. 2). The proteobacteria Enterobacteriaceae, Pseudomonadaceae and Pasteurellaceae, the firmicutes Clostridiaceae, Peptostreptococcaceae, Bacillaceae and Leuconostocaceae, and the actinobacteria Microbacteriaceae were at least ten times more abundant in ileal samples than in any other sample type. Among those, Leuconostocaceae were nearly digesta specific, whereas Pseudomonadaceae were nearly mucosa specific. The Bacteroidetes Odoribacteraceae and Rikenellaceae were at least ten times more abundant in day 25 faeces samples than in any other sample type. The firmicutes Christensenellaceae were nearly ten times more abundant in faeces samples than in any other sample type. This confirms that limiting the analysis of the intestinal microbiota to adult faeces can provide only a partial view of its complexity and the factors that determine it17. α-Diversity of faeces was lower at day 25 compared with at days 120–240, reminiscent of the enrichment of the intestinal flora between childhood and adulthood in humans8,18. α-Diversity was lower for ileal content than for caecal content and mucosa (Fig. 1c). β-Diversity tended to be inversely proportional to α-diversity, being higher for day 25 than for day 120 and 240 faeces. Variation in pairwise Bray–Curtis dissimilarities was highest for ileal content, which had the lowest α-diversity (Fig. 1d and Supplementary Table 2.3). The microbiota composition of pig day 240 (D240) faeces was more similar to that of human than to that of mouse faeces (Extended Data Fig. 2).

## Microbiota heritability in mosaic pigs

We first examined the relationship between genetic relatedness (genome-wide SNP identity by state) and microbiota dissimilarity (Bray–Curtis distance, all taxa combined)2. We used two approaches to mitigate confounding of genetics and environment: (1) we restricted the analyses to full-sibling littermates raised in the same environment, and (2) we confronted genetic similarity and microbiota dissimilarity across generations (F6 and F7). Both approaches supported an effect of genetics on microbiota composition manifested by significant negative correlations between genetic similarity and microbiota dissimilarity (Fig. 2a, b).

We then evaluated the heritability (h2) of the abundances of individual taxa/OTUs using F6 and F7 jointly (Fig. 2c and Supplementary Table 3.1). We computed total heritabilities (average heritability weighted by abundance)2 separately by sample type and taxonomic level. Total heritabilities were generally low (≤11.8%), yet higher for faecal (average D240, 10.2%) than for content samples (average ileal content, 5.7%) and lower for OTUs (average, 6.6%) than for higher taxonomic levels (average, 7.5%) (Fig. 2d). The correlation between $$\hat{{h}^{2}}$$ estimates obtained separately in F6 and F7 was positive and significant for D120 (P = 4.2 × 10−5), D240 (P = 2.2 × 10−16) and caecum content (P = 2.3 × 10−3), supporting genuine genetic effects (Extended Data Fig. 3). We established a list of the 55 most likely heritable taxa/OTUs on the basis of congruent $$\hat{{h}^{2}}$$ estimates in F6 (≥0.15), F7 (≥0.15), and across F6 and F7 (≥0.10) (Supplementary Table 3.1). It included the order Campylobacterales in D240 faeces, the species Bacteroides coprophilus in D25 faeces, and 53 OTUs of which two (D240 faeces, caecum content) were assigned to Christensenellaceae, five to Ruminococcaceae (four in D240 faeces, one in caecum content), one (D25) to Ruminococcus, one (D25) to Phascolarctobacterium, and one (D240) to RF32; that is, taxons characterized by h2 ≥ 0.10 in human studies19. With the exception of RF32, all of these belong to the order Clostridiales.

## A miQTL affecting Erysipelotrichaceae species

Heritability quantifies the proportion of additive genetic variance but does not foretell genetic architecture: phenotypes with low heritability may be affected by variants with major effects, whereas highly heritable traits may be very polygenic. To gain insights into the genetic architecture of the gut microbiota composition in this population, we performed a genome-wide association study (GWAS). We applied two statistical models testing SNP effects on taxa abundance and taxa presence/absence, respectively11. We ran GWAS separately by sample type, taxon and generation for a total of 57,557 GWAS (involving 8,490 taxa) (Supplementary Table 4.1). This yielded 1,527 genome-wide significant signals (P ≤5 × 10−8). For these, we performed meta-analyses across sample types, separately in F6 and F7 (Extended Data Fig. 4). We identified six signals exceeding the experiment-wide discovery threshold (P = 1.5 × 10−12) in one cohort and the experiment-wide replication threshold (P = 0.007) in the other (Fig. 3a and Supplementary Table 4.24.3). All lead SNPs mapped within 3,037 bp from each other (chromosome 1) and were in high LD (Fig. 3b). They affected two OTUs (OTU476 and OTU327) as well as the genus p-75-a5, to which OTU476 is assigned. All three are Erysipelotrichaceae.

To determine whether this microbiota QTL (miQTL) affects other taxa, we plotted the F6 and F7 association log[1/P] for lead SNP 1_272907239 and the 8,490 studied OTUs. OTU476 and OTU327 stood out as highly significant in F6 and F7 (Fig. 3c). Yet the P values for the 31 other p-75-a5 OTUs and the 83 other Erysipelotrichaceae OTUs were significantly shifted towards lower P values in both cohorts (Fig. 3c), and the sign was consistent with that for OTU476, OTU327 and p-75-a5 (Extended Data Fig. 4). This suggests that the chromosome 1 miQTL affects other species in this family. Notably, the h2 for p-75-a5 and OTU476 deviated significantly from 0 in D240 (0.10 and 0.14) and D120 (0.13 and 0.13) faeces, and h2 estimates for OTU327 in D120 faeces (0.13) (Supplementary Table 4.4).

QQ plots obtained after removing chromosome 1 variants (272.8–273.1 Mb interval) did not show convincing evidence for residual inflation of log[1/P] (Extended Data Fig. 4). Thus, the residual h2 is most likely highly polygenic.

## A 3.5-million-year-old deletion in the pig ABO orthologue

All lead SNPs map to the 3′ end of the porcine N-acetyl-galactosaminyl transferase gene underlying the ABO blood group in humans (Fig. 3b). This is a strong candidate that modulates interactions with several pathogens20, and was recently suggested to affect the intestinal microbiota in healthy humans21,22,23. A 2.3 kb deletion encompassing the last exon has been reported in pigs24. We showed that it was in near-perfect LD (r2 ≥ 0.94) with the lead SNPs in our population (Fig. 3d). We confirmed its boundaries and showed that it results from an intrachromosomal recombination between short interspersed nuclear elements (SINEs). We showed using RNA-sequencing (RNA-seq) and expression quantitative trait loci (eQTL) analysis in 300 F7 caecum samples that the deleted allele produces reduced amounts (around 1/3 of normal) of a 7.4 kb transcript (with alternative eighth and ninth exons) encoding a protein that lost 62% of its amino-acids, including 7 out of 8 active sites25 (Fig. 4a). Our results indicate that the 2.3 kb deletion creates a null allele (hereafter, the O allele) and that is a very strong candidate for the causative miQTL mutation (Extended Data Figs. 5 and 6).

We examined the effect of the AO genotype on the abundance of OTU476, OTU327 and p-75-a5. The results showed that (1) the effect of the A allele is dominant over the O allele (Supplementary Table 5.1), and (2) that the effect manifests in D120 and D240 faeces, caecal content as well as mucosa, but not in D25 faeces and D240 ileal content and mucosa (Fig. 4b and Supplementary Table 5.1). In these samples (D120, D240, caecum content, caecal mucosa), the AO genotype explained 7.9%, 3.2% and 6.6% of the variation in the abundance of OTU476, OTU327 and genus p-75-a5 (Supplementary Table 5.2). Notably, the abundance of OTU476 and OTU327 was highest in caecal content (around 0.92% and 0.02% of reads in AA/AO animals, and around 0.47% and 0.003% of reads in OO animals, respectively; Extended Data Fig. 5).

In primates, the ABO locus is under strong balancing selection that has perpetuated identical-by-descent alleles segregating in humans, gibbons and Old World monkeys for tens of millions of years26. We analysed the sequences of the 61 F0 (Sus scrofa domestica), 18 wild boars (9 Asian, 9 European) (Sus scrofa), 1 Indonesian wild boar from Sumatra (Sus scrofa vittatus), 1 Visayan warty pig from the Philippines (Sus cebifrons), 1 Javan warty pig from Indonesia (Sus verrucosus) and one common warthog from Africa (Phacochoerus africanus) in a 50 kb window spanning the AO gene. Asian and European wild boars diverged from a common S. scrofa ancestor around 1 million years ago (Ma), S. scrofa and S. s. vittatus around 1.5 Ma, S. scrofa and S. cebifrons/verrucosus around 3.5 Ma, and S. scrofa and P. africanus around 10 Ma27. The 2.3 kb deletion segregated in all eight F0 breeds, in all Asian and European/American wild-boar populations, and in S. cebifrons (Supplementary Table 5.3). The sequence of the deletion breakpoint was identical in all of the samples, confirming the identical-by-descent nature of the porcine O allele (Extended Data Fig. 7). Consistent with the hypothesis of an ancestral trans-species polymorphism (rather than hybridization), the O allele of S. cebifrons lay outside the cluster of S. scrofa O alleles (Fig. 4c and Extended Data Fig. 7). Further supporting balancing selection, the AO gene showed a marked decrease in population differentiation maximizing at the 2.3 kb deletion (Fig. 4d). Thus, although largely unknown, the underlying selective forces have operated in at least two mammalian branches (primates and suidae), over substantially long periods and broad geographical ranges, pointing towards their pervasive nature (Extended Data Fig. 7).

## The miQTL affects caecal GalNAc concentrations

As in humans, the porcine AO gene is broadly expressed, yet is particularly expressed in the intestines (Extended Data Fig. 7). The N-acetyl-galactosaminyl-transferase encoded by the A allele adds N-acetyl-d-galactosamine (GalNAc, α1-3 linkage) to H and Lewis antigens present on glycan substrates, including the heavily glycosylated mucins, which can be used as carbon source by intestinal bacteria20,28,29 (Extended Data Fig. 8). We reasoned that, owing to the abundance of intestinal mucus, the O allele might reduce the concentration of GalNAc in the intestine, thereby reducing the growth of bacterial species that are dependent on this sugar. We measured the concentration of free GalNAc, N-acetylglucosamine (GlcNAc) and N-acetylmannosamine (ManNAc) isomers (hereafter, HexNAc) in the caecum content of 278 D240 animals (124 F7 and 154 Duroc × Landrace × Large White) using liquid chromatography coupled with tandem mass spectrometry (LC–MS/MS). Intestinal HexNAc in pigs comprises approximately equal proportions of the GlcNAc and GalNAc isomers30. Dosage of the A allele (additive model) increased (P = 5.7 × 10−9) caecal HexNAc concentrations (Fig. 5a). Assuming that the increase is due only to GalNAc (given the enzymatic activity of the A allele), AA animals have at least twice the amount of free GalNAc compared with OO animals (Fig. 5a and Extended Data Fig. 9). Non-zero values in OO animals are primarily due to GlcNAc/ManNAc, and are probably also due to non-A antigen host and dietary GalNAc (GalNAc is ubiquitous at the core of all O-glycans, including intestinal mucins, and is a component of glycosaminoglycans such as chondroitin sulfate). Note that the effect of the A allele appears to be dominant on bacterial abundance, yet additive on GalNAc concentrations (P value of likelihood ratio test against the additive model = 0.12). This indicates that the additional increase in GalNAc availability in AA animals (versus AO animals) does not further favour the growth of OTU476, OTU327 and p-75-a5.

Showing that the O deletion (1) reduces caecal GalNAc concentrations, and (2) reduces the abundance of OTU476, OTU327 and p-75-a5 does not prove that one causes the other. The effect on bacterial abundance could be mediated by an unidentified mechanism independent of GalNAc. However, the hypothesis that it is the change in GalNAc concentrations that causes the change in bacterial abundance makes a testable prediction. In this case, we may expect to observe a ‘residual’ correlation between GalNAc concentrations and bacterial abundance ‘within’ the AO genotype, that is, by blocking the effect of the AO genotype and exploiting other sources of variation of GalNAc and bacterial abundance (Extended Data Fig. 8). We performed these analyses and observed the predicted positive correlation for OTU476 (P = 0.012) and p-75-a5 (P = 0.010), consistently within the three genotypes (AA, AO and OO). The correlation was positive (and consistent within the three genotypes) albeit not significant (P = 0.3) for the less abundant OTU327 (Fig. 5b, c and Extended Data Fig. 9).

Taken together, these results make a strong case that the AO genotype affects intestinal GalNAc concentrations, and that this affects the abundance of some Erysipelotrichaceae species.

## miQTL-responsive bacteria use GalNAc

To be used as carbon source by intestinal bacteria, GalNAc needs (1) to be released from the glycan structures by secreted glycoside hydrolases (GHs), (2) to be imported across the bacterial membranes by dedicated transport systems (TR), and (3) to be converted into intermediates of central metabolism by a specific catabolic pathway (CP). Although some bacteria have both GHs and TR–CP for specific monosaccharides, other bacteria have only the GHs (donors) or the TR–CP (acceptors)28. We reasoned that the bacteria that would respond to the miQTL would have a complete GalNAc TR–CP system (with or without GHs). To test this, we (1) isolated two bacterial strains (4-8-110 and 4-15-1) with a V3–V4 sequence similarity of 100% and 99.8% with OTU476, respectively, and sequenced their genome using the Oxford Nanopore Promethion system, and (2) built 3,111 metagenomic assembled genomes (MAGs) from shotgun data of 92 porcine intestinal samples, including 248 Erysipelotrichaceae. We compiled a list of 24 genes implicated in GalNAc use (TR–CP)28,31,32,33,34,35,36. These encode (1) 11 components of three GalNAc transporter systems (AgaPTS: agaE, agaF, agaV, agaW; TonB dependent transporter: omp, agaP, agaK; GnbPTS: gnbA, gnbB, gnbC, gnbD), (2) two GalNAc-6P deacetylases (agaA, nagA), (3) two galactosamine-6P (GalN-6P) isomerase and/or deaminases (agaI, agaS), (4) three tagatose-6P kinases (pfka, lacC, fruK), (5) four tagatose-1,6-PP aldolases or aldolase subunits (gatY-kbaY, gatZ-kbaZ, lacD, fba) and (6) two regulon repressors (agaR, gntR), for a total of six essential pathway constituents (Fig. 5d and Supplementary Table 6). Genes involved in the use of specific sugars tend to cluster and form operons of potentially coregulated genes (regulons) that support all or most of the essential TR–CP steps. The steps that are not encoded by the operon may be complemented in trans by genes encoding enzymes that may be less substrate-specific37,38. We searched for orthologues of the 24 genes in the two OTU476-like genomes and 3,111 MAGs. We generated two scores to evaluate the ability of bacterial species to use GalNAc. The first (pathway score) counted the number of essential steps in GalNAc use (out of six) that could be accomplished by the orthologues detected in the genome, irrespective of map position. The second (regulon score) counted the number of essential GalNAc-use steps that could be fulfilled by orthologues that were clustered in the genome, that is, forming a potential operon. We used agaS as an anchor gene to establish the regulon score28. The first notable observation was that at least one orthologue of agaS was found in the two (that is, all) OTU476-like strains, in 31% of Erysipelotrichaceae MAGs (n = 248), yet in only 2.9% of other MAGs (n = 2,863). The second was that both scores were significantly higher for Erysipelotrichaceae compared with the other MAGs (Ppathway = 2.0 × 10−16 and Pregulon = 2.0 × 10−16), and for the two OTU476-like strains compared with Erysipelotrichaceae and non-Erysipelotrichaceae MAGs combined (Ppathway = 2.2 × 10−3 and Pregulon = 1.2 × 10−5) (Extended Data Fig. 10). The two OTU476-like strains were characterized by a cluster with eight GalNAc genes, including orthologues of the four components of the AgaPTS transporter system (agaE, agaF, agaV, agaW), of nagA deacetylase, of agaS deaminase/isomerase, of fruK kinase and of the gatZ-kbaZ aldolase subunit (Extended Data Fig. 10). This amounted to a score of five for both pathway and regulon, corresponding, respectively, to the top 4.7% and 0.35% out of 3,113 MAGs, demonstrating the uncommon status of OTU476-like strains with regard to GalNAc use. Neither the 4-15-1 nor 4-8-110 genome encode GHs that are known to have α-N-acetylgalactosaminidase activity specific for the A antigen39,40,41, suggesting that these strains are acceptors only.

To confirm that the OTU476-like strains are able to import and catabolize GalNAc, we grew the 4-8-110 strain in the presence of 13C-labelled GalNAc and checked the appearance of 13C-labelled catabolites in the bacterial pellet using gas chromatography coupled with MS (GC–MS). After 13 h of culture, around 29% of glyceraldehyde-3-P and around 25% of dihydoxyacetone phosphate (the products of the conversion of tagatose-1,6-bisphosphate by gatZ aldolase) were labelled with 13C, demonstrating the use of GalNAc by strain 4-8-110. Approximately 23% of 3-phosphoglycerate and around 17% of lactate molecules but none of TCA metabolites were 13C-labelled, underscoring the predominance of glycolysis over oxidative phosphorylation in this anaerobe (Fig. 5d, e).

We further performed a gavage experiment in germ free mice to test the effect of the addition of GalNAc on the relative growth of 4-8-110 and Escherichia coli in vivo. We gavaged mice either with GalNAc (200 mg kg−1 live weight, yielding caecal GalNAc concentrations comparable to AA pigs; Extended Data Fig. 9) or phosphate-buffered saline (PBS), as well as a mixture of E. coli and the 4-8-110 OTU476-like strain. We euthanized the mice at day 12 and measured the relative abundance of E. coli and 4-8-110 using 16S rRNA sequencing in the caecum and faeces. Strain 4-8-110 was in essence not detectable in PBS-gavaged mice, whereas it accounted for an average of 0.9% of reads in the caecum content (P = 0.0079) and 0.7% of reads in the faeces (P = 0.0097) of GalNAc-gavaged mice (Fig. 5f).

## GalNAc operon of miQTL-sensitive bacteria

At least 65 MAGs (including E. coli and other non-Erysipelotrichacea species) contain orthologues for the five key GalNAc TR–CP steps as do 4-15-1 and 4-8-110. We wondered why the chromosome 1 miQTL does not affect these species. The organization of their GalNAc operons may provide a hint. The GalNAc clusters of non-Erysipelotrichaceae species have the features of genuine regulons. The relevant ORFs tend to be adjacent to each other (spanning around 10 kb) and on the same strand, compatible with polycistronic mRNAs enabling coregulated expression. By contrast, the open reading frames (ORFs) of the GalNAc clusters of the two OTU476-like strains and at least one studied Erysipelotrichaceae span around 50 kb and 30 kb, respectively, and are distributed on both strands. Neither genome contained orthologues of agaR or gntR, which are negative regulators of GalNAc regulons and were observed in all other GalNAc-rich MAGs (Fig. 6a and Extended Data Fig. 11). This suggests that, in contrast to E. coli and other species, the OTU476-like strains and some Erysipelotrichaceae cannot sense GalNAc concentrations and induce expression of the genes necessary for GalNAc use only when needed33,35,36, but rather may express these constitutively. To test this hypothesis, we grew the two OTU476-like strains and E. coli with and without GalNAc in the medium and analysed their transcriptome using RNA-seq. The GalNAc regulon of E. coli was tightly regulated, with a near complete absence of transcription in the absence of GalNAc, and around 300-fold upregulation in its presence. By contrast, in the OTU476-like 4-15-1 strain, GalNAc TR–CP genes were expressed at nearly equal levels with and without GalNAc, as predicted. Strain 4-15-1 expression levels in the absence of GalNAc were higher than E. coli expression levels in the presence of GalNAc. The expression pattern in the 4-8-110 strain was intermediate between E. coli and 4-15-1: GalNAc genes were expressed in the absence of GalNAc (albeit at low levels) yet were upregulated (around five to tenfold) in its presence (Fig. 6a). The difference in the response of the transcriptome to GalNAc addition between OTU476-like strains and E. coli response was not limited to the GalNAc operon. Although addition of GalNAc did not have noticeable effects on genes outside of the GalNAc regulon in E. coli, it altered the expression of ≥605 and 225 genes (q ≤ 0.05) in 4-15-1 and 4-8-110, respectively (Fig. 6b). Six KEGG pathways were consequently perturbed in both strains (FDR ≤ 0.05): metabolic pathways, pyrimidine metabolism, alanine, aspartate and glutamate metabolism, biosynthesis of antibiotics, ABC transporters and biosynthesis of secondary metabolites (Supplementary Table 7).

## Discussion

In humans, the ABO genotype affects susceptibility to various viral (including SARS-CoV-1 and 2), bacterial and protozoan pathogens20,42,43, and possibly the composition of the intestinal microbiota21,22,23. Invoked mechanisms are usually immune related, including pathogen adhesion, toxin binding, soluble decoys and natural antibodies26,44. We provide strong evidence that the miQTL reported in this research functions by affecting intestinal GalNAc concentrations, and thereby the growth of GalNAc-using bacteria. A puzzling finding is that not all species with the ability to use GalNAc as a carbon source appear to be affected. We suggest that this could be connected to the observation that the GalNAc operon is inducible in species like E. coli, while being constitutively expressed at high levels in at least some of the OTU476-like strains. The GalNAc gene cluster as seen in the OTU476-like strains is a possible evolutionary intermediate towards the formation of a genuine regulon as seen in E. coli, facilitating horizontal transmission of a ‘selfish’ functional gene ensemble even if not yet adaptively coregulated37. The distinct behaviour of the GalNAc operon in the 4-15-1 and 4-8-110 strains in response to GalNAc may be indicative of an evolving regulatory mechanism distinct from the canonical repressor-based regulon system. These findings suggest an alternative modus operandi of the miQTL. Bacteria affected by it may not be at an advantage when GalNAc is present at a relatively higher concentration in the intestinal content (as in AA and AO animals), but rather at a disadvantage when GalNAc is present at low concentrations (as in OO animals) due to wasting energy transcribing and translating useless genes. By tightly regulating the expression of their operon in response to ambient GalNAc availability, species like E. coli may fair equally well in the gut of AA/AO pigs as in that of OO pigs and, therefore, not be affected by the miQTL. Note that, in addition to the distinct organization of their GalNAc operon, the transcriptome of the two studied OTU476-like strain responds more to GalNAc addition than that of E. coli. The precise significance of this observation remains to be established but may indicate alternative or additional factors that underpin differential sensitivity to the miQTL.

FUT2-null alleles segregate in humans and in homozygotic individuals; this precludes synthesis of the type I H antigen and therefore epistatically of the type I A and/or B antigens in non-O individuals20,21 (Extended Data Fig. 8). Although the FUT2H117R variant segregated in our population, there was no evidence for an effect of that or any other FUT2 variant on the abundance of OTU476, OTU327 or p-75-a5 with or without conditioning on AO genotype (data not shown).

The effect of the ABO genotype on intestinal microbiota composition in humans remains somewhat controversial. Despite suggestive evidence in a small (n = 71) cohort of separate microbiota-based clustering of AB and B versus A and O individuals45, a subsequent study in a larger cohort (n = 1,503) could not detect experiment-wide significant effects of the ABO genotype on gut microbiota composition46. More recently, an effect of ABO genotype was reported on the abundance of OTUs assigned respectively to Faecalibacterium and Bacteroides in a cohort of around 9,000 German individuals21, but this was not confirmed in a subsequent study (~18,000 individuals)47. We examined the effect of the ABO blood group on the abundance of around 75 OTUs assigned to Erysipelotrichaceae in human gut samples (Extended Data Fig. 12). None of the OTUs detected in human were as closely related to the pig OTU476, OTU327 or p-75-a5 as these were to each other. We found no evidence for an effect of ABO blood group on the abundance of any of these. What underlies the difference between pigs and humans is unclear. Either Erysipelotrichaceae strains susceptible to the ABO genotype are not present at sufficient levels in humans, or the carbohydrate composition of human intestinal content makes these strains less sensitive to variations in GalNAc concentrations. The abundance of p-75-a5 was found to differ significantly between African subsistence categories and to be highest in pastoralists (as compared to hunter-gatherers and agro-pastoralists) possibly as a result of interaction with livestock48,49. Repeating the experiments in pastoralists may reveal the same miQTL effect detected in this study.

While this paper was in the last phase of the publication process two papers came out reporting an effect of ABO genotype on intestinal microbiota composition in human, albeit on distinct taxa22,23.

## Methods

### Animal rearing and sample collection

This study focused on the sixth (F6) and seventh (F7) generation of a mosaic population generated as follows. An average of 3.6 boars (range, 3–4) and 4 sows (range, 2–5) from four indigenous Chinese pig breeds (Erhualian (EH), Bamaxiang (BX), Tibetan (TB), Laiwu (LA)) and four commercial European/American pig breeds (Landrace (LD), Large White (LW), Duroc (WD) and Piétrain (PT)) were successfully mated, constituting the F0 generation. For each Chinese breed, the boars were mated with the sows of one European breed, and the sows with the boars of another European breed to produce the F1 generation. Thus, every Chinese and every European breed is parent breed of two distinct F1 hybrid combinations each, for a total of eight F1 combinations (BX–LW, BX–PT, LA–PT, LA–LD, TB–LD, TB–WD, EH–WD, EH–LW). The F2 generation was obtained by mating each F1 hybrid combination with two others that did not share parental breeds for a total of eight F2 combinations (BX–LW × LA–PT, BX–PT × LA–LD, LA–PT × TB–LD, LA–LD × TB–WD, TB–LD × EH–WD, TB–WD × EH–LW, EH–WD × BX–LW, EH–LW × BX–PT). Every F2 combination was obtained by reciprocally crossing an average of 4 boars from one F1 combination with an average of 7.25 sows from the other. The F3 generation was obtained by mating each of the eight F2 hybrid combinations with the only complementary F2 combination that did not share any parental breeds for a total of four F3 combinations (BX–LW–LA–PT × TB–LD–EH–WD, BX–PT–LA–LD × TB–WD–EH–LW, LA–PT–TB–LD × EH–WD–BX–LW, LA–LD–TW–WD × EH–LW–BX–PT) expected to each have around 12.5% of their genome from each of the founder breeds. Every F3 combination was obtained by reciprocally crossing an average of 7 boars from one F2 combination with an average of 10.8 sows from the complementary one. The F4, F5, F6 and F7 generations were obtained by intercrossing 57 boars × 75 sows (F3 to F4), 62 boars × 97 sows (F4 to F5), 85 boars × 170 sows (F5 to F6) and 82 boars × 111 sows (F6 to F7) (Supplementary Table 1).

All F6 and F7 animals were born and reared at the experimental farm of the National Key Laboratory for swine Genetic Improvement and Production Technology, Jiangxi Agricultural University (Nanchang, Jiangxi) under standard and uniform housing and feeding conditions. Piglets remained with their mother during the suckling period and were weaned at about 46 days of age. Litters were transferred to 12-pig fattening pens with automatic feeders (Osborne Industries), minimizing splitting and merging of litters. All pigs were fed twice per day with formula diets containing 16% crude protein, 3,100 kJ digestible energy, 0.78% lysine, 0.6% calcium and 0.5% phosphorus. Water was available ad libitum from nipple drinkers. Males were castrated at 80 days. Faecal samples were manually collected from the rectum of experimental pigs at the ages of 25, 120 and 240 days, dispensed in 2 ml tubes, flash-frozen in liquid nitrogen and stored at −80 °C. The animals were slaughtered at day 240. The ileum and caecum were sealed at both ends with a sterile rope and extracted from the carcass. Within 30 min after slaughter, ileal and caecal luminal content were collected (F6 and F7 animals), the ileum and caecum were rinsed with sterile saline solution, and samples of ileal and caecal mucosa were scraped with a sterile microscopic slide (F7 animals only). Approximately 1 g of content or scrapings was packed in 2 ml sterile freezer tubes, flash-frozen in liquid nitrogen and stored at −80 °C. The number of samples of the different types available for further analysis are provided in Supplementary Table 2.1. All of the animals included in the analyses were healthy and did not receive any antibiotic treatment within one month of sample collection. All of the procedures involving animals were approved by the Ethics Committee of the Jiangxi Agricultural University (no. JXAU2011-006).

### Genotyping by sequencing the F0, F6 and F7 generations

Genomic DNA was extracted from ear punches using a standard phenol–chloroform-based DNA-extraction protocol. DNA concentrations were measured using the Nanodrop-1000 instrument (Thermo Fisher Scientific), and the DNA quality of all of the samples was assessed by agarose (0.8%) gel electrophoresis. Genomic DNA was sheared to 300–400 bp fragment size. The 3′-ends were adenylated and indexed primers were ligated. The libraries were amplified by PCR using Phusion High-Fidelity DNA polymerase (NEB) according to the recommendations of the manufacturer (Illumina). The libraries were loaded onto the Illumina X-10 instrument (Illumina) for 2 × 150 bp paired-end sequencing by Novogene. We removed reads with a quality score of ≤20 for ≥50% of bases or ≥10% missing (N) bases. Read quality was checked using Fastqc (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Clean reads were aligned to the S. scrofa reference genome assembly 11.150 using BWA (v.0.7.17)51. BAM files of mapped reads were sorted by chromosome position using SAMTools (v.1.6)52. Indel realignment and marking of duplicates were performed using Picard (v.2.21.4, http://broadinstitute.github.io/picard). Individual genotypes were called from BAM files using Platypus (v.0.8.1)53. Individual genotypes were merged into a single VCF file using PLINK (v.1.9)54 encompassing a total of 39.3 million variants including 31,094,663 SNPs and 8,266,390 INDELs. Missing genotypes were imputed with Beagle (v.40)55. Genomic variants with MAF < 0.03 were removed.

### Computing nucleotide diversity

Nucleotide diversity between pairs of breeds was computed from variant frequencies as follows:

$${\pi }_{i}=(\mathop{\sum }\limits_{j=1}^{{n}_{i}}1-({f}_{ij}^{{\rm{A}}}\times {f}_{ij}^{{\rm{B}}})-((1-{f}_{ij}^{{\rm{A}}})\times (1-{f}_{ij}^{{\rm{B}}})))/w$$

where πi is the nucleotide diversity in window i, ni is the number of variants in window i, $${f}_{{ij}}^{{\rm{A}}}$$ is the frequency of variant j of window i in breed A,$$\,{f}_{{ij}}^{{\rm{B}}}$$ is the frequency of variant j of window i in breed B, and w is the size of the windows in base pairs. The overall nucleotide diversity for a pair of breeds A and B was computed as the average of πi across all of the windows. The numbers reported are averages of overall nucleotide diversities for multiple pairs of breeds (within European, within Chinese, between European, between Chinese, between European and Chinese), computed for a window size of 1 million base pairs.

### Estimating the contribution of the eight founder breeds in the F6 and F7 generation at the genome and chromosome level

We estimated the proportion of the genome of the eight founder breeds in the F6 and F7 generation according to ref. 56. Assume that the total number of variants segregating in the mosaic population is nT. Each of these variants has a frequency in each one of the founder breeds which we denote $${f}_{1}^{0.1}\to {f}_{{n}_{{\rm{T}}}}^{0.1}$$ for breed 1, $${f}_{1}^{0.2}\to {f}_{{n}_{{\rm{T}}}}^{0.2}$$ for breed 2, and so on, as well as a frequency in the F6 (or F7) generation, which we refer to as $${f}_{1}^{6}\to {{\rm{f}}}_{{n}_{{\rm{T}}}}^{6}$$. We assume that there is a total of B breeds. We denote the proportion of the genome of breed 1 in generation F6 (or F7) as P1, of breed 2 in generation F6 (or F7) as P2, and so on. We estimated the values of P1, P2, and so on, using a set of linear equations:

$${f}_{1}^{6}={\sum }_{j=1}^{B}({P}_{j}\times {f}_{1}^{0.j})$$
$$\vdots$$
$${f}_{i}^{6}={\sum }_{j=1}^{B}({P}_{j}\times {f}_{i}^{0.j})$$
$$\vdots$$
$${f}_{{n}_{{\rm{T}}}}^{6}={\sum }_{j=1}^{B}({P}_{j}\times {f}_{{n}_{{\rm{T}}}}^{0.j}).$$

We used standard least square methods (lm function in R) to find the solutions of Pj that minimize the residual sum of squares. This was done for the entire genome, as well as by autosome.

### 16S rRNA data collection and processing

Microbial DNA was extracted from faeces, luminal content and mucosal scrapings using the QIAamp Fast DNA stool Mini Kit according to the manufacturer’s recommendations (Qiagen). DNA concentrations were measured using the Nanodrop-1000 instrument (Thermo Fisher Scientific), and DNA quality was assessed by agarose (0.8%) gel electrophoresis. The V3–V4 hypervariable region of the 16S rRNA gene was amplified with the barcode fusion primers (338F: 5-ACTCCTACGGGAGGCAGCAG-3, 806R: 5-GGACTACHVGGGTWTCTAAT-3) with 56 °C annealing temperature. After purification, PCR products were used for constructing libraries and sequenced on the Illumina MiSeq platform (Illumina) at Major Bio. The 16S rRNA sequencing data were submitted to the CNGB database and have accession number CNP0001069. The raw 16S rRNA gene sequencing reads were demultiplexed and primer and barcode sequences were trimmed using Trimmomatic (v.0.39)57. Reads with ≥10 consecutive same or ambiguous bases were eliminated. Clean paired-end reads were merged (minimum 10 bp overlap) into tags using FLASH (v.1.2.11)58. The average number of tags per sample was around 40,888 (Supplementary Table 2.1). Chimeric reads were removed using USEARCH (v.7.0.1090)59. Sequencing data were rarefied to 19,632 tags, that is, the lowest number of tags per sample. Tags were clustered in OTUs with VSEARCH (v.2.8.1)60 using 97% as the similarity threshold. OTUs that would not have ≥3 reads in at least two samples or were detected in ≤0.2% of the samples were ignored. In the end, 12,054 OTUs accounting for an average of 98.7% of total reads per sample were used for further analysis. The mean number of tags for the 12,054 OTUs retained for further analyses was 1.6 (range, 0.01– 702.2). OTUs were matched to taxa using the Greengenes (v.13.5) database and the RDP classifier (v.2.2)61. PCoA was performed with the ape and vegan R packages (v.3.5.3) using Bray–Curtis dissimilarities. Shannon’s index was used as α-diversity metric and computed using mothur (v.1.43.0)62. Bray–Curtis dissimilarity was used as β-diversity metric and computed using vegdist of the vegan package in R (v.3.5.3). The mouse faecal microbiome data were from ref. 63. The human faecal microbiome data were 16S rRNA data from 106 healthy individuals (L.S. et al., manuscript in preparation).

### Measuring the heritability of microbiota composition (all taxa combined)

We first estimated the effect of host genetics on the composition of the intestinal microbiome by measuring the correlation between genome-wide kinship and microbiome dissimilarity. We computed genome-wide kinship (Θ) for all pairs of relevant individuals using the SNP genotypes at the above-mentioned 30.2 million DNA variants using either GEMMA (v.0.97)64 or GCTA (v.1.26)65. Both programs yielded estimates of Θ with the same distribution after standardization, albeit different raw values. We report results obtained using GEMMA (v.0.97). Microbiome dissimilarity was measured using the Bray–Curtis dissimilarity computed using the vegan R function66 and abundances of all OTUs. We computed Spearman’s (rank-based) correlations using the corrtest function in R (v.3.5.3). We first performed this analysis for each sample type and generation separately within litter, that is, only considering pairs of full-siblings born within the same litter, therefore in essence following ref. 67. We then performed the analysis across the F6 and F7 generations. The pairs of individuals considered were all F6–F7 animal pairs, which included no parent–offspring pairs. To account for dependencies characterizing the data, P values were determined empirically by permutation testing. We performed 1,000 permutations of kinship coefficients and Bray–Curtis dissimilarities within litter. Vectors of OTU abundances were permuted within litters, Bray–Curtis distances were recomputed and correlated with the unpermuted kinships. The empirical P value was determined as the proportion of permutations that yielded a Spearman’s correlation that was as low or lower than that obtained with the real data.

### Measuring the heritability of the abundance of individual taxa

Heritabilities of log-transformed abundances of specific taxa/OTUs were estimated using a linear mixed model implemented with the lme4QTL R-package (v.3.5.3)68. P values for the heritability estimates were computed using the associated lrt_h2 function68. We first estimated heritabilities using information from the F6 and F7 generation jointly (F6 plus F7). The model included individual animal (that is, polygenic effect), dam, litter, pen and batch as random effects and generation as a fixed effect. The additive genetic relationship matrix determining covariances between individual animal effects was computed using GCTA (v.1.26)65. This F6 plus F7 analysis extracts information from the correlation between kinship and phenotypic resemblance (1) between F6 animals (within F6), (2) between F7 animals (within F7) and (3) between F6 and F7 animals (between F6 and F7). To evaluate whether these three sources of information were consistent, we also estimated—for each taxon or OTU—the three heritabilities separately. The within F6 and within F7 heritabilities were estimated using the same mixed model (except for the absence of the fixed generation effect) and lme4QTL package as for the F6 plus F7 analysis. The between F6 and F7 heritabilities were estimated by regressing the squared difference in abundance on additive relationship according to ref. 69—narrow sense heritability was estimated as $$-\hat{\beta }/(2{\hat{\sigma }}_{P}^{2})$$ where $$\hat{\beta }$$ is the least-square regression coefficient and $${\hat{\sigma }}_{P}^{2}$$ an estimate of the phenotypic variance (Extended Data Fig. 3). The total heritability of the intestinal microbiome was further computed from the heritabilities and abundance of individual taxa/OTU according to ref. 2.

### Mapping miQTL

miQTL were mapped using the GenABEL R package (v.3.5.3)70, applying two models according to ref. 11. The first fitted a linear regression between allelic dosage and log10-transformed taxa abundance (additive model). It was applied to all SNPs with MAF ≥ 0.05 (in the corresponding data series) and taxa (that is, at all taxonomic levels including OTU) with non-null abundance in at least 20% of samples (in the corresponding data series), ignoring samples with null abundance if those represented more than 5% of samples. The second fitted a logistic regression model between allelic dosage and taxon presence/absence in the corresponding sample (binary model). It was applied only to taxa present in ≥20% and ≤95% of individuals and SNPs with MAF ≥ 10% (as the test statistic was inflated under the null when using this model with MAF < 10%; Extended Data Fig. 4). Both models included sex, slaughter batch (21 for F6, 23 for F7) and the three first genomic principal components as fixed covariates. GWAS were conducted separately for each taxon × data series combination and P values were concomitantly adjusted for residual stratification by genomic control. P values were combined across traits and/or taxa using a Z-score. P values were converted to signed Z-values using the inverse of the standard normal distribution and summed to give a Z-score. Z-scores were initially calculated using METAL (v.3.0)71. To compute the P value of the corresponding Z-score while accounting for the correlation that exists between the phenotypic values of a given cohort across traits, we also computed the genome-wide (that is, across all of the tested SNPs) average ($$\bar{Z}$$) and standard deviation ($${\sigma }_{Z}$$) of the Z score. The P value of Z scores was (conservatively) computed by assuming that $$\left(Z-\bar{Z}\right)/{\sigma }_{Z}$$ is distributed as N (0,1) under the null hypothesis. Both approaches yielded similar results.

### De novo assembly of the A allele of the porcine AO acetyl-galactosaminyl transferase gene

We extracted high-quality genomic DNA from longissimus dorsi of a Bamaxiang female using a phenol–chloroform-based extraction method (Novogene Biotech). A 40 kb SMRTbell DNA library (Pacific Biosciences) was prepared using BluePippin for DNA size selection (Sage Science) and then sequenced on the PacBio Sequel platform (Pacific Biosciences) with P6-C4 chemistry at Novogene Biotech. We obtained a total of 18,148,470 subreads with an N50 length of 17,273 bp. Moreover, a paired-end library with insert size of 350 bp was constructed and sequenced on the Illumina Novaseq 6000 PE150 platform (2 × 150bp reads) at Novogene Biotech. PacBio reads were self-corrected using Canu (v.1.7.1) before assembly with Flye (v.2.4.2)72. Errors in the primary assembly were first corrected using PacBio subreads using racon (v.1.4.10)73, and Illumina paired-end reads were then mapped to the contigs using bwa-mem74 to polish the contigs using Pilon (v.1.23, Broad Institute)75. Lastz (v.1.02.00)76 and Minimap2 (v.2.17-r941)77 were used to compare the Bamaxiang contig and the 40 kb sequence spanning the ABO gene of the S. scrofa build 11.1 reference genome.

### Developing a PCR assay to distinguish between AA, AO and OO pigs

We designed two pairs of primers to genotype the deletion in the F6 and F7 populations. The first pair of primers was located respectively within intron 7 of the ABO gene and downstream of the deletion (FP: 5′-GAGTTCCCCTTGTGGCTCAGT-3′, RP: 5′- TTGCCTAAGTCTACCCCTGTGC-3′). The second pair of primers was located in exon 8 (FP2: 5′-CGCCAGTCCTTCACCTACGAAC-3′, RP2: 5′-CGGTTCCGAATCTCTGCGTG-3′). PCR amplification was performed in a 25 µl reaction containing 50 ng genomic DNA and 1.5 U of LA Taq DNA polymerase (Takara) under thermocycle conditions of 94 °C for 4 min; 35 cycles of 94 °C for 1 min, 1 min at specific annealing temperature for each set of primers and 72 °C for 2 min; and 72 °C for 10 min on a PE 9700 thermal cycler (Applied Biosystem).

### RNA-seq and eQTL analysis

A total of 300 caecum tissue samples from F7 pigs that also had microbiota and genotype data were used to extract total RNA with TRIzol (Invitrogen) according to the manufacturer’s manual. Total RNA was electrophoresed on 1% agarose gels. RNA purity and integrity were assessed using an eNanoPhotometer spectrophotometer (Implen) and a Bioanalyzer 2100 system (Agilent Technologies). A Qubit3.0 Fluorometer (Life Technologies) was used to measure RNA concentrations. Total RNA (2 µg) of each sample was used to constructed RNA-seq libraries, using the NEBNext UltraTMR NA Library Prep Kit for Illumina (NEB) according to the manufacturer’s protocol. In brief, oligo(dT) magnetic beads (Invitrogen) were used to enrich mRNA, which was then fragmented using a fragmentation buffer (Ambion, USA). cDNA was synthesized using 6 bp random primers and reverse transcriptase (Invitrogen). After purification, cDNA was end-repaired, and index codes and sequencing adaptors were ligated. After PCR amplification, purification and quantification, the libraries were sequenced on the Novaseq-6000 platform using 2 × 150 bp paired-end sequencing. Clean data were obtained by removing adapter reads, poly-N and low-quality reads from raw data. Cleaned reads from each sample were mapped to the complete ABO sequence from the Bamaxiang reference genome with the A allele at the ABO locus constructed by the authors using STAR (v.020201)78. Samtools (v.1.6)52 was used to convert SAM format to BAM format. The read counts mapping to ABO (exon 1 to 7) were quantified for each sample using featureCounts (v.1.6.4)79. The expression abundance of the ABO gene was normalized to FPKM. Gender and batch were treated as covariates to correct for gene expression levels, and the corrected residuals were used for subsequent analyses. GEMMA (v.0.97)64 was used to analyse the association of ABO expression level with genome-wide variants using a linear mixed model.

### Whole-genome sequencing and bioinformatic analysis for wild boars, S. verrucosus and S. cebifrons

The genomes of six Russian wild boars, one Sumatran wild boar and one African warty hog were sequenced on the Illumina HiSeq X Ten platform at Novogene Biotech. Furthermore, six Chinese wild boars were sequenced in a previous study80, and we downloaded the genome sequencing data for eight other pigs from NCBI. Finally, we used a total of 22 genomes to call SNPs in the porcine ABO gene using GATK (v.4.2)81. We replaced the ABO gene of the S. scrofa build 11.1 genome with the 50 kb Bamaxiang contig sequence containing the A allele of ABO gene. The cleaned reads of the 22 individuals were aligned to the modified S. scrofa reference genome (build 11.1) using BWA (v.0.7.17)51.

### Phylogenetic analysis of the O alleles in the Sus genus

We applied GATK (v4.2) to perform indel realignment, and proceeded to SNP and INDEL discovery and genotyping with UnifiedGenotyper across all 83 samples simultaneously using standard hard filtering parameters according to GATK best practices recommendations81,82. We restricted the analysis to the 14 AA and 34 OO animals, therefore circumventing the need to phase the corresponding genotypes. We defined windows of varying size (0.5 to 50 kb) centred around the 2.3 kb deletion. For all pairs of individuals, we computed a running sum over all of the variants in the window adding 0 when both animals had genotype AA (alternate) or RR (reference), 1 when one animal was AA and the other RR, and 0.5 in all other cases. The nucleotide diversity for the corresponding animal pair was then computed as the running sum divided by the window size in bp. We ignored the variants located in the 2.3 kb deletion in this computation. The ensuing matrix of pairwise nucleotide diversities was then used for hierarchical clustering and dendrogram construction using the hclust (method=“average”) R function corresponding to the unweighted pair group method with arithmetic mean (UPGMA).

### Analysis of population differentiation

We quantified the degree of population differentiation by computing the effect of breed on the variance of allelic dosage using a standard one-way ANOVA fixed-effect model and a F-statistic computed as the ratio of the between breed mean squares (BMS) and within breed mean squares (WMS)83. BMS and WMS were computed as:

$${\rm{BMS}}=(\mathop{\sum }\limits_{i=1}^{B}\mathop{\sum }\limits_{j=1}^{{n}_{i}}{(\overline{{y}_{i}}-\overline{{y}_{{\rm{T}}}})}^{2})/(B-1)$$
$${\rm{WMS}}=(\mathop{\sum }\limits_{i=1}^{B}\mathop{\sum }\limits_{j=1}^{{n}_{i}}{({y}_{ji}-\overline{{y}_{i}})}^{2})/({N}_{{\rm{T}}}-B)$$

where yij is the allelic dosage of the alternate allele in individual j (of ni) of breed i (of B), $$\bar{{y}_{{\rm{i}}}}$$ is the average allelic dosage in breed i (of B), and $$\bar{{y}_{{\rm{T}}}}$$ is the average allelic dosage in the entire dataset. We computed the average of the corresponding F-statistic for all variants within a sliding window of fixed physical size (2 kb in Fig. 5e), and took the inverse of this mean as measure of population similarity. The corresponding profiles were nearly identical to those obtained by computing average FST values (fixation index) across variants (and taking the inverse) following ref. 84.

### Profiling ABO gene expression level at various adult and embryo tissues

Total RNA was extracted using Trizol from 15 tissues (lung, hypophysis, skin, spinal cord, liver, spleen, muscle, hypothalamus, heart, blood, brain, caecum, stomach, duodenum and kidney) collected from an adult Bamaxiang sow and a Duroc pig embryo (day 75). RNA quality was monitored by agarose (1%) gel electrophoresis, and using the RNA Nano 6000 Assay Kit of the Bioanalyzer 2100 system (Agilent Technologies). RNA concentration was measured using Qubit RNA Assay Kit in Qubit 2.0 Flurometer (Life Technologies). Total RNA (1 µg) of each sample was used to constructed RNA-seq libraries. Sequencing libraries were generated using the TruSeq RNA Library Preparation Kit (Illumina) according to the manufacturer’s recommendations, and index codes were added to attribute sequences to each sample. In brief, mRNA was purified from total RNA using poly-T oligo-attached magnetic beads. First-strand cDNA was synthesized using random hexamer primer and M-MuLV Reverse Transcriptase (RNase H). Second-strand cDNA synthesis was subsequently performed using DNA polymerase I and RNase H. The remaining overhangs were converted into blunt ends by exonuclease/polymerase activities. After adenylation of the 3′ ends of the DNA fragments, Illumina adaptors were ligated. To select cDNA fragments of preferentially around 350–400 bp in length, the library fragments were purified with AMPure XP system (Beckman Coulter). PCR was performed using Phusion High-Fidelity DNA polymerase, Universal PCR primers and Index (X) Primer. PCR products were purified (AMPure XP system) and the library quality was assessed on the Agilent Bioanalyzer 2100 system. Clustering of the index-coded samples was performed on the cBot Cluster Generation System using the TruSeq PE Cluster Kit v3-cBot-HS (Illumia) according to the manufacturer’s instructions. Sequencing was performed on the Illumina NovaSeq platform and 150 bp paired-end reads were generated. Filtered reads were obtained by removing adapter sequences, poly-N and low-quality reads. Cleaned reads were mapped to the complete ABO sequence from the Bamaxiang reference genome sequence using HISAT2 (v.2.2.1)85. Samtools (v.1.6)52 was used to convert SAM format to BAM format. The read counts mapping to ABO (exon 1 to 7) were quantified for each sample using featureCounts (v.1.6.4)79. To adjust for the effect of sequencing depth, the expression abundance of ABO gene was normalized to Transcripts Per Million (TPM). Expression abundance of ABO was used to cluster and visualize the expression level of the 15 tissues from an adult Bamaxiang sow and a Duroc pig embryo using the functions dist(), hclust(), as.dendrogram() and set() implemented in R package stats and dendextend.

### Testing the effect of the AO genotype on other phenotypes

The associations between the 2.3 kb ABO deletion and 150 traits were calculated in the F6 and F7 populations based on a meta-analysis combining the effects. The observed P value for a trait was calculated by testing a weighted mean of Z scores from F6 and F7 generations as follows: Z = (Z1W1 + Z2W2)/(W1 + W2), where Z1 = b1/SE1 and Z2 = b2/SE2, W1 = 1/(SE1)2 and W2 = 1/(SE2)2, where the subscripts 1 and 2 denote F6 and F7 generations, respectively; b1, b2, SE1 and SE2 were additive effects and standard errors of ABO locus on a given trait estimated from a linear mixed model, which accounted for population structure using a genomic relationship matrix derived from whole genome marker genotypes. A total of 250 and 254 traits were tested in the F6 and F7 generations, and 150 traits that were shared in the F6 and F7 generations were used for meta-analysis.

### Determination of the concentration of N-acetyl-galactosamine in caecal lumen

Targeted LC–MS/MS analysis was performed to determine the concentration of the mixture of GalNAc, GlcNAc and ManNAc isomers (referred to as HexNAc) in caecal lumen samples. Samples of caecal content were collected at D240 for 124 F7 and 154 animals from a Duroc × Landrace × Large White commercial population. The samples were lyophilized, grounded into powder by a Mixer Mill MM 400 (30 Hz, 1 min) (Retsch) and stored at −80 °C until use. Approximately 15 mg of powder was weighed and extracted with 500 µl of 70% methanol/water. The mixture was vortexed for 10 min, and centrifuged at around 16,000g and 4 °C for 10 min. Supernatant (300 μl) was transferred into a new centrifuge tube, placed at −20 °C for 30 min, and centrifuged again at 16,128g at 4 °C for 3 min. Finally, 150 μl of the supernatant was collected for further LC–MS analysis. Separation was performed in a 50 mm × 2.1 mm × 1.8 µm ACQUITY UPLC BEH C18 Column (Waters) in an ExionLC AD System (AB Sciex). Linear ion trap and triple quadrupole scans were performed on the Applied Biosystems 6500 Triple Quadrupole (QTRAP 6500) system equipped with an ESI Turbo Ion-Spray interface. The operation was performed under positive ion mode and controlled by Analyst software (v.1.6.3) (Sciex). The ESI source parameters were set as follows: ion source: ESI+, source temperature: 550 °C, ion spray voltage (IS): 5,500 V (Positive), curtain gas (CUR): 35 psi. The GalNAc was analysed using multiple reaction monitoring. We established a standard curve using increasing and known amounts of GalNAc standard and used it to determine GalNAc concentrations per gram of dried caecal content.

### Measuring the correlation between GalNAc concentrations and bacterial abundance within the AO genotype

Raw AUC (GalNAc) or bacterial abundances were showing considerable batch-to-batch variation both with regards to means and variances. We therefore transformed them to ranks within batch and projected these uniformly between 0 and 1, yielding equalized means of 0.5 and standard deviations ≈ 0.3 (scaled batch-corrected ranks). We then applied the same approach within AO genotype to the scaled batch-corrected ranks yielding scaled, batch- and genotype-corrected ranks. We computed Spearman’s correlations between the GalNAc and abundance scaled ranks.

### Isolating 4-8-110 and 4-15-1

Faecal samples were collected from the rectum of healthy pigs at about 120 days and transferred immediately to anaerobic conditions. Fresh samples were homogenized with sterile 1× PBS (pH 7.0) in an anaerobic glovebox (Electrotek), which contained 10% hydrogen, 10% carbon dioxide and 80% nitrogen. The faecal suspension was diluted 10−6, 10−7, 10−8 and 10−9-fold, and plated on GAM medium (Nissui Pharmaceutical). Plates were incubated at 37 °C for 3 days in an anaerobic glovebox. Single clones were picked and streaked until pure colonies were obtained on GAM medium. Full-length 16S rRNA gene sequencing was performed after amplification using the following primers: 27 forward: 5′-AGAGTTTGATCCTGGCCTCAG-3′; and 1492 reverse, 5′-GGTTACCTTGTTACGACTT-3′). The isolates were stored at −80 °C in GAM broth containing 16% of glycerol until further use.

### Oxford Nanopore sequencing

The strains 4-8-110 and 4-15-1 were recovered on GAM medium. Cells were collected at the period of logarithmic growth. Genomic DNA was extracted using the Blood & Cell Culture DNA Midi Kit (Qiagen) according to the manufacturer’s protocol. Libraries for whole-genome sequencing of the strains were constructed and sequenced on an ONT PromethION (Oxford Nanopore Technology) at NextOmics. To correct sequencing errors, a library for second-generation sequencing was also constructed for each of the two strains and sequenced (2 × 100 bp) on the BGISEQ platform (BGI). Bioinformatic analyses of sequencing data were performed following refs. 72,86. In brief, after quality control, the sequencing data were assembled using flye (v.2.6)72 with the parameter: --nano-raw, and the assembled genomes were corrected by combining the Oxford Nanopore data with the second-generation sequencing data using pilon with the default parameters. The encoded genes were predicted using prodigal (v.2.6.3) (parameter: -p none-g 11)87.

### MAG assembly

A total of 92 faecal samples from eight pig populations, four intestinal locations and different ages were used for metagenomic sequencing and construction of MAGs. Microbial DNA was extracted as described above. The libraries for metagenomic sequencing were constructed according to the manufacturer’s instructions (Illumina), with an insert size of 350 bp for each sample, and 2 × 150 bp paired-ends sequenced on the NovaSeq 6000 platform. Raw sequencing data were filtered to remove adapter sequences and low-quality reads using fastp (v.0.19.41)88. Host genomic DNA sequences were filtered out using BWA (v.0.7.17)51. The clean reads of each sample were assembled into contigs using MEGAHIT (v.1.1.3) with the option ‘--min-count 2 --k-min 27 --k-max 87 --k-step 10 --min-contig-len 500’89. Single-sample metagenomic binning was performed using two different binning algorithms ‘--metabat2 --maxbin2’ using the metaWRAP package (v.1.1.1)90. The bins (MAGs) generated by the two binning algorithms were evaluated for quality and combined to form a MAG set using the bin_refinement module in metaWRAP (v.1.1.1). Metagenomic sequences were further assembled to optimize MAGs using the reassemble_bins module of metaSPAdes (v.3.15.3) in the metaWRAP pipeline. CheckM (v.1.0.12) was used to estimate the completeness and contamination of each MAG91. The MAGs with completeness <50% and contamination >5% were filtered out. Non-redundant MAGs were generated by dRep (v.2.3.2) at threshold of 99% average nucleotide identity (ANI)92.

### Bioinformatic analyses

Gene prediction in MAGs was carried out using the annotate_bins module in metaWRAP (v.1.1.1). The FASTA file of amino acid sequences translated from coding genes was used to perform KEGG annotation using Ghost KOALA tool (v.2.2)93 on the KEGG website (https://www.kegg.jp/ghostkoala/). Taxonomic classification of MAGs was performed using PhyloPhlAn (v.0.99)94. The graphs in Fig. 5b, c were generated using custom Perl (v.5.10.1) and R scripts (v.3.5.3). Pathway and regulon scores were computed using a custom Perl script. Both scores included (1) one point for import (having orthologues of either the four components of AgaPTS (agaE, agaF, agaV and agaW) and/or the three components of the TonB dependent transporter (omp, agaP and agaK) and/or the four components of the GnbPTS transporter (gnbA, gnbB, gnbC and gnbD), (2) one point for GalNAc deacetylase activity (having an orthologue of agaA and/or nagA), (3) one point for GalN deaminase/isomerase (having on orthologue of agaS), (4) one point for tagatose-6-P kinase (having an orthologue of pfkA and/or lacC and/or fruK), and (5) one point for tagatose-1,6-PP aldolase (having an orthologue of gatY and/or gatZ and/or lacD and/or fba). For the pathway score, the orthologues could be located anywhere in the MAG; for the regulon score, they had to be located on the same sequence contig and in close proximity (2.5% of genome size) to the anchor gene agaS28. For the top hits, we manually checked whether proximity was confirmed either by the replication of the order in more than one MAG and/or by the colocalization of the genes on one and the same sequence contig. The effect of the MAG type (OTU476-like, Erysipelotrichaceae and others), completion, contig and genome size on pathway and regulon scores was estimated using the R lm function, and was highly significant. The P values for the Erysipelotrichaceae versus other contrast were directly obtained from the lm function. To conservatively estimate the P value of OTU476-like versus Erysipelotrichaceae + others we generated score residuals corrected for completion, contig number and genome size, and determined how many MAGs had scores as high or higher than the OTU476-like strains. The P values reported in Extended Data Fig. 10 correspond to the square of these proportions as there are two OTU476-like strains with same GalNAc cluster organization.

### Determining the metabolic flux ratios from 13C experiments by GC–MS

13C-labelled GalNAc (N-acetyl-d-[UL-13C6]galactosamine) (99% isotopic purity of 13C) was bought from ZZBIO (Shanghai). The 4-8-110 strain was cultured in 6 ml of GAM medium (Nissui Pharmaceuticals) adding 1 g l−1 of 13C-labelled GalNAc or regular GalNAc (as control) in an anaerobic glovebox (Electrotek) in triplicate as described above. After 13 h of culture, the cultures were centrifuged for 5 min at 1,500g. Bacterial cells were resuspended in 0.6 ml of cold (−40 °C) 50% aqueous methanol containing 100 µM of norvaline as internal standard, plunged in dry ice for 30 min and then thawed on ice. We added 0.4 ml of chloroform, vortexed for 30 s, centrifuged for 10 min at 17,530g (4 °C), transferred the supernatant to new 1.5 ml tubes and dried the sample at −105 °C in a FreeZone Freeze Dryers (Labconco). Metabolites were derivatized for GC–MS analysis as follows. First, 70 μl of pyridine was added to the dried pellet and incubated for 20 min at 80 °C. After cooling, 30 µl of N-tert-butyldimethylsilyl-N-methyltrifluoroacetamide (Sigma-Aldrich) was added and the samples were reincubated for 60 min at 80 °C before centrifugation for 10 min at 17,530g (4 °C). The supernatant was transferred to an autosampler vial for GC/MS analysis. A Shimadzu QP-2010 Ultra GC-MS system was programmed for an injection temperature of 250 °C and injected with 1 µl of sample. The GC oven temperature started at 110 °C for 4 min, rising to 230 °C at 3 °C min−1 and then to 280 °C at 20 °C min−1 with a final hold at this temperature for 2 min. The GC flow rate with helium carrier gas was 50 cm s−1. The GC column used was a 20 m × 0.25 mm × 0.25 mm Rxi-5 ms. The GC–MS interface temperature was 300 °C and the ion source temperature was set at 200 °C, with 70 V ionization voltage. The mass spectrometer was set to scan m/z range 50–800, with a 1 kV detector. GC–MS data were analysed to determine isotope labelling. To determine 13C labelling, the mass distribution for known fragments of metabolites was extracted from the mass spectra. For each fragment, the retrieved data comprised mass intensities for the lightest isotopomer (without any heavy isotopes, M0), and isotopomers with increasing unit mass relative to M0. The mass distributions were normalized by dividing by the sum over all isoptomers and corrected for the natural abundance of heavy isotopes of the elements H, O and C using 7 × 7 matrix-based probabilistic methods as described previously95,96,97 and implemented in MATLAB (release R2021a). Labelling results are expressed as the average fraction of the particular compound that contains isotopic label from the particular precursor.

### Gavage experiment

4-8-110 and an E. coli strain isolated from caecal content (with 4-15-1 and 4-8-110) were cultured in GAM medium (Nissui Pharmaceuticals) for 13 h and 4 h, respectively, in an anaerobic glovebox (Electrotek) as described above. Optical density values were adjusted to 0.2 by addition of GAM medium, glycerol added to 1/6 of the final volume, and aliquots were stored at −80 °C. Germ-free mice (Kunming line) were provided by Huazhong Agricultural University. All of the experimental procedures involving mice were approved by the Ethics Committee in Huazhong Agricultural University (HZAUMO-2021-0077). Two groups (A and B) of five female mice each were housed in two separate cages (temperature, 25 ± 2 °C; humidity, 45–60%; light cycle, 12 h–12 h light–dark; light hours, 06:30–18:30). Mice were gavaged with 100 μl of PBS (group A) or 100 μl of PBS + 80 g l−1 GalNAc (corresponding to 8 mg per 100 μl or ~200 mg kg−1 of live weight) (group B) directly in the mouth with a sterile syringe twice per day (09:00 and 16:00) for 10 consecutive days (days 2 to 11). All of the mice were further gavaged with 150 μl of the 4-8-110 glycerol stocks (see above) once per day (at 16:00) for five consecutive days (days 3 to 7), and with 150 μl of the E. coli glycerol stocks (see above) once per day (at 16:00) for three consecutive days (days 5 to 7). The mice were euthanized 5 days after the last bacterial inoculation (on day 12, based on ref. 98), and caecal content and faeces were collected for each mouse. Bacterial DNA was extracted using the QIAamp fast DNA stool mini kit (Qiagen). The V3–V4 region of the 16S rRNA gene was amplified and sequenced using the methods described above. Reads were mapped to the V3–V4 sequence of the previously determined 4-8-110 and E. coli strains. An average of 99.74% of the reads mapped either to the 4-8-110 or E. coli reference sequences.

### Testing the inducibility of the GalNAc operon and transcriptome response upon GalNAc addition in E. coli and the OTU476-like strains

#### Bacterial growth

Bacteria were grown in an anaerobic glovebox (Electrotek) containing 10% hydrogen, 10% carbon dioxide and 80% nitrogen at 37 °C. Samples (4 ml) were pipetted after gentle shaking to measure optical density at 600 nm (OD600) using an ultraviolet spectrophotometer (Yoke Instrument). Experiments were conducted in triplicate.

#### RNA extraction

The cultured bacterial cells of the two OTU476 like strains (4-8-110 and 4-15-1) and E. coli were collected at the end of the exponential growth phase (11 h and 4 h, respectively). Total RNA was extracted from bacterial cells using TRIzol Reagent according to the manufacturer’s manuals (Thermo Fisher Scientific), and genomic DNA was removed using DNase I (TaKara). The quantity and quality of total RNA were evaluated using a 2100 Bioanalyser (Agilent) and NanoDrop-2000 (Thermo Fisher Scientific). The RNA samples with OD260/280 = 1.8–2.0, OD260/230 ≥ 2.0, RIN ≥ 6.5, 28S:18S ≥ 1.0, ≥100 ng μl−1 and ≥2 μg were used to construct libraries for RNA-seq.

#### Construction of libraries and sequencing

The libraries for RNA-seq were constructed according to the TruSeq RNA sample preparation Kit (Illumina) using 2 μg of total RNA. In brief, rRNA was removed from total RNA using the Ribo-Zero Magnetic kit (Epicenter). All mRNAs were broken into short fragments (200 nucleotides) by adding fragmentation buffer. Double-stranded cDNA was synthesized using the SuperScript double-stranded cDNA synthesis kit (Invitrogen) with random hexamer primers (Illumina). cDNA was processed for end-repair, phosphorylation and A base addition according to library construction protocol (Illumina). The constructed libraries were sequenced on the NovaSeq 6000 platform with a 2 × 150 bp paired-end strategy. Base calling and quality value calculations were performed using the Illumina GA Pipeline (v.1.6) (Illumina).

#### Bioinformatics analysis

To obtain clean reads, low-quality sequences, reads with more than 5% of N bases (unknown bases) and reads containing adaptor sequences were removed from the raw data with a Perl script. The cleaned sequences were mapped to the reference genomic sequences of two OTU476 like strains (generated by ourselves as described above) and E. coli (GenBank: NC_000913.3) using Bowtie2 (v.2.4.2)99. The read counts mapping to the reference genomes were quantified for each sample using featureCounts (v.1.6.4). The fragments per kilobase of exon model per million mapped reads method was used to calculate the gene expression level. We used the DESeq2 package100 in R (v.g3.5.3) to test for differential expression.

### Analysing the effect of the ABO genotype on the abundance of Erysipelotrichaceae in humans

#### Samples

The data used correspond to the previously described CEDAR cohort101, which included 300 healthy individuals of European descent who were visiting the University Hospital (CHU) from the University of Liège as part of a national screening campaign for colon cancer. Blood samples and intestinal biopsies (ileum, colon and rectum) were collected with full consent. The experimental protocol was approved by the ethics committee of the University of Liège Academic Hospital. Informed consent was obtained before donation in agreement with the recommendations of the declaration of Helsinki for experiments involving human participants.

#### Sequencing

For microbiota analysis, DNA was extracted from biopsies using the QIAamp DNA Stool Mini Kit (QIAgen). Three 16S rRNA amplicons corresponding to the V1–V2, V3–V4 and V5–V6 variable regions were generated in separate PCR reactions and processed for paired-end (2 × 300 bp) NGS sequencing on the MiSeq instrument (Illumina) following the standard protocol at the GIGA genomics core facility.

#### Data processing

Reads were QV20 trimmed from the 3′ end, demultiplexed, primer sequences were removed using the bbduk tool (version 38.82)102. Reads mapping to the human genome were eliminated using the BBTools suite (v.38.82). The corresponding pipeline was constructed using Snakemake (v.7.0.1)103. Further analyses were performed using QIIME 2 (v.2018.11)104. The paired-end reads were denoised and joined using the DADA2 plugin (v.1.16)105 using batch-specific trimming length parameters yielding 9.1± 2.0 kb amplicon sequence variants (ASVs) per run for V1–V2, 4.5 ± 1.6 kb for V3–V4 and 6.8 ± 0.67 kb for V5–V6 amplicons. ASVs mapping to known contaminant taxa as well as ASVs with abundance negatively correlated with coverage depth were removed. Samples that contained more than 20% contaminant ASVs were eliminated from further analyses. ASVs were then clustered to 97% identity level OTUs using the DNACLUST program (v.r3)106. After OTU assignment, read counts were rarefied to 10,000 (V1–V2 and V5–V6) and 5,000 (V3–V4). As intestinal location only explored a minor proportion of the variance in OTU abundance (L.S. et al., manuscript in preparation), OTU abundances were averaged across locations. Local alignment identity of the detected ASVs with the OTU476 and OTU327 from the pig microbiome were measured using blastn107.

#### Association analysis with ABO blood group

The effect of ABO blood group on standardized abundances of individual OTUs was performed using a linear model (lm R function) including (1) ABO blood group (A, B, AB or O), (2) secretor status, (3) sex, (4) smoking status, (5) age and (6) body mass index. Analyses were conducted separately for the different amplicons.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

## Data availability

All the 16S rRNA sequencing data, the metagenomics sequence data and the RNA-seq data were submitted to the GSA database under accession numbers CRA006230, CRA006239, CRA006240 and CRA006216. The genotype data were deposited at the GVM (http://bigd.big.ac.cn/gvm/getProjectDetail?project=GVM000310) under the GSA database under accession number GVM000310. The GWAS summary statistics are available at Figshare (https://doi.org/10.6084/m9.figshare.19313960). The whole-genome sequencing data of experimental pigs have been deposited in the GSA database (https://ngdc.cncb.ac.cn/gsa/browse/CRA006383) under accession number CRA006383. The source data are available at GitHub (https://github.com/yanghuijxau/Manuscript-microbiota-ABO).

## Code availability

Codes to replicate the findings and the source data are available at GitHub (https://github.com/yanghuijxau/Manuscript-microbiota-ABO).

## References

1. Kundu, P., Blacher, E., Elinav, E. & Pettersson, S. Our gut microbiome: the evolving inner self. Cell 171, 1481–1493 (2017).

2. Rothschild, D. et al. Environment dominates over host genetics in shaping human gut microbiota. Nature 555, 210–215 (2018).

3. O’Hara, E., Neves, A. L. A., Song, Y. & Guan, L. L. The role of the gut microbiome in cattle production and health: driver or passenger? Annu. Rev. Anim. Biosci. 8, 199–220 (2020).

4. Schmidt, T. S. B., Raes, J. & Bork, P. The human gut microbiome: from association to modulation. Cell 172, 1198–1215 (2018).

5. Polderman, T. J. C. et al. Meta-analysis of the heritability of human traits based on 50 years of twin studies. Nat. Genet. 47, 702–709 (2015).

6. Polubriaginof, F. C. G. et al. Disease heritability inferred from familial relationships reported in medical records. Cell 173, 1692–1704 (2018).

7. Benson, A. K. et al. Individuality in gut microbiota composition is a complex polygenic trait shaped by multiple environmental and host genetic factors. Proc. Natl Acad. Sci. USA 107, 18933–18938 (2010).

8. Yatsunenko, T. et al. Human gut microbiome viewed across age and geography. Nature 486, 222–227 (2012).

9. Goodrich, J. K. et al. Human genetics shape the gut microbiome. Cell 159, 789–799 (2014).

10. Blekhman, R. et al. Host genetic variation impacts microbiome compoistion across human body sites. Genome Biol. 16, 191 (2015).

11. Turpin, W. et al. Association of host genome with intestinal microbial composition in a large healthy cohort. Nat. Genet. 48, 1413–1417 (2016).

12. Bonder, M. J. et al. The effect of host genetics on the gut microbiome. Nat. Genet. 48, 1407–1412 (2016).

13. Wang, J. et al. Genome-wide association analysis identifies variation in vitamin D receptor and other host factors influencing the gut microbiota. Nat. Genet. 48, 1396–1406 (2016).

14. Hughes, D. A. et al. Genome-wide associations of human gut microbiome variation and implications for causal inference analyses. Nat. Microbiol. 5, 1079–1087 (2020).

15. Sankararaman, S. et al. The genomic landscape of Neanderthal ancestry in present-day humans. Nature 507, 354–357 (2014).

16. Patterson, N. et al. Genetic evidence for complex speciation of humans and chimpanzees. Nature 441, 1103–1108 (2006).

17. Donaldson, G. P., Lee, S. M. & Mazmanian, S. K. Gut biogeography of the bacterial microbiota. Nat. Rev. Microbiol. 14, 20–32 (2016).

18. Radjabzadeh, D. et al. Diversity, compositional and functional differences between gut microbiota of children and adults. Sci. Rep. 10, 1040 (2020).

19. Goodrich, J. K. et al. Genetic determinants of the gut microbiome in UK twins. Cell Host Microbe 19, 731–743 (2016).

20. Cooling, L. Blood groups in infection and host susceptibility. Clin. Microbiol. Rev. 28, 801–870 (2015).

21. Rühlemann, M. C. et al. Genome-wide association study in 8,956 German individuals identifies influence of ABO histo-blood groups on gut microbiome. Nat. Genet. 53, 147–155 (2021).

22. Lopera-Maya, E. E. et al. Effect of host genetics on the gut microbiome in 7,738 participants of the Dutch microbiome project. Nat. Genet. 54, 143–151 (2022).

23. Qin, Y. et al. Combined effects of host genetics and diet on human gut microbiota and incident disease in a single population cohort. Nat. Genet. 54, 134–142 (2022).

24. Choi, M. K. et al. Determination of complete sequence information of the human ABO blood group orthologous gene in pigs and breed differences in blood type frequencies. Gene 640, 1–5 (2018).

25. Wang, S. et al. Design of glycosyl transferase inhibitors: serine analogues as pyrophosphate surrogates? ChemPlusChem 80, 1525–1532 (2015).

26. Ségurel, L. et al. The ABO blood group is a trans-species poilymorphism in primates. Proc. Natl Acad. Sci. USA 109, 18493–18498 (2012).

27. Groenen, M. A. M. A decade of pig genome sequencing: windo on pig domestication and evolution. Genet. Sel. Evol. 48, 23–32 (2016).

28. Ravcheev, D. A. & Thiele, I. Comparative genomic analysis of the human gut microbiome reveals a broad distribution of metabolic pathways for the degradation of host-synthesized mucin glycans and utilization of mucin-derived monosaccharides. Front. Genet. 8, 111 (2017).

29. Tailford, L. A. et al. Mucin glycan foraging in the human gut microbiome. Front. Genet. 6, 81 (2015).

30. Lien, K. A., Sauer, W. C. & He, J. M. Dietary influences on the secretion into and degradation of mucin in the digestive tract of monogastric animals and humans. J. Anim. Feed Sci. 10, 223–245 (2001).

31. Brinkkötter, A. B., Klöss, H., Alpert, C.-A. & Lengeler, J. W. Pathways for the utilization of N-acetyl-galactosamine and galactosamine in Escherichia coli. Mol. Microbiol. 37, 125–135 (2000).

32. Rodionov, D. A. et al. Genomic encyclopedia of sugar utilization pathways in the Shewanella genus. BMC Genom. 11, 494 (2010).

33. Leyn, S. A., Gao, F., Yang, C. & Rodionov, D. A. N-acetylgalactosamine utilization pathway and regulon in proteobacteria. J. Biol. Chem. 287, 28047–28056 (2012).

34. Hu, Z., Patel, I. R. & Mukherjee, A. Genetic analysis of the roles of agaA, agaI, and agaS genes in the N-acetyl-d-galactosamine and d-galactosamine catabolic pathways in Escherichia coli strains O157:H7 and C. BMC Microbiol. 13, 94 (2013).

35. Bidart, G. N., Rodriguez-Diaz, J., Monedoro, V. & Yebra, M. J. A unique gene cluster for the utilization of the mucosal and human milk-associated glycans galacto-N-biose and lacto-N-biose in Lactobacillus casei. Mol. Microbiol. 93, 521–538 (2014).

36. Zhang, H. et al. Two novel regulators of N-acetyl-galactosamine utilization pathway and distinct roles in bacterial infections. Microbiol. Open 4, 983–1000 (2015).

37. Lawrence, J. Selfish operons: the evolutionary impact of gene clustering in prokrayotes and eukaryotes. Curr. Opin. Genet. Dev. 9, 642–648 (1999).

38. Koonin, E. V. Evolution of genome architecture. Int. J. Biochem. Cell Biol. 41, 298–306 (2009).

39. Lombard, V. et al. The carbohydrate-active enzymes database (CAZy) in 2013. Nucleic Acids Res. 42, D490–D495 (2014).

40. Rahfeld, P. et al. An enzymatic pathway in the human gut microbiome that converts A to universal O type blood. Nat. Microbiol. 4, 1475–1585 (2019).

41. Rahfled, P. et al. Prospecting for microbial α-N-acetylgalactosaminidases yields a new class of GH31 O-glycanase. J. Biol. Chem. 294, 16400–16415.

42. Chen, Y. et al. ABO blood group and susceptibility to severe acute respiratory syndrome. JAMA 293, 1450–1451 (2005).

43. Ellinghaus, D. et al. The ABO blood group locus and a chromosome 3 gene cluster associate with SARS-CoV-2 respiratory failure in an Italian-Spanish genome-wide association analysis. Preprint at medRxiv https://doi.org/10.1101/2020.05.31.20114991 (2020).

44. Blancher, A. Evolution of the ABO supergene family. ISBT Sci. Ser. 8, 201–206 (2013).

45. Makivuokko, H. et al. Association between the ABO blood group and the human intestinal microbiota composition. BMC Microbiol. 12, 94 (2012).

46. Davenport, E. R. et al. ABO antigen and secretor statuses are not associated with gut microbiota composition in 1,500 twins. BMC Genom. 17, 941–955 (2016).

47. Kurilshikov, A. et al. Large-scale association analyses identify host factors influencing human gut microbiome composition. Nat. Genet. 53, 156–165 (2021).

48. Malmuthuge, N., Griebel, P. J. & Guan, L. L. Taxonomic identification of commensal bacteria associated with the mucosa and digesta throughout the gastrointestinal tracts of preweaned calves. Appl. Environ. Microbiol. 80, 2021–2028 (2014).

49. Hanson, M. E. B. et al. Population structure of human gut bacteria in a diverse cohort from rural Tanzania and Botswana. Genome Biol. 20, 16 (2019).

50. Warr, A. et al. An improved pig reference genome sequence to enable pig genetics and genomics research. Gigascience 9 (2019).

51. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).

52. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

53. Rimmer, A. et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat. Genet. 46, 912–918 (2014).

54. Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).

55. Browning, S. R. & Browning, B. L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 1084–1097 (2007).

56. Coppieters, W., Karim, L. & Georges, M. SNP-based quantitative deconvolution of biological mixtures: application to the detection of cows with subclinical mastitis by whole genome sequencing of tank milk. Genome Res. 30, 1201–1207 (2020).

57. Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 (2014).

58. Magoč, T. & Salzberg, S. L. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27, 2957–2963 (2011).

59. Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461 (2010).

60. Rognes, T., Flouri, T., Nichols, B., Quince, C. & Mahe, F. VSEARCH: a versatile open source tool for metagenomics. PeerJ 4, e2584 (2016).

61. Wang, Q., Garrity, G. M., Tiedje, J. M. & Cole, J. R. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ. Microbiol. 73, 5261–5267 (2007).

62. Schloss, P. D. et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 75, 7537–7541 (2009).

63. Cheema, M. U. & Pluznick, J. L. Gut microbiota plays a central role to modulate the plasma and fecal metabolomes in response to angiotensin II. Hypertension 74, 184–193 (2019).

64. Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012).

65. Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88, 76–82 (2011).

66. Dixon, P. VEGAN, a package of R functions for community ecology. J. Veg. Sci. 14, 927–930 (2003).

67. Visscher, P. M. et al. Assumption-free estimation of heritability from genome-wide identity-by-descent sharing between full siblings. PLoS Genet. 2, e41 (2006).

68. Ziyatdinov, A. et al. lme4QTL: linear mixed models with flexible covariance structure for genetic studies of related individuals. BMC Bioinform. 19, 68 (2018).

69. Haseman, J. K. & Elston, R. C. The investigation of linkage between a quantitative trait and a marker locus. Behav. Genet. 2, 3–19 (1972).

70. Aulchenko, Y. S., Ripke, S., Isaacs, A. & van Duijn, C. M. GenABEL: an R library for genome-wide association analysis. Bioinformatics 23, 1294–1296 (2007).

71. Willer, C. J., Li, Y. & Abecasis, G. R. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics 26, 2190–2191 (2010).

72. Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).

73. Vaser, R., Sovic, I., Nagarajan, N. & Sikic, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).

74. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at arXiv https://arxiv.org/abs/1303.3997 (2013).

75. Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE 9, e112963 (2014).

76. Harris, R. S. Improved Pairwise Alignment of Genomic DNA. PhD thesis, Pennsylvania State Univ. (2007).

77. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

78. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

79. Liao, Y., Smyth, G. K. & Shi, W. FeatureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).

80. Ai, H. et al. Adaptation and possible ancient interspecies introgression in pigs identified by whole-genome sequencing. Nat. Genet. 47, 217–225 (2015).

81. Van der Auwera, G. A. et al. From FastQ data to high confidence variant calls: the genome analysis Toolkit best practices pipeline. Curr. Protoc. Bioinform. 43, 11.10.11–11.10.33 (2013).

82. DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

83. Weir, B. S. & Cockerham, C. C. Estimating F-statistics for the analysis of population structure. Evolution 38, 1358–1370 (1984).

84. Nei, M. F-statistics and analysis of gene diversity in subdivided populations. Ann. Hum. Genet. 41, 225–233 (1977).

85. Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).

86. Hunt, M. et al. Circlator: automated circularization of genome assemblies using long sequencing reads. Genome Biol. 16, 294 (2015).

87. Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinform. 11, 119 (2010).

88. Chen, S., Zhou, Y., Chen, Y. & Jia, G. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).

89. Li, D. et al. MEGAHIT v1.0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods 102, 3–11 (2016).

90. Uritskiy, G. V., DiRuggiero, J. & Taylor, J. MetaWRAP—a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome 6, 158 (2018).

91. Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).

92. Olm, M. R., Brown, C. T., Brooks, B. & Banfield, J. F. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 11, 2864–2868 (2017).

93. Kanehisa, M., Sato, Y. & Morishima, K. BlastKOALA and GhostKOALA: KEGG tools for functional characterization of genome and metagenome sequences. J. Mol. Biol. 428, 726–731 (2016).

94. Segata, N., Bornigen, D., Morgan, X. C. & Huttenhower, C. PhyloPhlAn is a new method for improved phylogenetic and taxonomic placement of microbes. Nat. Commun. 4, 2304 (2013).

95. Li, M. et al. Aldolase B suppresses hepatocellular carcinogenesis by inhibiting G6PD and pentose phosphate pathways. Nat. Cancer 1, 737–747 (2020).

96. Nanchen, A., Fuhrer, T. & Sauer, U. Determination of metabolic flux rartios from 13C-experiments and gas chromatography-mass spectrometry data: protocol and principles. Methods Mol. Biol. 358, 177–197 (2007).

97. van Winden, W. A. et al. Correcting mass isoptopomer distributions for naturally occurring isotopes. Biotechnol. Bioeng. 80, 477–479 (2002).

98. Staley et al. Stable engraftment of human microbiota into mice with a single oral gavage following antibiotic conditioning. Microbiome 5, 87 (2017).

99. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).

100. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).

101. Momozawa, Y. et al. IBD risk loci are enriched in multigenic regulatory modules encompassing putative causative genes. Nat. Commun. 9, 2427 (2018).

102. Bushnell, B. BBMap: A Fast, Accurate, Splice-Aware Aligner (version 38.82) https://sourceforge.net/projects/bbmap/ (2014).

103. Köster, J. & Rahmann, S. Snakemake: a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012).

104. Bolyen, E., Rideout, J. R., Dillon, M. R., Bokulich, N. A. & Caporaso, J. G. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat. Biotechnol. 37, 852–857 (2019).

105. Callahan, B. J. et al. DADA2: high-resolution sample inference from Illumina amplicon data. Nat. Methods 13, 581–583 (2016).

106. Ghodsi, M., Liu, B. & Pop, M. DNACLUST: accurate and efficient clustering of phylogenetic marker genes. BMC Bioinform. 12, 271 (2011).

107. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

108. Srivastava, A. et al. Genomes of the mouse collaborative cross. Genetics 206, 537–556 (2017).

109. Yu, N. et al. Global patterns of human DNA sequence variation in a 10-kb region on chromosome 1. Mol. Biol. Evol. 18, 214–222 (2001).

110. The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

111. Frantz, L. A. F. et al. Evidence of long-term gene flow and selection during domestication from analyses of Eurasian wild and domestic pig genomes. Nat. Genet. 47, 1141–1148 (2015).

112. Charlier, C. et al. NGS-based reverse genetic screen for common embryonic lethal mutations compromising fertility in livestock. Genome Res. 26, 1333–1341 (2016).

113. Georges, M., Charlier, C. & Hayes, B. Harnessing genomic information for livestock improvement. Nat. Rev. Genet. 20, 135–156 (2019).

114. Geraldes, A. et al. Inferring the history of speciation in house mice from autosomal, X-linked, Y-linked and mitochondrial genes. Mol. Ecol. 17, 5349–5363 (2008).

115. Huang, H. et al. Fine-mapping inflammatory bowel disease loci to single-variant resolution. Nature 547, 173–178 (2017).

116. Suzuki, T. A. & Nachman, M. W. Spatial heterogeneity of gut microbioal composition along the gastrointestinal tract in natural populatiions of house mice. PLoS ONE 11, e0163720 (2016).

117. Vuik, F. E. R. et al. Composition of the mucosa-associated microbiota along the entire gastrointestinal tract of human individuals. UEG J. 7, 897–907 (2019).

118. Rowe, J. A. et al. Blood group O protects against severe Plasmodium falciparum malaria through the mechanism of reduced rosetting. Proc. Natl Acad. Sci. USA 104, 17471–17476 (2007).

119. Robinson, M. G., Tolchin, D. & Halpern, C. Enteric bacterial agents and the ABO blood groups. Am. J. Hum. Genet. 23, 135–145 (1971).

120. Camus, D., Bina, J. C., Carlier, Y. & Santoro, F. ABO blood groups and clinical forms of schistosomiasis mansoni. Trans. R. Soc. Trop. Med. Hyg. 71, 182 (1977).

121. Pereira, F. E. L., Bortolini, E. R., Carneiro, J. L. A., da Silva, C. R. M. & Neves, R. C. A, B, O blood groups and hepatosplenic form of schistosomiasis mansoni (Symmer’s fibrosis). Trans. R. Soc. Trop. Med. Hyg. 73, 238 (1977).

122. Ndamba, J., Gomo, E., Nyazema, N., Makaza, N. & Kaondera, K. C. Schistosomiasis infection in relation to the ABO blood groups among school children in Zimbabwe. Acta Trop. 65, 181–190 (1997).

123. Chaudhuri, A. & De, S. Cholera and blood groups. Lancet 2, 404 (1977).

124. Boren, T. et al. Attachment of Helicobacter pylori to human gastric epithelium mediated by blood group antigens. Science 262, 1892–1895 (1993).

125. Lindesmith, L. et al. Human susceptibility and resistance to Norwalk virus infection. Nat. Med. 9, 548–553 (2003).

126. Galili, U. in α-Gal and Anti-Gal (eds Galili, U. & Avila, J. L.) Vol. 32, 1–23 (Springer, 1999).

127. Prather, R. S., Shen, M. & Dai, Y. Genetically modified pigs for medicine and agriculture. Biotechnol. Genetic Eng. Rev. 25, 245–266 (2008).

## Acknowledgements

We thank Y. He, S. Xiao, W. Li, Y. Guo and Y. Xing for assistance in the construction of the experimental mosaic pig populations; Y. Su and J. Li for preparation of reagents and management of samples; Y. Momozawa, R. Mariman, M. Mni, L. Karim and M. Dekkers for generating the CEDAR-1 16S rRNA data; the staff at the Jiangxi Department for Education, the Ministry of Science and Technology of P. R. China, the Ministry of Agriculture and Rural Affairs of P. R. China, and Jiangxi department of Science and Technology for their long-term support of the swine heterogeneous stock project; and the members of the MIQUANT consortium for comments and discussions. L.H. is supported by The National Natural Science Foundation of China (31790410) and National pig industry technology system (CARS-35); C. Chen by the National Natural Science Foundation of China (31772579); H.Y. by the National Postdoctoral Program for Innovative Talent (no. BX201700102); L.S. by the FNRS IBD-GI-Seq project; M.G. by the Chinese Thousand Talents Program, the Belgian EOS ‘Miquant’ project and the FNRS (CDR ‘GEM’ project). C. Charlier is a senior research associate at the FNRS.

## Author information

Authors

### Contributions

H.Y. analysed the 16S rRNA sequence data, performed GWAS, meta-analyses and local association analyses, computed heritabilities of individual taxa, contributed to ABO genotyping and analysed the effect of the 2.3 kb deletion on taxa abundance. J.W. analysed the composition of the microbiome, including PCoA analyses, β- and α-diversity, correlations between kinship and microbiome dissimilarities, isolated the OTU476-like strains, performed the GalNAc feeding experiments, measured the concentrations of GalNAc in the caecal lumen, analysed the GalNAc import and use pathway in the MAGs, and contributed to ABO genotyping. X.H. participated in 16S rRNA sequencing (F6) and GWAS (F6). Y. Zhou performed metagenome sequencing analysis, analysed the GalNAc import and use pathway in MAGs, analysed the RNA-seq data from caecum samples and contributed to ABO genotyping. Y. Zhang participated in the preparation of the genotype data from whole-genome sequence information, participated in the computation of the genomic contribution of the different breeds in the F6 and F7 generation and the definition of expected mapping resolution, performed LD analyses, performed eQTL analysis for the ABO gene, participated in the characterization and sequence analysis of the ABO gene, including definition of the 2.3 kb deletion, and in the balancing selection and trans-species polymorphism analyses. M.L. assisted with the isolation of the OTU476-like strains, the GalNAc feeding experiments and genotyping of the ABO gene. Q.L. assisted with measuring the concentrations of GalNAc in caecal lumen. S.K., M.H., H.F., S.F., X.X., H.J., Z.C. and J.G. assisted with the experiments. Z.Z., X.T.,  Z.W., H.G. and Y.H. assisted with the preparation of genotype data from whole-genome sequencing data and conducted the analysis of the Nanopore data of the ABO region. J.M. assisted with the construction of the mosaic population. H.A. assisted with the bioinformatic analysis of the ABO region, de novo assembly of the A allele, and evolutionary analysis of the ABO alleles. L.S. analysed the effect of ABO genotype on intestinal microbiota composition in humans. W.C. assisted in the analysis of the sequencing data for the trans-species polymorphisms. C. Charlier supervised the characterization of the ABO gene and the 2.3 kb deletion and the corresponding haplotype structure in the F0, F6 and F7 population and for the trans-species polymorphism. B.Y. prepared the genotype data of whole-genome variants, assisted with raising the heterogeneous stock, and participated in the computation of the genomic contribution of the different breeds in the F6 and F7 generation and the definition of expected mapping resolution. M.G. supervised the bioinformatic and statistical analyses, performed bioinformatic and statistical analyses, and wrote the paper. C. Chen codesigned the study, supervised experiments, supervised bioinformatic and statistical analyses of gut microbiome, and wrote the paper. L.H. created the swine heterogeneous stock, designed the study, directed the project, supervised the experiments and analyses, and wrote the paper.

### Corresponding authors

Correspondence to Michel Georges, Congying Chen or Lusheng Huang.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

## Peer review

### Peer review information

Nature thanks Catherine Lozupone, Vincent Plagnol and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Extended data figures and tables

### Extended Data Fig. 1 Generating a large mosaic pig population for genetic analysis of complex phenotypes.

(a) Rotational breeding design used for the generation of a large mosaic pig population for the genetic analysis of complex phenotypes, with sampling scheme for faeces (D25, D120, D240), luminal content of the ileum (IC) and caecum (CC), and mucosal scrapings in the ileum (IM) and caecum (CM). BX: Bamaxiang, EH: Erhualian, LA: Laiwu, TB: Tibetan, LW: Large White, LD: Landrace, PT: Piétrain, WD: White Duroc. (b) Average similarity (1 – π) between allelic sequences sampled within and between the eight founder breeds. The colour intensity ranges from black (breeds with lowest allelic similarity: BX vs WD, 1 - 4.3x10−3) to bright red (breed with highest allelic similarity: WD, 1 - 1.8x10−3). The acronyms for the breeds are as in (a). More than 30 million variants with MAF ≥ 3% segregate in this population, i.e. more than one variant every 100 base pairs. This is slightly lower than the 40 million high quality variants segregating in the mouse collaborative cross108. (c) Comparison of the average nucleotide diversity (π, i.e. the proportion of sites that differ between two chromosomes sampled at random in the population(s)) within and between European (Eur) and Asian (As) domestic pigs, and between modern European (HSEur), Asian humans (HSAs), Neanderthal (Neand) and Chimpanzee (Pan Trogl). The average nucleotide diversity within the four Chinese founder breeds was ~2.5x10−3 and within the four European founder breeds ~2.0x10−3. By comparison, π-values within African and within Asian/European human populations are ~9x10−4 and ~8x10−4, respectively109,110. Thus, against intuition (as domestication is often assumed to have severely reduced effective population size) the within population diversity is >2-fold higher in domestic pigs than in human populations, as previously reported111,112,113. Nucleotide diversities between Chinese founder breeds and between European founder breeds were ~3.6x10−3 and ~2.5x10−3, respectively, i.e. 1.44-fold and 1.25-fold higher than the respective within-breed π-values. These π-values are of the same order of magnitude as the sequence divergence between Homo sapiens and Neanderthals/Denosivans (~3x10−3, ref. 15). By comparison, π-values between Africans, Asians and Europeans are typically ≤ ~1x10−3 (ref. 109). The nucleotide diversity between Chinese and European breeds averaged ~4.3x10−3. This π-value is similar to the divergence between M. domesticus and M. castaneus114, and close to halve the ~1% difference between chimpanzee and human16. Note that Chinese and European pig breeds are derived from Chinese and European wild boars, respectively, which are thought to have diverged ~1 million years ago27, while M. domesticus and M. castaneus are thought to have diverged ≤ 500,000 years ago114. (d) Autosome-specific estimates of the genomic contributions of the eight founder breeds in the F6 and F7 generation. We used a linear model incorporating all variants to estimate the average contribution of the eight founder breeds in the F6 and F7 generation at genome and chromosome level56. At genome-wide level, the proportion of the eight founder breed genomes ranged from 11.2% (respectively 11.5%) to 14.1% (14.7%) in the F6 (F7) generations. At chromosome-specific level, the proportion of the eight founder breeds ranged from 6.7% (respectively 4.9%) to 20.7% (22.1%) in the F6 (F7) generations. The genomic contribution of the eight founder breeds in the F6 and F7 generation is remarkably uniform and close to expectations (i.e. 12.5%) both at genome-wide and chromosome-wide level, suggesting comparable levels of genetic diversity across the entire genome. This does not preclude that more granular examination may reveal local departures from expectations, or under-representation of incompatible allelic combinations at non-syntenic loci. (e-f) Indicators of achievable mapping resolution in the F6 generation: (e) Frequency distribution (density) of the number of variants in high LD (r2 ≥ 0.9) with an “index” variant (was computed separately for all variants considered sequentially as the “index”), corresponding to the expected size of “credible sets” in GWAS115. The red vertical line corresponds to the genome-wide median. The green vertical line corresponds to the mapping resolution achieved in this study for the ABO locus (see hereafter). (f) Frequency distribution (density) of the maximum distance between an index variant and a variant in high LD (r2 ≥ 0.9) with it, defining the spread of credible sets. Red and green vertical lines are as in (D).

### Extended Data Fig. 2 Characterizing the age- and location-specific composition of the intestinal microbiome of the healthy pig.

(a) Definition of a core intestinal microbiome of the pig. A total of 58 OTUs that were annotated to 21 taxa were identified in >95% of day 120 and 240 faeces and caecum content samples of both F6 and F7 generations, hence defined as core bacterial taxa. (b) The compositions of the porcine and human intestinal microbiota are closer to each other than either is to that of the mouse. Boxplots are as is Fig. 1c. The number of samples available for analysis were 1281 pigs, 106 humans and 6 mice. (c) Abundances (F6-F7 averages when available) of the 43 families represented in Fig. 1b in the seven sample types relative to the sample type in which they are the most abundant (red – blue scale). The families are ordered according to the sample type in which they are the most abundant. The colour-code for phyla is as in Fig. 1b. Columns are added for comparison with mouse and human. Mouse data are from Fig. 1 in Suzuki & Nachman116, and human data from Fig. 6 in Vuik et al117. P_I: proximal ileum, D_IL: distal ileum, C: caecum, CO: colon, RE: rectum, F: faeces. The families differing the most with regards to location-specific distribution between species include Helicobacteriaceae, Veillonellaceae, Lactobacillaceae and Streptocaccaceae.

### Extended Data Fig. 3 Evaluating the heritability of intestinal microbiota composition in the mosaic pig population.

Correlation between heritability estimates of taxa/OTUs in F6 and F7 generation by sample type (D25, D120, D240, CC and IC). Correlation coefficients (r) and associated p-values (p) were computed using heritability estimates that were pre-corrected for bacterial abundance (residuals of linear model). Heritability estimates indeed tend to slightly increase with taxa abundance. Yet, results show that this effect cannot account for the observed correlations between F6 and F7 estimates in D120, D240 and CC, hence pointing towards genuine genetic effects. The shaded areas correspond to the 95% confidence region for the regression fit. Correlation coefficients and two-sided p-values were computed using Spearman’s rank-based method. Reported p-values are nominal (i.e. uncorrected for multiple testing).

### Extended Data Fig. 4 Identifying a microbiota QTL (miQTL) with major effect on the abundance of Erysipelotrichaceae species by whole genome sequence based GWAS.

(a) Schematic illustration of the samples and SNPs used for the two types of analyses (abundance and presence/absence) performed for miQTL mapping. (b) (Upper) Distribution of log(1/p) values for 1,527 sets of 11 p-values obtained in 11 data-series for a SNP x taxon x analysis model combination that yielded a genome-wide significant signal (p < 5 x 10−8) in the 12th data-series. (Lower) Distribution of log(1/p) values for 1,527 sets of 11 p-values obtained in the same data-series and with the same analysis model as in (upper) but with randomly selected SNP x taxon combinations matching the ones in (upper) for MAF and taxa abundance. Log(1/p) values were computed using GenABEL as described in Methods. Corresponding p-values are nominal and two-sided. (c) Correlation between the average (F6 and F7) taxon heritability, and the average (F6 and F7) number of genome-wide significant (p ≤5 × 10−8) miQTL for D240 faecal samples. The shaded area corresponds to the 95% confidence region for the regression fit. Correlation coefficient and associated p-values are Spearman’s. (d) QQ plot for 1,527 (number of signals (SNP x taxon x model x one data series in one cohort) exceeding the genome-wide log(1/p) threshold value of 7.3) sets of ≤ 5-7 p-values (same SNP x taxon x model, all data series in the other cohort) for real SNPs (Blue: quantitative model; Green: binary model), and matched sets of ≤ 5-7 p-values corresponding to randomly selected SNP x taxon combinations matched for MAF and abundance or presence/absence rate (Brown: quantitative model; Yellow: binary model). Log (1/p) values were computed using GenABLE as described in Methods. Corresponding p-values are nominal and two-sided. (e) Same QQ plot as in (C) after removal of all SNPs in the chromosome 1: 272.8-273.1Mb interval. Log(1/p) values were computed using GenABLE as described in Methods. Corresponding p-values are nominal and two-sided. (f) Distribution of the association log(1/p) values and corresponding signed z-scores for SNP 1_272907239 and 31 p-75-a5 OTUs (red) and 83 Erysipelotrichaceae (yellow) OTUs, showing an enrichment of effects with same sign as for OTU476 and OTU327. Log(1/p) values were computed using Metal (v3.0) as described in Methods. Corresponding p-values are nominal and two-sided. See also Supplemental discussion 1.

### Extended Data Fig. 5 The chromosome 1 miQTL is caused by a 2.3 kb deletion in the orthologue of the human ABO gene.

(a) Breakpoints of the 2.3 kb deletion showing the role of a duplicated SINE sequence in mediating an intra-chromosomal recombination. (b) Illustrative example of allelic balance for the cG146C SNP in an AA homozygote and of allelic imbalance for the same SNP in an AO heterozygote. (c) (Upper) eQTL analysis for the porcine AO gene maximizing at the exact position of the 2.3 kb deletion (p = 1.9x10−43) and showing the additive effect of the A allele increasing transcript levels ~3-fold (inset; FPKM: Fragments Per Kilobase of transcript per Million mapped reads). The “n’s” correspond to the number of animals of each genotype available for analysis. Boxplots are as in Fig. 1c. (Lower) Genome wide eQTL scan for the porcine ABO gene showing the strong cis-eQTL signal on chromosome 1. eQTL analysis was conducted with GEMMA (v0.97)64. Reported log-transformed p-values are nominal and two-sided. (d) Effect of N-acetyl-galactosaminyl transferase genotype (AA, AO or OO) on abundance of OTU327 and p-75-a5 in the twelve data series. Absence of an effect of N-acetyl-galactosaminyl transferase genotype (AA, AO or OO) on abundance of E. coli in the twelve data series. Sample sizes are as in STable 4.1. Boxplots are as in Fig. 3d. (e) Abundance of OTU476, OTU327 and p-75-a5 in the twelve data series. Violin plots with indication of the median. Numbers (n’s) are as in STable 4.1. See also Supplemental discussion 2.

### Extended Data Fig. 6 cis-eQTL analyses in the vicinity of the chromosome 1 miQTLK supports the causality of the 2.3 kb deletion.

(a) Cis-eQTL analysis for the porcine N-acetyl-galactosaminyl transferase (“ABO”), GBTG1, LCN1 (=OBP2B), MED22 and SURF6 genes in caecum. The blue triangle corresponds to the top SNP for the miQTL. The red triangles correspond to the top SNPs for the respective cis-eQTL. Only for N-acetyl-galactosaminyl transferase are blue and red variants the same. eQTL analyses were conducted with GEMMA (v0.97)64. Reported log-transformed p-values are nominal and two-sided. (b) Effect of AO genotype on the expression levels of the corresponding genes in caecum. There was no evidence for an effect of AO genotype on the expression of any of these genes other than ABO. The number of AA, AO and OO samples available for cis-eQTL analysis for each gene are given (n). Boxplots are as in Fig. 1c. We tested the difference in gene expression level between pairs of genotype classes using a two-sided t-test. (c) Effect of the top cis-eQTL SNPs (blue triangles in A) on OTU476 abundance. Only the top cis-eQTL SNPs for ABO has an effect on OTU476 abundance. The number of AA, AO and OO samples available for miQTL analysis for each gene are given (n). Boxplots are as in Fig. 1c. We tested the difference in bacterial abundance between pairs of genotype classes using a two-sided t-test.

### Extended Data Fig. 7 The 2.3 kb deletion in the orthologue of the human ABO gene is 3.5 million years old and under balancing selection.

(a) UPGMA tree based on nucleotide diversities between 14 AA and 34 OO animals in windows of increasing size (0.5 to 40 kb) centred on the 2.3 kb deletion in the porcine N-acetyl-galactosaminyl transferase gene (porcine O allele). PA: Phacochaerus Africanus, SC: Sus cebifrons, SV: Sus verrucosus, SU: Sus scrofa vittatus, CB: Chinese wild boar, RB: Russian wild boar, EB: European wild boar, ERH: Erhualian, BX: Bamaxiang, T: Tibetan, LA: Laiwu, LR: Landrace, LW: Large White, PI: Piétrain, WD: White Duroc. Context: To gain additional insights in the age of the porcine O allele, we generated phylogenetic trees of the A and O alleles of 14 AA and 34 OO animals including domestic pigs, wild boars, Visayan and Javanese warty pigs, and common African warthog. Examination of their local SNP genotypes (50K window encompassing the ABO gene) reveals traces of ancestral recombinations between O and A haplotypes as close as 300 and 800 base pairs from the proximal and distal deletion breakpoints, respectively, as well as multiple instances of homoplasy that may either be due to recombination, gene conversion or recurrent de novo mutations. On their own, these signatures support the old age of the O allele. We constructed UPGMA trees based on nucleotide diversity for windows ranging from 500 bp to 40 kb centred on the 2.3 kb deletion. Smaller windows have a higher likelihood to compare the genuine ancestral O versus A states, yet yield less robust trees because they are based on smaller number of variants. Larger windows will increasingly be contaminated with recombinant A-O haplotypes blurring the sought signal. Indeed, for windows ≥ 20 kb or more, the gene tree corresponds to the species tree, while for windows ≤ 15 kb the tree sorts animals by AA vs OO genotype. For all windows ≤ 15 kb the Sus cebifrons O allele maps outside of the Sus scrofa O allele supporting a deep divergence (rather than hybridization) and hence the old age of the O allele. Of note, for windows ≤1.2 kb, the warthog A allele is more closely related to the Sus A alleles than to the Sus O alleles (ED7A). This suggests that the O allele may be older than the divergence of the Phacochoerus and Sus A alleles, i.e. > 10 MYA. It will be interesting to study larger numbers of warthog to see whether the same 2.3 kb deletion exists in this and other related species as well. (b) Alignment of ~900 base pairs of the O alleles of domestic pigs (Bamaxian), European and Asian wild boars, and Sus cebufrons demonstrating that these are identical-by-descent. The SINE element that is presumed to have mediated the recombinational event that caused to 2.3 kb deletion is highlighted in red. Context: To further support their identity-by-descent we aligned ~900 base pairs (centred on the position of the 2.3 kb deletion) of the O alleles of domestic pig, European and Asian wild boars and Sus cebifrons. The sequences were nearly identical further supporting our hypothesis. It is noteworthy that the old age of the “O” allele must have contributed to the remarkable mapping resolution (≤3 kb) that was achieved in this study. In total, 42 variants were in near perfect LD (r2 ≥ 0.9) with the 2.3 kb deletion in the F0 generation, spanning 2,298 bp (1,522 on the proximal side, and 762 on the distal side of the 2.3 kb deletion). This 2.3 kb span is lower than genome-wide expectations (17th percentile), presumably due to the numerous cross-overs that have accrued since the birth of the 2.3 kb deletion that occurred in the distant past. Yet the number of informative variants within this small segment is higher than genome-wide average of (57% percentile) also probably due at least in part to the accumulation of numerous mutations since the remote time of coalescence of the A and O alleles (see Fig. 1d in main text). (c) QQ plots for the effect of AO genotype on 150 phenotypes pertaining to meat quality, growth, carcass composition, hematology, health, and other phenotypes in the F6 and F7 generation. P-values were obtained using a mixed model followed by meta-analysis (weighted Z score) across the F6 and F7 generations as described in Methods. log-transformed p-values used for the QQ plot are nominal and two-sided. Context: Our findings in suidae are reminiscent of the trans-species polymorphism of the ABO gene in primates attributed to balancing selection26. The phenotype driving balancing selection remain largely unknown yet a tug of war with pathogens is usually invoked: synthesized glycans may affect pathogen adhesion, toxin binding or act as soluble decoys, while naturally occurring antibodies may be protective20,44. In humans, the O allele may protect against malaria118, E. Coli and Salmonella enteric infection119, SARS-CoV-142, SARS-CoV-243 and schistosomiasis120,121,122, while being a possible risk factor for cholera123, H. pylori124 and norovirus infection125. Whatever the underlying selective force, it appears to have operated independently in at least two mammalian branches (primates and suidae), over exceedingly long periods of time, and over broad geographic ranges, hence pointing towards its pervasive nature. To gain insights in what selective forces might underpin the observed balanced polymorphism, we tested the effect of porcine AO genotype on >150 traits measured in the F6 and F7 generations pertaining to carcass composition, growth, meat quality, hematological parameters, disease resistance and behaviour. No significant effects were observed when accounting for multiple testing, including those pertaining to immunity and disease resistance. (d) Expression profile of the AO gene in a panel of adult and embryonic porcine tissues (own RNA-Seq data).

### Extended Data Fig. 8 The chromosome 1 miQTL affects caecal N-acetyl-D-galactosamine (GalNAc) concentrations which are correlated with the abundance of Erysipelotyrichaceae species within AO genotype: theory.

(a) ABO and α-gal epitopes in pigs and human. The glycosyltransferase gene located on 9q34.2 and underpinning the human ABO blood group is characterized in most human populations by three major alleles: (i) IA encoding a α-3-N-acetyl-D-galactosaminyltransferase that is adding GalNAc to H and Lewis antigens (yielding the A antigen) on various glycoproteins including mucins secreted in the intestinal lumen, (ii) IB encoding a α-3-D-galactosyltransferase that is adding galactose to the same antigens (yielding the B antigen), and (iii) the inactive IO null allele that precludes expression of either the A and/or the B antigen. Mutations in the fucosyltransferase 2 gene (FUT2) preclude formation of the H antigen on secreted proteins and hence the detection of A and B antigens in secretions20. The pig orthologue of the human ABO glycosyltransferase gene is located on the telomeric end of porcine chromosome 1q, and is characterized by two major alleles: (i) the A allele, encoding a α-3-N-acetyl-D-galactosaminyltransferase that is adding GalNAc to H and Lewis antigens, similar to the human IA allele, and (ii) the O allele corresponding to a null allele as a result of a 2.3 kb deletion similar to the human IO allele24. Thus, the B antigen (Galα1-3(Fucα1-2)Galβ1-4GlcNAc-R) is not observed in pig populations. However, what is found abundantly on the surface of cells in many tissues is the so-called “α-gal epitope” (Galα1-3Galβ1-4GlcNAc-R), which results from the addition of a galactose to the Galβ1-4GlcNAc-R precursor by a α1,3galactosyltransferase encoded by the GGTA1 gene. The orthologue of the GGTA1 gene is non-functional in human and Old World non-human primates, which, however, have high titers of circulating anti-α-gal antibodies contributing to acute rejection of xenografts126,127. (b) Identifying whether changes in GalNAc concentration are the cause of the observed changes in abundance of Erysipelotrichaceae species by searching for a correlation between the two phenotypes “within AO genotype”. (b1) If AO genotype is associated with the abundance of Erysipelotrichaceae species and GalNAc concentrations by virtue of different molecular mechanisms (for instance because they involved distinct causative mutations albeit in linkage disequilibrium, or because the gene has an as of yet unknown other activity that is causing the change in bacterial abundance, independently of its glycosyltransferase activity), there is no reason to expect a correlation between bacterial abundance and GalNAc concentration within AO genotype (red horizontal lines in the dotted circles). There is of course a correlation across genotypes that is due to the fact that AO genotype has a (direct or indirect) effect on both phenotypes. (b2) If, on the other hand, AO genotype causes the change in GalNAc concentration (which is very likely given its known enzymatic activity) which then causes the change in the abundance of Erysipelotrichaceae species, one can expect that bacterial abundance and GalNAc concentration will be correlated, also within AO genotype, as indicated by the sloped red lines within the dotted ellipses. This is what is observed with the real data.

### Extended Data Fig. 9 The chromosome 1 miQTL affects caecal N-acetyl-D-galactosamine (GalNAc) concentrations which are correlated with the abundance of Erysipelotyrichaceae species within AO genotype: results.

(a) Positive correlation between caecal GalNAc concentrations and bacterial abundance (upper panels: p-75-a5; lower panels: OTU327) “within AO genotype”. GalNAc concentrations and bacterial abundances were corrected for batch effects and AO genotype and scaled between 0 and 1 to equalize residual variance. Correlations were computed using all samples jointly and Spearman’s rank-based test; corresponding p-values (nominal; two-sided) are given (left panels). Regression lines are shown for the different AO genotypes separately (right panels); all of them are positive. Note that the scatter plots for p-75-a5 are not identical but very similar to those for OTU476 (Fig. 5b, c). This is because OTU476 accounts for most of the p-75-a5 genus in caecum content (see also Extended Data Fig. 5). These data can therefore not be considered to be independent. The shaded areas correspond to the 95% confidence regions for the regression fit. (b) Comparison of the free GalNAc concentrations in caecal content of OO, AO and AA pigs as well as in caecal content of germ-free mice gavaged with 200mg/kg GalNAc. Concentrations were determined in freeze-dried caecal content powder using LC-MS/MS. Number of analyzed samples are given (n). Boxplots are as in Fig. 1c.

### Extended Data Fig. 10 The chromosome 1 miQTL affects bacteria with a functional GalNAc import and catabolic pathway.

Presence anywhere in the genome (green), presence in close proximity to agaS (red), or absence (black) of the orthologues of 24 genes implicated in the GalNAc TR/CP pathway in the genome of (i) two OTU476 like strains (4-15-1 and 4-8-110), (ii) 248 MAGs assigned to the Erysipelotrichaceae family, and (iii) 2,863 MAGs assigned to other bacterial families. The two lanes on the right of the three panels correspond to the Regulon (red) and Pathway (green) score respectively. Both scores range from 0 (black) to 6 (bright red or green). Means (range) for the corresponding dataset are given on top. P-values (nominal, two-sided, uncorrected) of the pathway and regulon scores were computed using a linear model described in Methods.

### Extended Data Fig. 11 Different GalNAc operon structure and transcriptome response in miQTL-sensitive versus -insensitive GalNAc utilizing bacteria.

Maps of GalNAc “operons” in one of the two OTU476-like strains (NB: The organization of the GalNAc gene cluster was identical in both 4-15-1 and 4-8-110 strains), and six MAGs assigned respectively to an Erysipelotrichaceae, E. coli (an Enterobacteriaceae), a Collinsella (a Coriobacteriaceae), a Fusobacteriaceae, a Firmicutes and a Clostridium. Identified Open Reading Frames (ORFs) are represented as coloured boxes. Genes implicated in GalNAc import and catabolism are in red if they are part of the cluster and in green if located elsewhere in the genome. Genes with a known function unrelated to GalNAc are in blue. ORFs with uncharacterized gene product in gray. Gene acronyms are given next to the corresponding boxes. ORFs transcribed from the top (respectively bottom) strand are above (below) the dotted line. The respective transcriptional directions are marked by the arrows. The source of information used to confirm the map order is given (finished genome, multiple MAGs, single contig).

### Extended Data Fig. 12 No effect of ABO genotype on intestinal Erysipelotrichaceae abundance in human.

Volcano and QQ plots for 43 (V1-V2), 20 (V3-V4) and 9 (V5-V6) OTUs classified as Erysipelotrichaceae for the contrasts (a) [AA, AO and AB] versus [BB, BO and OO], (b) [BB, BO and AB] versus [AA, AO and OO], and (c) [OO] versus [all others]. The shaded areas correspond to the 95% confidence intervals of the spread of the QQ plot under the null hypothesis of no QTL. The actual points are always within these intervals precluding us to reject the null hypothesis. P-values (nominal, two-sided) were computed using the linear model described in Methods and hereafter. See also Supplemental discussion 3.

## Supplementary information

### Supplementary Information

Supplementary Discussion 1–3.

## Rights and permissions

Reprints and Permissions

Yang, H., Wu, J., Huang, X. et al. ABO genotype alters the gut microbiota by regulating GalNAc levels in pigs. Nature 606, 358–367 (2022). https://doi.org/10.1038/s41586-022-04769-z

• Accepted:

• Published:

• Issue Date:

• DOI: https://doi.org/10.1038/s41586-022-04769-z