Integrated Counts of Carbohydrate-Active Protein Domains as Metabolic Readouts to Distinguish Probiotic Biology and Human Fecal Metagenomes

Bowel microbiota is a “metaorgan” of metabolisms on which quantitative readouts must be performed before interventions can be introduced and evaluated. The study of the effects of probiotic Clostridium butyricum MIYAIRI 588 (CBM588) on intestine transplantees indicated an increased percentage of the “other glycan degradation” pathway in 16S-rRNA-inferred metagenomes. To verify the prediction, a scoring system of carbohydrate metabolisms derived from shotgun metagenomes was developed using hidden Markov models. A significant correlation (R = 0.9, p < 0.015) between both modalities was demonstrated. An independent validation revealed a strong complementarity (R = −0.97, p < 0.002) between the scores and the abundance of “glycogen degradation” in bacteria communities. On applying the system to bacteria genomes, CBM588 had only 1 match and ranked higher than the other 8 bacteria evaluated. The gram-stain properties were significantly correlated to the scores (p < 5 × 10−4). The distributions of the scored protein domains indicated that CBM588 had a considerably higher (p < 10−5) proportion of carbohydrate-binding modules than other bacteria, which suggested the superior ability of CBM588 to access carbohydrates as a metabolic driver to the bowel microbiome. These results demonstrated the use of integrated counts of protein domains as a feasible readout for metabolic potential within bacteria genomes and human metagenomes.


Results
taxonomic shifts of fecal microbiota associated with CBM588 ingestion. We recruited 7 patients 6 months after their small bowel transplantations (SBTs) ( Table 1; 3 males and 4 females). They took oral CBM588 (1.5 × 10 9 CFU/day) daily for 1 month (Fig. 1A). The median age was 37 years (range, 16-59 years). Stool samples were collected before, 1 week after, and 1 month after CBM588 ingestion. Microbiota were profiled by sequencing 16S rRNA-based amplicons 8 with a paired-end approach (raw read numbers are presented in Supplementary  Fig. S1A). Operational taxonomic units (OTUs) were defined using a USEARCH-based pipeline 15 . Rarefaction curves of distinct OTUs were constructed by administering 10 random samplings for each patient ( Supplementary  Fig. S1B). At each depth, the averages of unique OTUs were plotted. All rarefaction curves showed saturating behaviors with an increase in read depths. To compensate for uneven read numbers among different samples, 10 randomly rarefied data sets with normalized reads in mapped OTU format were prepared from the original data sets before downstream analyses were conducted ( Supplementary Fig. S2A). These 7 SBT recipients experienced no apparent infection or rejection during the study period.
Sample diversities were assayed in units of OTU. Those at the same time point were pooled and averaged before analyses. Hill numbers 16 were adopted to evaluate diversities with a parameter q ( Fig. 1B and Supplementary  Fig. S3A). With increasing values of q, the contributions of OTU abundances were increasingly weighted in results of Hill number-based diversities. Without weights (q = 0), Hill numbers were equal to α diversities (see Methods). All 10 rarefied data sets were evaluated and exhibited almost identical plots ( Supplementary Fig. S4). After 1 week, α diversities positively built up from 226.3 ± 2.6 to 251.0 ± 3.2 (SD), representing a 10.9% increase ( Supplementary Fig. S3A). The trend continued further at 1 month (Fig. 1B), increasing by 24.3% from 226.7 ± 3.6 to 282.0 ± 4.7 (SD). However, no significant differences in profiles were noted from positive-q Hill numbers after CBM588 ingestions for either 1 week or 1 month, implying the absence of dominant OTUs in samples.
Contrasts between samples were summarized using principal component analyses (PCAs). For each sample, percentages of OTUs were Hellinger-transformed before analyses were conducted 17 . All 10 rarefied data sets were evaluated, and few differences were noted among the plots (Supplementary Fig. S5). Supplementary Fig. S3B and Fig. 1C present representative 1-week and 1-month results, respectively. With variances of 28.80 ± 0.03% and 14.85 ± 0.03% (SD) explained by PCA leading components, we revealed that 1 week of exposure did not alter microbiota considerably from the baselines ( Supplementary Fig. S3B, triangles vs. circles of the same color). At 1 month, 30.32 ± 0.08% and 18.34 ± 0.03% (SD) of total variances could be accounted for by the top 2 PCA components. CBM588 samples were separated from CBM588-naïve samples for all patients (Fig. 1C, triangles vs. circles of the same color). This time-dependent divergence suggested a CBM588-driven effect on the taxonomic profiles of gut microbiota. The inability of PCAs to identify clustering for any of the 3 time points implied considerable individual variations in microbe compositions.
Each OTU was taxonomically classified using USEARCH 15 with a SILVA-based reference 18 . 16S rRNA reads of each sample were mapped indirectly to a SILVA taxonomy via OTU. At the phylum level ( Supplementary Fig. S3C and Fig. 1D), 5 (Actinobacteria, Bacteroidetes, Firmicutes, Proteobacteria, and Verrucomicrobia) out of 10 phyla were demonstrated to have changed significantly (p < 0.05) according to mixed linear models 19 (MLMs) of at least half of the rarefied data sets. Out of the 3 most abundant of these 5 phyla, Bacteroidetes increased the most (from www.nature.com/scientificreports www.nature.com/scientificreports/ To identify feature families associated with CBM588 administration, we used support vector classification 20 (SVC) and logistic regression 21 (LR) to determine the signature families from contrasts between naïve and 1-month data sets. There were 69 to 74 SILVA-mapped families among the 10 rarefied data sets. Only those that satisfied both SVC and LR models among at least half of the rarefied data sets were selected. We identified Bacteroidaceae, Enterobacteriaceae, and Veillonellaceae among all rarefactions, and Clostridiaceae was observed in half of the data sets (Supplementary Fig. S3D and Fig. 1E). At 1 month, the abundance of Bacteroidaceae and Veillonellaceae increased from 4.00 ± 0.02% to 21.43 ± 0.06% and 10.42 ± 0.03% to 11.55 ± 0.04% (SD), respectively (Fig. 1E). By contrast, the abundance of Enterobacteriaceae and Clostridiaceae decreased from 41.20 ± 0.04% to 34.16 ± 0.04% and 6.86 ± 0.01% to 2.66 ± 0.02% (SD), respectively. Similar trends were already observable at 1 week ( Supplementary Fig. S3D). functional shifts of fecal metagenome associated with CBM588 ingestion. To investigate metagenome functions from 16S rRNA data sets, all 10 rarefaction data sets were subjected to cross-reference analyses to KEGG pathways 22 using Piphillin 10 . Variations in associated KEGG pathways were further analyzed using PCA. All rarefied data sets yielded similar plots ( Supplementary Fig. S6), with the 1-month administration of CBM588 causing significant shifts in the plots for most patients ( Fig. 2A, triangles vs. circles of the same color). The first 2 components explained 63.02 ± 0.10% and 22.82 ± 0.10% (SD) of total variances, respectively. To select feature pathways that were most associated with the 1-month ingestion of CBM588, SVC 20 and LR 21 models were tested against all rarefied data sets. Only those pathways chosen by both models would be accepted. In each rarefaction, 297 to 301 KEGG pathways were observed; however, only 1 pathway, ko00511 or "other glycan degradation, " was constantly distinct from the others. Figure 1B shows radar plots with model coefficients as pointing hands toward 301 KEGG pathways. The only selected pathway was ko00511 with coefficients of 9.6 ± 0.1 (SD) and 20.0 ± 0.2 (SD) for SVD and LR models, respectively. The corresponding percentages of ko00511 among the samples of all rarefied data sets are displayed in Fig. 2C. Most patients exhibited an upward trend for this "other glycan degradation" pathway with CBM588 ingestion.
To verify these findings, 6 samples from 3 patients (P2, P3, and P5) with distinct profiles of ko00511 pathway percentages ( Fig. 2C) were subjected to the shotgun sequencing of metagenomes. Numbers of raw reads for each sample are listed in Supplementary Fig. S1C. We assumed that the quantities of protein domains involved in carbohydrate processing could serve as indicators of metabolic potential in fecal metagenomes. We employed the dbCAN database 23 to analyze 585 hidden Markov models (HMMs) of carbohydrate-active domains. Reads from shotgun sequencing were routed to a pipeline to define domain fractions per million amino acids per 250 nucleotides (DFPMAA 250 ) for every dbCAN-defined HMM ( Supplementary Fig. S2B). The total number of identified domain fractions on a given HMM after normalizations are conducted is DFPMAA 250 of the given domain. The addition of DFPMAA 250 across all HMMs was used to estimate overall carbohydrate processing capabilities, (i.e., ∑DFPMAA 250 ). Ten rarefied data sets at depths across 3 logs were tested for the robustness of ∑DFPMAA 250 (Fig. 3A), which revealed a stable trend, especially among depths with read numbers above or equal to 62,367.
We evaluated if estimates of ∑DFPMAA 250 (Fig. 3A) could exhibit favorable correlations with Piphillin-reported 10 percentages of the KEGG ko00511 pathway 22 (Fig. 2C), both of which were averaged from 10 rarefactions. The Pearson coefficient between ∑DFPMAA 250 and the ko00511 percentages was up to 0.90, which indicated a significant correlation (p < 0.015; Fig. 3B). All 3 patients had higher ∑DFPMAA 250 values after CBM588 ingestion (p < 0.005 after bootstrapping was conducted 10,000 times) (Fig. 3C). The availability of individual DFPMAA 250 estimates for each HMM enabled the profiling of diversities of carbohydrate-active domains in fecal metagenomes. Hill numbers with varying parameters up to 3.0 were plotted against averages of 10 rarefactions (Fig. 3D), in which higher parameter q values attached more weight to quantitatively dominant domains. Diversities were increased for all 3 patients, especially in the range between 0.0 and 1.0. α diversity, or the zero-q Hill number which equals counts of non-zero domains, increased with CBM588 ingestion from 284 to 311, 294 to 300, and 251 to 277 for P2, P3, and P5, respectively.
To validate above observations, we used HUMAnN2 24 and MetaCyc 25 to evaluate the same data sets of shotgun metagenomes ( Supplementary Fig. S1C). HUMAnN2 is based on an enhanced search upon known reference genomes, and MetaCyc provides a different catalog of metabolic pathways from KEGG. Analyses were repeated 5 times each with randomly selected 5 million paired reads per sample from original shotgun data sets ( Supplementary Fig. S1C). MetaCyc defines "Glycan Degradation" Class with 11 instance pathways. Among the 10 prokaryote-relevant pathways only "glycogen degradation I" was identified by HUMAnN2 in all rarefied data sets. This pathway breaks down intracellular glycogen when carbon sources are limiting 26 . Figure 3E shows the pathway abundance of "glycogen degradation I" in square-rooted units. The values were found to complement ∑DFPMAA 250 scores (Fig. 3C). The Pearson coefficient between both modalities of assessment was up to −0.97, indicating a strong negative correlation (p < 0.002; Fig. 3F).
Distinct properties of CBM588 in carbohydrate metabolism. We applied the same DFPMAA 250 pipeline to shotgun sequences of the CBM588 genome ( Supplementary Fig. S2B except Bowtie 2 filters). Notably, ∑DFPMAA 250 estimated a distinct value for CBM588 at 66.37 ± 0.13 (SEM), which was higher than any of human samples (Fig. 3C). Because human feces would yield averages from all bowel bacteria, we suspected that most bacteria would carry lower values of ∑DFPMAA 250 estimates in their genomes. From the data set of Köser et al. 27 , we found the shotgun genome sequences of several strains of bacteria (Table 2). To compensate for artificial underestimations of ∑DFPMAA 250 due to shorter read lengths ( Supplementary Fig. S7A), only those bacteria with average paired read lengths longer than 187.6 bp were subjected to DFPMAA 250 calculation ( Table 2 and Supplementary Fig. S7B). Unsurprisingly 8 out of the 9 evaluated bacteria had lower ∑DFPMAA 250 values than CBM588, with 1 being on the same level (ATCC BAA-334; Fig. 4A). All Gram-positive bacteria were found to  Fig. 4A and Table 2).
The DFPMAA 250 -based heatmap revealed a similar pattern for all human samples (Fig. 4B), implying a universal repertoire requirement for carbohydrate-active protein domains. Instead the uses and abundances of these protein domains varied considerably among different bacteria (Fig. 4B). We categorized scored counts of families of carbohydrate-active domains 28 among the 10 strains of bacteria studied, including auxiliary activities (AAs), carbohydrate-binding modules (CBMs), carbohydrate esterases (CEs), glycoside hydrolases (GHs), glycosyltransferases (GTs) and polysaccharide lyases (PLs; Fig. 4C). For CBM588, up to 45% of the genome-coded carbohydrate-active domains were in the CBM category, which differed significantly (p < 10 −5 ) from the other 9 bacteria (25.42% ± 1.45%, SEM). Further characterization of CBMs diversities with Hill numbers (Fig. 4D) revealed that CBM588 and ATCC BAA-334 were the two strains with highest values. CBM588, however, had a declining curve at a sharper slope, suggesting an unevener distribution of CBMs relative abundances than ATCC BAA-334.

Discussion
DFPMAA 250 estimates the potentials of carbohydrate metabolisms within bacteria genomes and human metagenomes by integrating counts of relevant protein domains. The underlying assumption is the enzyme amount is the key determinant of chemical reactions within a complex system such as human bowels regardless of the availability of reactants. A system based on a similar logic can successfully predict metabolomic turnover in oceans 29 . With the increasing number of interventions into bowel microbiota for medical benefits, a quantitative www.nature.com/scientificreports www.nature.com/scientificreports/ www.nature.com/scientificreports www.nature.com/scientificreports/ base would be required to make logical decisions and predictions for this metaorgan. Our study demonstrates a potential framework for this purpose. Other metabolisms could be formatted similarly. A set of essential scores would provide a valuable reference for clinicians to assess and evaluate bowel microbiomes in the same manner as creatinine for the kidney and transaminases for the liver. Scored summaries would also be easy for non-experts like patients to follow.
Construction of metabolic models between hosts and microbes have been pioneered by many scientists (reviewed by Heinken et al. 30 ). One category of approaches is "top-down" to find correlations among metabolites by statistical analyses. The other category is "bottom-up" to build metabolic networks with units and coefficients of known pathways plus functional annotations from reference genomes. Designs of "top-down" could help identify novel correlations but often lack mechanistic insights; approaches of "bottom-up" would provide astonishing precision but often take tons of time to calculate. Instead our design can be considered as a middle stop between both ends. Scientists can use domain-based scores as correlation targets to make hypotheses of mechanisms and to design experiments with known properties of the given domains. Involved proteins can be even cloned from nucleotide sequences as mapped to the given HMMs in the data sets. Pathways can be constructed accordingly. These are all advantages not readily available from previous approaches.
Values of ∑DFPMAA 250 were sensitive to mean read lengths but correctable up to a threshold of approximately 187.6 bp ( Supplementary Fig. S7A). This was likely due to an inherent criterion set by dbCAN 23 , which requires covered fractions of HMMs by aligned reads to be higher than 0.3. With 90% of the known protein domains smaller than 200 amino acids 31 , mean read lengths at 187.6 bp would likely be sufficient for calculating DFPMAA 250 values. Although this would set a limit on available choices of sequencing platforms, the demonstrated insensitivity to sequencing depths (Fig. 3A) could adequately compensate for this disadvantage. With rapid improvements on read lengths from the platform manufacturers, more sequence sets ready for the DFPMAA 250 pipeline would be expected.
Enhancements of carbohydrate processing capabilities in fecal metagenomes after CBM588 ingestions were supported by increased ∑DFPMAA 250 scores (Fig. 3C). These results are compatible with previous findings that Clostridia is associated with glycan degradation potential 32 . Because the abundance of the Clostridiaceae family decreased in our study (Fig. 1E), CBM588 had a low likelihood of having a direct contribution by mass effect. Instead, we observed that CBM588 not only carried a higher overall value of ∑DFPMAA 250 than other bacteria (Fig. 4A) but also had a significant proportion of carbohydrate-active protein domains in the carbohydrate-binding module category (Fig. 4C). It is likely that CBM588 indirectly diversifies the microbe community by offering access to more glycan varieties. In addition, the increased Bacteroidetes abundance (Fig. 1D) could positively reinforce the glycan-metabolizing potential of the microbiome 33,34 . Independent validation with HUMAnN2 24 and MetaCyc 25 identified a negative correlation (Fig. 3F) between pathway abundance of "glycogen degradation I" and ∑DFPMAA 250 scores. This finding suggests that microbiomes carrying higher values of ∑DFPMAA 250 can compensate the requirements of "glycogen degradation" in the microbe community. In other words, with diversified availability of carbons via enhanced metabolic potentials as implied by high ∑DFPMAA 250 scores, bacteria can thrive with less dependence upon glycogen degradation. It would be interesting to determine if simultaneous additions of glycan-rich foods with CBM588 could elicit any synergistic effects.
In this study, we used CBM588 instead of the more common Lactobacilli-based probiotics because its safety is well established 35 . Reports of its efficacy against enterohemorrhagic Escherichia coli O157:H7 36 and Clostridium difficile 37 also increased our confidence in its use in immunocompromised patients. In a rat model following SBT, Price et al. found that rejection and graft-versus-host disease are associated with shifts in gut microbiota toward potentially pathogenic organisms 38 , which can be ameliorated using probiotics 39 . In humans, Oh et al. indicated that the presence of the Enterobacteriaceae family significantly increases during episodes of rejection after SBT 14 . Bacterial compositions could also be affected by the presence of ileostomy and the availability of oxygen after a transplantation 40 . In our study, the most notable alterations were decreases in Enterobacteriaceae and increases in Bacteroidaceae after the 1-month ingestion of CBM588 in patients (Fig. 1E). With the known association between the Enterobacteriaceae family and graft rejection 14 , our results would support the use of CBM588 to improve the survival of small intestine allografts. The decrease in Enterobacteriaceae per se would in addition imply a lower likelihood of infection by potential pathogens in the family. Risks of rejection and infection might accordingly be minimized from the use of CBM588. Although number of patients in our cohort was limited, our data did reveal www.nature.com/scientificreports www.nature.com/scientificreports/ www.nature.com/scientificreports www.nature.com/scientificreports/ the time-dependent divergence of microbiota and concordant change of carbohydrate metabolism after CBM588 ingestion. Further large-scale investigations are warranted. The developed scoring system detailed herein could readily provide objective and quantitative readouts to facilitate the establishment of clinical reasoning behind adopting probiotics in the care of SBT recipients.

Methods
Sample collection and ethics approval. All patients received SBT 6 months before enrolment into the study (Table 1). Attendees took CBM588 (1.5 × 10 9 CFU/day) daily for 1 month (Fig. 1A). Stool and/or blood samples were collected before, 1 week, and 1 month after CBM588 ingestion. The experimental protocol was approved by the Institutional Review Board of Far Eastern Memorial Hospital, New Taipei City. The study was conducted in accordance with the relevant guidelines. Informed consent was obtained from patients directly or from their parents if attendees were younger than 18 years old.
Software and hardware. All analyses were performed on a 2013 Mac Pro equipped with 3.7-GHz Quad-Core Intel Xeon E5, 64 GB of memory, and 2 AMD FirePro D700 6 GB graphics cards. Inputs and outputs of various specialized packages were glued with Python 2.7 scripts.
next-generation sequencing. All sequencing libraries including 16S rRNA, shotgun metagenomes, and CBM588 genome were constructed and sequenced using commercial tools. The 16S libraries were sequenced on Illumina MiSeq as 2 × 300 bp paired-end readings, whereas the shotgun fecal metagenomes and CBM588 genome were determined on Illumina NextSeq as 2 × 150 bp paired-end readings. Fig. S1A) carrying primer sequences were trimmed using a self-developed script. Results were paired using PEAR 41 and filtered using USEARCH 15 . Pooled sequences were used to define operational taxonomic units (OTUs) by USEARCH with default criteria. Saturation curves of distinct OTUs ( Supplementary Fig. S1B) were plotted as means of 10 random selections with increasing numbers of paired reads. Ten rarefied subsets of reads in the format of mapped OTUs were prepared by giving each the same number of reads before downstream analyses were conducted.

Diversities in Hill numbers.
Hill numbers define a diversity profile with a parameter q formulated as follows, where S is "species equivalent, " f represents frequency of "species equivalent, " and q denotes the parameter.
With increasing values of q, more weights are given to the more abundant species equivalent. In this study, OTUs (Fig. 1B, Supplementary Figs S3A, and S4), or dbCAN-defined hidden Markov models (HMMs) 23 (Figs 3D and 4D) were used as species equivalents. At q = 0, species equivalents were counted without considering their normalized frequencies. We employed this zero-parameterized Hill number as the definition for α diversity 42 . At q = 1, counts were proportional to their normalized frequencies (i.e., Shannon diversity), whereas at q = 2, only dominant species equivalents were counted. A deeper slope of the curve represents an unevener distribution of relative abundances of species equivalents. principal component analysis. Decomposition with principal component analysis (PCA) was performed with scikit-learn in Python 43 . Values were Hellinger-transformed before analyses 17 . OTUs and KEGG-defined pathways 22 were used as variables to decompose 16S rRNA and metagenome functions, respectively. Selection of signature phyla and families. OTUs were taxonomically classified using USEARCH 15 and a SILVA-based reference 18 . 16S rRNA reads were given the same taxonomic designations as the associated OTUs. Mixed linear models 19 (MLMs) were adopted to identify signature phyla which best discriminated CBM588 effects among at least half of the rarefied data sets. Support vector classification 20 (SVC) and logistic regression 21 (LR) were used to select signature families linked to CBM588 exposure. Only those families that were in agreement in both SVC and LR models among at least half of the rarefied data sets were taken. Parameters were optimized by leave-one-out cross-validations.
16S rRNA-inferred metagenome functions. Piphillin 10 was used to extrapolate metagenomic functions to KEGG pathways 22 with 16S rRNA sequences. In addition to PCA analyses, KEGG pathways were also subjected to signature selections with SVC 20 and LR 21 models. Only those picked by both models among at least half of the rarefied data sets would be accepted. Model parameters were optimized with leave-one-out cross-validations.
Domain fractions for reads of shotgun metagenomes and bacteria genomes. Raw reads ( Supplementary Fig. S1C) were paired using PEAR 41 with quality control defaults. Bowtie 2 44 was used to filter out human sequences for reads of human origin. Genes were predicted using FragGeneScan 45 . CAZy-associated 28 HMMs were downloaded from dbCAN 23 for HMMER scanning 46 upon PEAR-assembled reads, where unassembled reads were excluded. "hmmsearch" with Z-value adjustment to 585 was used instead of "hmmscan" to increase scanning efficiency. Calculations of domain fractions are specified below.
Domain fractions per million amino acids 250. Domain fractions were determined by hmmscan-parser.
sh as downloaded from dbCAN 23 , but minimal changes to switch parameters were made to reflect the use of "hmmsearch" instead of "hmmscan." Other criteria, including p values and a minimal domain fraction of 0. www.nature.com/scientificreports www.nature.com/scientificreports/ as defined by dbCAN, were not altered. For a given domain, summed domain fractions normalized by counts of amino acid inputs and 250 base pairs were designated as "domain fractions per million amino acids per 250 nucleotides" or DFPMAA 250 (Supplementary Fig. S2B). Python scripts for the demonstration purpose is available in the Supplementary Information. HUMAnN2 validation. Evaluations were repeated 5 times against the MetaCyc 25 database. Each run was conducted upon a rarefied data set of 5 million randomly picked paired reads of each sample from the original shotgun data sets ( Supplementary Fig. S1C). There are 11 instance pathways in the MetaCyc 25 "Glycan Degradation" Class, including (1,4)-β-D-xylan degradation, cellulose degradation I, chondroitin sulfate degradation, dermatan sulfate degradation, glycogen degradation I, glycogen degradation II, homogalacturonan degradation, L-arabinan degradation, pectin degradation I, starch degradation I, and xyloglucan degradation I. Abundance of "glycogen degradation II" was discarded because the pathway is restricted to eukaryotes.
CBM588 genome sequencing. The construction of libraries and next-generation sequencing were contracted to commercial service providers. Raw reads (Supplementary Fig. S1D) were assembled by SPAdes 47 into contigs and scaffolds to confirm the probiotic identity by in silico polymerase chain reaction 48 to find corresponding 16S sequences with 5′-TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG-3′ and 5′-GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC-3′ primers ( Supplementary Fig. S1E). For calculations of DFPMAA 250 raw reads were processed as fecal shotgun metagenomes without contig or scaffold assembling.
DfpMAA 250 for bacteria genomes. Shotgun genomes of bacteria were found from the data sets of Köser et al. 27 , whereas the CBM588 genome was prepared as described in the previous section. Shorter versions of CBM588 shotgun sequences ( Supplementary Fig. S7) were simulated by using random trimmings of 5′ and 3′ ends of paired full-length reads. Only those bacteria with mean lengths of paired reads of over 187.6 bp were subjected to DFPMAA 250 analyses. Gene predictions and protein domain scanning were conducted in the same manner for fecal shotgun metagenomes but without the use of Bowtie 2 filters (Supplementary Fig. S2B).

Data availability
All raw sequencing data generated in this study have been submitted to the European Nucleotide Archive (https:// www.ebi.ac.uk/ena) under accession number PRJEB27661.