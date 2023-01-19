Selection of newly reconstructed organisms and retrieval of whole-genome sequences

First, we retrieved 4,185 genomes of human gut-associated strains that were available on PubSEED53 (Supplementary Note 6). To expand the species coverage, we performed an extensive literature search of species isolated from or detected in the human microbiome with available whole-genome sequences (Supplementary Table 1). This search led to the addition of a further 1,324 strains, which included 127 genomes of mouse-associated strains. The corresponding whole-genome sequences were retrieved in FASTA format from the National Center for Biotechnology Information (NCBI) FTP site (ftp://ftp.ncbi.nlm.nih.gov/). Moreover, we included 26 genomes of Eggerthella lenta strains54 available at https://www.ncbi.nlm.nih.gov/bioproject/PRJNA412637. Finally, we retrieved 761 human microbial genomes from the Human Gastrointestinal Bacteria Culture Collection55 in FASTQ format from https://www.ebi.ac.uk/ena/data/view/PRJEB23845 and https://www.ebi.ac.uk/ena/data/view/PRJEB10915. Together with AGORA1.03, which was obtained from the VMH23, these combined efforts resulted in 7,302 strains and 1,738 species included in AGORA2.

Manual refinement of metabolic pathways and gene annotations through comparative genomics

Of the 7,302 analyzed strains, 5,438 bacterial strains and three archaeal strains were present in the PubSEED resource53,56 (Supplementary Note 6) and could be re-annotated for their metabolic functions through comparative genomics. A total of 34 metabolic subsystems that had been reconstructed previously for a smaller subset of gut microbial strains20,57,58,59,60, as well as a newly created drug metabolism subsystem, were considered for the analysis (Supplementary Table 3a for a comprehensive list of subsystems). All subsystems are available at the PubSEED website.

Curation of subsystems

For annotation of the genes in each subsystem, the PubSEED platform was used53. Functional roles for each subsystem were annotated based on the (1) prescribed functional role for the protein, (2) sequence similarities of the protein to proteins with previously confirmed functional roles and (3) genomic context (Supplementary Note 7).

Metabolic pathways considerations for comparative genomics analysis

Absence of gene(s) for one or more enzymes in a pathway may result in blocked reactions in a metabolic reconstruction. To avoid this, we estimated the completeness of metabolic pathways during the genome annotation. For each potentially synthesized metabolite, all the biosynthetic pathways were collected in agreement with the KEGG PATHWAY resource61 and genes of the subsystem were attributed to corresponding steps of the metabolic pathways. Absence of the consequent reactions was determined as a gap. Only pathways with no more than two gaps with gap length of no more than one step (Supplementary Note 8) were further gap-filled and used for generation of reactions.

Sequence-based gap-filling

For the gapped pathways, the bidirectional best-hit (BBH) method62 was used: (1) The gene corresponding to the gap and present in the genome for the related organisms (belonging to the same species, genus, or family) was used as a query for a BLAST search in the genome with the gap. (2) Possible BBHs were defined as homologs for that alignment with the query protein having an e-value < −50 and protein identity ≥50%. (3) For each possible BBH, the reverse search was done for the genome that was a source of the query protein. (4) If the query protein and its best homolog in the analyzed genome formed a BBH pair, the gap was filled. (5) A similar genomic context for the query protein and its ortholog was considered as an additional confirmation for orthology of the identified BBH pair.

Annotation of the drug metabolic genes

To annotate drug-metabolizing genes, we used the following pipeline. (1) Identify genes known to encode for drug-metabolizing enzymes in a range of microbial organisms, from the scientific literature (Supplementary Table 5a). (2) Using the amino acid sequences of these known drug-metabolizing genes as queries, we performed a BLAST search for every analyzed genome. (3) The resulting best BLAST hit was then used as a query for the BLAST search in the genome having a known drug-metabolizing gene to confirm that the known protein sequence and its best BLAST hit form a pair of BBHs. (4) All BBHs were used for the construction of a rooted maximal-likelihood tree. (5) All previously known proteins were mapped onto the tree, and all monophyletic branches containing known drug-metabolizing enzymes were determined (Supplementary Fig. 10). (6) All annotated proteins in these branches were considered as orthologs of the known drug-metabolizing proteins. All the proteins not being in branches with known drug-metabolizing proteins were considered as proteins with other functions and were excluded from further analysis. Subsequently, a tree was constructed again for orthologs of the known drug-metabolizing proteins. (7) For l-tyrosine decarboxylase (TdcA, Enzyme Commission (EC) 4.1.1.25) and cytidine deaminase (cCda, EC 3.5.4.5), we found that genomic context is conserved between species and we also analyzed the genomic context. If the genomic context of a candidate gene was similar to that of a known drug-metabolizing gene, the candidate was considered as an ortholog of the known protein. Otherwise, it was considered to as a false positive prediction and excluded from further analysis (Supplementary Note 9 and Supplementary Fig. 10). As for (6), the tree was constructed again for only the orthologs of the known proteins. (8) For each tree, including only the orthologs of the known genes, we defined the monophyletic branches containing proteins derived from only one species. For each of such species-specific branches, we predicted subcellular localization (Supplementary Note 10) using the CELLO v.2.5 system (cello.life.nctu.edu.tw). (9) For cytoplasmic enzymes, drug transporters were predicted based on genomic context (Supplementary Note 11 and Supplementary Table 3b).

Tools

The PubSEED platform53,56 was used to annotate the subsystems. To search for BBHs for previously known proteins, a BLAST algorithm63 implemented in the PubSEED platform was used. Additionally, the PubSEED platform was used for analysis of the genomic context. To analyze the protein domain structure, we searched the Conserved Domains Database (CDD)64 using the following parameters: an e-value ≤ 0.01 and a maximum number of hits equal to 500. For the prediction of protein subcellular localization, the CELLO65 web tool was used. Alignments were performed using MUSCLE v.3.8.31 (ref. 66). For every multiple alignment, position quality scores were evaluated using Clustal X67,68. Thereafter, all positions with a score of zero were removed from the alignment and the modified alignment was used for construction of the phylogenetic trees. Phylogenetic trees were constructed using the maximum-likelihood method with the default parameters implemented in PhyML-3.0 (ref. 69). The obtained trees were midpoint-rooted and visualized using the interactive viewer Dendroscope, v.3.2.10, build 19 (ref.70).

Literature and database searches

Biochemical and physiological characterization papers were retrieved by entering the names of AGORA2 species into PubMed (https://www.ncbi.nlm.nih.gov/pubmed/). Information on 132 carbon sources, 30 fermentation pathways, 64 growth factors, consumption of 73 metabolites and secretion of 51 metabolites was subsequently manually extracted on the species and/or genus level from 732 peer-reviewed papers and >8,000 pages of microbial reference textbooks71. Moreover, the traits of each reconstructed strain including taxonomy, morphology, metabolism and genome size were retrieved through database searches. The taxonomic classification of the strains was retrieved from NCBI Taxonomy (https://www.ncbi.nlm.nih.gov/taxonomy/). Information on morphology, habitat, body site, gram status, oxygen status, metabolism, motility and genome size was manually retrieved from the Integrated Microbial Genomes and Microbiomes72 database (https://img.jgi.doe.gov/) (Supplementary Table 1). All experimental data that were used to refine AGORA2 are available at https://github.com/opencobra/COBRA.papers/tree/master/2021_demeter/input.

Generation of draft reconstructions

Draft reconstructions were generated through the KBase24 narrative interface. Genomes present in KBase were directly imported into the narrative. Otherwise, genomes in FASTA format were uploaded into the Staging Area and, subsequently, imported into the narrative through the ‘Batch Import Assembly From Staging Area’ (https://narrative.kbase.us/#catalog/apps/kb_uploadmethods/batch_import_assembly_from_staging) app. Genomes in FASTQ format were directly imported into the narrative through the ‘Import Paired-End Reads From Web’ (https://narrative.kbase.us/#catalog/apps/kb_uploadmethods/load_paired_end_reads_from_URL) app after retrieving the links to the corresponding files from https://www.ebi.ac.uk/ena/data/view/PRJEB23845 and https://www.ebi.ac.uk/ena/data/view/PRJEB10915. The imported assemblies were annotated using RAST subsystems73 through the ‘Annotate Multiple Assemblies’ (https://narrative.kbase.us/#appcatalog/app/RAST_SDK/annotate_contigsets) app. Draft metabolic reconstructions were generated through the ‘Create Multiple Metabolic Models’ (https://narrative.kbase.us/#appcatalog/app/fba_tools/build_multiple_metabolic_models) app and exported in SBML format through the ‘Bulk Download Modeling Objects’ (https://narrative.kbase.us/#appcatalog/app/fba_tools/bulk_download_modeling_objects) app.

Semiautomated, data-driven refinement pipeline

We developed a semiautomated refinement pipeline, DEMETER19, which had been previously used to build AGORA20. Briefly, DEMETER was developed by testing gap-filling steps in few reconstructions and propagating identified solutions to many reconstructions. Curation against experimental data is performed in DEMETER by gap-filling the appropriate reconstructions with a complete pathway for each experimentally demonstrated function. Biomass production under aerobic and anaerobic conditions and on defined media as well as biosynthesis of cell wall components are also enabled through gap-filling solutions that had been previously determined in few reconstructions. Similarly, futile cycles are solved by identifying and correcting the affected reactions in few reconstructions and propagating these changes during the development of DEMETER. More details on DEMETER are provided in ref. 19. A detailed tutorial is available as part of the COBRA Toolbox47.

For the generation of AGORA2, we revised DEMETER substantially. Specifically, we (1) translated ~1,000 additional reactions and ~800 metabolites from KBase to VMH23 nomenclature; (2) introduced additional gap-filling reactions, where needed, to enable biomass production under anoxic conditions on a complex medium with thermodynamically consistent reaction directionalities; (3) removed futile cycles resulting in thermodynamically implausible ATP production by making the responsible reactions irreversible; (4) ensured through gap-filling and/or deletion of appropriate reactions that all reconstructions captured the collected experimental data; and (5) adjusted biomass objective functions to account for class-specific cell membrane and cell wall structures as well as introducing a periplasm compartment (Supplementary Note 3). As described previously20, all refinement and debugging solutions were manually determined for a subset of the reconstructions and subsequently propagated to many reconstructions, as appropriate. All newly included metabolites and reactions were formulated based on literature and/or database23,28,74 searches, while ensuring mass and charge balance through the reconstruction tool rBioNet75. Reactions identified through comparative genomics (Supplementary Table 3b,c) were added to up to 5,438 reconstructions. Non-gene-associated reactions, for which the respective gene could not be found through comparative genomics, were removed from the draft reconstructions if doing so did not abolish biomass production.

Curation efforts were verified via a test suite19. Specifically, it systematically tested whether each reconstruction (1) grew anaerobically on complex medium; (2) had correct reconstruction structure, that is, mass and charge balance, and correct syntax for gene–protein–reaction associations; (3) was thermodynamically feasible, for example, produced realistic amounts of ATP; and (4) captured known metabolic traits of the organism according to the collected experimental and comparative genomic data. Supplementary Table 2 summarizes all features that are tested by the test suite.

For consistency, the existing 818 AGORA1.03 reconstructions (v.25.02.2019, available at https://www.vmh.life/files/reconstructions/AGORA/1.03/AGORA-1.03.zip) also underwent refinement through DEMETER. The AGORA1.03 reconstruction of Staphylococcus intermedius ATCC 27335 was removed since it was a duplicate of the newly reconstructed strain Streptococcus intermedius ATCC 27335. The names of eight AGORA1.03 reconstructions were changed to correct strain determination and/or spelling (Supplementary Table 1).

DEMETER has been implemented in the COBRA Toolbox47 and was run in MATLAB (MathWorks) v.R2020b.

Generation of quality control reports

The quality control reports and associated scores were determined for each AGORA2 reconstruction using the MetaboReport tool in the COBRA Toolbox47. The quality checks included are consistent with the Memote42 checks, as were the calculations of the scores. All 7,302 reports can be accessed via https://metaboreport.live.

Formulation of the drug reactions

A literature search for microbial enzymes known to transform, degrade, activate, inactive or indirectly influence commonly prescribed drugs was performed, yielding 15 enzymes in total (Fig. 3a and Supplementary Table 5), which are encoded by 25 genes (Supplementary Table 3b). To enable comparative genomic analyses, only drug transformations that could be linked to specific protein-encoding genes were considered. As described above, enzyme-encoding genes were analyzed in their genomic context as outlined in ref. 76 using PubSEED subsystems26,53. Additional information on the presence of the analyzed genes was retrieved from refs. 39,77,78.

Literature and database searches were performed for the metabolic fate of commonly prescribed human-targeted drugs. The structures of 287 drug metabolites and drug degradation products were retrieved from 73 peer-reviewed papers, HMDB79, DrugBank79 and the Transformer database80. Reactions were formulated based on the collected experimentally determined drug structures, drug downstream product metabolite structures and reaction mechanisms. Both cytosolic and extracellular enzymatic reactions were formulated depending on the identified subcellular protein locations. Since at least six drugs undergoing glucuronidation in the human body have been shown to be substrates for the microbial ß-glucuronidase81,82 (Supplementary Table 6), it was assumed that all retrieved glucuronidated drug metabolites (118 in total) could serve as substrates. Additionally, ß-glucuronidase reactions were formulated for 33 glucuronidated drug metabolites from a previously reconstructed module of human drug metabolism83 and three glucuronidated hormones from Recon3D (ref. 21). New metabolites and reactions were assigned VMH IDs following standards in nomenclature used for COBRA reconstructions9, and formulated while ensuring mass and charge balance through the reconstruction tool rBioNet75. In total, for 98 drugs (Fig. 3b), 353 unique metabolites, 381 enzymatic reactions, 373 exchange reactions and 710 transport reactions (Supplementary Table 6a,b) were formulated.

Atom–atom mapping

The COBRA Toolbox47 function ‘generateChemicalDatabase’ was used to generate atom–atom mappings. The process to obtain the atom–atom mappings for the AGORA2 reconstructions can be summarized as follows: (1) 1,894 out of 3,533 metabolic structures from the AGORA2 reconstructions were collected from the SMILES and InChIs associated with their metabolites and different chemical databases, such as VMH23, KEGG74, HMDB79, PubChem84 and ChEBI85 databases; the metabolic structures were standardized based on the InChI algorithm86 and can be found in the VMH database23; (2) the standardized metabolites and the reaction stoichiometry in the AGORA2 reconstructions were used to generate 5,583 out of 7,300 MDL RXN files; (3) 5,583 out of 7,300 AGORA2 reactions were atom mapped using the Reaction Decoder Tool algorithm87 for active transport reactions and a custom algorithm47 for passive transport reactions and coupled transport reactions. Atom–atom mappings can be found in the VMH database23 and are freely available at https://github.com/opencobra/ctf.

Simulations

All simulations were performed in MATLAB (MathWorks) v.R2020b with IBM CPLEX (IBM) as the linear and quadratic programming solver. Computations were carried out on a tower with a 2.80-GHz processor and 64-GB RAM with 12 cores dedicated to parallelization. The simulations were carried out using functions implemented in the COBRA Toolbox47. Flux balance analysis (FBA)34 was used to simulate metabolic fluxes. All additional scripts for data generation, data analysis and data visualization are available at https://github.com/ThieleLab/CodeBase.

Retrieval of reconstruction resources

Manually and semiautomatically curated reconstructions compared with AGORA2 were retrieved as follows: 72 fully manually curated reconstructions were downloaded from the BiGG database28 (http://bigg.ucsd.edu/). Reconstructions generated through gapseq18 (8,075 total) were downloaded from ftp://ftp.rz.uni-kiel.de/pub/medsystbio/models/EnzymaticDataTestModels.zip and exported in SBML format through the sybilSBML package in R using a custom script. MAGMA17 reconstructions (1,333 total) were downloaded from https://www.microbiomeatlas.org/data/MSP_GEM_models.zip. To enable comparability with AGORA2, exchange reactions in all retrieved reconstructions were translated to VMH23 nomenclature through custom MATLAB scripts. Moreover, an ATP demand reaction (VMH reaction ID: DM_atp_c_) was added if not already present and otherwise translated to VMH nomenclature.

Generation of reconstructions through CarveMe

Protein fasta files corresponding to 7,279 AGORA2 strains were downloaded from either NCBI (https://www.ncbi.nlm.nih.gov/assembly) or ENA (https://www.ebi.ac.uk/ena) and subsequently used to run CarveMe. The remaining 23 AGORA2 strains were excluded as a corresponding protein FASTA file was not available. Reconstructions for 7,279 strains were generated with CarveMe15 v.1.5.1 on Python 3.7.13 (retrieved from https://www.python.org/downloads/release/python-3713) and relying on DIAMOND88 v.0.9.14.

Generation of reconstructions through gapseq

Genome FASTA files retrieved as described above were used as the input for gapseq18. A total of 1,767 models were generated with gapseq 1.2, which was run in R89 v.4.1.2 on an Ubuntu 22.04 machine. The R interface of GLPK (package Rglpk) was used as the linear programming solver.

Flux and stoichiometrically consistent reactions

The subset of flux and stoichiometrically consistent reactions, as defined in ref. 29, was retrieved through the ‘findFluxConsistentSubset’ and ‘findStoichConsistentSubset’ functions implemented in the COBRA Toolbox47. The fraction of stoichiometrically and flux consistent reactions, excluding exchange and demand reactions, was subsequently determined for each AGORA2 reconstruction and corresponding KBase draft reconstruction as well as for 5,587 reconstructions generated through CarveMe15, 8,075 reconstructions generated through gapseq18, 1,333 MAGMA17 reconstructions and 73 curated reconstructions from the BiGG database28. Briefly, the subset of stoichiometrically consistent reactions in a reconstruction includes all reactions that are mass and charge conserved, excluding exchange, demand and sink reactions, which are by definition mass and charge imbalanced29. The subset of flux consistent reactions consists of all reactions can carry flux under the defined set of constraints29.

Validation against three independent experimental datasets

For an independent assessment of predictive potential of genome-scale reconstructions, independent (that is, not used for the reconstruction process) experimental data on metabolite uptake and secretion were retrieved from three sources30,32,33 and mapped onto the VMH23 nomenclature through custom MATLAB scripts. The experimental data included species-level positive and negative metabolite uptake and secretion data for 457 species (5,341 strains) and 269 metabolites in AGORA2 from the NJC19 resource30, and species-level positive metabolite uptake data from ref. 32 for 184 species (328 strains) and 85 metabolites in AGORA2. Moreover, strain-resolved positive and negative metabolite uptake and secretion data for 676 AGORA2 strains and 220 metabolites, and positive and negative enzyme activity data for 881 AGORA2 strains and 31 enzymes, were retrieved from the BacDive database33. The enzyme data were mapped to the respective reactions in each of the compared reconstruction resources’ namespaces. Positive data indicated that the metabolite uptake, secretion capability or enzyme activity had been demonstrated in a microorganism, while negative data indicated that the microorganism has been shown not to possess the capability. For each retrieved positive or negative data point, the capability of the respective model to take up or produce the corresponding metabolite was calculated using FBA on unlimited medium by either minimizing or maximizing the corresponding exchange reaction, respectively. For enzyme data, it was tested whether at least one reaction mapped to the respective enzyme was present in the model and could carry a nonzero flux. If the data point was positive and the corresponding model could also take up or secrete the metabolite or produce flux through the corresponding enzymatic reactions(s), this resulted in a true positive prediction, while a false negative prediction occurred when the microorganism was known to have this capability, but the corresponding model did not capture the trait. If the data point was negative and the corresponding model also could not take up or secrete the metabolite or did not produce flux through any reaction(s) mapped to the enzyme, this resulted in a true negative prediction, and otherwise the prediction was a false positive.

Prediction accuracies were calculated for the three experimental datasets. For an assessment of the predictive potential of AGORA2 compared with other reconstruction resources, the analysis was repeated for the strains in KBase draft reconstructions; CarveMe reconstructions; and BiGG, gapseq and MAGMA reconstructions that overlapped with the AGORA2 organisms with available data. To this end, the predictive value of all resources was tested via mixed effect logistic regressions with the in silico prediction as predictor and the in vivo behavior (binary) as response variable, while introducing the model as random effect variable accounting for the stochastic dependencies of predictions for different metabolites stemming from the same model. Moreover, the accuracy per model was calculated for all resources, and then compared with the AGORA2 accuracies via nonparametric sign rank tests. The list of all strains in the compared reconstruction resources that were tested against the three datasets is shown in Supplementary Table 4a. All scripts are available at https://github.com/ThieleLab/CodeBase.

Validation of drug-metabolizing capacities against independent experimental data

A literature search was performed for in vitro experiments demonstrating the capabilities of human microbial strains to metabolize reconstructed drugs through the 15 annotated enzymes, resulting in 253 drug–microbe pairs (Supplementary Table 7). As this data contained both positive and negative data, true positive, true negative, false positive and false negative predictions could occur as described above. If no studies on the specific reconstructed drugs were found for the enzyme, studies on general activity of the enzyme were retrieved. If possible, the tested microorganisms were matched to AGORA2 models on the strain level, and otherwise pan-species models were used. Subsequently, the capabilities to metabolize the drugs through the respective enzymes for the 164 AGORA2 models with available data (Supplementary Table 7) were tested by computing whether the corresponding reaction could carry flux. Accuracy, sensitivity and specificity of predictions were calculated after determining the number of true positive, true negative, false positive and false negative predictions. P values were calculated by Fisher’s exact test and, for sensitivity analysis, by mixed effect logistic regression including the model as random effect variable, accounting for the stochastic dependency of predictions stemming from the same model.

Drug yields

To determine each strain’s capability to metabolize drugs, all AGORA2 strains were constrained with a simulated Western diet20 and the flux through the exchange reactions corresponding to each drug was minimized using FBA, corresponding to maximal uptake rate of the drug. For all AGORA2 organisms capable to take up at least one drug, the yield of ATP, carbon and ammonia from 1 mmol of the drug per g dry weight per h was evaluated as follows. Each reconstruction was constrained to only allow the uptake of water, phosphate and oxygen (VMH IDs: h2o, pi, o2). Demand reactions for ammonia as well as CO 2 and pyruvate (as proxies for carbon sources) (VMH IDs: nh4, co2, pyr) were added, while a demand reaction for ATP (VMH ID: atp) already existed in each reconstruction. Next, the uptake of each drug metabolite (15 in total, one representative for each enzyme) was allowed one by one at an uptake rate of 1 mmol per g dry weight per h. For each drug metabolite, the yields of ATP, ammonia, CO 2 and pyruvate from each drug metabolite were computed using FBA by maximizing the flux through the respective demand reactions. As control, yields were also computed for 1 mmol per g dry weight per h of glucose and without any metabolites added.

Simulation of drug metabolism by individual gut microbiomes

Previously, metagenomic sequencing from fecal samples of a cohort of 616 Japanese patients with CRC and healthy controls had been performed38. Species-level abundances for this cohort, which have been determined with MetaPhIAn2 (ref. 90), were retrieved from https://www.nature.com/articles/s41591-019-0458-7#MOESM3. Unclassified taxa on the species level, eukaryotes and viruses were excluded. Of the remaining 517 species, 501 (97%) could be mapped onto the 1,738 AGORA2 species. Pan-species models for AGORA2 were created through the ‘createPanModels’ function. From the pan-species models, personalized microbiome models for each of the 616 samples were built through a computationally efficient pipeline43 with the species-level abundances as input data and parameterized as described elsewhere10,60. For each individual, we integrated all microbial models having a nonzero abundance in the sample into one personalized microbiome model. To contextualize the models with appropriate diet constraints, a simulated Average Japanese Diet described previously41 (Supplementary Table 12) was used. To predict the drug conversion potential of each microbiome, the fecal secretion reactions for 13 drug metabolism end products were optimized one by one using FBA34, while providing the respective precursor drug as well as oxygen at a de facto unlimited uptake rate of 1,000 mmol per g dry weight per h.

Shadow price analysis

To determine species in microbiome models that were of importance for the microbiome’s combined potential to metabolize a drug, a shadow price analysis was performed as described previously60. Briefly, shadow prices are a feature of every FBA solution (that is, the shadow price is the dual to the primal linear programming problem) that reflect the contribution of each metabolite in the model to the flux through the objective function8. A nonzero shadow price for a metabolite indicates that this metabolite has importance for the total flux capacity through the optimized objective function, that is, in our case, the secretion of a drug metabolic product. A shadow price of zero indicates that increasing the availability of this metabolite would not change the flux through the objective function. To determine the species that were bottlenecks for the conversion potential of the 13 drugs in each microbiome model, nonzero shadow prices for species biomass metabolites (‘species_biomass[c]’), which reflect the contribution of the species to the community biomass reaction, were retrieved.

Statistical analysis

We analyzed statistically the net production capacity of 13 drug metabolites (Fig. 6a) among 252 healthy individuals and 364 patients with CRC. For each drug metabolite, we calculated the mean flux and the share of microbiomes with a flux greater than zero. Drug metabolites, which had in over 50% of the cases a zero flux, were dichotomized (can be produced versus cannot be produced) and, subsequently, analyzed via logistic regressions. Drug metabolites with over 50% nonzero entries were analyzed via linear regressions using heteroscedastic robust standard errors. First, we investigated potential effects of basic covariates (age, sex and BMI) via generalized linear regressions (logistic or linear) with the net production capacity being the response variable (dichotomized or metric). Age and BMI were introduced into the models as restricted cubic splines91 using four knots (the 5%-percentile, the 33%-percentile, the 66%-percentile and the 95%-percentile) resulting in three spline variables, each to test on potential nonlinear relationships. Significance was then determined by testing the three spline variables belonging to age (or BMI, respectively) simultaneously on zero via the Wald test91. While for age substantial nonlinearities were found, no indication for nonlinear BMI effects could be identified. The final models included, therefore, only the linear BMI term. Second, we tested for potential associations of net production capacities with case control status. This test was done via generalized linear regressions (logistic or linear) with the net production capacity being the response variable (dichotomized or metric), while adjusting for age (restricted cubic splines), sex (male/female) and BMI (linear). We corrected for multiple testing using the FDR, adjusting significance values for 13 tests per analyses stream. A test was considered nominal significant with P < 0.05 and FDR-corrected significant if FDR < 0.05. For sensitivity analysis, we recomputed the drug-metabolizing capabilities using an average European diet instead of a Japanese diet. Then, we calculated Pearson correlations for each drug metabolite between the secretion potentials under Japanese and an average European diet. All statistical analyses were performed with STATA 17/MP. All scripts are available at https://github.com/ThieleLab/CodeBase.

Sign prediction of fecal metabolite–species associations using AGORA2-based community models

We utilized the publicly available metabolome dataset (n = 347) from ref. 38. To test whether AGORA2-based community modeling is capable of predicting the sign of statistical associations between species presence and fecal metabolite concentrations in the CRC sample, we calculated maximal net secretion for 52 metabolites with fecal metabolome data with more than 50% of the samples having concentrations above the limit of detection. Metabolite net secretion was computed using the mgPipe module in the Microbiome Modeling Toolbox10,43 while relying on computationally efficient flux variability analysis92. Then, we calculated for each species present in at least 10% of the microbiomes and at max 90% of the microbiomes the effect of species (binary predictor: species present versus species not present) on each fecal metabolite concentration in multivariable regressions, adjusting for age, sex, BMI and study group. We then filtered for all species metabolite associations with P < 0.05. Next, we calculated the effect of the species presence on the community net secretion of the corresponding metabolite in analogous regressions. Finally, we calculated for each metabolite the agreement in signs between the in vivo association statistics and the in silico association statistics. Significance was determined by Fisher’s exact test and FDR correction was applied, accounting for 52 tests. Note that the P values should be treated with care since the signs of the various association statistics may cluster due to the multivariate nature of both the metabolome and the microbiome data.

Data visualization

The phylogenetic tree of AGORA2 organisms was constructed in PhyloT (https://phylot.biobyte.de/) and visualized in iTOL (https://itol.embl.de/)93. Violin plots were generated in BoxPlotR (http://shiny.chemgrid.org/boxplotr/). Clustering of taxa by reaction presence through t-distributed stochastic neighbor embedding (t-SNE)52 was performed using the t-SNE implementation in MATLAB with Euclidean distance, barneshut set as the algorithm and perplexity set to 30. Taxa with fewer representatives than 0.5% of all clustered strains were excluded from the t-SNE plots. Significance of differences in coordinates across taxonomic units was determined by Kruskal–Wallis tests. Circle plots were generated using the online implementation of Circos94. Figure 6 and Supplementary Fig. 9 were generated with the graphics functions of STATA 16/MP. All other data were visualized in MATLAB and R89.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.