Main

Coordinated changes of genetically encoded metabolic enzymes and transporters, and the metabolites they act on, underpin diverse cancer-associated phenomena, including tumorigenesis1, pluripotency2,3, the onset of drug resistance4,5,6 and the modulation of immune responses7,8,9,10,11. However, despite the high value of joint profiling of metabolites and gene expression/protein levels, previous large-scale studies of tumour metabolism have overwhelmingly focused on the analysis of gene expression data12. Conversely, the few instances of multimodal metabolomic and transcriptomic profiling of human tumour specimens have largely been performed in disparate, unrelated studies by a multitude of research teams13,14,15,16,17,18,19,20,21,22. Integration of metabolomic datasets produced in different patient cohorts is challenging due to technical batch effects and the semi-quantitative nature of untargeted metabolomic data (reported in arbitrary units of relative abundance). Thus, both the scarcity of multimodal metabolomic/transcriptomic data from tissue specimens and the challenges of harmonizing available datasets fundamentally impede the discovery of recurrent, coordinated changes in metabolic gene expression and metabolite abundance across cancers.

Tumours from diverse cancer types differ in their cell-type composition, vascularization and other factors ultimately influencing metabolism. Yet, they share a convergent set of metabolic alterations23,24,25,26,27. For example, several meta-analyses of the tumour metabolic transcriptome have identified recurrent upregulation of genes in one-carbon metabolism and oxidative phosphorylation across cancer types26,28,29. Analogously, meta-analyses of metabolomics data have demonstrated that numerous central carbon metabolites (for example, lactate) and effector metabolites (for example, kynurenine) are at higher abundance in tumour tissue compared to normal tissue across many cancer types23,30. These studies have illustrated the power of meta-analysis for distilling highly recurrent metabolic phenotypes from heterogeneous data but have left unresolved the question of how metabolic gene expression and metabolite abundance are coordinated and ultimately shape tumour physiology.

To systematically investigate gene–metabolite covariation in cancer, we assembled, harmonized and integratively analysed metabolomics and transcriptomics profiles from 988 primary tumour and matched adjacent normal tissue collected in 15 independent studies covering 11 cancer types. The preprocessed and harmonized data constitute the Cancer Atlas of Metabolic Profiles (CAMP), representing what is, to our knowledge, the largest harmonized dataset of multimodal metabolomic and transcriptomic data on primary tumour specimens. The CAMP is publicly available for download on Zenodo (https://doi.org/10.5281/zenodo.7150252) and can be interactively explored at https://rezniklab.shinyapps.io/CAMP-shiny-app/. Leveraging the diversity of diseases in our dataset, we designed a concordance-based statistical meta-analysis approach to discover instances of gene–metabolite interactions (GMIs) that transcended cancer type. This revealed two distinct classes of GMIs: First, we identified a small number of strong interactions between enzymes and metabolites involved in the same or subsequent reactions (‘proximal’ GMIs), suggesting that these enzymes are the primary determinants of their respective metabolite pool sizes. A second group of GMIs consisted of a small number of metabolites broadly correlated to large numbers of genes. Interestingly, this second class of GMIs was enriched for genes specifically expressed in immune cells, and for metabolites related to nicotinamide adenine dinucleotide (NAD+), a pleiotropic metabolite that acts both as a central cofactor in metabolism31 and as a signalling molecule influencing cell identity32. Taken together, these findings suggest that gene–metabolite covariation in tumours emerges, in part, from two complementary phenomena: the expression of enzymes with strong control over metabolite pool size, and the presence of specific cell populations in the tumour microenvironment with characteristic metabolomic profiles.

Results

The Cancer Atlas of Metabolic Profiles

Since metabolomic profiling has so far been excluded from large multimodal tumour profiling projects (for example, the TCGA33), there is no unified resource of metabolomic/transcriptomic data in the cancer research field. However, several groups have independently produced and released matched metabolomic/transcriptomic data in diverse cancer types13,14,15,16,17,18,19,20,21,22. We combined these datasets with several in-house studies to create a comprehensive collection of 988 samples (764 tumour samples and 224 adjacent normal samples) across 11 different cancer types, covering 15 datasets, which we called the CAMP (Table 1 and Fig. 1a). The overall collection includes a total of more than 40,000 unique transcripts and almost 2,500 unique metabolites. To maximize comparability across these heterogeneous studies, we applied a unified workflow to process RNA expression from microarray and RNA-sequencing (RNA-seq) data, harmonize metabolite names and annotations, and standardize data normalization and preprocessing (Methods). The CAMP represents an unprecedented resource to investigate the covariation of metabolite levels and gene expression at scale across diverse lineages of human cancers and normal tissues. To ensure high quality of the data, we evaluated several measures of quality control (QC). These included confirming that changes in metabolite abundance and gene expression between tumour and normal samples recapitulated those from prior work (Extended Data Fig. 1a,b), and demonstrating that covariation between pairs of metabolites was strongest between proximal metabolites in the metabolic network (that is, metabolite pairs acted upon by a common enzyme; Extended Data Fig. 1c).

Table 1 Overview of cancer type, sample size and type of gene expression data in the study
Fig. 1: Summary of the Cancer Atlas of Metabolic Profiles.
figure 1

a, The CAMP integrates metabolomic and transcriptomic data from 15 datasets covering 11 different cancer types, comprising both tumour and normal tissue. BRCA1/BRCA2, breast cancer; COAD, colon adenocarcinoma; DLBCL, diffuse large B cell lymphoma; GMB, glioblastoma; HürthleCC, Hürthle cell carcinoma; OV, high-grade serous ovarian cancer; PDAC, pancreas adenocarcinoma; PRAD, prostate adenocarcinoma; ccRCC1/ccRCC2/ccRCC3/ccRCC4, clear-cell renal carcinoma. b, Overview of non-parametric concordance meta-analysis. For a given transcript (T) and metabolite (M) pair, all measurements from all datasets were considered and weighted according to sample size. The concordance is a non-parametric measure of bivariate association between T and M that can be applied in meta-analysis across multiple datasets.

Source data

We interrogated the CAMP dataset for recurrent covariation between genes and metabolites across datasets and cancers. Such gene–metabolite covariation could emerge via numerous mechanisms including direct metabolic interactions, for example, via expression changes of a rate-limiting enzyme, or by the accumulation/depletion of metabolites as part of a broader phenotype, such as a cytotoxic immune response34,35,36. While each cancer type is likely to demonstrate its own unique pattern of transcriptomic and metabolomic changes, the CAMP enables discovery of metabolomic/transcriptomic covariation that transcends diseases. To identify cancer-type-agnostic metabolite and transcript covariation across CAMP datasets in a statistically principled manner, we developed a concordance-based meta-analysis approach (Fig. 1b and Methods). Concordance is a non-parametric measure of correlation, which enables the identification of consistently positive or negative gene–metabolite associations across datasets37. Our measure of concordance ranges from –1 to 1 and is closely related to non-parametric correlation coefficients such as Kendall’s tau, with a value of –1 corresponding to strong discordance and 1 corresponding to strong concordance (Methods). We focused our analysis on tumour samples only in the CAMP (although analogous analysis could be carried out on normal samples), and focused on the 276 metabolites that were quantified in more than half of the tumour datasets (at least 8 of 15 tumour datasets) and the 16,082 genes that were quantified in all 15 studies. Of all possible gene–metabolite pairs (276 metabolites × 16,082 genes = 4,438,632 pairs), a total of 22,619 pairs (0.51%) were significantly correlated after multiple-testing correction at a false discovery rate (FDR) of 0.01 (Fig. 2a). This included 269 metabolites (~97%) and 7,987 genes (~50%) participating in at least one significant association (Supplementary Table 1), which we refer to as GMIs. Post-hoc QC confirmed that 7,737/22,619 (78%) of GMIs were statistically significant in two or more individual datasets (Extended Data Fig. 1d). Statistically significant GMIs identified in single-study concordance analysis of the BRCA1 (microarray) and BRCA2 (RNA-seq) were highly consistent, indicating that the choice of transcriptomic profiling technology (that is, microarray versus RNA-seq) did not introduce substantial artefacts (Extended Data Fig. 2). Importantly, the results of the concordance were not affected by imputation, and an analogous concordance analysis omitting all imputed data reproduced 20,291/22,619 (89.7%) of statistically significant GMIs. Finally, we examined the consistency of GMIs (considering only those measured in >7 distinct datasets) generated from two distinct, non-overlapping subsamples of the full CAMP dataset. This analysis identified a high degree of consistency between estimates of concordance for each GMI (Spearman rho 0.31, P value < 1016), confirming the robustness of the results of our concordance meta-analysis.

Fig. 2: Meta-analysis across the CAMP captures lineage-transcending gene–metabolite interactions.
figure 2

a, Volcano plot of the GMIs computed between the 16,082 genes present in all datasets, and the 276 metabolites present in at least 8 of our 15 tumour cohorts. The x axis indicates the scaled concordance value, where values above 0 indicate positive association and values below 0 indicate negative association. The y axis represents the corresponding −log10 FDR-adjusted P value. Two-tailed P values were estimated from the unscaled concordance value’s z-score (Methods) and were corrected for multiple testing using the Benjamini–Hochberg method. The horizontal line indicates the significance cut-off of 0.01 FDR. Light grey dots indicate nonsignificant gene–metabolite pairs, and black dots indicate significant pairs. Three top hits have been highlighted. b, Statistically significant GMIs are enriched for proximal interactions, but proximal interactions nevertheless constitute a minority of all statistically significant GMIs. Of all significant gene–metabolite pairs in our concordance meta-analysis, 3,304/22,619 pairs had a defined distance (~14.61%), but only 565/22,619 (~2.50%) of these were proximal. c, Proximal GMI prioritization. GMIs for the 22 metabolites whose strongest GMI was proximal (distance less or equal 2). For each metabolite, we ranked genes by their statistical significance. Two-tailed P values were estimated from the unscaled concordance value’s z-score (Methods) and were corrected for multiple testing using the Benjamini–Hochberg method. Red and black dots indicate proximal and non-proximal genes significantly associated with the corresponding metabolite, respectively, while grey dots indicate genes with nonsignificant associations. Bold metabolites exhibit a large gap between the dominant GMI and all other GMIs for a metabolite. d,e, Scatterplots of the association between kynurenine levels and two proximal genes (IDO1, d; AFMID, e). Metabolite abundances were scaled within each dataset to be displayed together. Two-tailed P values were estimated from the unscaled concordance value’s z-score (Methods) and were corrected for multiple testing using the Benjamini–Hochberg method. f, CRISPR–CAS9-mediated knockout of IDO1 depleted IDO1 protein levels in HCT116 cells. Western blot was performed once and not repeated. g, Kynurenine levels were depleted upon knockout of IDO1 in HCT116 cells (n = 3 in each condition). Data are presented as mean values ± s.d. h, Scatterplot of the association between tryptophan levels and IDO1 in the CAMP. Two-tailed P values were estimated from the unscaled concordance value’s z-score (Methods) and were corrected for multiple testing using the Benjamini–Hochberg method. i, Tryptophan levels increase upon IDO1 knockout (n = 3 in each condition). Data are presented as mean values ± s.d. j, Scatterplot of oxidized glutathione (GSSG) and GGT1. Two-tailed P values were estimated from the unscaled concordance value’s z-score (Methods) and were corrected for multiple testing using the Benjamini–Hochberg method. k, Validation of the GSSG and GGT1 relationship was based on the study from Priolo et al.42. The dataset includes four data points for each condition. The two-tailed P value was estimated with a Wilcoxon rank-sum test. a.u., arbitrary units.

Source data

A subset of gene–metabolite covariation represent direct enzyme–substrate interactions

Among the statistically significant GMIs (Fig. 2a), we noted that the strongest positively correlated GMI (IDO1–kynurenine, adjusted P value = 4.82 × 1035) and the two strongest negatively correlated GMIs, GDA–guanine (adjusted P value = 2.31 × 1020) and CD38–nicotinamide mononucleotide (NMN; adjusted P value = 3.90 × 1016) corresponded to ‘proximal’ metabolic interactions, which are interactions between an enzyme and its direct or nearly direct substrate/product. For example, IDO1 catalyses the catabolism of tryptophan to N-formyl-kynurenine, which is subsequently metabolized to kynurenine38, and both CD38 and GDA directly degrade the metabolites guanine and NMN, respectively39,40. We confirmed that statistical significance for these three GMIs was likely driven by several cancer types rather than a single dataset with very strong associations (Extended Data Fig. 3).

The direct biochemical relationship between the three GMIs above raised the possibility that functional proximity between enzymes and their substrates/products might underlie a large fraction of GMIs. To test this, we systematically computed the biochemical distance between all gene–metabolite pairs using the highly curated Human-1 metabolic network model from Robinson et al.41. In this framework, a distance of one represents molecules that are involved in the same metabolic reaction, and a distance of two indicates a gene–metabolite pair that take part in subsequent reaction steps (Methods). Although statistically significant GMIs were enriched for proximal interactions relative to nonsignificant gene–metabolite pairs (odds ratio 1.42, Fisher’s exact test P value = 3.09 × 1015), proximal interactions themselves constituted only a small fraction of the total ensemble of statistically significant GMIs (2.5%, 565/22,619; Fig. 2b). Thus, while several of the strongest GMIs arose from proximal interactions, gene–metabolite proximity was a weak determinant of the full GMI landscape.

To further investigate the above observations about the strength of specific proximal GMIs and the overall relationship between metabolic proximity and GMIs, we investigated the relative strength of different GMIs affecting a common metabolite. We then focused on the 22 metabolites whose strongest GMI was proximal, covering diverse molecules involved in nucleotide metabolism (guanine, cytidine), cofactor metabolism (NAD+), redox metabolism (cystine, oxidized glutathione (GSSG)) and other pathways (Fig. 2c). Interestingly, we found that for 8 of 22 metabolites (kynurenine, guanine, NMN, GSSG, tryptophan, glycerophosphocholine (GPC), cystine and cytidine), a large gap existed between the most significant GMI and the second-highest correlating transcript. This gap suggested that the pool size of these metabolites was strongly controlled, in a lineage-agnostic manner, by a single, dominant gene. Consequently, we hypothesized that targeted genetic knockdown of dominant GMIs for these 8 metabolites would have a higher likelihood of producing significant changes in pool size than for other metabolites with multiple, comparably strong GMIs, each of which might control the pool size of the metabolite (Fig. 2c). Similar results were found when relaxing the threshold for calling gene–metabolite proximity (Extended Data Fig. 4).

We sought to functionally validate a subset of these predicted, metabolite pool-controlling genes. First, we investigated the association between IDO1 and two metabolites, kynurenine and tryptophan. IDO1 converts tryptophan to N-formylkynurenine, which is subsequently catabolized to kynurenine by AFMID. We observed that kynurenine levels were strongly associated with IDO1 (Fig. 2d) but not AFMID (which acts directly to produce kynurenine; Fig. 2e) expression across the CAMP. These findings were consistent with independent measurements of the metabolome and transcriptome obtained from the Cancer Cell Line Encyclopedia consortium on ~900 cell lines, where IDO1 expression was associated with kynurenine abundance (P = 8.6 × 109) but AFMID expression was not (P = 0.8) (ref. 36) (Extended Data Fig. 5a,b). To experimentally test the hypothesis that disruption of IDO1 impacts both tryptophan and kynurenine pool sizes, we used CRISPR–Cas9-mediated knockout with single-guide RNAs (sgRNAs) targeted against IDO1 human colorectal carcinoma HCT116 cells. These experiments corroborated earlier data indicating that knockout of IDO1 depleted kynurenine pools (Fig. 2f,g). However, while the association between IDO1 and kynurenine has been widely described in the literature36, our analysis indicated that IDO1 is also expected to determine the pool size for tryptophan, an amino acid involved in numerous other reactions in the cell, most obviously the synthesis of proteins (Fig. 2h). Consistent with this observation, we observed that knockout of IDO1 was sufficient to increase tryptophan levels, indicating that the pool size of this highly connected proteinogenic amino acid could be perturbed in part through disruption of IDO1 activity (Fig. 2i). Second, in support of a proximal GMI between GGT1 and GSSG (Fig. 2j), we reanalysed existing metabolomic data from a functional knockdown of GGT1 versus control in human embryonic kidney HEK293T cells42,43. This data confirmed that knockdown of GGT1 was associated with an increase in GSSG levels with respect to mock control (P value = 2.90 × 102), suggesting that GGT1 is a pool-determining consumer of GSSG (Fig. 2k). Taken together, these data demonstrate that lineage-transcending GMIs discovered through pathway-based analysis of the CAMP represent examples of genes exerting strong control over metabolite pool sizes.

Pathway-level metabolic and transcriptomic changes weakly covary

Despite the interesting findings related to proximal GMIs, most (97.5%) statistically significant GMIs represented distant, non-proximal interactions beyond obvious enzyme–substrate metabolic relationships (Fig. 2b). One possible implication of such nonlocal covariation is that genes and metabolites in the same metabolic pathway would show asynchronous and uncorrelated changes across different groups of samples, such as tumours and normal tissues. To investigate this hypothesis, we studied the consistency of transcriptional and metabolic differences in tumour versus adjacent normal tissue across cancer types. To this end, we performed differential analysis of metabolite and transcript levels between tumour and normal tissues in the 7 CAMP datasets where both tissues were available (Table 1) and aggregated the results into 85 Kyoto Encyclopedia of Genes and Genomes (KEGG) metabolic pathways. Of these, we considered 63 pathways with at least one metabolite or gene measured in at least 5 of the 7 CAMP datasets (Supplementary Table 2).

For each KEGG pathway, we evaluated (using a differential abundance (DA) score; Methods) whether metabolites and transcripts showed synchronous accumulation or depletion patterns in tumours relative to normal tissues. Pathways were biased towards asynchronous changes (276/441, 63%), where increases in metabolite levels coincided with decreases in transcript levels, and vice versa (Fig. 3a). Only one pathway (histidine metabolism) demonstrated fully synchronous changes in all datasets, whereas a few others demonstrated uniformly asynchronous changes (for example, primary bile acid biosynthesis). We also assessed whether there was a correlation between the extent of metabolomic versus transcriptomic disruption regardless of the direction (using a differential fraction (DF) score; Methods). A minority of pathways (9/63) demonstrated significant associations (nominal P value < 0.05) between RNA and metabolite DF scores (Fig. 3b; see Fig. 3c as an example). Interestingly, these 9 pathways belonged to just 2 KEGG pathway classes (Supplementary Table 3): amino acid metabolism and carbohydrate metabolism. Enrichment analysis indicated that the class of carbohydrate metabolism pathway (of 8 in total) was significantly over-represented relative to the others (Fisher’s exact test P value: 2.36 × 102). Thus, most pathways showed no evidence of a correlation between metabolomic and transcriptomic disruption, emphasizing the implications of predominantly distally acting GMIs and prompting the question of which biological phenomena produce these distal GMIs.

Fig. 3: Tumour versus normal changes in metabolite and transcript abundance are predominantly asynchronous.
figure 3

a, Heat map of metabolite and transcript DA scores across datasets and pathways, capturing the tendency for metabolites and genes to accumulate or deplete in tumours relative to normal tissues. The size of the dots indicates the number of molecules measured in that pathway, while the colour represents the DA score. b, Spearman correlation coefficients of the metabolite and transcript DF scores in KEGG pathways. Red dots indicate nominal significance (P value < 0.05). A minority of pathways showed significant association between metabolomic and transcriptomic disruption across CAMP datasets. P values were estimated from Spearman’s rank correlation test and were corrected for multiple testing using the Benjamini–Hochberg method. c, Example of Spearman correlation calculation: Distribution of the metabolite (x axis) and transcript (y axis) DF scores across datasets for the citrate cycle (tricarboxylic acid (TCA) cycle) pathway. P values were estimated from Spearman’s rank correlation test.

Source data

Hub metabolites are enriched for immune genes associations

Having established that most GMIs do not represent biochemically proximal interactions, we adopted a broader approach to identify the driving factors of metabolite–transcript correlations. First, we investigated the distribution of GMIs across metabolites and genes, observing that GMIs were strongly concentrated in a small number of metabolites (Fig. 4a). The top three metabolites with the highest number of GMIs, quinolinate, NMN and 5′-methylthioadenosine alone contributed to 17% (3,823/22,619) of all GMIs in our analysis, and the top ten metabolites covered 35% of the GMIs (8,048/22,619), far higher than the fraction covered by the top eight genes (Fig. 4b and Extended Data Fig. 6a,b). That is, a small number of metabolites participated in an exceptionally high number of GMIs, acting as ‘hubs’ for strong covariation with gene expression. Interestingly, hub metabolites concentrated in certain metabolic pathways. Among the top ten most correlated metabolites, we found several constituents of the NAD+ biosynthesis pathway (quinolinate, NMN and NAD+) and nucleotide metabolism (thymine, uracil and adenine).

Fig. 4: Gene–metabolite interactions are concentrated in hub metabolites associated with immune-related genes.
figure 4

a, Distribution of significant GMIs across metabolites. The top ten metabolites with the highest number of significant associations are labelled. b, Distribution of significant GMIs across genes. The top eight genes with the highest number of significant associations are labelled. c, KEGG-based gene pathway enrichment analysis results for metabolites with at least one GMI and at least one significantly enriched pathway. The heat map colour represents the strength of the enrichment as the negative log10(P value) of the pathway enrichment test. Two-sided P values were estimated with a Fisher’s exact test and adjusted for multiple testing using the Benjamini–Hochberg method. Cells coloured in shades of red indicate pathways that were significant after multiple-testing corrections (adjusted P value < 0.01), and grey cells indicate insignificant associations. Individual pathways were classified based on the type of process they describe: metabolism (lilac), immune (green) or other (cream). MTA, 5′-methylthioadenosine.

Source data

To determine whether the genes correlated with a particular metabolite were enriched for specific cellular functions, we performed an unsupervised pathway enrichment analysis. For each metabolite with at least one GMI, we investigated whether specific pathways and processes were over-represented in the transcripts correlated with that metabolite. Overall, the considered transcripts spanned over 146 KEGG pathways (Supplementary Table 4), covering the 85 metabolic processes in Fig. 3 as well as cellular processes (for example, cell growth and death), signalling pathways, genetic processing pathways (for example, transcription and translation) and organismal systems pathways (for example, immune, endocrine and sensory systems). A total of 32 unique pathways were over-represented across 40 metabolites (adjusted P value < 0.01; Fig. 4c). Interestingly, only 2 of those 32 pathways represented metabolic processes: oxidative phosphorylation, for which the top three most associated metabolites were adenine, NMN and FAD, and the TCA cycle, also associated with FAD. The remaining top-ranked pathways were exclusively non-metabolic ones. Three metabolites in particular, quinolinate, NMN and NAD+, showed broad enrichment for immune-related cellular processes, including chemokine signalling, as well as B cell and T cell antigen receptor signalling pathways. This indicates that these metabolites are correlated with the expression of a wide array of genes associated with the immune response.

NAD+-related metabolites associate with immune cell infiltration

Human tumour tissues are heterogeneous compositions of various cell populations, including tumour cells, immune cells and stromal cells. Bulk and single-cell profiling technologies have established that a large subset of genes are exclusively expressed in immune cells or non-immune cell subpopulations (for example, tumour cells)44,45. We reasoned that the correlation between NAD+-related metabolites and immune-related genes (Fig. 4c) could therefore arise if NAD+-related metabolites were at a characteristically higher or lower abundance in immune cells relative to non-immune cells. One implication of this hypothesis is that, while each cancer type might demonstrate its own unique metabolomic changes associated with immune infiltration, NAD+-related metabolites should be expected to demonstrate consistent effects across many different cancer types.

To determine if NAD+-related metabolites were associated with immune infiltration across tumours, we used single-sample gene-set enrichment analysis (ssGSEA)46 to compute a previously validated 141-gene RNA signature of overall immune cell infiltration (ImmuneScore) directly from bulk RNA-seq data47, and identified the individual metabolites correlated with this immune phenotype. Concordance between metabolite levels and the ImmuneScore signature was assessed across all samples in each CAMP dataset (Fig. 5a). In general, covariation between specific metabolite pools and ImmuneScore expression was cancer-type specific. For example, of the metabolites significantly associated with ImmuneScore in intrahepatic cholangiocarcinoma (ICC) and hepatocellular carcinoma (HCC; representing the top two cancer types with the highest number of metabolites significantly associated with ImmuneScore), only 3 metabolites were consistently associated with ImmuneScore in both datasets (Fig. 5b and Supplementary Table 5). We did not observe a significant correlation (Spearman’s rank correlation P value = 0.69) between the percentage of metabolites significantly associated with ImmuneScore and expression of the ImmuneScore signature itself (Fig. 5a), suggesting that the extent of immune infiltration did not confound our analyses.

Fig. 5: The abundance of NAD+-related metabolites is shaped by immune infiltration in the tumour microenvironment.
figure 5

a, Left, association of metabolite abundance with immune infiltration varied significantly across cancer types. The bar plot indicates the fraction of metabolites significantly associated with the ImmuneScore signature in each dataset. Bar colours code for the sample size of the dataset. Sample size is the number of tumour samples of the dataset in Fig. 1a. Middle, box plots indicate the expression range of the ImmuneScore signature within each dataset. Right, plot of scaled concordance calculated between metabolites and ImmuneScore. Red dots indicate metabolites with a positive association with ImmuneScore, blue dots indicate metabolites with negative associations, and grey dots indicate metabolites that were not statistically significant. b, Heat maps of five metabolites—three that were highly correlated with ImmuneScore in the ICC dataset and two that were highly correlated with ImmuneScore in the HCC dataset. Samples were sorted by increasing ImmuneScore. UDP-G, UDP-glucuronate; GPC-16:1, 1-palmitoleoyl-GPC (16:1); GPC-18:2, 1-linoleoyl-GPC (18:2). c, Bar plot indicating the strength of association between metabolites and ImmuneScore from concordance meta-analysis. Two-tailed P values were estimated from the unscaled concordance value’s z-score (Methods) and were corrected for multiple testing using the Benjamini–Hochberg method. Bar length represents the −log10 adjusted P value. Metabolites related to NAD+ metabolism are shown in black. d, Scatterplots of the abundance of two NAD+-related metabolites, quinolinate and NMN, versus ImmuneScore expression across all datasets. Metabolite abundances were scaled within each dataset. e, Bar plot comparing the absolute concordance values of metabolites to ImmuneScore in a pathway compared to all other pathways (one-sided P values were estimated from Wilcoxon rank-sum test and were corrected for multiple testing using the Benjamini–Hochberg method). f, Metabolomic measurements of purified populations of CD45 tumour cells and CD45+ T cells isolated from ovarian cancer tumours. NAD+ was negatively correlated with the ImmuneScore signature in the ovarian cancer dataset (n = 45). NAD+ was similarly lower in abundance in CD4+/CD8+ cells than CD45 (non-immune) cells in the dataset of purified cell populations (n = 24 in CD4+/CD8+; n = 18 in CD45).

Source data

While metabolite concordance with ImmuneScore was generally heterogeneous across CAMP datasets, we observed that several NAD+-related metabolites, including quinolinate and NMN, were recurrently associated with high or low levels of immune infiltration in numerous disease contexts (Fig. 5a). To systematically identify such lineage-agnostic metabolomic correlates of immune infiltration, we again applied concordance meta-analysis, identifying 14 metabolites significantly associated with ImmuneScore across datasets (adjusted P value < 0.05; Fig. 5c), with quinolinate and NMN being the strongest hits (Fig. 5d). Interestingly, 4/14 significantly associated metabolites (quinolinate, NAD+, NMN and kynurenine) were members of the NAD+ biosynthesis pathway. Consistent with this, we identified NAD+ metabolism as the sole pathway whose metabolites demonstrated significantly stronger association with immune infiltration than all other pathways (adjusted P value = 0.04; Fig. 5e).

The above analysis suggested that NAD+-related metabolites were at differential abundance in immune cells relative to non-immune cells, and that this effect produced a consequent accumulation of NAD+-related metabolites in immune-infiltrated tumours. Some support for this hypothesis can be found in previously published immunohistochemical data indicating that the abundance of quinolinate increases dramatically in diverse immune cell populations in response to Toll-like receptor 4 ligands such as lipopolysaccharide48. To provide more evidence for this hypothesis, we compared our findings to a recently published study of metabolomic profiles of purified CD45 tumour cells and CD45+ (CD8+ and CD4+) T cells from ovarian cancer tumours48. In our data, NAD+ was negatively correlated with ImmuneScore in ovarian cancer, suggesting that it was at lower abundance in immune cells relative to non-immune cells (Fig. 5f). Consistent with this, NAD+ was at significantly lower abundance in CD45+ T cells than CD45 tumour cells in the dataset of purified cell populations (NAD+ log2 fold change = −1.22, P value = 1.7 × 104; Fig. 5f). Together, these analyses suggest that the pool sizes of NAD+-related metabolites are at characteristically different abundance in immune cells relative to other cell types, and that this effect ultimately drives the association of quinolinate and other NAD+-related metabolites with immune infiltration in bulk tumour data.

Kynurenine and histamine levels correlate to specific cell populations

Whereas in the previous analysis we investigated the association of metabolite levels with overall immune infiltration, we next turned to investigating the association of metabolite levels and specific immune cell populations (for example, T cells, macrophages and numerous other cell types), each with unique transcriptional phenotypes and immunological functions49,50,51. To investigate how these diverse immune cell populations contributed to the observed GMIs, we estimated the abundance of 23 immune cell types from bulk transcriptomics profiles51 using ssGSEA as shown previously46, and computed their association with metabolite levels across cancers. We again focused on lineage-agnostic relationships by performing a concordance meta-analysis to calculate associations between metabolite levels and immune cell signatures across tumours from all cancer types. In total, 7.3% of all metabolite–signature pairs (466 of 6,348 pairs) demonstrated statistically significant associations (adjusted P value < 0.05; Fig. 6a). Among these, quinolinate was positively associated with almost all immune cell populations (17/23), consistent with prior immunohistochemical data and suggesting that it accumulates in a variety of immune cell types in a cancer-type-agnostic manner52.

Fig. 6: A subset of metabolites associates with specific immune cell lineages.
figure 6

a, Volcano plot of cell-type signature–metabolite interactions. The rug plot at the bottom highlights the numerous associations with quinolinate. b, Manhattan plot of adjusted concordance P values between metabolites and cell types. c, Histamine associates with the abundance of mast cells across most datasets of the CAMP. d, Bar plot of adjusted concordance P values of metabolites correlated with the mast cell gene signature. The red dashed line indicates the significance cut-off of 0.05. Metabolites labelled in red are associated with histamine metabolism. e, HDC expression strongly associates with mast cell abundance across the CAMP. f, Kynurenine abundance associates with an aDC signature across the CAMP. In af, two-tailed P values were estimated from the z-scores of the unscaled concordance values (Methods) and were corrected for multiple testing using the Benjamini–Hochberg method.

Source data

Aside from quinolinate, the two strongest associations between immune cell-type signatures and metabolite levels were two comparatively rare cell populations, namely mast cells and histamine and activated dendritic cells (aDCs) and kynurenine (Fig. 6b). Mast cells are a myeloid cell population that, when stimulated, mediate the inflammatory process by synthesizing histamine from histidine using the enzyme HDC. We found both histamine and its related metabolite 1-methylhistamine were the two metabolites most significantly associated with the presence of mast cells (Fig. 6c,d), and that the association between histamine and mast cell abundance was driven by diverse cancer types in the CAMP (Extended Data Fig. 7a). Histamine levels themselves were strongly associated with the expression of HDC across the CAMP (Fig. 6e). Importantly, the mast cell signature contains HDC, but when the signature was recalculated without HDC the significant positive concordance was preserved (concordance = 0.28, P value = 3.4 × 1017; Extended Data Fig. 7b). Moreover, single-cell data also indicate that HDC expression is strongly elevated in mast cells relative to other cell types (Extended Data Fig. 7c). Interestingly, the median variation of histamine across the tumour datasets in the CAMP was ~740-fold, implying that fluctuations in the abundance of an otherwise rare cell population were sufficient to produce large-magnitude changes in histamine abundance in the bulk tumour.

In contrast to mast cells and their physiological role in producing histamine, DCs are not known to be dedicated sources of kynurenine in the microenvironment, although single-cell data indicate that IDO1 expression is strongly elevated in DCs relative to other cell types (Extended Data Fig. 8a). While the aDC signature contains IDO1 (and IDO1 participates in a strong GMI with kynurenine; Fig. 2), an aDC signature without IDO1 preserved strong positive concordance with kynurenine (concordance = 0.16, P value = 3.3 × 105; Extended Data Fig. 8b), confirming that this association was not solely dependent on IDO1 itself (Fig. 6f). IDO1 and kynurenine corresponded to the strongest GMI in the CAMP, raising two opposing hypotheses on the mechanism underlying the primary source of kynurenine in the tumour microenvironment: either bulk tumour expression of IDO1 is primarily determined by high DC-specific expression or, alternatively, it is driven by a comparatively low expression in far more abundant non-DC-cell populations. Resolving this ambiguity would require single-cell measurements of metabolite levels, which have recently become technically feasible53. Together, these data suggest that the presence of individual and comparatively rare cell populations is associated with shifts in the abundance of immunomodulatory metabolites in the tumour microenvironment.

Discussion

Metabolism is jointly controlled by genetically encoded enzymes and small-molecule metabolites. To study the interactions between genes and metabolites at scale, we assembled and harmonized a database of metabolomic and transcriptomic data from ~1,000 tumour and normal samples, which we refer to as the CAMP. Although large-scale multimodal measurements of metabolism have previously been produced in bacteria54 and yeast55, a comparable resource for human cancers was missing. The CAMP is thus a resource and represents a significant public database for studying the metabolism of complex human tissues and cancers, and for the interrogation of gene–metabolite covariation across different tissue and disease contexts. Our analysis of the CAMP demonstrates that large-scale studies of multimodal metabolic data can reveal fundamental principles of metabolic regulation at the scale of both individual metabolic reactions (Fig. 2) and complex human tissues (Figs. 5 and 6).

Our detailed statistical analysis of the CAMP revealed a multitude of gene–metabolite associations that transcended the tissue of origin. We focused on two specific types of gene–metabolite covariation. The first, induced by functional proximity, identified metabolic genes whose activation or inhibition is likely to have a significant effect on the direct substrate or product of the respective reaction (Fig. 2). While metabolite pools are likely controlled by a large number of genes and other factors, our data-driven analysis identified a small subset of metabolites whose pool size was strongly associated with a single, proximal gene across tissue lineages. The GMIs identified in Fig. 2b offer a data-informed, rational approach for modulating the pool sizes of their metabolite constituents. While each of the highlighted metabolites in Fig. 2b may participate in numerous metabolic reactions, data from the CAMP specifically nominates single genes (for example, GGT1 for GSSG, and IDO1 for tryptophan and kynurenine) as those targets whose perturbation is most likely to disrupt the corresponding pool size.

The second broad form of gene–metabolite covariation, likely induced by cell-type-specific physiology, corresponded to metabolites whose abundance was associated with the presence of specific immune cells in the microenvironment. Thus, a sizeable fraction of non-proximal GMIs was associated with a small number of metabolites (enriched for NAD+-related molecules) that were significantly correlated with a large number of genes, and in particular immune-related genes. This apparent association between the abundance of specific immune cells in a tissue specimen and the levels of numerous NAD+-associated metabolites suggests that immune cells have evolved mechanisms to maintain the concentrations of these metabolites at characteristically different levels relative to other cell types. Because NAD+ is both the central mediator of redox poise in the cell and a cofactor for numerous metabolic and non-metabolic reactions, understanding the mechanisms by which immune cells achieve differential abundance of NAD+-related metabolites, and the selective pressure to do so, can provide insights into the metabolic phenotypes underlying both cancer and other diseases involving dysfunctional immune responses56. Emerging spatial metabolomics (for example, high-resolution MALDI) and single-cell metabolomics (for example, rapid purification with paramagnetic beads, followed by mass spectrometry) technologies hold the promise of revealing the full extent of cell-type-specific metabolomic adaptations. Such approaches can be used alone or in tandem to measure the metabolomic profiles of individual cells in heterogeneous tissue slices57,58,59,60, rendering the full extent of cell-type-specific metabolomic alterations within experimental reach.

Our findings here nominate a new paradigm for understanding tumour-associated changes in the metabolome and how it relates to the cellular composition of the tissue microenvironment. Our discovery that the abundances of NAD+-related metabolites change consistently across tissues as a function of immune cell abundance adds to a limited but pre-existing understanding of how cell-type-specific metabolism enables function (for example, in the accumulation of lipids in adipocytes and clear-cell tumour cells). More importantly, the discovery that the tumour metabolome changes significantly based on immune cell composition has significant implications for the interpretation of metabolomic data in the context of cancer. Prior studies (including those from our own group) have often used adjacent normal tissue as a reference to understand tumour-specific changes in metabolite levels. What the discoveries here reinforce is that changes in the bulk tumour metabolome may not be caused by tumour-cell-intrinsic changes in metabolism, and instead may in some or many cases arise from non-malignant cell populations.

Discoveries about the function of cancer genes have emerged from a combination of untargeted, population-scale genomic surveys61 and mechanistic experiments in specific disease and genetic backgrounds62. Combining these approaches has proven transformative for the discovery of recurrent and large-effect-size alterations and prompted their characterization in model systems of human disease. In contrast, the field of cancer metabolism has primarily been driven by bottom-up experiments with modest support from large-scale (largely genetic or transcriptomic) datasets63,64,65. The CAMP is a counterpoint to these efforts. By assembling and harmonizing in one database metabolomic and transcriptomic data from diverse diseases, the CAMP represents a unique opportunity for de novo discovery of translationally relevant metabolic phenotypes in cancer. Expanding the scope of the CAMP to include both additional multimodal metabolic data as it is published (enabled by open-source code; Data availability and Code availability) and other forms of data, including but not limited to genomic sequencing, epigenetic profiling and proteomic measurements, holds the potential to reveal entirely new and highly recurrent metabolic phenomena in cancer.

Methods

Collection of Cancer Atlas of Metabolic Profiles

We combined 12 published datasets with 3 additional in-house datasets that profiled metabolite and gene expression from the same samples to create a comprehensive collection of 988 samples (764 tumour samples and 224 adjacent normal samples) across 11 different cancer types, covering 15 datasets, which we called the CAMP. Details and references associated with these studies are provided in Table 1 and Fig. 1a. Data are available for download at https://doi.org/10.5281/zenodo.7150252.

Gene expression data processing pipeline

Six of the CAMP datasets included gene expression data captured by the Affymetrix platform (GSE28735, GSE37751, GSE26193, GSE62452, GSE76297 and Cornell_PROSTATE). For these datasets, CEL files were downloaded from the Gene Expression Omnibus or from their source within our respective institutions. Then, we applied the robust multichip average (RMA) algorithm for background subtraction, quantile normalization and summarization (via median-polish) by using the ‘rma’ function implemented in the R oligo package (version 1.40.2) (ref. 66). Each dataset’s rma-normalized expression matrix was then used for downstream analysis. In the COAD dataset, gene expression data were captured by an Agilent custom array, and we downloaded the gene expression matrix deposited at the Gene Expression Omnibus repository (GSE89076) for further downstream analysis.

For CAMP datasets with RNA-seq data, RNA-seq reads were aligned against human genome assembly hg19 by STAR 2-pass alignment67 (version 2.5.3a). QC metrics, such as general sequencing statistics, gene feature and body coverage, were then calculated based on the alignment result through RSeQC68 (version 2.6.4). RNA-seq gene-level count values were computed by using the R package GenomicAlignments69 (version 1.14.2) over aligned reads with UCSC KnownGene70,71 with hg19 as the base gene model. The Union counting mode was used, and only mapped paired reads after alignment quality filtering were considered. Finally, gene-level TPM (transcripts per million) and raw read count values were computed by the R package DESeq2 (version 1.18.1) (ref. 72). All concordance analysis described herein with RNA-seq data used TPM values for RNA-seq data.

Across the 15 CAMP datasets, 16,082 of ~40,000 distinct transcripts were profiled in all cohorts and used for analysis (Extended Data Fig. 9a).

Metabolomics data preprocessing

Metabolomics data for 3/15 datasets (GBM, LiCa1 and LiCa2) were provided preprocessed and were therefore used in their original form. Ovarian cancer data were already normalized and were only log2 transformed before analysis. For the other 11 metabolomics datasets in our study, we obtained the raw metabolomics data from the data owners. In this case, the processing pipeline was standardized across datasets and included batch correction via median scaling if multiple batches of data were present (only necessary for PRAD dataset as it was the only dataset produced in distinct batches) and probabilistic quotient normalization73 using either only normal samples if available, or all tumour samples if no normal sample was included, and only metabolites with less than 20% values missing to create the reference sample. Importantly, nearly all metabolites with a large number of GMIs were imputed in two or fewer studies (Extended Data Fig. 10a). After normalization, metabolite abundances were log2 transformed. For each cohort and tissue type (that is, tumour and normal), metabolites with more than 80% values missing were excluded from the analysis. For the remaining metabolites, missing values were imputed using the minimum value recorded. Data preprocessing was performed using the R package maplet. Metabolite names and annotations were manually harmonized for consistency. First, we identified potentially matching compounds by systematically investigating the HMDB74, KEGG75 and Metabolon platform-specific peak IDs across datasets. The resulting candidates for each query were then manually investigated to make sure they corresponded to the same molecule. Incorrectly aggregated metabolites were discarded, and a single metabolite name was chosen to represent all validated matching compounds. The overlap of metabolites across datasets was heterogeneous (Extended Data Fig. 9b): of 2,411 unique molecules quantified in at least 1 dataset, fewer than 500 were measured in more than 5 datasets, and only 3 metabolites (gluconate, glucose and glucose-3-phosphate) were quantified in all 15 datasets in tumour samples. This high variability can be attributed to a variety of technical and biological factors, including metabolite ionizability on the mass spectrometer and potential specificity of certain metabolites to distinct tissues. A detailed, step-by-step tutorial describing how to harmonize additional future metabolomics datasets can be found in the Zenodo repository (https://doi.org/10.5281/zenodo.7150252, supplementary_dataset_v0.3.4.zip).

Concordance meta-analysis

To identify gene–metabolite pairs that were consistently associated across tumour types and cohorts, we used a stratified, weighted concordance model. Concordance is a non-parametric measure of correlation, similar to Kendall’s tau, that relies on the concept of concordant pairs37.

Briefly, consider sample \(i\) and sample \(j\) in a dataset where both \(m\) (metabolite) and \(g\) (gene) have been measured. The pair of sample i and sample j is defined to be concordant if \({sign}({m}_{i}-{m}_{j})={sign}({g}_{i}-{g}_{j})\), that is, if they have the same order in both samples, and discordant if they have opposite signs. Once pairwise concordance has been estimated for all pairs, the overall concordance c is calculated as

$$c=\frac{{\rm{\#}}{concordant\; pairs}+{\rm{\#}}{tied\; pairs}}{{\rm{\#}}{concordant\; pairs}+{\rm{\#}}{discordant\; pair}s+{\rm{\#}}{tied\; pairs}}$$

To account for the multiple cohorts in our study, we performed a stratified concordance analysis, where pairwise comparisons are only calculated within datasets, but not across. A global concordance value is then estimated by counting the overall number of concordant pairs according to the formula above. Moreover, given the vast heterogeneity in sample sizes across studies, we downweighed each observation by the number of samples in the corresponding study (that is, \(1/{samples}({{dataset}}_{i})\)), so that each dataset would contribute equally to the overall concordance. The concordance calculation was performed using the concordance function from the survival R package (v3.2-3) (ref. 76).

To make this quantity more intuitive, we further scaled the concordance range to values between −1 and 1 through cscaled = 2c − 1, which is analogous to Somers’ D77. Furthermore, in the absence of ties, this value is also identical to Kendall’s rank correlation coefficient tau77. In the figures presented in this paper, a scaled concordance of 0 indicates absence of association, a value greater than 0 indicates positive association, while a value less than 0 indicates negative association. As with Somers’ D and Kendall’s tau, the magnitude of cscaled captures the strength of the effect, with values near –1 and 1 corresponding to strong discordance and concordance, respectively.

A z-score was computed as unscaled concordance (that is, in the range of 0–1) minus 0.5 and divided by the square root of the variance, and the resulting value was used to derive a two-tailed P value78,79. P values were corrected for multiple testing using the Benjamini–Hochberg method to control the FDR80. Code to reproduce this analysis is available in the associated GitHub repository in the 5_Concordance.R script.

For all GMI analysis (for example, relating to Fig. 2) and concordance analysis relating metabolite levels to gene expression signatures (for example, in Figs. 5 and 6), only tumour samples were used.

Generation of IDO1 knockout HCT116 cell line

sgRNAs (oligonucleotide sequences are indicated in Supplementary Table 6) targeting IDO1 as well as non-targeting control were cloned into lentiCRISPRv2 puro plasmid (Addgene, 98290). Lentiviral packaging vectors psPAX2 (Addgene, 12260), pMD2.G (Addgene, 12259), along with sgRNA expressing vector were transfected into HEK293T cells using polyethylenimine transfection reagent. Around 72 h after transfection, supernatant containing lentivirus was harvested and filtered through a Whatman filter (Fisher Scientific) with a pore size of 0.45 µm to remove cell debris. Target cells (HCT116, human colon cancer cell line, American Type Culture Collection, CCL-247) were transduced with lentivirus using 8 µg ml−1 polybrene (Sigma). Around 72 h after lentivirus transduction, 3 µg ml−1 puromycin (Thermo Fisher) was added to cell culture medium to select for virus-infected cells. Two weeks after puromycin selection, target gene knockout was confirmed by western blot. For induction of IDO1 gene expression, cells were treated with 100 ng ml−1 interferon-γ for 48 h.

Metabolomics profiling on the IDO1 knockout HCT116 cell line

Cells were plated in six-well tissue culture plates at a density of 100,000 cells per well. Around 72 h after cell seeding, metabolites were extracted and analysed by liquid chromatography–mass spectrometry (LC–MS). For metabolite extraction, culture medium was aspirated, and cells were washed once with ice-cold PBS. After PBS washing, 1 ml of ice-cold extraction solvent (methanol:water ration of 80:20) was added. After overnight incubation at −80 °C, cells and extraction solvent were collected into 1.5-ml microcentrifuge tubes using a cell scraper. Samples were centrifuged at 20,000g for 20 min at 4 °C. Supernatant (900 μl) was collected and dried in a vacuum evaporator (Genevac EZ-2 Elite).

For LC–MS, dried extracts were resuspended in 30 μl of a 97:3 ratio of water:methanol containing 10 mM tributylamine and 15 mM acetic acid. Samples were vortexed, incubated on ice for 20 min, and clarified by centrifugation at 20,000g for 20 min at 4 °C. LC–MS analysis used a Zorbax RRHD Extend-C18 column (150 mm × 2.1 mm, 1.8-μm particle size, Agilent Technologies). Solvent A was 10 mM tributylamine, 15 mM acetic acid in a 97:3 ratio of water:methanol, and solvent B was 10 mM tributylamine and 15 mM acetic acid in a 3:97 ratio of water:methanol, prepared according to the manufacturer’s instructions (MassHunter Metabolomics dMRM Database and Method, Agilent Technologies). LC separation was coupled to a 6470 triple-quadrupole mass spectrometer (Agilent Technologies), which was operated in dynamic MRM scan type and negative ionization mode.

Tumour versus normal pathway analysis

There were seven CAMP datasets that had both tumour and normal samples available (BRCA1, COAD, GBM, PRAD, PDAC, ccRCC3 and ccRCC4). We applied differential gene expression tests between tumour and normal samples in each dataset using limma-voom (limma package, version 3.5.2). Genes with an FDR-adjusted P value < 0.1 were considered significantly differentially expressed. Similarly, we also conducted differential metabolite abundance testing between tumour and normal samples using Wilcoxon rank-sum tests. Metabolites with an FDR-adjusted P value < 0.1 were considered significantly differentially abundant.

For each KEGG pathway, we calculated the DF score and the DA score for genes and metabolites separately:

DF score = (number of significantly up-regulated constituents + number of significantly down-regulated constituents) / number of measured constituents in a pathway

DA score = (number of significantly up-regulated constituents − number of significantly down-regulated constituents) / number of measured constituents in a pathway

Constituents are either genes or metabolites in the above formula.

Conceptually, the DF score captures the overall disruption of the constituents of a pathway, whereas the DA score captures the tendency for pathway constituents to increase or decrease in abundance relative to a reference (in this case, normal tissue) state.

Gene–metabolite distance

To define a distance between genes and metabolites, we considered the highly manually curated genome-scale human metabolic model from Robinson et al.41, referred to as the ‘Human-1’ model, which describes the metabolic reaction network of transporters, enzymes and metabolites. We systematically computed the distance between a gene and a metabolite according to how many reaction steps separate the two molecules. If a gene and a metabolite participate in the same reaction, they will be assigned a distance of one; if they take part in subsequent reactions, they will be assigned a a distance of two, and so on. We defined an interaction as ‘proximal’ if the corresponding gene–metabolite pair had a distance of one or two (Extended Data Fig. 10b).

Overall, we could compute a distance for 473,206/4,438,632 (~10.66%) gene–metabolite pairs in our analysis, and 78,672/473,206 (~16.62%) of these were classified as proximal (that is, a distance of one or two). Of all significant gene–metabolite pairs in our concordance meta-analysis, 3,304/22,619 pairs had a defined distance (~14.61%), but only 565/22,619 (~2.50%) of these were proximal.

Gene-set enrichment analysis

For the analysis in Fig. 4, we ran a pathway enrichment analysis among all significant genes for each metabolite with at least one significant gene association. This set of genes was mapped to a total of 146 KEGG pathways, and the enrichment test was performed using classical hypergeometric testing81. P values were adjusted using the Benjamini–Hochberg method for controlling the FDR80. Adjusted P values < 0.01 were considered significant.

Results were then aggregated into a metabolite × pathway matrix and visualized as a heat map, where metabolites and pathways were clustered based on the –log10(P value). Clustering was performed using the pheatmap function82 (pheatmap package, v1.0.12) with Ward linkage and Euclidean distance.

Bulk gene expression deconvolution analysis

To dissect the role of the immune compartment in the tumour microenvironment, we calculated the ImmuneScore through the estimate R package83. To calculate cell-type-specific infiltration patterns in Fig. 6, we used ssGSEA46 for bulk gene expression deconvolution analysis. Signature gene lists of immune cell types and immune features were obtained from Bindea et al.51. Briefly, ssGSEA takes the sample TPM expression values as the input and computes an enrichment score for a given gene list as compared to all the other genes in the sample transcriptome.

Statistics and reproducibility

No statistical method was used to predetermine sample size. No data points were excluded from the analyses. The statistical tests used for the individual analyses are indicated in the figure legends. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.