Main

Molecular subtyping in cancer has identified biomarkers that stratify tumors according to biological and clinical phenotypes, providing an improved understanding of the signaling underpinning tumor development and treatment response. Numerous studies have leveraged gene-level expression values to identify tumor subtypes1,2,3, followed by downstream characterization using collections of pathway-level gene signatures that represent biologically important phenotypes4. The value of this approach is exemplified in CRC by the CMS5, colorectal intrinsic subtypes (CRIS3) and the single-cell RNA sequencing (scRNA-seq)-based intrinsic subtypes (iCMS)6. These gene-level subtyping approaches are dominated by associations with genetic alterations, particularly those underpinning the Vogelstein paradigm7,8. Although iCMS sub-stratified stromal CMS4 tumors, it did not identify any heterogeneity within the largest subtype of epithelial-rich tumors, CMS2.

Although subtype discovery using gene-level data represents the most commonly deployed approach, pathway-level data can provide a closer link with molecular mechanisms and clinical phenotypes9. Therefore, we reasoned that by using pathway-level data, initially in KRAS mutant (KRASmut) CRC tumors, we can identify a more comprehensive view of biological signaling related to disease.

Using transcriptional data from a series of colorectal tumors, we identify, validate and characterize a set of pathway-derived subtypes (PDS), providing unique insights into tumor biology regardless of mutational status. Although this was developed in bulk CRC, PDS classification reveals a previously unseen continuum of epithelial cell states within CMS2, associated with cell cycle, transcriptional activity and lineage differentiation states. These intrinsic biological traits are distinct from the iCMS classification system when applied to the same single-cell cohorts.

Overall, these data support the use of the PDS system in conjunction with existing subtyping approaches to ensure that tumor studies are informed by multiple tiers of cancer-relevant information that cannot be fully revealed by individual subtyping methods.

Results

Pathway-led CRC subtype discovery and classification

We generated a matrix of pathway-level single sample gene set enrichment analysis (ssGSEA) scores (n = 1,783 per tumor) as an initial framework for mapping the tumor activation status in a subset of the FOCUS10,11 clinical trial (ISRCTN79877428, n = 360 tumors; S:CORT cohort) across n = 640,000+ combinations of biological processes, including BIOCARTA, PID, KEGG and REACTOME collections from the Molecular Signature Database (MSigDB4). Following pathway-level conversion, KRASmut samples (n = 165) were used for unsupervised class discovery that, through k-means clustering, revealed three distinct PDS: PDS1 (~27%), PDS2 (~38%) and PDS3 (~35%) (Fig. 1a, Extended Data Fig. 1a–c and Supplementary Table 1). Assessment of mutations according to PDS revealed no distinct mutational type for KRAS itself (Fig. 1b) or numerous key mutations (Fig. 1c). Visualization of the n = 626 gene sets most significantly associated with PDS revealed how transcriptionally distinct these genetically indistinguishable groups were (Fig. 1d and Supplementary Table 2). These analyses also indicated that CMS2 tumors, and to a lesser extent CMS3 tumors, were distributed across PDS1 and PDS3, whereas the CMS1 and CMS4 inflammatory and stromal tumors were merged within PDS2 (Fig. 1e,f and Extended Data Fig. 1d).

Fig. 1: PDS of CRC highlights two subsets of CMS2.
figure 1

a, Schematic of class discovery: gene expression to pathway matrix using gene signature databases (see Methods) on KRASmut CRC samples (n = 165) from the FOCUS cohort (used as the discovery set) defines three PDS following series of dimensionality reduction (t-SNE) and unsupervised k-means clustering. b, Bar chart highlighting the proportion of KRASmut variants across PDS. c, Oncoprint with key cancer driver genes across PDS in KRASmut CRC samples. d, Heatmap depicting the PDS-specific ssGSEA scores across the discovery set (n = 165) with PDS and CMS annotation. e, UMAP from the PDS-specific ssGSEA scores on the discovery set with PDS (left) and CMS (right) annotations, using all samples from the discovery set (top) and PDS1 and PDS3, and CMS2 and CMS3 samples only (bottom). f, Sankey plot focusing on CMS2 and CMS3 primarily subdivided into PDS1 and PDS2. g, Heatmap visualization of the ‘Hallmark’ gene sets ssGSEA scores across the discovery set, with annotated PDS discovery calls and CMS. h, Schematic of classifier development: the FOCUS discovery set was divided into training (n = 125) and test (n = 40) sets, and subsequently, the svmRBF classification algorithm was trained on the training set and tested on the test set to finalize performance of the classification model. The classification prediction threshold was determined and the PDSclassifier R package was developed. i, Ternary plot displaying PDS prediction probabilities using PDSclassifier on the FOCUS test subset (n = 40). The red line denotes the default PDS prediction threshold at 0.6 and the dashed black line represents the PDS prediction threshold at 0.5 and 0.7. j, Sankey plot highlighting the overall concordance of PDS calls between discovery and classifier calls, suggesting robustness of the classifier. UNK, unknown.

Consistent with CMS1–CMS4 associations, PDS2 tumors were enriched for inflammatory and immune signaling pathways, such as interferon-α and interferon-γ response as well as stromal-related epithelial-to-mesenchymal and transforming growth factor β (TGF-β) activation using the Hallmark collection4 (Fig. 1g). PDS1 tumors displayed elevated cell-cycle-related pathways, including MYC and E2F targets and G2M checkpoint, whereas there was near universal repression of cancer-associated hallmark signaling in PDS3 (Fig. 1g). Overall, these three distinct transcriptional classes confirm the extent of signaling heterogeneity within KRASmut CRC and suggest that PDS classification provides a novel basis for forward and reverse translation studies.

To ensure that PDS classification could be performed on additional datasets, we explored different classification algorithms, in which the PDS-specific n = 626 gene set signatures (Fig. 1d) were used as features for training and developing the classification model (Fig. 1h). Given the best overall accuracy displayed by the support vector machine via radial basis function (svmRBF) algorithm (Extended Data Fig. 1e,f), the PDS classifier was based on this model. The PDS classification model also includes a prediction probability scoring read-out so users can clearly enumerate intratumoral subtype heterogeneity in individual samples. We defined 0.6 as the default threshold for PDS classification, with tumors that do not reach this threshold termed ‘mixed’ (Fig. 1i and Extended Data Fig. 1g,h). To ensure that this classifier can be easily used by the wider research community, an R-based classification package was developed: PDSclassifier (see Code availability), which accurately classified tumors into the same robust clusters identified during class discovery (Fig. 1h and Extended Data Fig. 1i).

Genomic and mutational landscapes of PDS1–PDS3 tumors

Although PDS was developed initially within KRASmut tumors, when tested on the entire FOCUS cohort (n = 360) it classified 87% of all samples independent of KRAS mutational status, PDS1 = 26%, PDS2 = 31%, PDS3 = 30%, mixed = 13% (Fig. 2a). The same PDS–CMS associations were observed in KRAS wild-type (WT) and KRASmut tumors (Fig. 2b); PDS2 were predominantly CMS4 or CMS1, and CMS2 and CMS3 were distributed across PDS1 and PDS3 (Fig. 2c,d). No clinical or pathological differences were identified between PDS subtypes (Extended Data Fig. 2a). In line with CMS associations, there was enrichment of BRAF mutations and fewer APC mutations in PDS2 than in PDS1 or PDS3 (Fig. 2e and Extended Data Fig. 2b). Despite such distinct biological signaling between PDS1 and PDS3, these tumors had identical mutational profiles across key genes within the WNT, MAPK, PIK3CA, cell cycle or TGF-β pathways (Fig. 2e and Extended Data Fig. 2b). This was further observed when copy number estimates were assessed (Fig. 2f and Extended Data Fig. 2c,d).

Fig. 2: PDS tumors are genomically indifferent but transcriptionally distinct with prognostic value.
figure 2

a, PDS classification on all FOCUS samples (n = 360) yields ~12.8% of samples with ‘mixed’ PDS biology that did not reach the threshold of 0.6. b, Heatmap depicting the PDS-specific ssGSEA across the FOCUS cohort, recapturing the same pattern regardless of KRAS mutational status. The top annotation bar indicates PDS, CMS and KRAS mutation. c, Sankey plot highlighting the FOCUS cohorts showing CMS–PDS alignment (top) and how CMS2 and CMS3 are subdivided into PDS1 and PDS3 (bottom). d, UMAP from the PDS-specific ssGSEA scores on the FOCUS cohort with PDS (left) and CMS (right) annotations using all samples from FOCUS cohort (top) and CMS2 and CMS3 and PDS1 and PDS3 samples only (bottom). e, Cancer driver mutations displayed in oncoprint with (red *) BRAF enrichment and low APC mutation in PDS2. f, Copy number alterations in heatmap indicating no significant difference across PDS. Statistics: Fisher’s exact test between PDS1 and PDS3. g, Significant ‘Hallmark’ ssGSEA heatmaps in the FOCUS and GSE39582 cohorts, with top annotation of PDS prediction probabilities and PDS calls. h, Radar plots highlight highly upregulated Hallmark gene sets prominent for each PDS in the FOCUS (top) and GSE39582 (bottom) cohorts. i, Transcriptional factor activity shown in the heatmaps displays transcriptional factors activated or repressed across PDS. j, Kaplan–Meier RFS plots in GSE39582 and PETACC-3 colon cancer cohorts (top). Univariate Cox proportional analysis outcome displayed in a tabular format (bottom), comparing PDS1 with PDS2 and PDS3 with a hazard ratio (HR), 95% confidence intervals (CI) and P value.

Overall, despite being developed within KRASmut tumors, these PDS identify distinct biology across all CRC tumors, regardless of mutational status.

Transcriptional characteristics of PDS1–PDS3 tumors

Using the Hallmark and DoRothEA12 algorithms, we characterized PDS-specific traits in the entire FOCUS cohort (GSE156915; n = 360), alongside CRC samples from two additional independent non-randomized cohorts, including the cohort used in the development of CMS (GSE39582 (ref. 2); n = 566) and another independent cohort from the S:CORT13,14 program (SPINAL cohort, GSE248381; n = 258). Again, there was a clear transcriptional distinction between each subtype, with the most prominent biological signaling cascades representing cell-cycle-related activity in PDS1 and stromal or inflammatory signaling in PDS2; the only consistent signal elevated in PDS3 was the repression signature ‘KRAS signaling down’ (Fig. 2g,h and Extended Data Fig. 3a,b). These transcriptional landscapes were underpinned by distinct transcription factor (TF) activity, with cell-cycle-related TFs, such as E2F2, being significantly active in PDS1 (Fig. 2i). PDS2 tumors were significantly activated for n = 32 stromal and inflammatory TFs, including SMAD3, STAT3, IRF1 and ERG. Although repressed for biological pathways within the Hallmark collection in general, PDS3 exhibited activation of TFs relating to a diverse set of hormonal and developmental processes, including FOXA2 and NR2F2 (Fig. 2i).

To test the clinical relevance of PDS classification, we assessed relapse-free survival (RFS) rates, which revealed that PDS2 and PDS3 displayed poor prognosis compared to PDS1 in the GSE39582 cohort (Fig. 2j) regardless of KRAS mutational status (Extended Data Fig. 2e). These findings were independently validated in the randomized PETACC-3 trial cohort15, further reinforcing the poor prognosis of PDS3 compared to PDS1 (Fig. 2j and Extended Data Fig. 2e). When tested in microsatellite stable tumors only, the prognostic relevance of PDS3 remains (Extended Data Fig. 2e). Given that the majority of PDS1 and PDS3 tumors were also classified as CMS2 (Extended Data Fig. 2f), we sub-stratified this otherwise uniform subtype, revealing that CMS2–PDS1 tumors have a significantly better outcome than CMS2–PDS3 tumors (Extended Data Fig. 2g), with PDS3 tumors having as poor an outcome as the stroma-rich CMS4 tumor subtype (Fig. 2j and Extended Data Fig. 2h).

Intratumoral heterogeneity across PDS

To identify the cellular source of the signaling underpinning PDS, we assessed gene expression scores in micro-dissected tumors (GSE31279), which highlighted that the PDS1 and PDS2 gene sets are primarily derived from epithelial and stromal regions, respectively (Fig. 3a,b), whereas the PDS3 gene sets do not show any significant association between epithelium or stroma. Using hematoxylin and eosin (H&E) slides from FOCUS and SPINAL samples, we set out to identify distinguishing histological features between PDS classes using a methodology similar to the image-based CMS (imCMS) classifier16,17 (Fig. 3c). PDS1 and PDS2 tumors contain visual features that our models used to discriminate between the two classes (PDS1 vs PDS2 and PDS3: area under the receiver operating characteristic curve (AUROC), 0.740 ± 0.019; F1-score, 0.574; PDS2 vs PDS1–PDS3: AUROC, 0.810 ± 0.033; F1-score, 0.618; Fig. 3d,e). However, our trained models were unable to identify PDS3-specific morphological patterns, as the learned features that were discriminative for PDS1 and PDS2 were also prominent in PDS3-labeled images, resulting in poor cross-validation performances for PDS3 (PDS3 vs PDS1 and PDS2: AUROC, 0.557 ± 0.026; F1-score, 0.338; Fig. 3d,f). The mixed morphology of PDS3 was also supported by the intermediate description of these tumors using ESTIMATE18 (Fig. 3g and Extended Data Fig. 3b,c), making our models erroneously classify PDS3 images as either PDS1 or PDS2. Even a more stringent 0.8 PDS classification threshold, which identifies transcriptionally more homogeneous PDS tumors and maintains the same PDS prognostic value (Fig. 3h–j), did not improve the image-based classifier.

Fig. 3: PDS intratumoral heterogeneity exits with PDS3 displaying no defined histological features.
figure 3

a, Heatmap visualization of the ssGSEA score using PDS-specific gene in laser capture micro-dissected CRC dataset with annotated epithelium or stroma region per sample. b, Comparison of PDS-specific gene set ssGSEA score between epithelium (n = 8) and stroma (n = 10), using two-sided Wilcoxon rank-sum test. Boxplots depict the median and interquartile range, with whiskers extending to the minimum and maximum values (excluding outliers as dots). c, Workflow schematic on development of the image-based PDS classifier (imPDSclassifier). d, Confusion matrix displaying the PDS prediction based on the imPDSclassifier to the transcriptomic-based PDS calls. e,f, imPDS predictions on the digital whole slide H&E images with tile-level confidence probability on PDS1, PDS2 (e) and PDS3 (f) samples. Scale bars, 2 mm. g, Stromal and immune ESTIMATE score across PDS in the FOCUS (PDS1, n = 93; PDS2, n = 113; PDS3, n = 108) and SPINAL (PDS1, n = 80; PDS2, n = 54; PDS3, n = 82) cohorts. Boxplots inside the violin plots depict the median and interquartile range, with whiskers extending to the minimum and maximum values (excluding outliers as dots). P values: two-sided Wilcoxon rank-sum test. h, Ternary plots represent the utility of PDS prediction probability thresholds, whereby increasing the threshold from 0.6 (left) to stringent 0.8 (right) can lead to more homogeneous PDS tumor samples while i, j, still retaining its prognostic value (PDS1, n = 109; PDS2, n = 83; PDS3, n = 80). Error bars, 95% confidence intervals. k, PDS prediction probability stacked bars and l, density plots highlighting the level of PDS heterogeneity found in each sample, with dashed lines representing the default threshold of 0.6 (top); increasing threshold, for example, to 0.8 (bottom), makes it more transcriptionally homogeneous.

Overall, these data indicate that although PDS1 and PDS2 tumors are dominated by epithelial and stromal influences, respectively, PDS3 tumors are morphologically indistinguishable from these two classes. However, despite this histological heterogeneity, PDS can be used to identify tumors that are transcriptionally homogeneous, particularly when the classification threshold is increased, ensuring that sample classification is based on robust and clearly distinct signaling (Fig. 3l).

PDS3 tumors are devoid of canonical LGR5+ and regenerative ANXA1+ stem traits

The lack of robust morphological patterns in PDS3 tumors prompted us to next assess their stem cell and precursor associations. Using the intestinal stem cell (ISC) index19, we observed an association between PDS1 tumors and the canonical LGR5+ crypt-base columnar cell (CBC) signature, PDS2 tumors aligned strongly with LGR5/ANXA1+ regenerative stem cell (RSC) signature and PDS3 tumors scored low for both stem signatures (Fig. 4a and Extended Data Fig. 4a–d). A subset of cases (n = 20) from the SPINAL cohort was selected for fluorescence in situ hybridization, which further confirmed increased epithelial LGR5 expression in PDS1 tumors, epithelial ANXA1 expression in PDS2 tumors and an almost complete absence of staining for both markers in the PDS3 tumors tested (Fig. 4b). Furthermore, PDS1 and PDS2 tumors were associated with tubular and serrated gene signatures, respectively20, but PDS3 tumors had no clear association with these precursor features. This was confirmed using a cohort of transcriptionally profiled polyp samples (S:CORT polyp cohort19) pathologically defined as tubulovillous adenomas, sessile serrated lesions and traditional serrated adenomas, in which tubulovillous adenomas were predominantly PDS1 and sessile serrated lesions associated with PDS2, but PDS3 tumors lacked any precursor association (Fig. 4c and Extended Data Fig. 4b–d). Additionally, Ki67+ immunohistochemistry and transcriptional signatures of proliferation21 and replication stress22 further highlighted these features in PDS1, with significantly lower signaling in PDS3 tumors (Fig. 4d–f).

Fig. 4: PDS3 tumors indicate no stem-like and precursor lesion associations.
figure 4

a, Scatter gradient-density plot representing LGR5+ CBC and LGR5/ANXA1+ RSC ssGSEA scores with PDS sample annotations in the SPINAL cohort (top). Stem cell plasticity landscape schematic highlighting the PDS1 association with CBC and the PDS2 association with RSC gene signatures along the stem cell index proposed in a previous study19 (bottom). b, In situ hybridization multiplex staining with LGR5, ANXA1 and DAPI across PDS samples in the SPINAL (PDS1, n = 7; PDS2, n = 7; PDS3, n = 6) cohort. Scale bars, 1 mm and 100 μm. c, Schematic representing PDS classification of the colorectal polyp dataset with categorized polyp types (left); confusion matrix depicting the association between PDS calls and polyps (right). d, Proliferation index and replication stress measure across PDS in the FOCUS (PDS1, n = 93; PDS2, n = 113; PDS3, n = 108) and GSE39582 (PDS1, n = 186; PDS2, n = 140; PDS3, n = 122) cohorts. e, Representative immunohistochemistry images of Ki67+ staining. Scale bars, 50 μm. f, Violin plot showing the tumor area (mm2) determined from H&E digital histological scores using HALO (PDS1, n = 91; PDS2, n = 111; PDS3, n = 107) and the per cent of tumor Ki67+ per sample (represented in e) across PDS in the FOCUS (PDS1, n = 91; PDS2, n = 108; PDS3, n = 105) cohort. Boxplots inside the violin plots (in d and f) depict the interquartile range, median, minimum and maximum values (excluding outliers as dots). P values: two-sided Wilcoxon rank-sum test. g, ShinyApp platform (SubtypeExploreR), integrated with the FOCUS and GSE39582 cohorts allows interrogation into gene expressions and signatures across three different CRC subtypes, including PDS, CMS and iCMS. TVA, tubulovillous adenomas; SSL, sessile serrated lesions; TSA, traditional serrated adenomas.

Overall, these results indicate the presence of numerous well-characterized cancer-related features associated with both PDS1 and PDS2 tumors; however, these approaches provided limited insights into the biology underpinning the ~30% of CRC tumors classified as PDS3.

To complement the non-exhaustive PDS characterizations presented here and to support the FAIR principles23, we developed the ‘SubtypeExploreR’ platform. This enables any user to interrogate transcriptional genes and/or signatures, including existing signatures from numerous databases or an unlimited combination of de novo unpublished classifiers, according to PDS and other CRC subtypes in our bulk cohorts (Fig. 4g; https://subtypeexplorer.qub.ac.uk).

This resource will ensure that these data are not just accessible, but (re)usable to a much wider audience.

Intrinsic stem-to-differentiated axis aligns with PDS1 and PDS3 tumors

To uncover biological interactions that underpin PDS3, genes within the PDS3-specific gene sets (n = 961) were assessed using Enrichr24 and STRING25 (Extended Data Fig. 5a,b), revealing an association with the polycomb repressive complex (PRC), which has an established role in repressing PRC targets that regulate cellular differentiation. Assessment of the gene expression of key PRC genes, such as EZH2, revealed that they were significantly lower in PDS3 tumors (Extended Data Fig. 5c), resulting in enrichment of PRC targets genes26 in PDS3 tumors compared to PDS1 and PDS2 tumors (Fig. 5a and Extended Data Fig. 5e). On the contrary, MYC targets, known for their role in maintaining a stem-like or pluripotent state26, were enriched in PDS1 and PDS2 compared to in PDS3 tumors (Fig. 5b and Extended Data Fig. 5f). We found a highly significant negative correlation between individual tumor scores in these bulk tumor tissue along a MYC–PRC axis, with PDS1 tumors displaying high-MYC–low-PRC target expression and PDS3 tumor displaying low-MYC–high-PRC target expression, and PDS2 tumors being intermediate (Fig. 5c and Extended Data Fig. 5g). In combination with our data indicating lower levels of cell cycle and transcriptional activity, and an absence of canonical and RSCs (Figs. 3 and 4), these data suggest that the cellular epithelial states that dominate PDS3 tumors may have shifted towards a greater proportion of differentiated lineages than in PDS1 and PDS2 tumors.

Fig. 5: MYC–PRC targets biological axis associated with stem–differentiation cellular dynamics.
figure 5

a, GSEA enrichment and violin plot displaying ssGSEA scores compared across PDS in the FOCUS cohort, using PRC targets gene signature. b, as in a, but with MYC targets gene signature (PDS1, n = 93; PDS2, n = 113; PDS3, n = 108). Boxplots inside the violin plots depict the interquartile range, median, minimum and maximum value (excluding outliers as dots). P values (violin plots): two-sided Wilcoxon rank-sum test. GSEA: Benjamini–Hochberg adjusted P value. NES, normalized enrichment score. c, Scatterplot depicting the inverse correlation between MYC targets and PRC targets ssGSEA scores, with annotated PDS samples. P value: two-sided Pearson correlation coefficient. d, Schematic of mouse intestinal crypt epithelium transcriptional dataset (GSE143915 (ref. 27)). e, PRC targets and Myc targets ssGSEA scores across colon epithelial cells. f, as in c, but with colon epithelial cell types. g, Schematic of murine organoid-derived epithelial single-cell dataset28 used for RNA velocity and pseudotime trajectory analysis. h, PHATE visualizations with cells annotated as MYC targets-high and PRC targets-high (top) and epithelial cell types (bottom). Arrows: RNA velocity based on CellRank39 method. i, Pseudotime score comparisons across groups (top) and cell types (bottom). The red and green dash lines represent the median of MYC and PRC targets ssGSEA scores, respectively. j, Schematic of SMI, made available in the PDSclassifier R package for both bulk and single-cell data. k, PHATE visualization with SMI. l, Comparison of SMI across epithelial cell types in GSE143915 and PDS in the FOCUS cohort (n as in b). P values: two-sided Wilcoxon rank-sum test. m, Heatmap visualization of the differentiation, PRC targets and MYC targets gene expressions in the FOCUS cohort, with annotated PDS prediction probabilities, PDS and SMI. n, In addition to the stem phenotypic landscape described in a previous work19, an overall tumor epithelial landscape describes the stem–differentiation cellular dynamics along the SMI, in which PDS3 tumors are highly enriched for differentiated-like traits that correspond to high PRC target expression. AbsPro, absorptive progenitor; SecPDG, secretory progenitor/deep crypt secretory cells/goblet, Ent, enterocytes; EEC, enteroendocrine cell.

To test this premise, we analyzed transcriptional data from an independent cohort of mouse colonic epithelial cells across six differentiation states (GSE143915 (ref. 27); Fig. 5d), which revealed a significant association between Myc-PRC target gene expression levels and cellular or biological (as opposed to tumor grade or histological) differentiation status. Stem and progenitor cells displayed highly proliferative and PDS1-like (high-Myc–low-PRC target) gene expression, whereas more differentiated enterocyte and enteroendocrine cell populations displayed low proliferation and PDS3-like (high-PRC–low-Myc target) gene expression (Fig. 5e). These data again highlight the strong association between cellular stem–differentiation and the MYC–PRC axis, with the phenotypic traits associated with the PDS classification system in bulk tumor data (Fig. 5f and Extended Data Fig. 5d,h).

To provide more granularity, we used scRNA-seq data (n = 29,452 epithelial cells) generated from murine CRC organoids mono-cultured or co-cultured with fibroblasts and/or macrophages with existing lineage annotations28 (Fig. 5g). We defined MYC targets-high and PRC targets-high cell populations based on the MYC and PRC targets ssGSEA score, representing PDS1-like and PDS3-like cells, respectively. Notably, we observed that MYC targets-high cells strongly aligned with the annotated canonical stem cell populations28 (colonic stem cells and hyper-proliferative colonic stem cells) and, to a lesser extent, the regenerative or revival stem cells, in a gradient of gene expression along the cell fate trajectory towards PRC targets-high cells, which were closely associated with differentiated cell populations, including transit-amplifying, goblet and late enterocyte cells (Fig. 5h,i and Extended Data Fig. 6c). Given the absence of stem populations in PDS3 and an association of MYC–PRC with stem–differentiation, we proposed a PDS-related stem–differentiation single sample scoring system for use in bulk and single-cell data, termed the ‘stem maturation index’ (SMI; Fig. 5j). Evaluation of the SMI in single-cell and bulk tissue datasets confirmed a clear correlation with PRC targets scores, differentiated colonic cells and PDS3 tumors (Fig. 5k,l and Extended Data Fig. 5i,j). Additionally, expression levels of several differentiation marker genes were also elevated in PDS3 and positively correlated with the SMI, further confirming PDS association along the phenotypic landscape of stem-to-differentiation, with PDS3 tumors having reduced stem populations and an increased abundance of epithelial cells that are further along the differentiated trajectory (Fig. 5m, Extended Data Figs. 5k,l and 6a,b).

Taken together, although the use of CMS and stem cell signatures can enumerate the canonical versus regenerative stem-like state of a tumor, our data supports that the use of these existing approaches in conjunction with PDS and SMI classification provides an otherwise unseen view of the overarching stem–differentiation cellular landscape and information about the overall differentiation state (distinct from the tumor grade) of a bulk tumor or a single cell (Fig. 5n).

Pre-clinical models fail to recapitulate PDS3 tumor biology

Using equivalent mouse pathways for PDS classification and bulk tumor transcriptional data from n = 51 genetically engineered mouse models (GEMMs) across n = 6 genotypes29—Apcfl/+ (A); Apcfl/+ KrasG12/+ (AK); BrafV600E/+ Trp53fl/fl (BP) and BrafV600E/+ Trp53fl/fl Notch1Tg/+ (BPN), KrasG12D/+ Trp53fl/fl Notch1Tg/+ (KPN); KrasG12D/+ Trp53fl/fl (KP)—we assessed how well mouse models align to human PDS (Fig. 6a). The A and AK models were exclusively PDS1, whereas the KP, KPN and BPN models were divided into PDS2 and PDS3 (n = 1 BPN as PDS1), with BP models aligning with PDS3 (Fig. 6a and Extended Data Fig. 6d). There was a clear alignment of the biological hallmarks, TF activation states and CMS classifications associated with PDS1 and PDS2 between human and mouse tumors (Fig. 6b,c). However, in contrast to human PDS3 biology, GEMM tumors classified as PDS3 were not associated with CMS2 or the transcriptional repression for hallmarks signaling observed in human tumors and displayed signaling similar to GEMMs classified as a ‘mixed’ subtype (Fig. 6b,c), indicating that the mouse models used did not accurately represent human PDS3 biology (Fig. 2). Assessment of proliferation index and replication stress according to both PDS classification and genotype demonstrated elevation of both phenotypes in PDS1, but no further suppression was observed in PDS3 compared to PDS2 (Fig. 6d and Extended Data Fig. 6e). Finally, although elevation for the Myc–PRC targets signaling continuum is maintained in PDS1 GEMMs, there was again little distinction between PDS2 and PDS3 (Fig. 6e).

Fig. 6: PDS3 closely resembles WT normal differentiation patterns and lacks pre-clinical models.
figure 6

a, Six genotypic GEMM primary tumors were sequenced and PDS were called. b, Heatmap displaying the ssGSEA scores for the Hallmark gene sets in GEMMs across PDS, including ‘mixed’ samples. Top annotation bar indicates PDS calls, mouse CMS (MmCMS) and genotypes. c, Heatmap displays TF activity from DoRothEA12. d, Violin plots showing proliferative index (top) and replication stress (bottom) across PDS in GEMMs (PDS1, n = 11; PDS2, n = 15; PDS3, n = 12). Boxes inside the violin plots depict the interquartile range, median, minimum and maximum values. P values: two-sided Wilcoxon rank-sum test. e, Scatterplot highlighting the inverse correlation between Myc targets and PRC targets ssGSEA scores recapitulated in GEMMs, with PDS annotation. P value: two-sided Pearson correlation coefficient. f, PHATE visualization on murine organoid scRNA-seq data with cell density and cell type annotations for AK (left) and WT (right). g, Waddington landscape representing AK and WT models annotated with cell types (top) or MYC–PRC target-high groups (bottom), closely resembling PDS1 stem-like vs PDS3 differentiated-like traits, respectively. h, PHATE visualization for WT cultured in WENR (W, WNT3A; E, EGF; N, Noggin; R, R-spondin-1) in which cellular states are skewed towards stem cell enrichment with limited differentiation. i, Waddington landscape of WT + WENR annotated with cell types (top) and MYC–PRC target-high groups (bottom), closely resembling PDS1 stem enrichment. j, Representative Ki67+ ISH images of murine organoids (AK, AK + ENR, WT, WT + ENR; n = 2 replicates each per group). Scale bars: AK, 50 μm; AK + ENR, 100 μm; WT, 50 μm; WT + ENR, 50 μm. k, PHATE visualization for proliferation index scores across all populations.

In line with the murine tumor tissue above, AK organoids remain strongly associated with a PDS1-like stem-enrichment (Fig. 6f) and also display a high entropy and differentiation potential when viewed using a Waddington-like landscape (Fig. 6g)30. By contrast, in the absence of mutations and culture media supplements (WNT3A, EGF, Noggin and R-spondin-1; WENR), WT murine organoids display a normal-like homeostatic distribution of stem-to-differentiated cell populations (Fig. 6f) and reduced differentiation potential (Fig. 6g); a pattern that most closely reflects PDS3-like lineage distributions. Importantly, the lack of PDS3 pre-clinical models may be a result of pre-clinical in vitro and/or ex vivo culturing conditions, as organoid media containing these growth factors drives a strong selective pressure towards an AK-like stem-enriched and highly proliferative cellular hierarchy, even in WT models (Fig. 6h–k and Extended Data Fig. 6f–h). Although WT organoids are most closely aligned to PDS3 stem-differentiation patterns, it must be clearly stated that such models do not capture the full extent of overarching PDS3 tumor biology.

These findings indicate a critical limitation for pre-clinical modeling of the worst prognostic group of CRC tumors, restricting opportunities for improving our mechanistic understanding and testing of PDS3-specific therapeutic options.

Assessment of PDS-specific phenotypes from bulk to single-cell

We next leveraged the human scRNA-seq data used to develop iCMS, n = 49,155 epithelial cells from n = 63 CRC tumors6. In line with bulk tumors and epithelial-specific intestinal lineages, these analyses revealed that neoplastic cells states can be classified along an MYC–PRC target axis, in individual patient clusters (Fig. 7a–c and Extended Data Fig. 7a,b) or when clustered according to the pathway-level gene sets used for PDS development (Fig. 7d). Cell cycle annotation31 revealed that PDS1-like neoplastic cells displaying higher MYC target signaling were more likely to be in S or G2M phase, whereas PDS3-like cells elevated for PRC targets expression were more likely to be in G1 phase (Fig. 7e). These results are more pronounced when assessed in cells at either end of the MYC–PRC continuum, which are most representative of PDS1 and PDS3 biology (Fig. 7c and Extended Data Fig. 7c,d). Additional characterization further confirms faithful alignment between bulk and scRNA-seq, whereby phenotypes of elevated proliferation and replication stress are observed in MYC targets-high PDS1-like cells alongside elevation of the KRAS signal repression signature (KRAS signal DN) in PRC targets-high PDS3-like cells (Fig. 7f,g and Extended Data Fig. 8e). Moreover, CBC stem cells are aligned with MYC targets-high, whereas RSC coincides more closely between the boundary of MYC targets-high and PRC targets-high cells, highlighting the features of PDS2-like intrinsic traits (Fig. 7h,i and Extended Data Fig. 7f). These findings suggest that biology identified by PDS1–PDS3 tumors in bulk tissue strongly correlates with the intrinsic biology defined by MYC–PRC target gene signatures at a single-cell level.

Fig. 7: Single-cell analysis of CRC confirms MYC–PRC targets axis.
figure 7

a, Epithelial scRNA-seq dataset from 63 patients with CRC over five cohorts; UMAP displaying cell clustering based on patients. b, UMAP visualization indicating cells annotated with MYC targets ssGSEA score (left), PRC targets ssGSEA score (center) and a blend of the two (right). c, Scatterplot highlighting an inverse correlation between MYC targets and PRC targets at the single-cell level (color key as in b). d, UMAP visualization using PDS gene set scores with cells annotated with MYC targets (left), PRC targets (center) and a blend of the two (right). e, UMAP visualization annotated with cell cycle phases per cell (left) and stacked bar chart display of the proportion and number of cells at different G1, S or G2M cell cycle phases. f, UMAP visualization annotated with cells indicating proliferative index, replicative stress and KRAS signal DN. g, Boxplots displaying a comparison between MYC targets-high (n = 9,221) and PRC targets-high (n = 8,964) cells for proliferative index, replicative stress and KRAS signal DN. Boxplots depict the interquartile range, median, minimum and maximum values (excluding outliers as dots). P values: two-sided Wilcoxon rank-sum test. h, UMAP visualization annotated with cells indicating CBC and RSC stem gene signatures. i, Boxplots as in g for CBC and RSC.

Cellular differentiation trajectory status in colorectal tumors

We next assessed the RNA velocity and pseudotime trajectories (Fig. 8a). Pseudotime analysis revealed that the cell fate trajectories in each patient cluster were highly correlated with the PDS1–PDS3 axis of MYC targets–PRC targets or SMI (Fig. 8b and Extended Data Fig. 8a,b). Moreover, RNA velocity further validated that the directionality of eventual cell fate transcriptional changes—from MYC targets-high to PRC targets-high cell populations—was present in each patient (Fig. 8c–e). As the original iCMS classifier provides patient-level classification, it provides limited information on the strength and/or extent of iCMS-related biological signaling in individual cells. Therefore, it is unsurprising that iCMS is not strongly aligned with the fundamental cell fate trajectory associated with the stem–differentiation dynamics that are found in each patient (Fig. 8e–h and Extended Data Fig. 8c,d). To provide more clarity into iCMS at the single-cell level, we applied single sample iCMS2–iCMS3 classifier scores from the top tertile of each group, which clearly revealed that the cells least aligned with either iCMS2 or iCMS3 are those that are most aligned with PDS3, SMI and PRC targets-high cells (Fig. 8i and Extended Data Fig. 8e).

Fig. 8: PDS3 differentiation-like trait relates to tumor biology unexplained by single-cell-derived iCMS.
figure 8

a, RNA velocity and pseudotime analysis using the CellRank39 method, applied to the CRC epithelial scRNA-seq data. b, UMAP visualization displays patient-based clusters (left) and scatterplot between MYC targets and PRC targets ssGSEA scores (right) with cells annotated with pseudotime. P value: two-sided Pearson correlation coefficient. c, UMAP visualization of PDS gene set scores with MYC and PRC targets-high cell annotations and arrows indicating RNA velocity using the CellRank method. d, UMAP visualizations (top) of data from two patients with CRC, with MYC and PRC targets-high cell annotations and RNA velocity displayed with arrows. Violin plots (bottom) display pseudotime analysis for the respective CRC patients. e, UMAP visualizations (top) of data from all patients, annotated with MYC and PRC targets-high (left) or iCMS calls (right), and violin plots (bottom) displaying pseudotime analysis for the MYC–PRC groups and iCMS. f, Violin plots display differentiation potency using correlation of connectome and transcriptome (CCAT) comparing MYC–PRC targets-high (left) and iCMS (right). g, Violin plots for CCAT measure per iCMS2 (left) or iCMS3 (right) patients. h, Scatterplot displaying the association between CCAT and SMI. i, UMAP visualization on PDS gene set scores, cell annotation with iCMS2 (top-left) and image-based CMS3 (top-right) ssGSEA scores, a blend of the two (bottom-right) and SMI (bottom-left). j, Summarized schematic describing the CRC bulk tumor classified into PDS, outlining the biological and clinical features of PDS tumors, in which MYC targets in association with PDS1 and PRC targets in association with PDS3 led to the identification of an SMI that enumerates stem–differentiation shifts in cellular state.

Collectively, our findings uncovered a phenotypically subtle subset of colorectal tumors characterized by a lack of numerous cancer-related features, many of which are viewed as essential hallmarks for tumor growth. Overall, the use of PDS, in conjunction with existing CRC classifications, provides a more comprehensive overview of both the dominant and nuanced biology underpinning CRC. It is only through the assessment of all these phenotypic traits—not alone but in combination—that the field can advance in its goals to understand and therapeutically target the mechanisms underpinning CRC (Fig. 8j).

Discussion

Numerous studies have used individual gene-level data for molecular subtyping class discovery3,5. However, in this study, we leverage pathway-level data for class discovery, using signatures that mechanistically underpin important cancer-relevant functions to provide a more direct link with cancer phenotypes in a way that transcends current transcriptional subtypes. This approach identified three PDS, PDS1–PDS3, which merged the inflammatory/stromal CMS1 and CMS4 tumors as PDS2 and sub-stratified epithelial-rich CMS2 and CMS3 tumors into two distinct PDS: PDS1 and PDS3. Biological and molecular characterization reveal that PDS3 lesions display lower cell cycle and stress response, reduced proliferation and stem cell populations, alongside increased cellular differentiation compared to PDS1 and PDS2 tumors. Given the repression for many cancer-related biological phenotypic features that dominate PDS3, these lesions display the worst RFS rates in multiple adjuvant cohorts, including the randomized PETACC-3 clinical trial15 (summarized in Fig. 8j). Our data also clearly demonstrate how substantial phenotypic and clinical heterogeneity persist within genetically identical tumors. The PDS classifier can be used to stratify tumor samples in a genotype-agnostic manner across numerous independent cohorts, providing an additional informative link between gene expression data and clinical or biological phenotypes.

A recent scRNA-seq study with tumor epithelial cells used gene-level data for class discovery, resulting in sub-stratification based on intrinsic biology (iCMS2 and iCMS3). However, iCMS did not identify different subtypes within epithelial-rich CMS2 tumors6. In the data presented here, we clearly demonstrate that by using pathway-level data for class discovery in bulk data rather than a gene-level method in single-cell neoplastic lineages, CMS2 tumors can be robustly and reproducibly sub-stratified into clearly distinct prognostic groups driven by subtle biological signaling that is not evident in iCMS or CMS alone. Importantly, these traits are evident even in the bulk and scRNA-seq transcriptomic datasets used for the development of the CMS and iCMS classifiers. As such, the biological pathway activation approach that underpins PDS classification provides a novel basis for forward and reverse translation studies in combination with these existing subtyping approaches.

The poor outcomes associated with PDS3 tumors in the adjuvant setting require further investigation, as it is possible that the repressed and differentiated patterns that dominate PDS3 bulk tumor profiles are concealing small subsets of aggressive and/or chemotherapy-resistant subclones, similar to mesenchymal subpopulations recently described in melanoma32, or stem populations not characterized by our CBC and RSC investigations33. The poor outcomes associated with PDS3 align with a recently reported role for PRC components in determining how well tumors and tumor cells survive during sustained environmental stress; whereby cells that are deficient in epigenetic regulators, particularly EZH2, have a superior fitness advantage over EZH2-proficient tumors34. These stress-resistant subpopulations were characterized by low cycling ‘transcriptional numbness’, similar to the characteristics of PDS3 tumors presented here. Finally, as our study focused on molecular information, there may be a series of unaccounted-for epidemiological, microbial or viral factors that underpin both the biological traits of the PDS classes and their associated clinical outcomes.

The histological features and molecular profiles in PDS3 lesions were indistinguishable from the other classes, yet these tumors have clearly distinct biological and transcriptional states. The apparent depletion of canonical or RSCs in PDS3 tumors, alongside their slow-cycling nature and elevated proportion of mature and apparently differentiated lineages sits in contrast with the definitive pathological adenocarcinoma classification and CMS2-dominant nature of these tumors. This demonstrates a clear disconnect between biological or lineage-related differentiation and pathological or grade-related differentiation and suggests the presence of additional stem-like populations that underpin tumor viability and maintenance33,35,36. Given the distinct activation states across a series of cancer-relevant signaling cascades and phenotypic features, future work is now needed to detangle the histological and biological mechanisms underpinning these lesions.

Data presented here demonstrate that PDS1 and PDS2 tumors align well with a range of available GEMMs and organoids currently used in cancer research, whereas there appears to be no GEMMs or organoids that faithfully align to the biological signaling that characterize PDS3 in human tumors; a point that currently limits more comprehensive mechanistic studies. The lack of accurate PDS3 models is intriguing but perhaps not surprising based on how current systems are developed. Given the relatively short lifespan of mice, there is a bias toward the development of fast-growing, stem-rich lesions that thrive following an ‘all at once’ induction of driver genes37. This scenario does not replicate the sequential accumulation of genetic alterations and environmental changes seen in human tumor development, which can occur over a 15+ year period38, nor the highly differentiated nature of PDS3 tumors. Furthermore, even when slow-cycling WT organoids are grown in enriched media of growth factor ligands, these additives rapidly drive models towards the PDS1 and PDS2 high-entropy biological traits rather than retaining the more differentiated PDS3 tumor biology. As such, if PDS3 biology is to be modeled in a way that more faithfully aligns with human tumors, methodologies underpinning model development require urgent attention.

In summary, our study presents an approach to tumor classification that relies on patterns observable within broad biological signaling rather than individual gene clustering. Although it would be easy to view independent subtyping approaches as competing, such as CMS, iCMS, CRIS and PDS, it will only be through the classification of samples using these complementary approaches in parallel that we can reveal the biological granularity and a greater mechanistic understanding that would otherwise be missed by using them in isolation. Maximizing the amount of phenotypic information that we derived from tumor data, using classification tools that provide synergistic insights into different yet equally important transcriptional signaling, will provide the most comprehensive landscape for the stratification of tumors into discrete biological groups.

Methods

The FOCUS data (GSE156915 ref. 11) used for class discovery and classifier development in this study were accessed through the UK Medical Research Council S:CORT program14. All subjects provided written informed consent on their samples at the time of entry to the clinical trials for further research. The clinical trial (FOCUS; reference 79877428) and the study (S:CORT; reference 15/EE/0241) were approved by the National Research Ethics Service in the United Kingdom11.

Bulk datasets

The FOCUS (GSE156915; ref. 11) and SPINAL (GSE248381) datasets were generated within the S:CORT program14, in which microarray gene expression profiles, mutation, clinical, immunohistochemistry, tissue blocks and tumor microarrays were available. FOCUS is a Medical Research Council-funded randomized trial cohort consisting of 360 formalin-fixed paraffin-embedded primary tumor samples for metastatic CRC. SPINAL consists of 258 formalin-fixed paraffin-embedded samples from patients with CRC, at mixed stages. Other publicly available datasets were accessed from Gene Expression Omnibus with accession numbers GSE39582 (ref. 2), GSE31279 (ref. 40), GSE143915 (ref. 27) and GSE218776 (ref. 29), and from ArrayExpress E-MTAB-6363 (ref. 41). The validation of clinical association was carried out with the PETACC-3 cohort15. The Cancer Genome Atlas (TCGA) dataset for colon and rectal adenocarcinoma (COREAD42) was accessed and extracted from the Genomic Data Commons (GDC) through TCGAbiolinks43. In all microarray datasets, the probes-to-genes were collapsed using the collapseRows function in the WGCNA R package (v.1.70-3), in which the probe with the highest average value per gene was selected44. For TCGA COREAD RNA-seq data, the TCGAbiolinks R package (v.2.16.1) was used to download HT-seq counts. Using the varFilter in the genefilter R package (v.1.70.0), low-variance genes (var.cutoff = 0.25) were excluded followed by quantile normalization and log2 transformation to the count matrix.

Single-cell datasets

Two epithelial scRNA-seq datasets were also used for the study—a CRC tissue-derived scRNA-seq merged dataset from five different cohorts6 and a scRNA-seq dataset derived from murine organoids mono-cultured or co-cultured with fibroblasts and/or macrophages28. For the scRNA-seq human CRC dataset6, the processed count expression matrix for n = 49,155 epithelial cells and the corresponding epithelial metadata were downloaded through the Synapse under accession code syn26844071. The murine organoid scRNA-seq dataset28 consists of n = 29,452 epithelial cells from WT mouse colonic organoids and at least five different genotypic CRC organoids, including shApc (A), KrasG12D/+ (K), shApc and KrasG12D/+ (AK), KrasG12D/+ and Trp53R172H/- (KP) and shApc, KrasG12D/+ and Trp53R172H/- (AKP), and all the corresponding metadata were also downloaded from a previous publication28.

GEMMs

All animal experiments were performed according to a UK Home Office license (Project License 70/8646) and were reviewed by the animal welfare and ethical board of the University of Glasgow. Both male and female 6–12-week-old mice were systematically induced with a single injection of 2 mg tamoxifen (Sigma-Aldrich, T5648) using intraperitoneal injection and sampled at clinical endpoint, which was defined as weight loss and/or hunching and/or cachexia. All experiments were performed on mice with a C57BL/6 background. RNA was extracted using either an RNeasy mini kit (Qiagen) or TRIzol reagent (ThermoFisher Scientific) and its concentrations were assessed using a NanoDrop 200c spectrophotometer (ThermoScientific). RNA quality was evaluated using an Agilent 220 TapeStation ScreenTape and sequenced using an Illumina TruSeq RNA sample prep kit, then run on an Illumina NextSeq using the High Output 75 cycles kit (2 × 36 cycles, paired-end reads, single index). The quality of the raw sequence was assessed using the FastQC algorithm (v.0.11.8). Sequences were trimmed to remove adaptors and low-quality base calls, defined as those with a Phred score of <20, using the Trim Galore tool (v.0.6.4). Thereafter, the trimmed sequences were aligned to the mouse genome build GRCm38.98 using HISAT2 (v.2.1.0), then FeatureCounts (v.1.6.4) was used to determine raw counts per gene. Two mouse cohorts that were used in our previous study29, namely small cohort (n = 18; E-MTAB-6363 (ref. 41)) and large cohort (n = 39; GSE218776 (ref. 29)), were amalgamated to create a larger cohort of GEMMs. APN and AP models were excluded, as the batch they were sequenced in was deeply confounded by genotype, resulting in a collection of n = 51 tumor samples, including six genotypes: Apcfl/+ (A); Apcfl/+ KrasG12/+ (AK); BrafV600E/+ Trp53fl/fl (BP); BrafV600E/+ Trp53fl/fl Notch1Tg/+ (BPN); KrasG12D/+ Trp53fl/fl Notch1Tg/+ (KPN); and KrasG12D/+ Trp53fl/fl (KP). After batch correction using the ComBat_seq function in the sva R package (v.3.40.0), the vst function in the DESeq2 R package (v.1.32.0) was applied to normalize the read counts.

Intestinal organoid culture

Organoid lines were generated from the small intestine of uninduced WT and tamoxifen-induced VillinCreER Apcfl/fl KrasG12D/+ (AK) mice. Two biological replicates per genotype were used. Tissue segments were collected from the small intestine (5–15 cm from the pyloric sphincter), and samples were immediately washed with PBS and kept on ice. For isolation45, the samples were dissociated in 2 mM EDTA at 4 oC for 30 min and subsequently shaken and separated into four fractions. Fractions two to four, containing intestinal crypt material, were filtered through a 70 µm cell strainer and centrifuged at 600 rpm for 3 min. The cell pellet was resuspended in Matrigel (BD Bioscience, 356231) and seeded in accordance with the pellet volume obtained.

For culturing46, organoid base media was made up of Advanced DMEM/F12 (ThermoFisher Scientific, 12634010) supplemented with 1 mM l-glutamine (ThermoFisher Scientific, 25030024), 10 µM HEPES Buffer (ThermoFisher Scientific, 15630056), penicillin–streptomycin (100 U ml−1) (ThermoFisher Scientific, 15140122), N2 (ThermoFisher Scientific, 17502001) and B27 (ThermoFisher Scientific, 12587001) and is hereafter referred to as complete organoid base media. Intestinal crypts were cultured in complete organoid base media further supplemented with 50 ng ml−1 Recombinant Human EGF (Peprotech, AF10015), 100 ng ml−1 Recombinant Murine Noggin (Peprotech, 25038) and 500 ng ml−1 Recombinant Murine R-spondin1 (R&D Systems, 3474-RS), hereafter referred to as ENR media.

To assess the effect of ENR withdrawal on organoid transformation, WT and AK organoid lines were split at a ratio of 1:6 following the formation of large budding structures and then cultured for 48 h with either complete organoid media supplemented with ENR or complete organoid base media only. Organoids were removed from the culture plate using ice-cold PBS and then transferred to a 15 ml falcon and subjected to 3× wash and centrifugation steps at 600 rpm for 3 min. The resulting cell pellet was fixed in 10% paraformaldehyde at 4 °C for 10 min. After aspirating the supernatant, the cell pellet was resuspended in 2% agarose (Melford, A20090-500) and embedded in paraffin blocks. The following antibodies were used according to standard histological processing techniques: BrdU (BD Biosciences, 347580, TRS High, 1:250), Ki67 (Cell Signaling, 12202, ER2 20 min, 1:1000), Chromogranin-A (AbCam, ab108388, TRS High, 1:600) and Synaptophysin (Cell Signaling, 36406, TRS High, 1:150). In situ hybridization was performed on the Leica Bond Rx Autostainer using the following RNAscope probes according to the manufacturer’s instructions: Lgr5 (312178), Anxa1 (509298), Clu (427898) and Olfm4 (311838). Positive (mmPpib, 313918) and negative (mm-DapB, 312038) control probes were implemented to ensure staining integrity and accuracy. Images were acquired using an Olympus BX51 and Zen2 Lite Blue imaging software at ×20 magnification.

Gene signatures

The majority of the gene set signatures, including the Hallmark, BIOCARTA, KEGG, PID and REACTOME gene set collections, were accessed from the MSigDB using the msigdbr R package (v.7.4.1) for both human and mouse species4. Other gene set signatures included: (1) stem-related CBC and RSC19; (2) precursor polyp-related tubular and serrated signatures obtained from differential gene expression analysis between tubular and serrated adenomas from GSE45270 (ref. 20) using the limma R package47 (v.3.46.0); (3) MYC and PRC target modules26; and (4) a list differentiation-specific markers28,48,49 (Supplementary Table 3) used for the pathway-based analyses. These gene signatures were also applied to mouse data where they were directly applicable using MSigDB, or the gene signatures were converted to the mouse orthologues with ensembl using the biomaRt R package50 (v.2.50.3).

Unsupervised discovery of PDS

Four well-curated publicly available gene set collections, BIOCARTA (gene set, n = 289), KEGG (n = 186), PID (n = 196) and REACTOME (n = 1499), were accessed from the MSigDB with the msigdbr R package (v.7.0.1) and were used to generate a matrix of ssGSEA scores from gene expressions of the FOCUS cohort using the GSVA R package (v.1.26.0). The parameters min.sz = 10, method = ‘ssgsea’ and ssgsea.norm = T were set, resulting in the scores for 1,783 gene sets. A subset of 165 KRASmut primary tumor CRC samples were selected from the ssGSEA score matrix (excluding KRAS WT, BRAF and NRAS mutants) for the unsupervised class discovery phase. The score matrix was subjected to dimensionality reduction (t-SNE) analysis, and the two continuous variables (Dim1 and Dim2) were obtained using the Rtsne R package (v.0.15). The variables were scaled before applying unsupervised k-means clustering. The silhouette width and elbow methods determined k = 3 as an optimal number of clusters (cluster R package v.2.1.2; factorextra R package v.1.0.7), and the bootstrap resampling method from the fpc R package (v.2.2.3) identified k = 3 as the highly stable number for clustering. Following unsupervised k-means clustering, three groups of clusters were named: PDS1, PDS2 and PDS3.

Development and application of the PDS classification system

Using the discovery subset (n = 165) in which the PDS classes were defined previously, the discovery set was randomly divided into a training set (n = 125) and a test set (n = 40) based on the bootstrap resampling method using the caret R package (v.6.0-90). Three different classification algorithms were tested, including the nearest shrunken centroid (or prediction analysis of microarrays), lasso and elastic-net regularized generalized linear model (glmnet) and svmRBF, implemented in the caret R package (v.6.0-90). As a feature selection step to reduce the number of gene sets and draw out only subgroup-specific gene sets, ssGSEA scores that were above average specific to each subgroup were selected and further highly correlated gene sets (>0.9) were excluded, resulting in 626 gene sets in total (Supplementary Table 2). To make it feasible for users, the test run was performed on the gene expression matrix, and the ssGSEA score conversion steps were implemented within the classification model. The gene expression matrix of the test samples was first converted to the ssGSEA scores with the gsva function along with the parameters ssgsea.norm = F, which generated unscaled scores. The unscaled ssGSEA scores were scaled using the min–max scaling method in the gsva function, in which the minimum and maximum values determined during the class discovery phase were used. Once the score matrix of the test samples was generated, it was batch-corrected against the training set as the reference batch using the ComBat function from the sva R package (v.3.42.0).

Before running the classification algorithms, leave-one-out cross-validation was used in the training set to minimize overfitting of classification. The classification algorithms were trained on the training set and the corresponding hyperparameters were adjusted to finalize the models. Out of the three classification models, the svmRBF algorithm displayed a high classification performance on the test data; therefore, it was selected to develop the PDS classification system. The PDSclassifier R package has been developed for the PDS classification model that is available to share (see Code availability).

Pathway analysis

To define subtype-specific biological associations, GSEA was used with the fgsea R package51 (v.1.21.0) with eps = 0 and nPerSimple = 10,000. For fgsea, a ranked gene list was first obtained for each comparison in each dataset with the limma R package (v.3.50.3). The comparison between subtypes was made in a grouped pairwise manner. Statistical significance was measured with a Benjamini–Hochberg false discovery rate of <0.05, and normalized enrichment scores indicate upregulation (positive value) or down-regulation (negative value).

The ssGSEA scores in bulk datasets were generated with the GSVA R package (v.1.42.0) with the ‘ssgsea’ method52 and ssgsea.norm = T. For scRNA-seq datasets, ssGSEA scores per cell from the signatures were calculated using the enrichIt function in the escape R package53 (v.1.6.0) with method = ‘ssgsea’ and min.size = 1.

Proliferative index, replication stress and ISC index

Replication stress was calculated from a collection of transcriptional signatures (n = 20; Supplementary Table 4) associated with cell cycle and DNA repair extracted from the MSigDB for both species, and GSVA scores were generated using ‘gsva’ method from GSVA R package52 (v1.42.0), followed by the total sum of the GSVA scores per sample across gene sets22. The list of gene sets was used to determine the enrichment scores per cell using escape followed by the sum of enrichment scores per cell to obtain replication stress scores, which were subsequently scaled between −1 and 1 using the rescale function in the scales R package (v.1.2.0) in the scRNA-seq dataset. The transcriptomic measure of proliferation was calculated with the ProliferativeIndex R package21 (v.1.0.1), which calculates a proliferative index from a list of proliferative cell nuclear antigen-associated genes. For the mouse model, the proliferative cell nuclear antigen-associated gene signature was converted to mouse orthologues as mentioned previously before applying the signature to calculate the proliferative index score. The gene signature was extracted directly from the R package and used to calculate enrichment scores per cell using escape in the scRNA-seq cohort. Likewise, the CellCycleScoring function in the Seurat R package31,54 was also used to predict the cell cycle phase per cell into G1, S and G2M. The ISC index provides a continuum score from the gene expression dataset that represents the stem cell phenotype, with the extreme ends of the scoring scale as either a strongly conventional CBC or an RSC phenotype. The ISCindex R package19 (v.0.0.0.9) was downloaded and used for the bulk dataset. Additionally, these CBC and RSC gene signatures were also used to obtain ssGSEA scores for the bulk and scRNA-seq datasets.

TF activity analysis

For the quantification of TF activity from the gene expression profiles, the collection of well-curated TFs and their targets in the DoRothEA12 database was accessed; the TF-target interaction with high confidence A and B were selected for the analyses of both human and mouse using the dorothea R package (v.1.6.0). Statistically significant differences (P < 0.05) in TF activity between subtypes were determined using the ‘rowTtest’ function from the viper R package55 (v.1.28.0). A list of uniquely activated TFs per subtype across cohorts was identified and visually presented in a heatmap.

Cell lineage analysis

ESTIMATE18 was applied to produce immune and stromal fractions as well as tumor purity scores using the estimate R package (v.1.0.13) and visualized using the ggplot2 R package.

CRC molecular subtyping

Tumor samples in each cohort were classified into CMS using the random forest method from CMSclassifier (v.1.0.0) at a default threshold, with the exception of the FOCUS and SPINAL cohorts, for which the CMS posterior probability threshold levels were reduced to 0.4 (ref. 5). CMS for mouse data were called using the ‘Option C’ classification in the MmCMS R package29 (v.0.1.0). CRIS classification was made using CRISclassifier (v.1.0.0) at default settings3. Unclassified samples were determined using the recommended Benjamini–Hochberg false discovery rate of >0.2. For iCMS bulk classification6, the iCMS gene signatures were extracted to create the iCMS template, which was subsequently used with the nearest template prediction method embedded in CMScaller (v.2.0.1)56. Samples above a false discovery rate of 0.05 were classed as ‘unknown’. In the scRNA-seq dataset, iCMS labels were used as previously defined. Furthermore, iCMS2-high and iCMS3-high labels were created for each cell using the ssGSEA score from the iCMS gene signatures and selecting top tertiles.

Immunohistochemistry and digital histology scoring

Tissue microarrays were scanned at ×20 magnification and imported into QuPath57 (v.0.2.3). The suitability for inclusion of individual Ki67+ immunohistochemistry-stained cores was determined by manual visual assessment of the scanned images, after application of the QuPath tissue microarray dearraying tool. Color deconvolution was applied to separate stains, followed by tissue detection (pixel threshold, resolution, low (7.96 µm per pixel); channel, average channels; Gaussian prefilter; smoothing sigma, 2.0; tissue threshold, 235). Cells were detected within the annotated tissue (requested pixel size, 0.5; nucleus background radius, 15.0; median filter radius, 0.0; sigma, 2.5; minimum area, 10.0; maximum area, 300.0; threshold, 0.1, maximum background intensity, 1.0; exclude DAB, false; cell parameters, default; general parameters, default) and smoothed (radius, 25 µm). An object classifier (random trees; default settings) was trained by examples of annotated tumor epithelium and stroma. Set cell intensity classification (nucleus: DAB OD mean, 0.18) to differentiate positive and negative tumor epithelium and stroma. Additionally, an H-score was also generated by setting a three-tier cell intensity classification (nucleus: DAB OD mean, 0.10; 0.25; 0.42). This resulted in a total of 354 patients with matched PDS call and Ki67+ assessment. The total tumor area was also enumerated and assessed based on the digital histology scoring using the HALO platform (Indica Labs, Albuquerque, NM, USA) on the H&E whole slide images (WSIs) from the FOCUS cohort (n = 356)58.

Digital image-based PDS classifier

We developed a set of deep-learning classifiers and analyzed their performances using WSIs from the FOCUS and SPINAL cohorts. After rejecting images of poor quality and images from patients with undefined PDS calls after pathologist review, we used a dataset of 997 WSIs of H&E-stained resection specimens from 520 patients (PDS1, n_slides = 329; PDS2, n_slides = 314; PDS3, n_slides = 354). We conducted experiments under a fivefold cross-validation protocol: for each fold, we split the data according to a 60–20–20 training–validation–test distribution such that classes and cohorts were stratified. We made sure that the validation and test splits did not overlap across the five folds and that images from the same patient were always in the same split. Tumor regions were manually annotated in all WSIs; we then restricted our experiments to the use of image data from these regions to prevent potential classification bias from non-tumor regions. The tumor regions of these WSIs were tiled into sets of image patches of size 318 × 318 pixels, extracted at ×5 magnification (resolution ~2 μm per pixel) with 50% overlap. We used a customized 50-layer ResNet as a deep-learning architecture to process the input image patches and output a probability density over the PDS classes. Each image patch was labeled using the PDS class of their WSI of origin, and our models were trained to maximize the output probability for the target class (minimization of the cross-entropy loss) using mini-batches of size 16. At inference time, we applied the trained models on all the generated tiles from a given WSI and then averaged their predicted probability densities to produce slide-level probability estimates. We then selected the class with the highest relative probability score as the image-based PDS call for this WSI. This approach was based on the weakly supervised learning protocol proposed within the image-based CMS study16. The classification performance of the trained models was systematically assessed using the test patrician for each fold.

Mutation and copy number profiles

Mutational and copy number associations between PDS were determined, when available, across at least four different CRC cohorts: FOCUS11, GSE39582) ref. 2), SPINAL (S:CORT cohort) and TCGA COREAD42. The proportion of driver mutations and their variants (KRAS, BRAF and TP53) were examined in detail across these cohorts. The mutational data for TCGA COREAD were retrieved from the GDC with the TCGAbiolinks R package (v.2.25.2) and analyzed using the maftools R package (v.2.12.0)59. Oncoprint was also used to interrogate and visualize the genetic alternations in key driver genes along signaling pathways, including WNT, MAPK, PIK3CA, cell cycle and TGF-β pathways, using the ComplexHeatmap (v.2.12.0) and circlize (v.0.4.15) R packages60. For the FOCUS and SPINAL cohorts, copy number chromosomal arm calls and copy number estimations per gene were available via the S:CORT consortium. For TCGA COREAD, the GISTIC data were accessed through the GDC data portal42.

Survival analyses

Two different cohorts (GSE39582 and PETACC-3) were explored for clinical and prognostic evaluation. Only PDS samples were considered for the analysis, excluding ‘mixed’ samples. Moreover, patients with missing information on RFS status, relapse-free months or chemotherapy treatment status, and patients with records dating back less than one month were excluded from all survival analyses in GSE39582. The Cox proportional hazards method was also performed to calculate the hazard ratio and confidence intervals for statistical group comparisons. The analysis for the PETACC-3 clinical trial (NCT00026273) was performed by P.K.T. All survival analyses and visualizations were carried out with the survival (v.3.2-13), survminer (v.0.4.9) and ggplot2 (v.3.3.6) R packages.

Single-cell human data analyses

The count expression matrix for tumor epithelial cells was normalized using SCTransfrom in the Seurat R package54 (v.4.1.1) with method = ‘glmGamPoi’ implemented from the glmGamPoi R package (v.1.8.0). Using Seurat functions, the data were subsequently clustered and visualized using uniform manifold approximation and projection (UMAP). Additionally, the n = 626 PDS-specific gene sets were used to generate a single-cell matrix of ssGSEA scores using the enrichit function from the escape R package. The matrix was used as an assay in the Seurat object, in which it was clustered and visualized in a UMAP plot. Cells were defined as ‘high’, ‘mid’ or ‘low’ based on tertile using MYC targets and PRC targets ssGSEA scores, whereby MYC-high–PRC-low were considered as MYC targets-high and PRC-high–MYC-low were considered as PRC targets-high. Similarly, in addition to the previous classification of iCMS on the data, iCMS gene signatures were used to generate iCMS2 and iCMS3 ssGSEA scores and, with the same approach, iCMS2-high and iCMS3-high were also defined. In addition to UMAP visualization from Seurat, the data were also visualized with the high-dimensionality reduction PHATE method61 using the phateR (v.1.0.7) R package.

Trajectory inference analysis and differentiation potency

The single-cell fate mapping and trajectory inference were examined using the Python-based CellRank (v.1.5.2) method39 in both single-cell datasets. To maintain consistency in both datasets, CytoTRACEkernal was used to compute pseudotime, and the information was projected onto the UMAP with arrows displaying RNA velocity-like directionality towards increasing differentiation status.

Further assessment of differentiation status was performed using the SCENT (v.1.0.3) R package, which computed the correlation of connectome and transcriptome scores using a human or murine version of the net17Jan16 protein–protein interaction network for both scRNA-seq data, which highlights the differentiation potency of an epithelial cell62. Waddington-like landscapes were visualized in 3D using SideFX Houdini 19.5 as previously described28.

Calculation of the SMI

The SMI provides a method of transcriptomic measure along the stem–differentiation continuum in association with stem-like or differentiation-like properties. SMI can be calculated from the gene expression profile of both bulk tumor tissue and single-cell data and is simply the difference between the PRC targets and MYC targets ssGSEA score scaled between a value of −1 and 1. The method has also been implemented in the PDSclassifier R package in the form of the calculateSMI function.

Statistics and data visualization

Data interrogations, analyses, visualizations and interpretation were mostly processed using R (v4.2.1) in RStudio and Python (v.3.9). Statistical analyses were performed in R using the stats (v.4.2.1) or ggpubr (v.0.4.0) package for plots and included the two-sided Wilcoxon rank-sum test, Kruskal–Wallis rank-sum test, Fisher’s exact test and Pearson’s correlation coefficient test. For copy number by arm analysis, Pearson’s chi-squared test post-hoc analysis was performed using the chisq.posthoc.test R package (v.0.1.2), and Benjamini–Hochberg adjusted P values were determined using the p.adjust function in the stats R package. Other R packages that were used for data analysis and visualization include ggtern (v.3.3.5), ComplexHeatmap (v.2.10.0), circlize (v.0.4.15), umap (v.0.2.8.0), ggplot2 (v.3.3.6), patchwork (v.1.1.1), riverplot (v.0.10), ggforce (v.0.3.3) and RColorBrewer (v.1.1-2).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.