Image-based consensus molecular subtyping in rectal cancer biopsies and response to neoadjuvant chemoradiotherapy

The development of deep learning (DL) models to predict the consensus molecular subtypes (CMS) from histopathology images (imCMS) is a promising and cost-effective strategy to support patient stratification. Here, we investigate whether imCMS calls generated from whole slide histopathology images (WSIs) of rectal cancer (RC) pre-treatment biopsies are associated with pathological complete response (pCR) to neoadjuvant long course chemoradiotherapy (LCRT) with single agent fluoropyrimidine. DL models were trained to classify WSIs of colorectal cancers stained with hematoxylin and eosin into one of the four CMS classes using a multi-centric dataset of resection and biopsy specimens (n = 1057 WSIs) with paired transcriptional data. Classifiers were tested on a held out RC biopsy cohort (ARISTOTLE) and correlated with pCR to LCRT in an independent dataset merging two RC cohorts (ARISTOTLE, n = 114 and SALZBURG, n = 55 patients). DL models predicted CMS with high classification performance in multiple comparative analyses. In the independent cohorts (ARISTOTLE, SALZBURG), cases with WSIs classified as imCMS1 had a significantly higher likelihood of achieving pCR (OR = 2.69, 95% CI 1.01–7.17, p = 0.048). Conversely, imCMS4 was associated with lack of pCR (OR = 0.25, 95% CI 0.07–0.88, p = 0.031). Classification maps demonstrated pathologist-interpretable associations with high stromal content in imCMS4 cases, associated with poor outcome. No significant association was found in imCMS2 or imCMS3. imCMS classification of pre-treatment biopsies is a fast and inexpensive solution to identify patient groups that could benefit from neoadjuvant LCRT. The significant associations between imCMS1/imCMS4 with pCR suggest the existence of predictive morphological features that could enhance standard pathological assessment.


INTRODUCTION
Important progress has been made in the treatment of high-risk rectal cancer (RC) patients in the past decades [1].Prospective clinical trials of neoadjuvant chemoradiotherapy have shown that approximately 15% of patients can reach pathological complete response (pCR) by neoadjuvant chemoradiotherapy (CRT) [2] and that response rates can be further increased to about 30% pCR by total neoadjuvant treatment (TNT) [3][4][5][6].Organ sparing approaches may be realistic for patients with pCR, and can have major impact on long-term quality after definite treatment.However, intensified neoadjuvant treatment is associated with increased frequency and severity of adverse events that can present during or after CRT [7].At the same time, responses are heterogeneous with 30-40% of patients presenting with tumour regression of different grades while 7-30% of patients classify as non-responders [8,9].Interpretable predictive biomarkers to guide personalized treatment are therefore of utmost importance to further improve patient selection for intensified treatment regimens and treatment outcomes [10].The setting for the identification of predictive markers in the clinical pathway of RC patients remains highly challenging: Turnaround time needs to be short and only scarce material from diagnostic biopsies is available for study, limiting the utility of molecular and functional analysis methods.
The Consensus Molecular Subtypes (CMS) define four distinct subtypes of colorectal cancer (CRC) by common patterns of gene expression [11].The prognostic value of CMS classes has been reported in several studies [12,13], yet their association with treatment outcome is an open research topic [11].Recently, Domingo et al. [14] have described an association between pCR to neoadjuvant chemoradiotherapy and transcriptional CMS signatures in diagnostic RC biopsies, suggesting that molecular classification may also be used as a predictive biomarker for treatment stratification.Overall, early profiling of CMS classification in the clinical pathway promises to be highly relevant for personalized treatment decisions [15].Limitations of transcriptional CMS classification include a high cost and time requirement for RNA sequencing, a high failure rate due to the small amount of tumour material available and difficulty to standardize single samples [13,16].In addition, even successful transcriptome generation shows higher frequencies of unclassified cases in biopsies than resections [13,17].
Machine learning has been shown to be a promising alternative solution for predicting molecular signatures across multiple cancer types from diagnostic histopathology sections [18].In particular, Sirinukunwattana et al. [13] showed that deep learning models can predict image-based CMS (im-CMS) classes that match transcriptional CMS calls directly from hematoxylin-and-eosin (H&E) stained histopathology whole slide images (WSIs) of CRC tumour specimens, using three independent cohorts.This prior work also highlighted the potential of imCMS to stratify patients for whom transcriptional CMS failed.Yet, evidence of the generalization of imCMS on biopsy cohorts was limited by the need for domain-adversarial training and fine-tuning, requiring labeled data from the target cohort, which is a shortcoming for the implementation of imCMS in the clinic.
Here, we set out to show whether imCMS classifiers can be trained without domain adaptation and perform reproducibly in independent pre-treatment biopsy datasets.By investigation of real-world cohorts sourced from the Medical Research Council (MRC) and Cancer Research UK (CRUK) Stratification in Colorectal cancer (S:CORT), we estimated via simulation experiments how many cancer biopsies are sufficient to achieve near-optimal imCMS classification performance.We then assessed whether imCMS classification is associated with pCR to neoadjuvant long course chemoradiotherapy (LCRT) with single agent fluoropyrimidine treatment in two independent held out diagnostic biopsy cohorts, comprising a total of 169 advanced RC patients with outcome data.

Study Design
The study design, cohorts and aims are outlined in (Figure 1) and detailed methods for all experiments and statistical analysis are provided in (Supplementary Material and Methods).

Rectal Cancer LCRT Treatment Protocol
All patients from the GRAMPIAN, ARISTOTLE and SALZBURG cohorts included in the study received a "standard" treatment protocol for advanced RC by pelvic irradiation (45-50.4Gy in 25 fractions over 5 weeks) combined with Capecitabine (825mg•m −2 BD on treatment days).Detailed information on these cohorts is available in Supplementary Material and Methods.The primary endpoint for all cases was pCR after completion of LCRT as assessed by histopathological analysis of the surgical RC resection specimen by an expert gastrointestinal pathologist according to established guidelines.

Processing of Tissue Samples and Slide Scanning
For the four S:CORT cohorts, serial 5-µm sections were cut from one pathologist-selected representative tumour block for H&E staining followed by up to nine unstained sections for RNA extraction.H&E slides were reviewed by an expert gastrointestinal pathologist and invasive cancer regions were annotated to guide RNA and DNA extraction by macrodissection.Regions of extensive necrosis and non-tumour tissue were excluded according to standard practice for molecular tumour profiling.All H&E slides were scanned on an Aperio scanner at magnification 20× (0.5µm•px −1 ).All digital slides were re-reviewed and a board-certified pathologist annotated tumour regions while excluding areas containing folds or debris.The data filtering procedure, cohort sizes and additional details for each cohort are summarized in (Supplemental Figure S1).

Transcriptional CMS Classification
For the S:CORT cohorts, RNA expression was obtained by microarray (Xcel, Affymetrix), and raw CEL files underwent the robust multiarray average normalization with the Affymetrix package (v1.56.0) [19] in R [20].Batch-corrected transcriptional CMS calls were derived for each sample using the protocol described in Supplementary Material and Methods.Any variation in the number of patients and slides in the same cohorts used in the study by Sirinukunwattana et al. [13] stems from the updated procedure for transcriptional CMS.For TCGA [21], the same transcriptional CMS calls were used as previously reported by Sirinukunwattana et al. [13].

Transcriptomic Immune Profiling
The batch-corrected version of the S:CORT transcriptome was used to derive estimates of immune and stromal cell types with MCPcounter [22] and Xcell [23] by applying the original R packages.Although scores for both signatures are not comparable between cell types, they were scaled from 0 to 1 to facilitate visualization.

Deep Learning imCMS Classification
To develop a WSI-based CMS classifier, we considered the model proposed by Sirinukunwattana et al. [13] as a baseline (imCMSv1), and re-implemented a new version (imCMS v1.5) that facilitates the training procedure, and reproducibility of our experiments while keeping comparable performance and interpretability.The version 1.5 of imCMS is based on the three-stage process illustrated in (Figure 1D).First, manually annotated tumour regions of an input WSI are tiled with patches of size 318×318px at magnification 5× (∼2µm•px −1 ) with 50% overlap, and all tiles with less than 50% overlap with the annotated tumour regions were excluded.Then, all the extracted patches are fed as input to a trained deep learning model that outputs probability scores for each target CMS class.Third, all tile-level probability scores are averaged to produce slide-level probability scores, and the class with the highest score is considered as the imCMS call prediction for the input WSI.
We kept the first stage identical to imCMSv1 but opted for a fixed magnification for tile extraction at magnification 5× based on the results of [13], suggesting optimal performance in both resection and biopsy images.With the last two stages, we moved from a count-based assumption to a collective assumption for determination of imCMS calls [24], thus enabling more fine-grained contributions of each tile to the slide-level predictions.Further, for imCMS1.5,we changed the pre-trained InceptionV3 backbone architecture used in imCMSv1 by a randomly initialized customized ResNet architecture, to ensure that our results can be reproduced without having to rely on a pre-training procedure.See Supplementary Material and Methods for more details about the implemented model architecture and training procedure.

Effect of multi-cohort training on imCMS performance
With the goal of developing a generic imCMS classifier that can perform well on both resection and biopsy specimens (without the need for a domain adaptation procedure as required in imCMSv1 [13]), we first assessed the effect of combining resection and biopsy cohorts for training the models, while keeping independent resection and biopsy cohorts held out for evaluation purposes.We designed five experiments with different combinations of cohorts (Figure 1B).For each experiment, the development datasets were split into five bins (with stratified cohorts and classes at the patient level) to build five folds of training/validation splits by using each bin once as an internal validation set (for model selection purposes), while the remaining four bins are used for training.Classification performances were separately measured for each trained model using the corresponding held out test sets, as well as for the ensemble model that combines the output of the five models (Figure 1C).
The macro-average area under the receiver operating characteristic curves (AUROC) of the trained models are reported in (Figure 2), the detailed ROC curves, and confusion matrices of the best performing trained models are reported in (Supplemental Figure S2).For each test set, we achieved the best performance when both resection and biopsy images were used jointly for training.The ensemble model resulting from the experiment [E3a] (TCGA: macro-average AUROC .813;ARISTOTLE: macro-average AUROC .798)was used for subsequent analysis of outcome association.

Association between imCMS and pCR to neoadjuvant LCRT
To investigate the association between imCMS calls and pCR to neoadjuvant LCRT with single agent fluoropyrimidine, we identified two RC cohorts treated with the same treatment regimen (pelvic irradiation 45-50.4Gy in 25 fractions over 5 weeks, combined with single agent Capecitabine on treatment days) (Supplementary Material and Methods).We applied the imCMS ensemble model resulting from experiment [E3a] to WSIs of all pre-treatment biopsies of the ARISTOTLE and SALZBURG cohorts with available pCR and pre-treatment T/N stage data.imCMS calls at the case level were defined as the majority vote of the calls of the five different trained models that constitute the ensemble model.Cases with undecided majority were classified as "mixed" (n=2).The analysis of imCMS calls with clinicopathological data and treatment outcomes was conducted with 114 patients of the ARISTOTLE cohort including 24 patients with pCR to neoadjuvant LCRT, and 55 patients of the SALZBURG cohort with 6 patients with pCR to neoadjuvant LCRT.Patients were split into groups based on their predicted imCMS class.Odds ratios for pCR were calculated separately for each group and adjusted by T-stage, N-stage and cohort.Significant associations (p-value<0.05) were found for both imCMS4 with lack of pCR (OR 0.25, 95%CI 0.07-0.88,p=0.03) and imCMS1 with pCR (OR 2.69, 95%CI 1.01-7.17,p=0.048) (Figure 3), suggesting that imCMS calls generated from pre-operative biopsy material associate with response of the primary tumour to combined chemoradiotherapy.
Development Sets Numbers are macro-averages of the AUROC for ensemble models of five models trained and validated with five different folds of the same development sets.Each model of an ensemble model was evaluated separately: the mean and standard deviation of the corresponding macro-average AUROCs is reported in parenthesis.The ensemble model resulting of the experiment [E3a] was used for subsequent analysis of association between imCMS and pCR to neoadjuvant LCRT.

Simulation: Effect of biopsy sampling on imCMS classification
To assess how reliable biopsy-based imCMS predictions are in comparison to resection-based imCMS predictions in a control scenario, we estimated the change in classification performance of imCMS as a function of the number of biopsy fragments in a sample.It is theoretically possible to conduct an experiment in which we would generate a dataset with many tissue fragments biopsied from each patient's tumour site along with paired resection specimens and associated CMS calls, then comparing imCMS classification performance when combining different numbers of fragments.However, since repeated sampling cannot be ethically justified, we argue that such an experimental protocol can be approximated via the generation of virtual biopsy fragments randomly sampled from existing resection images.
We generated 26 simulations of biopsy datasets, each with a varying number of biopsy fragments and sourced from either of the two datasets of resection specimens, FOCUS or SPINAL.For every WSI in each resection dataset, we collected 10 000 subsets of randomly cropped non-overlapping biopsy-shaped images.These subsets were concatenated to simulate different biopsy sampling scenarios (Figure 4B-C).Cropping was performed using true shapes of tumour biopsy fragments annotated in the GRAMPIAN and ARISTOTLE cohorts.Here, each subset simulates a random biopsy sampling event that an endoscopist could potentially perform including both superficial and deep samples.Examples of simulated biopsy samples from a source resection specimen are illustrated in (Figure 4B-C).We designed these simulated datasets to approximate the characteristics of actual biopsy datasets under the assumption that, (a) sampling locations are uniformly distributed in a tumour region, (b) the shape of the biopsies is random and independent of the sampling location, (c) sampling locations of consecutive biopsy fragments are random and non-overlapping.Consequently, we argue that the distribution of images of such a simulated biopsy from a single resection image closely approximates the distribution of images of actual random biopsy samples excised from the same tumour specimen.
To assess the performance of imCMS in the simulated biopsy datasets, we made sure to evaluate models on data not seen during training by using the training and test partitions described in (Figure 1E).Each trained imCMS model resulting from the training procedures [E4a] and [E4b] was applied to all its respective simulated test sets.As a result, we observed a rapid increase of classification performances when the number of fragments in a biopsy sample increases until convergence to the performance level that the models achieved when using the resection data without sampling (Figure 4D).For both test conditions, we report a AUROC difference lower than 3% of the AUROC achieved with the original resection data when the number of simulated tumour biopsy fragments in a sample is above five, suggesting that reliable imCMS classification can be achieved at diagnosis in a large fraction of cases [25].

Stability of distribution of cell types between resection and biopsy samples
As endoscopists excise biopsies in specific targeted regions of tumours, this procedure is subject to variance between operators as well as patient-specific factors.This can potentially induce a variation of morphological information in images of biopsies in comparison to resection specimens.This may particularly affect heterogeneously distributed factors in the tumour microenvironment such as cancer, immune and stromal cell populations.We therefore investigated whether there are systematic differences in microenvironment composition between biopsies and resection specimens of CRC specimens.Specifically, we confirmed the absence of confounding factors with real-world biopsies and investigated whether systematic differences of cell distribution exist between biopsy samples and resection specimens by deconvolution of stromal, immune and tumour-related signatures from transcriptomic data in 529 biopsies and 565 resections from four S:CORT cohorts (FOCUS, SPINAL, GRAMPIAN and ARISTO-TLE).First, we generated absolute abundance estimates for key immune and stromal cell types with two  different tools (MCPcounter [22] and xCell [23]) using bulk transcriptomic data.With both methods, we detected comparable levels of key immune lineages (T-cells including CD8+ and differentiated cytotoxic T-cell subpopulations, B-cells, NK-cells, monocytes, myeloid dendritic cells and neutrophils) as well as endothelial and stromal cell populations in biopsies and resections across CMS subtypes (Figure 5).This result suggests that biopsies may be representative of the broad abundance of cell types in the tumour microenvironment compared to resection specimens, according to their transcriptomic CMS classes.

Tissue microenvironment patterns related to CMS classes
CMS classification is driven by microenvironment and tumour-related factors.To investigate consistency of imCMS classification in RC biopsy material with underlying biological patterns, we visualized image tiles with the highest predicted probability score for each imCMS class in biopsy material (Figure 6).In agreement with prior studies, we found a consistent pattern of high stromal content and dissociative tumour growth (tumour budding) in tumour tiles classified as imCMS4 from all three cohorts.imCMS1 tiles more frequently contain lymphocytic infiltration and focal mucinous differentiation, although this feature was less frequent than previously observed in CRC resection specimens (compare TCGA, bottom (Figure 6)), which is consistent with a lower representation of mucinous differentiation in RC. imCMS2 and imCMS3 features were consistent with previously described patterns, with imCMS2 showing epithelial-rich glandular and cribriform tumour growth with focal comedo-like necrosis while imCMS3 was characterized by glandular differentiation with tubular growth, focal mucin and a minor villiform component.Bioinformatic deconvolution of cell composition supported the observed tile-level associations.In biopsy specimens classified as imCMS4, a significantly higher frequency of fibroblast signatures were observed, while imCMS1 samples showed a tendency towards increased immune signatures, supporting biological interpretability of imCMS predictions at the case level.

DISCUSSION
There are currently no clinically established predictive markers for response to neoadjuvant chemoradiotherapy in RC.Decision making for RC patients and strategies for assignment to neoadjuvant treatment protocols are still taken at the cohort level and it is challenging to predict which patients will respond to neoadjuvant chemoradiotherapy.Current pathology assessment of pre-operative biopsies is limited to the confirmation of cancer diagnosis and a limited panel of molecular studies such as testing for mismatch-repair deficiency (MMRd) and the testing for common driver mutations if clinically desired.
Recent studies have demonstrated a strong benefit of neoadjuvant treatment of patients with MMRd RC with PD-1 Blockade, but with a frequency of approximately 5-10%, this genotype is infrequent in RC [26].Other studies have suggested that the absence of tumour budding [27,28], low stromal content [28] or the quantification of cytotoxic T-cells [29] may aid in identifying patients with favourable prognosis, but these methods are based on subjective visual features and incompletely capture the complex biology related to neoadjuvant chemoradiotherapy response.Better methods to supply a biologically informed and clinically relevant classification of RC biology from pre-operative biopsy material are therefore needed.
In an earlier study [13], we developed an image-based approach to predict the consensus molecular subtypes from WSIs of clinical CRC specimens and provided the proof of principle for generating imagebased molecular calls even with minimum input material.In a clinical setting, such image-based morphomolecular classifiers have the benefit to offer a level of interpretability linked to known molecular profiles, as opposed to black-box classifiers whose interpretation is limited.The predictive ability of image-based molecular calls for response to chemoradiotherapy remained an open question.For the current study, we re-implemented the imCMS analysis pipeline and trained classifiers using a combination of CRC resection and biopsy cohorts, across different stages.We found that combining these datasets was a viable option to enable generalization to external biopsy cohorts without having to rely on domain adversarial training or fine-tuning as previously required [13].This incremental improvement showed that such computational pipelines can directly and accurately classify images in new cohorts without the need for additional data or re-training.Despite distributional shifts between training cohorts (e.g., biopsy sampling artefacts, relative proportions of different tissue morphology) can make models subject to learning cohort-specific features that can limit classification performance, we observed a consistent improvement of performance when combining resection and biopsy modalities for training.This result is the most striking when comparing the performances of the models tested using TCGA: models trained solely with biopsy data (GRAMPIAN or ARISTOTLE) were not able to generalize unless resection data (FOCUS and SPINAL) was used.Furthermore, we conjecture that combining these datasets enables learning of modality-agnostic features that are invariant to inter-cohort variations.This hypothesis is supported by our visual assessment of the image patches with highest probability scores from different patients across the TCGA, ARISTOTLE and SALZBURG cohorts, which illustrates the stability of morphology of top-contributing tiles across independent cohorts.
The significant associations between imCMS1 and pCR to neoadjuvant LCRT and between imCMS4 and lack of pCR to neoadjuvant LCRT suggest the existence of predictive morphological features that our models were able to capture.This result paves the way for future work for the validation of such a tool as a support for clinical decision in the neoadjuvant treatment setting.Specifically, imCMS1 calls could identify patients for total neoadjuvant treatment with favorable prognosis.Due to an enrichment for immune-activated cases and MMRd cases, imCMS1 cases could also have a greater likelihood to achieve complete remission with immune-checkpoint blockade [30,31], but this association requires further study.In contrast, patients classified as imCMS4 could be selected for clinical trials adding additional chemotherapy or biological agents to CRT, as the likelihood for response to standard LCRT protocols with single agent fluoropyrimidine is low.This result is aligned with current understanding of CMS4associated biology: based on the high stromal content and TGF-β signatures in this subset, resistance to cytotoxic treatment is common [32].For imCMS4 patients, tumour-stroma-based therapeutic targeting may therefore offer a potentially efficacious and biologically informed treatment alternative [30,31].Ten Hoorn et al. suggest that prospective studies are required to further establish the CMS taxonomy and confirm treatment efficacy by CMS subtype in clinical practice [12].However, CRC subtyping via current RNA sequencing technologies is deficient, in scenarios with low tumour material: other modalities such as biopsy imaging thus constitute a promising, fast and cost-effective alternative.Using an image-based approach, we demonstrate successful imCMS classification for all samples in the present study, compared to technical failure rates of up to 35% using state-of-the-art panel sequencing approaches [13].
Current ESGE guidelines [25] recommend sampling a minimum of six biopsies for the diagnosis of colorectal carcinomas, to ensure sampling of tumour fragments, and to reliably represent the overall tumour phenotype.Yet whether this number of biopsies is optimal to determine the classification of CRC according to biological subtypes remains an open question.We thus undertook comprehensive simulation experiments on existing fully digitized clinical cohorts to capture the morphology related to transcriptional CMS calls and to address whether clinically established sampling protocols are sufficient to describe tumour heterogeneity at the gene-expression level.As expected and due to the spatially heterogeneous nature of CRC tumours [33,34], our results suggest that sampling less than five tumour biopsies is not sufficient to properly predict the CMS call that would be obtained from an equivalent resection specimen of the same tumour.Our results corroborate the 2021 ESGE recommendation as our experiments show on two independent external test datasets of 147 and 266 patients, that five or more standard tumorous biopsy fragments are sufficient to reliably capture the global tumour phenotype needed to achieve CMS classification performance with fidelity close to full resection specimens.Thus, the current sampling protocol established in clinical practice is likely sufficient to capture tumour biology from several cancerous fragments, and to enable informed patient stratification by computational analysis.
A limitation of the current imCMS pipeline is the reliance on manual annotations of tumour regions as a pre-processing step, making the whole pipeline semi-automated.We suggest that future work should investigate strategies to either fully automate this step or to accurately classify WSIs independently of localized tumour regions.The pipeline of imCMSv1.5 was designed as an incremental extension of the work of Sirinukunwattana et al. [13], yet with the goal of increasing classification and generalization performance.Other deep learning architectures and training procedures such as recent proposed solutions to biopsy image classification problems [35,36] should be considered in future work.A weakness of our simulation experiments is the strong assumption for randomness of the sampling distribution of biopsy fragments and we concede that the distribution of true biopsy fragments may be different from our simulation, e.g., tumour fragments from the lumen are more likely to be sampled in a real-world scenario.We also recognize that the guidelines requiring six biopsies per case is designed to ensure there is a high chance of identifying invasive cancer, rather than slough or non-invasive malignancy.Here, our approach shows five or more biopsy fragments containing invasive tumour are required to achieve optimal imCMS classification.Although we did not observe any significant difference in terms of distribution of cell types between the studied resection and biopsy cohorts, such sampling information should be accounted for in future work.
To conclude, we found that deep learning models can automatically capture the morphology associated with transcriptional CMS in imaged biopsies from unseen cohorts.We found that patients stratified according to biopsy-based imCMS respond differently to neoadjuvant LCRT.Beyond CMS in CRC, the on-going development morpho-molecular classification models across cancer types and molecular signatures offer a new type of cost-effective computational tools to support clinical decision making [18].This surrogate for transcriptional analysis can also be of use for research studies with restricted funding, or to revisit existing trial cohorts by studying associations between image-based CMS classification and clinical variables of interest without the need for extra tissue material.computational pathology to provide biopharma services.TM: consultant for GTL.FH: honoraria from Pierre Fabre, Amgen, Servier, Daiichi Sankyo, BMS, Merck, Sanofi; travel support from Servier, BMS, Roche, Merck, Pharmamar, Pfizer, Pierre Fabre, Sanofi, Daiichi Sankyo, Gilead; scientific advisory role for Servier, Daiichi Sankyo, BMS; holds stock options in Guardant Health.DN: consultant for Boerhinger Ingelheim, Lilly.All other authors have no relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript.This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending, or royalties.

Additional Details on Clinical Cohorts
To train and validate the deep learning models developed in this study, H&E-stained tissue specimens of diverse stage and clinical settings were used from three cohorts from the Medical Research Council (MRC) and Cancer Research UK (CRUK) Stratification in COloRecTal cancer (S:CORT) programme: FOCUS (colon and rectal resections, n=365 patients; n=704 slides), randomised clinical trial testing different strategies of sequential and combination chemotherapy for patients with advanced CRC after surgical resection (MRC FOCUS, ISRCTN79877428) [37].
SPINAL (colon and rectal resections, n=206 patients; n=410 slides), patients with primary tumors without previous treatment balanced according to T stage, N stage, location (colon/rectum) and recurrence/metastatic disease (either at diagnosis or during follow-up).Samples were obtained from Birmingham and Manchester Hospitals, United Kingdom and the COIN clinical trial (ISRCTN27286448) [38].
GRAMPIAN (rectal preoperative biopsies, n=233 patients; n=414 slides; sequential cohort of highrisk RC with threatened or involved circumferential rectal fascia on pre-treatment MRI scan treated at Aberdeen Royal Infirmary, United Kingdom.For investigation with clinical outcomes, we used H&Estained tissue from the UK national clinical ARISTOTLE trial and a sequential cohort of high-risk RC with threatened or involved circumferential rectal fascia on pre-treatment MRI scan treated at the University Clinic in Salzburg, Austria (SALZBURG).All patients in the The ARISTOTLE and SALZBURG cohorts were strictly selected to have undergone the same treatment protocol for advanced rectal cancer by pelvic irradiation combined with single agent fluoropyrimidine.
ARISTOTLE (rectal preoperative biopsies, n=300 patients; n=300 slides, (base cohort)) UK national clinical trial (ISRCTN09351447) [39] which compared the efficacy of standard CRT with (intervention) or without (control) irinotecan in high-risk RC with threatened or involved circumferential rectal fascia on pre-treatment MRI scan.Cases were selected from the control arm of the ARISTOTLE trial and in whom biopsies were available for molecular analysis.Pathological response was assessed centrally according to a pre-specified pathology protocol using the Dworak method (NPW Leeds).
SALZBURG (rectal preoperative biopsies, n=61 patients; n=61 slides); sequential cohort of highrisk RC treated at the III rd Department of Internal Medicine of the Paracelsus Medical University Salzburg, Salzburg, Austria; Patients received neoadjuvant long-course chemoradiotherapy with single agent capecitabine as detailed in "RC Treatment".Pathological response was assessed by detailed histopathological assessment of the resection specimen, undertaken 6-12 weeks after CRT using the Dworak method.
Clinical data was anonymized by S:CORT number and was provided comprising demographic data, baseline stage generated from pre-treatment pelvic MRI scans and CT scans TAP, and outcome data.

Ethical Approval
All samples in the S:CORT cohorts were obtained following individual informed consent and ethical approval by the National Research Ethics Service in the United Kingdom (ref 15/EE/0241; IRAS reference 169363).The SALZBURG cohort was reviewed by the ethical board of the provincial government of Salzburg, Austria (415-E/2343/5-2018), although under Austrian law informed consent is not needed for research use and is therefore not available for all cases.

Statistical Analysis
Four logistic regression models for presence vs. lack of presence of each imCMS class were built for the endpoint of pCR.Models were adjusted by cohort (i.e., ARISTOTLE and SALZBURG) and the clinical confounders pretreatment T stage and pretreatment N stage, determined from pretreatment MRI assessments (i.e., at the time of clinical decision).The category 'mixed imCMS' was not analysed as it only contained two cases.p-values <0.05 were considered statistically significant.Statistical analyses were conducted using R [20].

Figure 1 .
Figure 1.Study Flow Figure.A) Description of the image dataset used in this study.B) Table of the cohort-level partitions of the dataset for training and validation of the models compared in this study.C) Illustration of the 5-fold experimental setup and ensembling approach investigated in this study.D) Flowchart of the 3-stage imCMS classification pipeline investigated in this study.E) Table of the cohort-level partitions of the dataset used for training and validation of biopsy simulations (left), and illustration of the biopsy sampling simulation protocol investigated in this study.

Figure 2 .
Figure 2. Classification performances of imCMS models trained and evaluated using data from different combinations of cohort.

imCMS1 14 .Figure 3 .
Figure 3. A) Illustration of the analysis of association between imCMS groups resulting from the application of an ensemble model (experiment [E3a]) and outcome in the independent holdout cohorts ARISTOTLE and SALZBURG.B) Comparison of odds ratio of pCR outcome by neoadjuvant LCRT between different groups of patients of the ARISTOTLE and SALZBURG cohorts based on imCMS calls derived from pre-operative biopsy images.Patients with biopsies classified as imCMS1 were significantly more likely to achieve pCR by LCRT (OR 2.69; 95%CI: 1.01-7.17,p=0.048) than all other imCMS groups, whereas patients with imCMS4 calls rarely achieved pCR (OR 0.25; 95%CI: 0.07-0.88,p=0.031).Two patients had mixed imCMS classification calls and were not shown in the odds ratio comparison.

Figure 4 .
Figure 4. Biopsy Simulation Experiments.(A) Example of a WSI with heterogeneous tile-level imCMS classification.(B) Illustration of a biopsy simulation procedure based on a source imaged resection specimen.Biopsy samples are randomly generated using pre-existing masks of tumor biopsies by groups of m=1 fragments.imCMS calls are generated for each simulated biopsy sample and resulting distributions of class probability for 10 000 simulations are shown on the right.(C) same as (B) with m=3 fragments.The spatial heterogeneity of tile-level prediction (A) explains the observed difference between the distributions.D) Classification performances of trained imCMS models on simulated biopsy datasets with varying number of biopsies per sample.Dots with lines indicate the mean and standard deviation of the macro-average AUROC obtained by five models for each simulated dataset: [E4a] (left); [E4b] (right).The red dotted line indicates a 3% difference score with the macro-average AUROC obtained on corresponding fully imaged resection specimens.

Figure 5 .
Figure 5.Comparison of measures of abundance scores for ten representative cell types between 529 biopsy and 565 resection primary CRC samples of the cohorts used in this study (FOCUS, SPINAL, GRAMPIAN and ARISTOTLE) using the transcriptomebased MCP-counter (A,B) and xCell tools (C,D).Cell enrichment scores were measured both by CMS subgroup (B,D) and sample type (A,C).Cell type distributions by CMS-class do not differ between biopsy and resection samples, indicating that transcriptional CMS calls derived from biopsy samples robustly capture the underlying biology of CRC shown on the actual tissue.

Figure 6 .
Figure 6.Gallery of image patches (side 630µm) from three cohorts (ARISTOTLE, SALZBURG, TCGA).For each cohort, we selected the 16 image patches from different patients, with highest probability score for each imCMS class (based on the ensemble model of experiment [E3a]).Visual interpretation confirms the existence of distinct morphological patterns within each predicted imCMS class.(A), imCMS1: increased lymphocytic infiltrates, poor tumor differentiation, focal mucin; (B), imCMS2: glandular differentiation with cribriform growth patterns and comedo-like necrosis associated with imCMS2; (C), imCMS3: glandular differentiation with ectatic mucin-filled glandular structures in combination with a minor component showing papillary and cribriform morphology; (D), imCMS4: prominent desmoplastic stromal reaction and dissociative tumor growth (tumor budding).

Figure S2 .
Figure S2.Detailed Receiver Operating Curves (ROC) and confusion matrices (CMs) of the trained imCMS models of experiments [E3a] with TCGA (A) and ARISTOLE (C) as test sets, and of experiments [E3b] with GRAMPIAN (B) as test set.Area under the ROC (AUROC) are shown for each CMS class and each of the five trained models of each experiment.Results of majority voting of five models (ensemble model) are reported in the right-most CMs.Cases without absolute majority were classified as "mixed".The reported macro-average F1-scores are derived from the confusion matrices and are ranging from 0 to 1, with 1 indicating perfect classification.