Regression-based Deep-Learning predicts molecular biomarkers from pathology slides

Deep Learning (DL) can predict biomarkers from cancer histopathology. Several clinically approved applications use this technology. Most approaches, however, predict categorical labels, whereas biomarkers are often continuous measurements. We hypothesize that regression-based DL outperforms classification-based DL. Therefore, we develop and evaluate a self-supervised attention-based weakly supervised regression method that predicts continuous biomarkers directly from 11,671 images of patients across nine cancer types. We test our method for multiple clinically and biologically relevant biomarkers: homologous recombination deficiency score, a clinically used pan-cancer biomarker, as well as markers of key biological processes in the tumor microenvironment. Using regression significantly enhances the accuracy of biomarker prediction, while also improving the predictions’ correspondence to regions of known clinical relevance over classification. In a large cohort of colorectal cancer patients, regression-based prediction scores provide a higher prognostic value than classification-based scores. Our open-source regression approach offers a promising alternative for continuous biomarker analysis in computational pathology.


Introduction
The collection and pathological examination of tissue specimens is used for accurate diagnosis of patients with malignant tumors, providing information related to histology grade, subtype, stage and other tumor biomarkers.Digital pathology describes the computational analysis of tissue specimen samples in the form of whole slide images (WSI).Numerous studies have shown that alterations in individual genes [1][2][3] , microsatellite instability [4][5][6] , and the expression of individual genes 7 or expression patterns of groups of genes 8,9 can be predicted directly from WSI.This research area has also enabled genetic changes to be correlated with morphological patterns (i.e.genotypic-phenotypic correlations) 10 , which facilitates the prediction of patient outcome. 11Consistent with their clinical application, several of these methods have been approved for clinical use by regulatory agencies 12 , to the extent that the prediction of biomarkers from pathological diagnostic workflows based on deep learning (DL) is becoming increasingly relevant, not only in the research setting, but also as a de facto clinical application. 2,12,13e prediction of genotypic-phenotypic correlations, which involves predicting genetic biomarkers from WSIs, is a weakly supervised problem in DL.To accomplish this task, a DL model correlates phenotypic features from WSIs with a single ground truth obtained from molecular genetic sequencing of tumor tissue at the patient level.Nevertheless, as these WSI are of gigapixel resolution, neural network processing requires breaking them into smaller regions referred to as tiles or patches.These regions may, however, contain less relevant tissues such as connective tissue or fat, which might not contribute to biomarker predictability. 14][17][18] To implement this strategy, feature vectors are first extracted from pre-processed tiles.These vectors are then aggregated by a multi-layer perceptron with an attention component, allowing for a patientlevel prediction of the WSI.
Despite the current attMIL approach yielding a high accuracy for biomarker prediction from WSIs 15,19,20 , almost all published approaches are limited to classification problems with categorical values (e.g.1,22 Nonetheless, the ground truth of many biomarkers is available as continuous values, which are then binarized prior to being utilized as ground-truth for DL.This is true for whole-genome duplications, copy number alterations, homologous recombination deficiency (HRD), gene expression values, protein abundance, and many other measurements.Studies that pursue regression analysis of continuous values often opt for dichotomisation or custom thresholds for categorization.For example, prior to modeling, Fu et al. utilized a LASSO approach for the classification of continuous chromosome data into three classes. 10chmauch et al. trained a regression model to predict continuous biomarkers and subsequently used percentile thresholds for the evaluation of the models through a categorical representation. 7wever, binarization or dichotomization of these values results in information loss 23 , which presumably limits the performance of DL systems predicting these biomarkers from pathology slides.Alternatively, a more suitable approach to classification in histopathological WSI analysis would be regression.Regression 24 is a modeling approach used to investigate the relationship between variables, such as morphological features from a WSI, and continuous numerical values, such as genetic biomarkers.To date, there is a paucity of data exploring this approach.A recent study by Graziani et al. presented a novel approach to predict continuous values from pathological images 25 , yet their regression network was not systematically compared and required more extensive validation with respect to the more-explored classification approach.
In this study, we systematically compared classification-and regression-based approaches for prediction of continuous biomarkers across multiple cancer types.We hypothesized that regression outperforms classification in weakly supervised analyses of pathology hematoxylin-and-eosin (H&E)stained WSIs for biomarker predictability, model interpretability and prognostic capability.In addition to various tumor entities, our work also explores several clinically relevant biomarkers represented as continuous numerical values.As a result, we developed a new contrastively-clustered attentionbased multiple instance learning (CAMIL) regression approach, which combines self-supervised learning (SSL) with attMIL, and systematically compared it with the CAMIL classification approach, and the regression method proposed by Graziani et al. 25 The evaluation and application of regression versus classification on multiple datasets, organs and biomarkers fills a gap in the computational pathology literature.

Regression predicts HRD from histology
We developed a new regression-based DL approach which combines a feature extractor trained by SSL 26 and an attMIL 14 model (Fig. 1A-C), referred to as contrastively-clustered attention-based multiple instance learning (CAMIL) regression.We tested the abilities of this approach for prediction of HRD directly from pathology images.We chose HRD because it is a pan-cancer biomarker that is measured as a continuous score, but can be binarized at a clinically validated cutoff.We used the The Cancer Genome Atlas (TCGA) cohorts for breast cancer (BRCA), colorectal cancer (CRC), glioblastoma (GBM), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), pancreatic adenocarcinoma (PAAD), and endometrial cancer (UCEC) to train a regression DL model for each cancer type and evaluated their performance by cross-validation (Fig. 1D).To mitigate batch effects, which are problematic in the TCGA cohort, we used site-aware cross-validation splits 27 .We found that our CAMIL regression models were able to predict HRD status with AUROCs above 0.70 in 6 out of 7 tested cancer types.The area under the receiver operating characteristic (AUROC) with 95% confidence interval (CI) were 0.78 (± 0.02) in BRCA, 0.76 (± 0.12) in CRC, 0.75 (± 0.40) in GBM, 0.72 (± 0.06) in PAAD, 0.72 (± 0.05) in LUAD, 0.57 (± 0.05) in LUSC, and 0.82 (± 0.03) in UCEC (Fig. 2A, Suppl.Table 1).We validated the models on CPTAC, a set of external validation cohorts, in which images and HRD status were available for LUSC, LUAD, PAAD, UCEC.In these cohorts, the model achieved even higher AUROCs, reaching 0.68 (± 0.04) in PAAD, 0.81 (± 0.03) in LUAD, and 0.96 (± 0.01) in UCEC.The lowest AUROC was 0.62 (± 0.06) in LUSC.Together, these data show that regression-based DL can predict HRD status from pathology images alone.

Regression outperforms the state-of-the-art classification-based approach
We compared the performance of our new DL approach, CAMIL regression, against two state-of-theart approaches: the Graziani et al. regression method 25 and the CAMIL classification method.In order to compare classification with regression, we chose the AUROC as an evaluation metric.In the siteaware-split test set of the TCGA cohort, CAMIL regression outperformed both of the previous approaches in HRD prediction in all 7 of the tested cancer types (Fig. 2A, Suppl.Table 1).In 5 out of 7 cancer types, an ANOVA test showed that the difference in mean AUROCs was statistically significant with p<0.05 (Suppl.Table 2 and 3).In TCGA-LUSC, all three methods performed equally poorly, reaching AUROCs of 0.57 (± 0.05), 0.57 (± 0.04) and 0.57 (± 0.03) for CAMIL regression, Graziani et al. regression, and CAMIL classification, respectively.In the external validation cohorts, all models reached comparable performance (Suppl.Table 1 and 2).In the external validation cohorts (Fig. 2B), a t-test showed that the mean AUROCs of CAMIL regression were not statistically significantly better than the classification model, whereas the Graziani et al. model outperformed the CAMIL classification model in 1 out of 4 external validation cohorts (Suppl.Table 3).
Next, we compared CAMIL regression to Graziani et al. 25 regression by assessing the coefficient of determination R 2 of the predicted scores compared to the clinically-derived ground-truth scores.In TCGA, the CAMIL regression model reached higher R 2 scores than the Graziani et al. 25 model in all of the 7 selected cohorts (Suppl.Table 5).In the CPTAC validation cohort, the CAMIL regression model reached higher R 2 scores than the Graziani et al. 25 model in all 4 of the selected cohorts (Suppl.Table 5).To determine the reason for our superior performance over Graziani et al. 25 regression, we conducted an ablation study of the CAMIL regression approach.These results revealed that the inferior performance in Graziani et al. 25 approach for predicting clinical biomarkers is mainly due to the standard stochastic gradient descent optimizer, compared to the stochastic gradient descent with adaptive moments optimizer in our CAMIL regression approach (Suppl.Table 7).Taken together, these data indicate that the CAMIL regression method outperforms the Graziani et al. 25 regression method and the CAMIL classification method.Consequently, the regression method by Graziani et al. 25 is not further compared to CAMIL regression and classification in subsequent experiments.
Moreover, we investigated additional aspects of model performance which the AUROC does not capture 28 .We compared CAMIL regression to CAMIL classification by quantifying the absolute distance between the medians of the normalized scores for the positive and negative samples (Fig. 2C-F).For example, for detection of HRD status in endometrial cancer, the AUROC on the CPTAC test cohort was 0.98 ± 0.02 for CAMIL classification and 0.96 ± 0.01 for CAMIL regression.This difference was not statistically significant (p = 0.095).When the distribution of the CAMIL regression model output (Fig. 2C-F) was visualized, we found a greater separation of the predicted HRD scores in positive and negative patients compared to the CAMIL classification approach (Suppl.Table 4).The absolute distance between the peak of the score distribution between positive and negative patients was higher for CAMIL regression than for CAMIL classification.We further quantified this in all tumor entities and found that in all 7 of the selected TCGA cohorts, this distance was larger in the CAMIL regression, resulting in a greater class separability.In CPTAC, as compared to the classification-based approach, class separability was improved in 2 out of 5 cohorts when using the regression approach.Overall, our CAMIL regression approach improves separation distance of the groups' medians by 378% for the test set of the TCGA training cohort, and 19% for the external CPTAC test cohort (Suppl.Table 4).

Regression predicts key biological process biomarkers from histology
Having shown that our CAMIL regression method can predict HRD from histology WSIs, we expanded our experiments to additional biomarkers.We investigated biomarkers related to the three key components of solid tumors: tumor cells, stroma, and immune cells.For tumor cells, we aimed to predict proliferation, as measured by an RNA expression signature 29 .For stroma, we aimed to predict stromal fraction (SF), as assessed via DNA methylation analysis 29 .For immune cells, we investigated the tumor infiltrating lymphocytes regional fraction (TIL RF), the leukocyte fraction (LF), and the lymphocyte infiltration signature score (LISS) 29 .We found that our CAMIL regression method was able to predict all of these five biomarkers with high AUROCs across cancer types in the TCGA cohort (Suppl.Table 9).For example, in breast cancer, the AUROCs for these five biomarkers were 0.88 (± 0.02) in TIL RF, 0.83 (± 0.05) in proliferation, 0.81 (± 0.03) in leukocyte fraction, 0.80 (± 0.03) in LISS and 0.80 (± 0.03) in stromal fraction.In colorectal cancer, these AUROCs were 0.79 (± 0.07), 0.59 (± 0.12), 0.76 (± 0.06), 0.70 (± 0.01) and 0.77 (± 0.04), respectively.In all other cancer types, mean AUROCs of above 0.70 were reached (Suppl.Table 9).These findings show that the regression-based DL model can be trained to predict tumor cell proliferation, stromal fraction and immune-cell-related biomarkers from H&E histopathology.
We compared this to the state-of-the-art CAMIL classification approach using the AUROC with 95%CI as evaluation metric.Using site-aware splits, our proposed CAMIL regression approach outperformed CAMIL classification in 8 out of 34 instances, with the remainder of cases having equal performance for the classification and regression approach (Fig. 3B).Regression outperformed classification in TCGA-BRCA in two targets, LF (0.80 ± 0.02, p<0.0001) and LISS (0.80 ± 0.03, p<0.0001).In TCGA-CRC, the performance between regression and classification was equal for all five targets.In TCGA-LIHC, regression outperformed classification in LISS (0.70 ± 0.01, p < 0.001).In TCGA-LUAD, regression outperformed classification in proliferation (0.84 ± 0.04, p < 0.0001).In TCGA-LUSC, regression outperformed classification in TIL RF (0.88 ± 0.04, p < 0.0001).In TCGA-STAD, regression outperformed classification in proliferation (0.87 ± 0.07, p = 0.06), but did not reach a statistically significant AUROC in either classification or regression (p > 0.05).In TCGA-UCEC, regression outperformed classification in the two lymphocyte-based targets, TIL RF (0.82 ± 0.04, p < 0.0001) and LISS (0.73 ± 0.06, p < 0.001).These findings collectively demonstrate that utilizing the CAMIL regression approach leads to an average 4% increase in the AUROCs, as compared to employing the CAMIL classification approach for the same task of predicting key biological process biomarkers from histology.

Regression improves interpretability of biomarker predictions from histology
Next, we investigated the interpretability of the CAMIL classification model compared to the CAMIL regression model.We evaluated the biological plausibility of spatial prediction heatmaps obtained by deploying the regression model and the classification model on tumors in the site-aware split test set of the TCGA cohort.We used the models trained to predict the LISS in breast cancer.Although the LISS is only available as a weak label (one score per WSI), a good model should be able to highlight regions which are associated with the LISS, and these regions should contain lymphocytes.Indeed, we saw that both the classification model and the regression model placed their attention on lymphocyte-rich regions (Fig. 3C-0).In the evaluated WSIs, however, the LISS regression model yielded a sharper delineation of lymphocyte-rich regions and placed less attention on areas where histologic features are less relevant.Contrastingly, the LISS classification model demonstrates relatively less confidence in areas with a dense lymphocyte population compared to the regression model, as indicated by a lower attention score (Fig. 3C-1).The classification model assigns importance to regions without any presumed clinical relevance, as evidenced by the fact that the model highlighted the tissue edge which lacks high density lymphocytes regions (Fig. 3C-2).We quantified these findings by a blinded interpretability review of 42 attention heatmaps from the classification and regression models by KJH, a pathology resident.Based on the expert review, the CAMIL regression approach produced the most interpretable attention heatmaps in 34 out of 42 cases.In 6 out of 42 cases, the CAMIL classification approach was more interpretable.Similar interpretability between the CAMIL classification and regression approaches was observed in 2 out of 42 cases.Hence, CAMIL regression outperforms CAMIL classification in interpretability in 81% of cases as observed in a blinded review.Taken together, these data demonstrate that the regression approach gives a statistically significantly better AUROC for the investigated biomarkers (p < 0.05; Suppl.Table 11), and a markedly improved interpretability, compared to the classification approach.

Regression-based biomarkers improve survival prediction in colorectal cancer
Biological processes of tumor cell proliferation, deposition of stromal components, and infiltration by lymphocytes are biologically relevant during tumorigenesis and progression, and are known to be related to clinical outcome. 30,31Thus, prediction of lymphocytic infiltration from H&E pathology slides should be relevant for prognostication.We investigated this in a large cohort of 2,297 patients with colorectal cancer from the Darmkrebs: Chancen der Verhütung durch Screening (DACHS) study, for which H&E WSIs and long-term (10 years) follow-up data were available for overall survival (Suppl.Table 15).
First, we deployed the CAMIL classification models that were trained on colorectal cancer patients in TCGA, which obtained similar AUROCs in all biomarkers (Fig. 3B).We deployed these models on WSIs from patients enrolled in DACHS, obtaining a binarized prediction label for each patient.We then assessed the prognostic impact of this predicted label with univariate and multivariate Cox Proportional Hazard models for overall survival (Fig. 4A and 4B), yielding hazard ratios (HR).We found that the classification models reached significant risk-group stratification in 3 out of 5 biomarkers (Fig. 4A, Suppl.Table 12): leukocyte fraction (HR=0.74,p < 0.0001), LISS (HR=0.74,p < 0.0001), and stromal fraction (HR=0.77,p < 0.0001).These hazard ratios represent only a modest predictability of survival.In the multivariate survival model (Fig. 4B, Suppl.Table 13), the classification models show significant prognostic capabilities in only 2 out of 5 biomarkers: leukocyte fraction (HR=0.83,p = 0.0394) and LISS (HR=0.82,p = 0.0265).
When we repeated the procedure with continuous scores obtained from the CAMIL regression models, we found that the regression models markedly improved the survival prediction.The regression model reached significant risk-group stratification in 3 out of 5 biomarkers (Fig. 4A): leukocyte fraction (HR=0.18,p < 0.01), LISS (HR=0.03,p < 0.0001) and TIL regional fraction (HR=0.21,p < 0.01).This effect was also observed when the scores obtained from the CAMIL regression model were binarized at the median before using them as an input for the univariate Cox Proportional Hazard model (Suppl.Table 14), showing consistent risk-group stratification superiority for regression-based biomarkers.For the multivariate survival model (Fig. 4B, Suppl.Table 13), the regression models show significant prognostic capabilities in the same 2 biomarkers: leukocyte fraction (HR=0.20,p < 0.01) and LISS (HR=0.14, p < 0.01).Again, the HR for regression are significantly further away from non-significance (HR=1) with non-overlapping 95%CI compared to the classification models.Similar observations were made for the models trained on breast cancer patients from TCGA and deployed on colorectal cancer patients from DACHS, corroborating the improved generalizability of regression on biomarkers across different cancer types (Fig. 4C and  4D).
Taken together, these data demonstrate that by training models on biologically relevant biomarkers with weakly supervised learning, the resultant regression models are better predictors of survival than their classification counterpart.Therefore, regression models enhance the use of weakly supervised learning to build DL systems of potential clinical utility.

Discussion
Since 2018, the field of digital pathology has rapidly expanded to include the development of tools for predicting molecular biomarkers from routine tumor pathology sections, which has led to the development of clinically approved products.Traditional DL methods have limited the analysis of many biomarkers, including HRD and gene expression signatures, which are continuous values, by categorizing them into discrete classes.Our study provides direct evidence that novel regression networks, such as the CAMIL regression method described in this study, which builds on recent work using attention-based multiple instance learning and self-supervised pre-training of the feature extractor 18,20,26 , outperforms traditional classification networks in predicting these biomarkers.This approach unlocks a key clinical application area for pathology-based biomarker prediction.
Our proposed CAMIL regression approach has shown promising results in improving the accuracy and separability of biomarker predictions compared to CAMIL classification.This improvement is particularly noticeable for biomarkers that have a clinically established threshold for categorization, such as HRD.Similar improvements are observed for biomarkers that do not have any clinically relevant cut-off point and would traditionally necessitate dichotomization for analysis, such as immune biomarkers.In addition, our CAMIL regression approach demonstrates better generalization capabilities than the regression approach by Graziani et al. 25 , as seen in the external test cohort.We identified that the optimizer used in Graziani et al. 25 predominantly caused the regression model to converge to the mean, which explains the observed difference.
Furthermore, our study highlights the advantages of regression-based biomarker prediction over classification-based prediction in terms of interpretability.We demonstrated that, for tumor infiltrating lymphocytes, attention heatmaps generated through regression were preferred in 81% of cases for their interpretability compared to those generated through classification.Regression also resulted in an improvement in survival prediction based on immunologic biomarkers, as it allowed for more effective stratification of risk groups for overall survival compared to classification models.The biomarkers were deliberately chosen on the basis of their prognostic capabilities [32][33][34][35] , and are better reflected by the tumor morphology analysis through the CAMIL regression approach as compared to the CAMIL classification approach.
This study has several limitations.The experiments were limited to a select number of tumors and clinical targets, and not all analyzed clinical targets had an external test set with the same clinical information available.This resulted in meta-external test sets through site-aware splits, and blind deployments on an external cohort.Additionally, none of the hyperparameters of the trained models were optimized.Further research could expand the analysis to a wider variety of cancers and clinical targets, while also exploring potential pitfalls of regression in computational pathology.The approaches described here, however, provide a proof-of-principle for the use of regression-based attMIL systems and their potential impact for the inference of biomarkers and prediction of outcomes from histologic WSIs, expanding the repertoire of applications of DL in precision medicine.

Ethics statement
We examined anonymized patient samples from several academic institutions in this investigation.This analysis has been approved by the ethical boards at DACHS.CPTAC and TCGA did not require formal ethics approval for a retrospective study of anonymised samples.The overall analysis was approved by the Ethics commission of the Medical Faculty of the Technical University Dresden (BO-EK-444102022).

Image Data and Cohorts
A total of 11,671 raw WSIs were scanned by an Aperio ScanSlide scanner and pre-processed in this study.Two types of clinical targets were analyzed to observe the performance of the classification and regression models: 1) continuous variables with a known clinically relevant cut-off for categorization, and 2) continuous variables with unknown clinically relevant cut-offs, thus requiring categorization by splitting at the median.These categories of targets were chosen due to theory mentioning the loss of information by splitting at the median 23 , but does not mention the loss of information when utilizing clinically relevant cut-offs before training the model.
The target with a clinically relevant cut-off is homologous recombination deficiency (HRD) (Suppl.Table 16), a clinically relevant biomarker in solid tumor types, such as breast cancer.One way to calculate HRD is by adding up the three subscores, Loss of Heterozygosity (LOH), Telomeric Allelic Imbalance (TAI) and large-scale state transitions (LST), giving us a continuous value ranging from 0 to 103 in the training sets.A clinically relevant cut-off point of HRD >= 42 was used to binarize the continuous score 36 .
The targets without a known clinically relevant cut-off point are biological process biomarkers (Suppl.Table 17), which are interesting to analyze due to their prominent role in immunotherapy outcome prediction 29,37,38 : Stromal Fraction (SF) with range [0, 0.92] and leukocyte fraction (LF) with range [0, 0.96] as assessed via DNA methylation analysis, lymphocyte infiltrating signature score (LISS) with range [-3.49, 4.17] and proliferation (Prolif.) with range [-2.86, 1.59], as measured by RNA expression data and tumor infiltrating lymphocytes regional fraction (TIL RF) with range [0, 63.65], quantified using a DL based classification.For TCGA-LIHC, there was no data available for TIL regional fraction, leading to an analysis of 5 targets in 7 cancer types with 5-fold cross-validation, resulting in (35-1)*5 models for each modeling type, of which the AUROC ± 95%CI of the 5 folds per target and tumor type was reported.

Model description
The entire image processing pipeline, from whole-slide image (WSI) to patient-level predictions, consisted of three main steps: 1) image pre-processing, 2) feature extraction, 3a) classification-based attention attMIL and 3b) regression-based attMIL for score aggregation resulting in patient-level predictions (Fig. 1A and 1B).
All WSI in the experiments were tessellated into image patches at a resolution of 224 by 224 pixels with an edge length of 256 µm, resulting in a Microns Per Pixel (MPP) value of approximately 1.14.After tessellation, every image patch underwent a rejection filter using the Canny edge detection method 39 , removing blurry patches and the white background of the image when two or less edges were detected in the patches.The remaining patches were color-normalized in order to reduce the H&E-staining variance across patient cohorts according to the Macenko spectral matching technique 40 , with a prior added step of brightness standardization.For pre-processing, our end-to-end WSI pre-processing pipeline was utilized.The target image used to define the color distribution was uploaded to the GitHub repository.
Every pre-processed image patch was turned into a 2048 feature vector through inference of a ImageNet-weighted ResNet50-based self-supervised contrastive clustering model fine-tuned on 32,000 WSIs from different cancer types; RetCCL 26 .The feature extraction resulted in an (n x 2048) feature matrix per patient, where n is the number of (224 x 224 pixels) pre-processed image patches.

Experimental setup and implementation details
For the experiments, 5-fold cross-validation on patient-level with site-aware splits was performed to train the models.Site-aware splits ensure that patients are stratified and grouped by the hospital the WSI originated from, creating a stratified random 80-20 split which forces all patients from the same hospital to exist in either the training and internal validation set, or the internal test set, while retaining ground-truth class distributions.Specifically, in The Cancer Genome Atlas (TCGA), site-specific histological features were shown to be present in the WSI, causing biased evaluations in the model when not accounted for accordingly during the training procedure 27 .The basis for the weakly supervised classification and regression was adapted from the attention-based multiple instance learning (attMIL) method by Ilse et al 41 .Our proposed model used Balanced MSE 42 as a loss function to account for the natural class imbalance in clinical settings, as well as the Adam optimizer 43 and an attention component followed by a MLP head 41 which was trained for 25 epochs.The dropout layer was removed, due to loss of performance in regression in tabular data settings 44 .The attMIL variant in our proposed CAMIL regression differs from Ilse et al. by swapping their feature extractor with a pre-trained ResNet50 with ImageNet weights, fine-tuned on 32,000 histopathology images in a selfsupervised manner using contrastive clustering shown to yield significantly better results on WSI image analysis 26 .Moreover, the classification head consisting of a fully-connected (FC) layer and sigmoid operation was swapped with custom heads to allow for classification and regression tasks to be performed.The attention component was not altered.
To evaluate the relative supremacy between classification and regression, first, the CAMIL regression method was compared with 1) the regression method from Graziani et al. and 2) the CAMIL classification method on the continuous HRD score and clinically-relevant binarized HRD score, respectively.Then, CAMIL regression was compared to CAMIL classification on continuous biomarkers related to biological processes with no known clinically-relevant cut-off points, where the median score per target was used for binarizing.Moreover, an expert review by a pathology resident was conducted on attention heatmaps produced by CAMIL classification and CAMIL regression to determine which method yielded the most interpretable heatmaps.Finally, the prognostic capabilities of CAMIL regression versus CAMIL classification was evaluated on an external data cohort DACHS-CRC by predicting survival of groups stratified by the models which were trained on the same biological process biomarkers and extracted features.For the HRD scores, the models were trained on TCGA-BRCA, TCGA-CRC, TCGA-GBM, TCGA-LUAD, TCGA-LUSC, TCGA-PAAD, TCGA-UCEC and externally validated on CPTAC-LUAD, CPTAC-LSCC, CPTAC-PDA and CPTAC-UCEC.For the biological process biomarkers, the models were trained on TCGA-BRCA, TCGA-CRC, TCGA-LUAD, TCGA-LUSC, TCGA-LIHC, TCGA-STAD and TCGA-UCEC.Every model that was compared, both regression and classification, consisted of the exact same patients for training, internal validation, internal testing and external testing (Suppl.Table 16 and 17).
For the regression method from Graziani et al., we introduced the self-supervised component as feature extractor 26 followed by embedding-level attention aggregation, instead of the ImageNet weighted ResNet18 backbone followed by patch-level attention aggregation in the original study by Graziani et al. 25 As it was shown that the self-supervised backbone increases performance and generalizability compared to an ImageNet weighted architecture as backbone 26 , we added the selfsupervised component in order to compare the regression heads in isolation.The commonalities between the models are the learning rate (1.00E-04), weight decay (1.00E-02), patience (12 epochs), the attention component 41 and the fit-one-cycle learning rate scheduling policy 45 .The differences of the models' hyperparameters and optimization strategies (Suppl.Table 6) of Graziani et al. and our CAMIL regression model were broken down in an ablation study to find the reason for the performance differences of the regression heads.

Statistics and endpoints
The classification and regression method were made comparable in a similar dimension by utilizing the area under the receiver operating characteristic (AUROC) metric.For the definition of the binarized groups required for the AUROCs, the clinically-relevant cut-off for HRD was used, while for the biological process biomarkers, the continuous targets were split at the median.The prediction scores of the classification model [0-1] and the predictions of the regression models (−∞,∞) were used as continuous score for all the possible thresholds of the AUROC. 46By utilizing this approach, it was possible to test which type of model, when provided with the same ground-truth binarized label, had the least overlap between the predicted score distributions for different groups.This, in turn, resulted in achieving the highest AUROC.However, the AUROC measures only the separation of groups' score distributions, but does not account for the distance between the distributions.Therefore, to determine whether there is an increased distance between distributions, the median and interquartile range (IQR) were calculated for the clinically relevant HRD+ and HRD-groups.However, this calculation was not performed for the biological process biomarkers due to the unclear relevance of distance between the dichotomized groups.
To determine statistical significance of the AUROCs, the 95% confidence interval (CI) of the 5 training folds was calculated for each model.In order to identify if the AUROCs of the three compared models (CAMIL classification, regression from Graziani et al., and our proposed CAMIL regression) had a significant difference for the HRD target, the repeated measures ANOVA statistical analysis was performed, which resulted in an F value for each tumor-type the three models were trained on.If the difference between the three models was statistically significant, the dependent onesided t-test for paired samples statistical analysis was performed in order to determine if CAMIL regression outperformed CAMIL classification, resulting in a t-statistic with 95%CI for every model comparison for every analyzed tumor-type of the internal test set from TCGA.For the external test set, the repeated measures ANOVA is also performed, after which two dependent one-sided t-tests with Bonferroni correction were performed, resulting in two t-statistics with 97.5%CI for every model comparison of every analyzed tumor-type.For the biological process biomarkers' models, a dependent two-sided t-test with 95%CI was performed to test the alternative hypothesis if the 5-fold mean of the CAMIL classification and CAMIL regression AUROCs were significantly different from each other.
To determine the prognostic capabilities of the biological process biomarkers' models, survival prediction analysis is done on an external cohort, DACHS.All 5 models trained through site-aware splits were blindly deployed, of which the mean of the predicted scores were used for further analysis.The univariate (UV) and multivariate (MV) Cox proportional-hazards (PH) regression analysis were independently performed to determine the Hazard Ratio (HR) of the classification and regression models' predictive biomarker.The continuous score from the regression models were used for the Cox PH analyses, as well as the binarized continuous scores to rule out bias in the prognostic capabilities solely through which variant of the continuous score was used.The prognostic capabilities of the classification and regression models were independently analyzed together with three covariates: age (continuous, ℝ + ), sex (binary, 0: female, 1: male) and tumor stage (continuous, ℤ ∈ [1, 4]).Thus, one model's scores per target and the three covariates were analyzed for each model independently.Statistical significance of the HR is reached when the 95%CI does not cross a HR=1.

Visualization and explainability
To compare the separability of CAMIL classification and CAMIL regression models' score distribution for HRD at a similar scale, all values for both models were min-max normalized individually to redistribute every model's score output between [0,1].To explain the classification and regression CAMIL models' capability of decision-making using clinically relevant features, the attention component from the attMIL model architecture was utilized.The attention heatmaps were created by loading the attMIL model architectures for classification and regression into a fully convolutional equivalent 47 with their respective weights from the training procedure, which allows for a highresolution attention heatmap, rather than 224x224 patches the model was trained on.By running inference on the WSIs of the patient, the attention layer which resulted from the patient-wise prediction was extracted and used as an overlay on the WSI to indicate hot zones which the model used in decision making.The TCGA-BRCA cohort was chosen for visualization to observe the contrast between equal and superior performance of the regression model compared to the classification model in lymphocyte-based targets.For each target, the classification and regression model were trained, validated and tested on the exact same patient using site-aware splits.The attention heatmaps for the blinded review were generated from patients with the top 42 highest expression of the LISS biomarker from the unseen internal TCGA-BRCA test set through the trained classification and regression models, resulting in 84 heatmaps in total.The models' clinical interpretability was reviewed by a pathologist, choosing the most interpretable attention heatmap for each of the 42 patients.

Data and Code availability
All source codes are available under an open-source license on GitHub.The pre-processing pipeline is found at https://github.com/KatherLab/end2end-WSI-preprocessing/releases/tag/v1.0.0preprocessing, the classification pipeline is found at https://github.com/KatherLab/marugoto/releases/tag/v1.0.0-classification, the regression pipeline is found at https://github.com/KatherLab/marugoto/releases/tag/v1.0.0-regression, and the classification and attention heatmaps are found at https://github.com/KatherLab/highres-WSI-heatmaps/releases/tag/v1.0.0-heatmaps.The slides for TCGA are available at https://portal.gdc.cancer.gov/.The slides for CPTAC are available at https://proteomics.cancer.gov/data-portal.The molecular data for TCGA is available at https://www.cbioportal.org/and additional biomarker data is available from Thorsson et al.C) The performance metrics and their respective confidence intervals (CI) to evaluate the performance of the three separately trained heads of the model, including the coefficient of determination (R 2 ) for the regression models, the area under the receiver operating characteristic (AUROC) for all models, analysis of variance (ANOVA) with repeated measures for the homologous recombination deficiency (HRD) and biological process biomarkers, and expert review of attention heatmaps with univariate (UV) and multivariate (MV) Cox proportional-hazard (PH) models for the biological process models.D) The cohorts used for training and external validation represented in the inner-and outer circle, respectively.The training cohorts are from The Cancer Genome Atlas (TCGA) programme for all clinical targets, with the external validation cohorts coming from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) effort and the Darmkrebs: Chancen der Verhütung durch Screening (DACHS) study for the HRD target and the biological process biomarkers, respectively.The biological process biomarkers are tumor infiltrating lymphocytes regional fraction (TIL RF), proliferation (Prolif.),leukocyte fraction (LF), lymphocytes infiltrating signature score (LISS) and stromal fraction (SF).The considered cancer types in this study are breast cancer (BRCA), colorectal cancer (CRC), glioblastoma (GBM), lung adenocarcinoma (LUAD), lung squamous cell cancer (LUSC), pancreas adenocarcinoma (PAAD), endometrial cancer (UCEC), liver hepatocellular carcinoma (LIHC) and stomach cancer (STAD).

Figure 3: CAMIL classification versus CAMIL regression for the prediction of continuous biological process biomarkers of the tumor microenvironment. A)
The scope in which we analyzed the tumor microenvironment (TME) consists of tumor cells, stroma and immune cells.B) Heatmap depicting area under the receiver operating curve (AUROC) deltas between CAMIL regression and CAMIL classification for 5 biological process biomarkers (tumor infiltrating lymphocytes regional fraction (TIL RF), proliferation (Prolif.),leukocyte fraction (LF), lymphocytes infiltrating signature score (LISS) and stromal fraction (SF)) on the test sets of breast cancer (BRCA), colorectal cancer (CRC), liver hepatocellular carcinoma (LIHC), lung adenocarcinoma (LUAD), lung squamous cell cancer (LUSC), pancreas adenocarcinoma (PAAD), liver hepatocellular carcinoma (LIHC), stomach cancer (STAD) and endometrial cancer (UCEC) from The Cancer Genome Atlas (TCGA) program for site-aware split folds.The higher the positive delta, the better the CAMIL regression model performed.Statistical significance is indicated with an asterisk as a result of a dependent one-sided t-test (ɑ=0.05).C) Attention heatmap of a slide from the test set of TCGA-BRCA.Image 0 shows the entire slide, with an area of interest for diagnostics in image 1. Image 2 shows an area presumably containing non-essential diagnostics information.This is repeated for the original slide, the attention heatmap using the classification model, and the attention heatmap using our CAMIL regression model in fold 0 for LISS.The higher the attention score of an area, the more important it is for the model's decision making.Icon source: smart.servier.comSuppl.Table 3: One-sided dependent t-tests to determine if regression outperforms classification in the models for homologous recombination deficiency (HRD) prediction using site-aware splits.For predicting HRD, models were trained in a site-aware manner on The Cancer Genome Atlas (TCGA), breast cancer (BRCA), colorectal cancer (CRC), glioblastoma (GBM), lung adenocarcinoma (LUAD), lung squamous cell cancer (LUSC), pancreatic adenocarcinoma (PAAD) and endometrial cancer (UCEC).In the Clinical Proteomic Tumor Analysis Consortium (CPTAC) effort, LUAD, LUSC, PAAD and UCEC were used as external validation cohorts.In TCGA, CAMIL regression model is compared with a dependent one-sided t-test to the CAMIL classification approach, with significance for p < 0.05.In CPTAC, two hypotheses are tested through dependent one-sided t-tests: whether CAMIL regression is better than CAMIL classification, and whether Graziani's regression is better than CAMIL classification.Here, significance is reached at p < 0.025 when using Bonferroni's correction for multiple hypothesis testing.Suppl.Table 4: Median and interquartile range (IQR) of the min-max normalized predicted scores for CAMIL classification and CAMIL regression approach.The median and IQR of the min-max normalized prediction scores AUROC for the CAMIL classification and CAMIL regression is calculated for each model trained on the HRD score.A positive percentage indicates a larger distance between the median peaks of the groups' distribution using regression, whereas a negative percentage indicates a larger distance between the median peaks of the groups' distribution using classification.In The Cancer Genome Atlas (TCGA), breast cancer (BRCA), colorectal cancer (CRC), glioblastoma (GBM), lung adenocarcinoma (LUAD), lung squamous cell cancer (LUSC), pancreatic adenocarcinoma (PAAD) and endometrial cancer (UCEC) were used for site-aware training.In the Clinical Proteomic Tumor Analysis Consortium (CPTAC) effort, LUAD, LUSC, PAAD and UCEC were used as external validation cohorts.
Regression (Graziani et  Suppl.Table 5: Comparison of Graziani regression with CAMIL regression through the coefficient of determination (R 2 ).The p-value follows from a two-sided independent t-test with significance reached at p < 0.05, indicating the correlation coefficient is non-zero.Within a range of 0 to 1, the higher the R 2 , the better the regression model.In The Cancer Genome Atlas (TCGA), breast cancer (BRCA), colorectal cancer (CRC), glioblastoma (GBM), lung adenocarcinoma (LUAD), lung squamous cell cancer (LUSC), pancreatic adenocarcinoma (PAAD) and endometrial cancer (UCEC) were used for site-aware training.In the Clinical Proteomic Tumor Analysis Consortium (CPTAC) effort, LUAD, LUSC, PAAD and UCEC were used as external validation cohorts.

CAMIL regression has better generalization capabilities than Graziani regression
Given the nature of how AUROCs are produced, the continuous output score of the regression models can be used in combination with a categorical target, such as the clinically-relevant binarized HRD score.However, AUROCs only indicate how well the given continuous score is able to separate between the negative and positive class, i.e. rewarding a high AUROC for a model which outputs a low intra-class variance and a high inter-class variance, of the absolute range of the prediction scores.Analyzing the performance between regression methods, it was found that CAMIL regression is capable of predicting more clinically-relevant output scores which are closer to the absolute ground-truth (Suppl.For deeper analysis into the differences between the regression heads of Graziani et al. regression and CAMIL regression, an ablation study (Suppl.Table 7) was performed on CAMIL regression using the TCGA-BRCA cohort.The TCGA-BRCA cohort was chosen for the ablation study as both Graziani et al. regression model and the CAMIL regression model gave statistically significant AUROCs for all 5 folds which were in a similar range with low variance, in contrast to TCGA-LUAD which showed more variance among the 5 folds due to an outlying fold (Fig. 2).
[ Suppl.Table 11: Dependent two-sided t-test with 95% confidence interval for CAMIL classification and CAMIL regression models trained with site-aware splits.For the models which were trained without site-aware splits, a two-sided dependent t-test was performed to determine whether the means of the area under the receiver operating characteristics (AUROC) from the CAMIL classification and CAMIL regression approach were significantly different (p < 0.05).The models were trained on cohorts from The Cancer Genome Atlas (TCGA), breast cancer (BRCA), colorectal cancer (CRC), liver hepatocellular carcinoma (LIHC), lung adenocarcinoma (LUAD), lung squamous cell cancer (LUSC), gastric cancer (STAD) and endometrial cancer (UCEC) on biological process biomarkers: tumor infiltrating lymphocytes regional fraction (TIL RF), proliferation (Prolif.),leukocyte fraction (LF), lymphocyte infiltration signature score (LISS), and stromal fraction (SF).
Suppl. Figure 2: CAMIL regression models' distribution plots with the coefficient of determination (R²) and Spearman's rank correlation coefficient (ρ).The CAMIL regression models were trained using site-aware splits on cohorts from The Cancer Genome Atlas (TCGA), breast cancer (BRCA), colorectal cancer (CRC), liver hepatocellular carcinoma (LIHC), lung adenocarcinoma (LUAD), lung squamous cell cancer (LUSC), gastric cancer (STAD) and endometrial cancer (UCEC) on biomarkers for tumor infiltrating lymphocytes (TIL RF) regional fraction, leukocyte fraction, lymphocyte infiltration signature score (LISS), and stromal fraction.The R² and ρ are shown as performance metrics for the test set of all five site-aware splits.The model's score distributions are displayed on the y-axis, whereas the ground-truth continuous score is displayed on the x-axis.Suppl.Table 13: Hazard ratios (HR) with 95% confidence interval (CI) and corresponding pvalues of multivariate Cox proportional-hazard models for the models for biological process biomarkers.The CAMIL regression and CAMIL classification models were trained using site-aware splits on cohorts from The Cancer Genome Atlas colorectal cancer (CRC) and breast cancer (BRCA) on biomarkers for tumor infiltrating lymphocytes regional fraction (TIL RF), proliferation (Prolif.),leukocyte fraction (LF), lymphocyte infiltration signature score (LISS), and stromal fraction (SF).These models were deployed on CRC patients from the Darmkrebs: Chancen der Verhütung durch Screening (DACHS) study.For the classification models, the predicted dichotomised labels were used, whereas our regression model used the predicted continuous scores for the univariate Cox proportional-hazard models.The covariates used in the analysis are sex, age, and tumor stage.An HR equal to 1 indicates non-significant prognostication capability.

Figures
Figures

Figure 1 :
Figure 1: End-to-end experimental workflow overview with image pre-processing, modeling, performance metrics and used cohorts.A) The image pre-processing pipeline and tile-level feature extraction by running inference on a ResNet50 with pre-trained ImageNet weights and retrieval contrastive clustering (RetCCL) model for a feature matrix for each patient.B) The modeling architecture using attention-based multiple instance learning (attMIL) on the self-supervised extracted features with three separately trained heads for CAMIL classification, regression as proposed by

Figure 2 :
Figure 2: Performance overview of classification versus regression approaches predicting the homologous recombination deficiency (HRD) score.Panel A) and B) show boxplots of area under the receiver operating characteristic (AUROC) values from HRD predictions of this (I) CAMIL classification, (II) regression by Graziani et al. and (III) CAMIL regression on the internal test set from The Cancer Genome Atlas (TCGA) and the external test set from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) effort, respectively.Cancer types include glioblastoma (GBM), pancreas adenocarcinoma (PAAD), endometrial cancer (UCEC), colorectal cancer (CRC), breast cancer (BRCA), lung adenocarcinoma (LUAD) and lung squamous cell cancer (LUSC).Non-significant AUROC values are shown as transparent violin instances, and statistical tests include an analysis of variance with repeated measures displayed at the bottom and dependent one-sided t-tests, with Bonferroni correction for multiple hypothesis testing in the external test set displayed on top.Panel C) and D) show the proportional distribution plot of the normalized classification scores of the internal test set from the CAMIL classification model trained on TCGA-UCEC, and the external test set CPTAC-UCEC, respectively.Panel E) and F) show proportional distribution plot of the normalized regression scores of the internal test set from the CAMIL

Figure 4 :
Figure 4: Overview of the externally validated prognostic capabilities of the trained models to predict overall survival.Panel A) and B) display an univariate (UV) Cox Proportional-Hazard (PH) analysis of the trained models on The Cancer Genome Atlas (TCGA) program, deployed on the external colorectal cancer (CRC) samples from the Darmkrebs: Chancen der Verhütung durch Screening (DACHS) study for the TCGA-CRC and TCGA breast cancer (BRCA) models, respectively.Panel C) and D) display a multivariate (MV) Cox PH analysis of the trained immune cell models, deployed on the external DACHS-CRC cohort for the TCGA-CRC and TCGA-BRCA models, respectively.Each model's output, from CAMIL classification (categorical class predictions) and CAMIL regression (continuous score predictions), is considered independently together with the three covariates tumor stage, age and sex for the MV Cox PH analysis.The observed biological process biomarkers are tumor infiltrating lymphocytes regional fraction (TIL RF), proliferation (Prolif.),leukocyte fraction (LF), lymphocyte infiltration signature score (LISS), and stromal fraction (SF).Stars indicate statistical significance (p < 0.05) for the hazard ratios (HR) and their 95% confidence interval (CI).

Fig. 1 )Suppl. Figure 1 :
, giving a prediction range of(29, 34) and(10, 55) for the Graziani et al. regression model and our CAMIL regression model on the CPTAC-LUAD external test cohort, respectively.For this analysis, LUAD was chosen as it resulted in the only statistically significant tumor-type for the regression models with both an internal and external validation set for the HRD target.With an R 2 of 0.16 (3,06E-22) for the Graziani et al. regression and an R 2 of 0.29 (1.39E-40) for our proposed CAMIL regression model, superior generalization capabilities for our proposed CAMIL regression over the regression method by Graziani et al. are observed.However, the measure by AUROC would indicate that the regression by Graziani et al. has superior performance, showing the limited capabilities of comparing regression models solely through AUROC values.Comparison of CAMIL regression and Graziani et al. regression on an external testing cohort.A) The correlation plot of the regression approach by Graziani et al. on the external cohort of lung adenocarcinoma (LUAD) from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) effort in the original range of the homologous recombination deficiency (HRD) continuous ground-truth, and a zoom-in of the same data in the range of the model's prediction.B)The corresponding area under the receiver operator characteristic (AUROC) curve using the continuous prediction scores plotted in panel A and the HRD binary ground-truth.C) The correlation plot of the CAMIL regression approach from this study on the external cohort CPTAC-LUAD in the original range of the HRD continuous ground-truth, and a zoom-in of the same data in the range of the model's prediction.D) The corresponding AUROC curve using the continuous prediction scores plotted in panel C and the HRD binary ground-truth.

Table 1 : Area under the receiver operating characteristics (AUROC) and area under the precision
IEEE International Symposium on Biomedical Imaging: From Nano to Macro 1107-1110 (2009).41.Ilse, M., Tomczak, J. M. & Welling, M. Attention-based Deep Multiple Instance Learning.arXiv [cs.LG] (2018).42.Ren, J., Zhang, M., Yu, C. & Liu, Z. Balanced MSE for Imbalanced Visual Regression.arXiv 46. Bradley, A. P. The use of the area under the ROC curve in the evaluation of machine learning algorithms.Pattern Recognit.30,1145-1159 (1997).47.Pathak, D., Shelhamer, E., Long, J. & Darrell, T. Fully Convolutional Multi-Class Multiple Instance Learning.arXiv[cs.CV] (2014).recallcharacteristics (AUPRC) with 95% confidence interval (CI) and corresponding p-values of the homologous recombination deficiency (HRD) target with site-aware splits.The evaluation AUROC and AUPRC for this CAMIL classification, Graziani et al. regression and CAMIL regression is calculated for each model trained on the HRD score.In The Cancer Genome Atlas (TCGA), breast cancer (BRCA), colorectal cancer (CRC), glioblastoma (GBM), lung adenocarcinoma (LUAD), lung squamous cell cancer (LUSC), pancreatic adenocarcinoma (PAAD) and endometrial cancer (UCEC) were used for site-aware training.In the Clinical Proteomic Tumor Analysis Consortium (CPTAC) effort, LUAD, LUSC, PAAD and UCEC were used as external validation cohorts.The p-values are a result of an independent two-sided t-test comparing the means of the positive and negative scores of the models' predictions.Statistical significance is reached at p < 0.05 and indicates a difference in the means between the positive and negative scores.Statistically insignificant AUROCs and AUPRCs for the HRD score prediction are marked with gray.

Table 2 : Repeated measures analysis of variance (ANOVA) of the three modeling approaches for the homologous recombination deficiency (HRD) score with site-aware splits.
The repeated measures ANOVA for the models trained on the HRD score through the CAMIL classification, Graziani's regression and our CAMIL regression approach, resulting in F values with the degrees of freedom (DoF) 1 and 2. Statistical significance is reached at p < 0.05 and indicates a difference in the means between the three modeling approaches.Statistically insignificant F values for the models trained on the HRD score are marked with gray, with.In The Cancer Genome Atlas (TCGA), breast cancer (BRCA), colorectal cancer (CRC), glioblastoma (GBM), lung adenocarcinoma (LUAD), lung squamous cell cancer (LUSC), pancreatic adenocarcinoma (PAAD) and endometrial cancer (UCEC) were used for site-aware training.In the Clinical Proteomic Tumor Analysis Consortium (CPTAC) effort, LUAD, LUSC, PAAD and UCEC were used as external validation cohorts

Table 6 : Overview of the differences between the three modeling approaches.
The main differences between contrastively-clustered attention-based multiple instance learning (CAMIL) classification, Graziani et al. regression and CAMIL regression.

Table 7 : Ablation study of CAMIL regression with adaptations from Graziani et al. regression for the test set of The Cancer Genome Atlas (TCGA) breast cancer cohort.
Using the same site-aware splits for all the models, our CAMIL regression modeling approach was altered according to the changes found in Graziani et al. regression approach, adding a layer with 20% dropout, swapping the Adam optimizer for stochastic gradient descent (SGD), increasing the epochs to 100, and removing the kernel-based balancing.The regression approaches are compared through their prediction score range between [0, 91], where a wider range is better, the coefficient of determination (R 2 ) with standard deviation (std), and the area under the receiver operating characteristic (AUROC) with 95% confidence interval (CI) across all five splits.

Table 9 : Area under the receiver operating curve (AUROC) and p-values with 95% confidence interval (CI) with site-aware splits.
The performance of CAMIL classification and CAMIL regression models with site-aware splits is measured on cohorts from The Cancer Genome Atlas (TCGA), breast cancer (BRCA), colorectal cancer (CRC), liver hepatocellular carcinoma (LIHC),

Table 10 : Dependent two-sided t-test with 95% confidence interval with patient-level splits.
For the models which were trained without site-aware splits, a two-sided dependent t-test was performed to determine whether the means of the area under the receiver operating characteristics (AUROC) from the CAMIL classification and CAMIL regression approach were significantly different (p < 0.05).The models were trained on cohorts from The Cancer Genome Atlas (TCGA), breast cancer (BRCA), colorectal cancer (CRC), liver hepatocellular carcinoma (LIHC), lung adenocarcinoma (LUAD), lung squamous cell cancer (LUSC), gastric cancer (STAD) and endometrial cancer (UCEC) on biomarkers for tumor infiltrating lymphocytes regional fraction (TIL RF), proliferation, leukocyte fraction (LF), lymphocyte infiltration signature score (LISS), and stromal fraction (SF).