A spatial analytics computational and systems biology platform predicts risk of colorectal cancer recurrence and identifies emergent spatial domain networks associated with recurrence

An unmet clinical need in solid tumor cancers is the ability to harness the intrinsic spatial information in primary tumors that can be exploited to optimize prognostics, diagnostics and therapeutic strategies for precision medicine. We have developed a transformational spatial analytics (SpAn) computational and systems biology platform that predicts clinical outcomes and captures emergent spatial biology that can potentially inform therapeutic strategies. Here we apply SpAn to primary tumor tissue samples from a cohort of 432 chemo-naïve colorectal cancer (CRC) patients iteratively labeled with a highly multiplexed (hyperplexed) panel of fifty-five fluorescently tagged antibodies. SpAn predicted the 5-year risk of CRC recurrence with a mean area under the ROC curve of 88.5% (SE of 0.1%), significantly better than current state-of-the-art methods. SpAn also inferred the emergent network biology of the tumor spatial domains revealing a synergistic role of known features from CRC consensus molecular subtypes that will enhance precision medicine.

. The use of chemo-naïve (no administration of neoadjuvant or adjuvant therapies for the 5+ years of follow-up) CRC patient cohort provides SpAn the opportunity to interrogate unperturbed primary tumor biology.
SpAn uses HxIF spatial proteomics data to learn recurrence-guided and spatially-informed prognosis of CRC recurrence SpAn performs a virtual three-level spatial-dissection of the tumor microenvironment, by first explicitly decomposing the TMA into epithelial and stromal regions as detailed in Methods and illustrated in Fig. S2. The cells in the epithelial region are identified using E-cadherin cell-cell adhesion labeling and pan-cytokeratin, with individual epithelial cells segmented using a Na + K + ATPase cell membrane marker, ribosomal protein S6 cytoplasmic marker, and DAPI-based nuclear staining. The resulting epithelial spatial domain of the TMA in Fig. 1a is shown in Fig. 1c.
The remaining cells are assigned to the stromal domain and are visualized in Fig. 1d. These stromal cells have diverse morphologies. 20 Based on the epithelial and stromal domains, SpAn also identifies a third epithelial-stromal domain, shown in Fig. 1e, to explicitly capture a 100 µm boundary wherein the stroma and malignant epithelial cells interact in close proximity. Together these three intra-tumor spatial domains comprise the virtual three-level spatial dissection of the tumor microenvironment that forms the basis for the SpAn spatial model overviewed in Fig. 1f  SpAn is used in place of the more typical approach of implicitly incorporating correlations as interactions between covariates (average biomarker expressions) within the prediction model 28 .
These explicit features not only detect the association between two biomarkers presumably mediated by intracellular and intercellular networks all within the same spatial domain but also by mediators (e.g., exosomes) derived from another spatial domain. As an example, SpAn finds enrichment of KEGG 'microRNAs in cancer' pathway in the epithelial and epithelial-stromal domains ( Figure 5), while concurrently selecting correlation between CD163 and PTEN as a feature in the stromal domain for recurrence prognosis (see Fig. 2c). As has been reported in gastrointestinal cancers, tumor cell derived exosomal miRNAs mediate crosstalk between tumor cells and the stromal microenvironment, and induce polarization of the macrophages to the antiinflammatory and tumor-supportive M2 state via activation of the PTEN-PI3K signaling cascade under hypoxic conditions resulting in enhanced metastatic capacity. 29,30 SpAn then uses CRC recurrence-guided learning to determine those specific spatial domain features that constitute the optimal subset for prognosis via model selection based on L1-penalized Cox proportional hazard regression method ( Figure 2b). 31,32 See Methods for details on penalized regression, Fig. S3 for validity of the proportional hazard assumptions, and Figure S4 for determination of threshold for concordance with recurrence outcome. A follow-up analysis of the selected features is performed to test the stability of their contribution to recurrence prognosis through testing the stability of the sign of the corresponding coefficients at the 90% threshold. The final domain-specific features are shown in Fig. 2c Fig. S5. As detailed in Methods, SpAn combines these domain-specific features weighted by their corresponding coefficients into a single recurrence guided spatial domain prognostic model, whose performance is shown in Fig. 3a. The results were obtained by bootstrapping (sampling with replacement) patient data set to generate 500 pairs of independent training and testing sets using stratified sampling that ensured the proportion of patients in whom cancer recurred in each of the five years remained the same in each bootstrap. For each bootstrap, SpAn used the training data for learning and the independent testing data to compute the receiver operating characteristic (ROC) curve. These ROC curves are shown in Fig. 3a along with the mean ROC curve. The mean area under the curve (AUC) for bootstrapped ROC curves is 88.5% with a standard error of 0.1%, demonstrating the stable performance of SpAn. We also maximized Youden's index 33 to identify the clinically relevant operating point on the ROC curves that minimized the overall misdiagnosis rate. Figure 3b shows the resulting sensitivity and specificity values for all bootstrap runs, with mean values respectively of 80.3% (standard error of 0.4%) and 85.1% (standard error of 0.3%). High specificity limits SpAn from misidentifying no-evidence-ofdisease patients as being at high risk of CRC recurrence, while at the same time good sensitivity allows SpAn to not miss high-risk patients. This is emphasized by a high positive likelihood ratio value of 7.2 (standard error of 0.23), which quantifies the large factor by which odds of CRC recurring in a patient go up, when SpAn identifies the patient as at being risk of CRC recurrence.
At the same time a small negative likelihood value of 0.22 (standard error of 0.003) quantifies the decrease in odds of CRC recurrence in a patient when SpAn identifies the patient as being at lowrisk. Finally, these results are brought together in Fig. 3c, which show the large separation in recurrence-free survival curves of patients identified by SpAn at low-and high-risk of five-year CRC recurrence.
Validating the rationale behind SpAn The rationale behind our 'virtual-dissection followed by combination of the three specific spatial domains' approach is motivated by the acknowledged active role of the microenvironment and its spatial organization, and the differential role played by the epithelial and stromal domains in tumor growth and recurrence. 2,3,14,34 We tested the validity of this rationale within the context of our data by comparing the performance of SpAn with the null model, which is based on recurrence-guided learning on the spatially undissected patient TMA spot. The learning procedure for the null model was identical to the domain-specific learning within SpAn. Figure 3d shows the improvement in performance quantified by the AUC of the bootstrapped ROC curves achieved by SpAn over the null model. This improvement is statistically significant with a p-value less than 0.005, allowing us to reject the null model at the 99% confidence level. The improved performance highlights the importance of explicitly modeling the epithelial and stromal spatial domains associated with the TME. Interestingly, beyond supporting our rationale, this comparative test shows that our approach of exploiting both biomarker expressions and their correlations results in absolute performances of both the null model and SpAn that are better than those from models using biomarker expressions alone, without correlations ( Fig. S6) such as published state-of-theart approaches that include Immunoscore®. 15,16 SpAn is highly predictive of 5-year CRC recurrence for TNM Stages I-III in individual patients The ability to identify patients in whom CRC will recur, especially for those patients in Stages II and III of tumor progression is highly clinically relevant. Figure 3e shows that SpAn can consistently identify patients in whom risk of CRC recurrence is high for Stages I through III, with mean AUC of bootstrapped ROC curves for the three stages respectively being 82.1%, 89.4% and 88.6%.
Standard error of these mean AUC values respectively is 0.4%, 0.2%, and 0.2%, demonstrating the stability of SpAn performance. Although the overall performance across all three stages is highly significant with the potential of improving prognosis, the relative reduction in Stage I performance is a consequence of a small cohort of only ten patients in Stage I with CRC recurrence.
The ability of SpAn to predict risk of recurrence in individual patients from all three Stages, is relevant in the context of administering adjuvant therapy, especially for Stage II patients. Current guidelines for treating Stage II CRC patients from The National Comprehensive Cancer Network (NCCN), 35 the American Society of Clinical Oncology (ASCO), 36

and the European Society of Medical
Oncology (ESMO) 37 do not recommend routine adjuvant chemotherapy for Stage II patients, but do state that it should be considered for sub-population of Stage II patients that are at higher risk and might benefit from being put on adjuvant therapy regimen. 38 The personalized prognostic potential of SpAn implies that we could triage Stage II patient cohorts into low and high-risk groups, with the latter being further considered for therapy. Furthermore, SpAn could help with postoperative surveillance of high-risk Stage II patients with more intensive follow-up regimes. 39 While 20% to 30% of Stage II CRC patients are at high-risk of recurrence, there are Stage III patients that have good prognoses of stable 5-year recurrence-free survival. SpAn, therefore, could also be used to fine-tune their postoperative surveillance and adjuvant chemotherapy regimens.
Prognostic performance of SpAn remains stable over the 5-year time period A majority of CRC recurrence occurs in the first five years, with 90% occurring in the first four. 40,41 We, therefore, consider the time-dependent performance 42 of SpAn during the first fiveyear period. Figure 3f plots the AUC for time-dependent ROC performance. The performance of SpAn in predicting risk of recurrence remains consistent and stable (95% confidence interval shown) with only a small, and gradual reduction in time-dependent AUC values as we move away from the resection and imaging timepoint. This result suggests SpAn captures the critical biological underpinnings of recurrence in the primary tumor. Supplementary Fig. S7 shows the timedependent AUCs for domain-specific temporal performance of SpAn.
SpAn identifies spatial domain networks as emergent properties that explain the robust ability to distinguish patients in which CRC recurs Given the highly prognostic performance of SpAn, we took a systems perspective to understand and to explain the underlying network biology responsible for this performance within each of the three spatial domains. For each domain, we quantified the unique associations between biomarkers included in the selected features through partial correlations between every biomarker pair, when controlling for other biomarkers as described in Methods. This approach was performed for all patients. The resulting partial correlation for every biomarker pair was separated into two groups according to no-evidence-of-CRC and CRC-recurrence patient cohorts and the information distance based on Jensen-Shannon divergence 43 was computed between them (see Methods for more details). The resulting domain-specific distance matrices, shown in Figs. 4a-c, define associated graphs with the nodes being the biomarkers and edge weights quantifying the differential change, the information distance, in biomarker association between patients in which CRC recurred and those in which there was no evidence of recurrence. The stronger the weights, the larger the distance and the more significant the differential change in association between the two markers for the two patient cohorts. We defined the graphs generated by the distance matrices thresholded at the 99 th percentile as the spatial domain networks that were most significant for CRC recurrence prognosis. Figures 4d-f show the resulting networks for the three spatial domains that reveal the heterogeneous nature of the cell populations and signaling pathways leveraged by SpAn in CRC recurrence prognosis.
The epithelial-stromal domain network is comprised of three dominant sub-networks associated with tumor-invading T lymphocytes, 44 disruption in DNA mismatch repair cellular process, and the role of cancer associated fibroblasts (CAFs) in the desmoplastic microenvironment as indicated by the strong edge weight between smooth muscle actin (SMA) and collagen IV. CAFs are well known to promote EMT 45 and the differential expression of betacatenin and phosphorylated-MET in Fig. 4f is also consistent with the epithelial-stromal domain supporting the mesenchymal phenotype. 46 These features have also been identified with those distinguishing consensus molecular subtype (CMS) 4 that is associated with a poor prognosis in comparison to the other 3 subtypes in the transcriptome-based classification. 12,13 Interestingly, the epithelial-stromal spatial domain also reveals the presence of DNA mismatch repair network that has been associated with regulation of T lymphocyte infiltration, a prominent feature of CMS1. Thus, the epithelial-stromal spatial domain associated with recurrence combines two features, where in contrast, each alone is associated with two different CMS subtypes. This theme extends to the epithelial spatial domain in Fig. 4d, where metabolic deregulation, a prominent feature of CMS3, and DNA mismatched repair, a hallmark of CMS1 are evident. The association of these two subnetworks in the epithelial domain has the potential to promote tumor cell growth while escaping immune surveillance. Finally, we observe a prominent tumor associated macrophage (TAM) network in the stromal spatial domain (Figure 4e). TAM polarization towards the M2 phenotype regulated by AKT/PTEN has been associated with poor prognosis in CRC that could result from their immunosuppressive and matrix remodeling phenotypes. 30 Spatial domain networks reveal domain-specific network biology of CRC-relevant pathways We used STRING 47 and KEGG 48 databases to identify pathways enriched by biomarkers within each of the spatial domain networks and further corroborate their connections to prominent features in the CMS subtype classification. Figure 5 shows the pathways enriched in each of the three spatial domains, and further identifies those that are common to a majority of at least two of the three spatial domains. Since their identification is based on the spatial domain networks we computationally identified as significant for CRC recurrence prognosis, these pathways play a differentially important role in prognosis of CRC recurrence.
CMS2 tumors are associated with chromosomal instability pathway and enrichment of genes associated with cell cycle and proliferation. Interestingly, both Pi3k-Akt signaling and cell cycle pathways enriched in our analysis are associated with CMS2 tumor subtype, with almost 60%-70% of CRCs associated with dysregulation of Pi3k-Akt signaling pathways. 49 Tumors associated with the CMS4 mesenchymal phenotype show upregulated expression of genes involved in epithelial-to-mesenchymal transition along with increased stromal invasion, angiogenesis and transforming growth factor-β (TGF-β) activation. 12,13 Interestingly, proteoglycans in cancer, focal adhesion and microRNAs in cancer pathways enriched in our analysis enable the mesenchymal phenotype. For example, non-coding microRNAs both regulate and are targets of upstream regulators for modulating the epithelial to mesenchymal phenotype by targeting EMT-transcription factors such as ZEB1, ZEB2, or SNAIL. 50 Similarly, the focal adhesion pathway through the integrin family of transmembrane receptors mediates attachment to the extracellular matrix, and when dysregulated promotes cell motility and the mesenchymal phenotype. 51,52 Furthermore, extracellular and cell surface proteoglycans with their interaction with cell surface proteins such as CD44 have been known to promote tumor cell growth and migration. 53,54 Our analysis suggests that by capturing correlation-based crosstalk between heterocellular signaling pathways, SpAn leverages the interconnections between the subtypes for a high performing CRC recurrence prognosis and reveals a synergistic role of the CMS subtypes in CRC progression and recurrence. We note that the ability of SpAn to leverage these interconnections is due to the spatial-context-preserving sampling of a diverse set of CRC-relevant biomarkers enabled by HxIF imaging.
Interestingly, this network biology paradigm also shows enrichment of pathways specific to a single spatial domain whose oncogenic or tumor suppressive roles in CRC is an active area of research but whose differential role in CRC recurrence has not been widely studied. For example, in the epithelial domain our analysis shows the enrichment of Thyroid hormone signaling pathway that has been associated with a tumor suppressive role in CRC development. 55,56 In contrast, the bacterial invasion pathway, enriched in epithelial-stromal boundary region, has been implicated in the oncogenic role of the colonic microbiome in CRC development. 57,58 Our analysis also reveals enrichment of certain other pathways, such as the hypoxiainducible factor 1 (HIF-1), human epidermal growth factor receptor 2 (HER2) and T-cell receptor signaling pathways in the epithelial domain. Hypoxia is typical in many solid tumors in CRC with HIF-1 regulating tumor adaptation to hypoxic stress. 59 Alterations in Her-2 signaling, either through genomic amplification or mutations is tumor promoting, and anti-HER2 therapies for preventing CRC recurrence and are a focus of on-going work. 60 We finally note that MAPK and PI3K-AKT signaling cascades are implicated in many of the above discussed pathways.

Discussion
This study highlights the importance of spatial context of the primary tumor microenvironment in conferring distinct malignant phenotypes such as recurrence in CRC. We show how a computationally unbiased approach can be implemented through statistical modeling of spatially defined domains leading to a highly specific and sensitive platform for prognostic and diagnostic tests, as well as potentially inferring therapeutic strategies ( Figure 6). SpAn provides testable hypotheses regarding how the spatial association of common networks could potentially lead to SpAn, when used in combination with a hyperplexing imaging platform such as Cell DIVE TM allows mechanistic hypotheses to be tested through iterative probing of the same spatial domains with additional biomarkers inferred by the pathway analyses ( Figures 5 and 6). We expect even more specific mechanistic biomarkers to emerge based on the iterative hyperplexed imaging approach that incorporates finer stage-based focus, thereby reducing the total number of biomarkers needed for optimal analyses. We will be pursuing this in subsequent studies. This feature of SpAn combined with a hyperplexing imaging platform will not only allow refinement of its prognostic ability, but since the iterative analysis can be potentially conducted in real time with further advances in the technologies, it may allow the prognosis to be specific to individual patients. Importantly, by enabling the testing of mechanistic hypotheses in patient samples directly connected to a specific clinical outcome, SpAn may inform therapeutic strategies to prevent the outcome. The next phase of this work will take advantage of sampling multiple regions of primary tumors with larger TMAs and/or whole side sections, exploring other spatial analytics ranging from simple to sophisticated spatial heterogeneity metrics 65 and incorporating a combination of protein and nuclei acid biomarkers. 66 The present retrospective study provides the foundation for the implementation of SpAn in prospective studies predicting disease outcomes in patients with CRC and other carcinomas.
The high specificity and sensitivity of SpAn lies in its ability to unbiasedly identify emergent networks that appear to be closely associated and likely to be mechanistically linked to recurrence.
We anticipate that hyperplexed datasets based on multiple imaging modalities will be generated faster and become less expensive as the technology evolves to become a mainstay tool to analyze solid tumors.

Methods
Cell DIVE-based hyperplexed imaging (HxIF) of tissue microarrays (TMA). The 55 biomarkers plus DAPI nuclear counterstain included in this study are described in Supplementary Table S1. HxIF imaging of a TMA slide was performed using sequentially multiplexed labeling and imaging of 2 to 3 biomarkers along with DAPI counterstain through a label-image-chemical-inactivation iterative cycle visualized in Supplementary Fig. S1, and previously described 1 in detail in the supporting information therein.
Broadly, the supporting information details the hyperplexed immunofluorescent workflow with information on iterative cycles of antibody labeling of single 5 µm formalin-fixed and paraffin embedded tissue sections and TMA slides, autofluorescence removal, imaging, and dye inactivation in tissue. All samples were stained and imaged in a single batch for 2 to 3 biomarkers and DAPI at a time.
Image processing and single cell analysis. DAPI based nuclear staining was used to register and align sequentially labeled and imaged TMA spots prior to downstream image analysis steps. 1,2 Autofluorescence was removed from the stained images, 1,3,4 which were then segmented into epithelial and stromal regions (Fig. S2), differentiated by epithelial E-cadherin staining. This was followed by segmentation of individual cells in both the epithelium and stroma. Epithelial cells were segmented using Na + K + ATPase based cell-membrane staining to delineate cell borders and membrane regions, the cytoplasmic ribosomal protein S6 for cytoplasm identification, and DAPI stain for nuclear regions.Protein expression level and standard deviation were subsequently quantified in each cell.
Quality checks and data normalization. Following single cell segmentation, several data preprocessing steps were conducted. These included cell filtering, spot exclusion, log2 transformation and slide to slide normalization. Cells were included for downstream analysis if their size was greater than 10 pixels at 20X magnification. The hyperplexing process can result in the tissue being damaged, folded or lost. Image registration issues can also result in poor-quality cell data.
Therefore, a tissue quality index based on the correlation of that image with DAPI was calculated for each cell for each round. Only those cells whose quality index equals to or greater than 0.9 (meaning that at least 90% of the cells overlapped with DAPI) were included. All the slides for all the biomarkers were adjusted to a common exposure time per channel. The data were then log2 Kendall rank-correlation was chosen as the correlation metric because it is a non-parametric measure of association between two biomarkers. Moreover, its use of concordant and discordant pairs of rank-ordered biomarker expression for computing correlation coefficients allows it to robustly capture biomarker associations in presence of measurement noise and small sample size.
Rank-correlation for each pair of biomarkers was computed for each spatial-domain from all cells across the spatial-domain expressing the biomarkers. This approach is distinctly different from prediction models that typically consider correlations via interactions, implicit within the models, between mean biomarker intensity expressions -with the biomarker expressions being the only covariates of the model. 5 We emphasize that we did not compute correlations through mean is set to 1 in penalty :,< ( ), followed by (2) L2-regularized estimation, with set to 0 in the penalty term. 6,7 As part of model selection, SpAn selected the model that maximally explained the deviance of the null model (model with only intercept and no predictive features) from the biased but perfect model with an exact fit to the recurrence data. Due to the LASSO penalty, the resulting optimal model forces the coefficients of vector that correspond to features that play a minimal role in predicting risk of recurrence to zero. The subset of features with non-zero coefficients defines the selected model. However, to ensure that this subset identifies a stable set of features, SpAn repeated model selection over 500 bootstraps, and included only those features that were stably concordant at the 90% level with the recurrence outcome. The rationale for 90% concordance is discussed in supplementary Fig. S4. The values of coefficients corresponding to the features stably selected using L1-penalized Cox regression, however, are conditioned on the 1540 input features. To remove this dependence, the second step of SpAn re-estimated the beta coefficients corresponding to the selected features only, by maximizing the partial likelihood function with L2-regularization as the penalty. The resulting beta-coefficients were passed through a final stability check, where the stability of the coefficient sign in 90% of the 500 bootstrap runs was tested, and only features that passed this threshold (Fig. 2c) were included in the spatial domain model. SpAn performed this process independently for each of the three spatial-domains resulting in domain-specific recurrence-guided features (Fig. 2c) and their coefficients (Fig. S5).
Spatial-Model. Each of the three recurrence-guided domain-specific models defined a hazard risk given by 1 . 2 , N . O and 1 a . a 2 for the epithelial, stromal, and epithelial-stromal domains respectively. SpAn then defined the final overall risk of recurrence model as ∏ Partial correlations and spatial-domain networks. For each spatial domain, the selected features identified a set of biomarkers specific to predicting risk of CRC recurrence. SpAn used them to define a space of biomarkers within which partial correlations between every pair was computed by controlling for confounding effect of biomarkers not defining the pair. 8 The process performed on each patient was as follows: Let the set of biomarkers identified by the selected features be N (<= 55). Using the already computed Kendall rank-correlations between the 55 biomarkers, an × correlation matrix corresponding to the biomarkers was constructed, with small shrinkage-based modification to guarantee its positive definiteness, and therefore, its invertibility.
Next, the × precision matrix was computed by inverting . The partial correlation between any two biomarkers P and t within the set identified by the selected features, was then computed using vA , ,vA w = xy z{ , ,z{ w | y z{ , ,z{ , •y z{ w ,z{ w , where vA , ,vA w is the ( , ) •€ element of the precision matrix . The partial correlations were performed for all patients and were then separated into two groups corresponding to patients with no evidence of disease and those patients in which CRC recurred. Probability distributions of the partial correlations -on the compact set [−1,1] -within each group were computed and the information distance between these two distributions was computed using the Jensen-Shannon divergence. This information distance defines the differential change in the association -partial correlation -between biomarkers P and t in the two patient cohorts. Greater the distance, larger the differential     Epithelial-stromal domain spatial domain networks obtained by thresholding the corresponding spatial domain information distance matrices at the 99 th percentile, which identify differential change most significant for CRC recurrence prognosis. Figure 5. CRC recurrence-specific network biology inferred by SpAn. Domain-specific biomarkers identified by the spatial domain networks are used to interrogate the KEGG and STRING databases to identify domain-specific pathways enriched by the biomarkers. The epithelial, stromal and epithelial-stromal domain are respectively shown in green, red and blue, with pathways unique to those domains also coded with the same colors. Pathways that are enriched in more than one domain are coded with a color combination of those respective domains. For example, the PI3K-AKT signaling pathway is enriched in all three spatial-domains, and therefore, has a boundary box color-coded with all three colors. On the other hand, the mismatch repair pathway is enriched in the epithelial and epithelial-stromal domains, and is therefore, color-coded by red and blue colors.       Figure S1. Cell DIVE hyperplexed immunofluorescence imaging and processing scheme. For the K th iteration, autofluorescence image of the tissue is acquired prior to labeling it with 2-3 fluorescent dye-conjugated primary antibodies and DAPI counterstain. Fluorescent-labeled tissue images are then acquired, followed by inactivation of the dyes and start of the next iteration. 20 Figure S2. Tissue and cell segmentation. TMA spot, visualized here through a virtual Hematoxylin and Eosin image, is segmented into epithelial and stromal regions using E-cadherin to identify epithelial region from the stroma, and is followed by individual cell identification using Na + K + ATPase (cell membrane marker), ribosomal protein S6 (cytoplasmic marker) and DAPI (nuclear counterstain). Figure S3. Validity of the proportional hazard assumption in penalized Cox regression. The pvalues (shown in log scale) indicate the validity of the proportional hazard assumption for the overall Cox regression for all the 500 bootstraps and for each of the three domains. It can be seen that the overall global test is not statistically significant at the 95% confidence level (indicated by the dashed line) for almost all Cox models generated for the 500 bootstrap runs, demonstrating that the proportional hazard assumption is consistently valid. Figure S4. Rationale for choosing 90% concordance rate. Plot of concordance of the penalized Cox regression model as a function of a threshold function that identifies the biomarker features most consistently selected by L1 penalization at the concordance level corresponding to the threshold. The larger the threshold the more stringent the consistency requirement on feature selection, and smaller the number of selected features. As shown in the plot, for low threshold values, the concordance value is saturated, and therefore, in this region injective correspondence between threshold value and concordance does not exist. In the monotonic decay region such a correspondence can be identified. The 90% concordance level identifies such a correspondence for all three spatial domains without compromising performance. Figure S5. Coefficients for the recurrence-guided and domain-specific penalized Cox regression models of SpAn. Boxplots for coefficients that control the contribution of the selected features (obtained using L1-penalty) to each of the recurrence-guided and domain-specific penalized Cox regression under L2 regularization. The coefficients were computed for all 500 bootstrap runs and the boxplots capture the spread of values. It is worth noting that for all bootstraps the coefficients maintain their sign, which quantifies the nature of their contribution. A positive coefficient implies worse prognosis for increase in the corresponding feature value, while negative coefficient implies the converse. Figure S6. Performance of SpAn platform for predicting risk of 5-year CRC recurrence for intensity-based features. SpAn ROC curves for predicting risk of 5-year CRC recurrence in patients with resected CRC primary tumor using only biomarker expressions. The plot shows ROC curves for 500 bootstrap runs with independent training and validation sets. Mean area under the ROC curve is 72% with a standard error of 0.2%. Figure S7. Time-dependent AUCs for domain-specific temporal performance of SpAn. Temporal performance of SpAn for the three spatial-domains illustrated by the time-dependent AUC values plotted as a function of time in years. The 95% confidence interval computed using the 500 bootstraps for each of the three spatial-domains is also shown by the yellow shaded area around the mean time-dependent AUC values depicted by the solid black line.