## Introduction

Biomarkers improve patient care and impact therapeutics development1. The risk of a disease, e.g., recurrence of cancer, is routinely assessed by prognostic biomarkers2. The development of tissue prognostic biomarkers for breast cancer (exemplary of most diseases) has historically followed the one-biomarker-at-a-time approach, in which candidate biomarkers underwent representative multiregion sampling to suppress the spatial intratumor heterogeneity highlighted by multiregion sequencing and gene expression3. The multiregion sampling may take a virtual form such as multiple microscopic views of one histologic tissue section, or a physical form such as multiple microdissections from one tissue section, multiple tissue sections/blocks or tumor microarray cores from one primary tumor, and specific multisite sampling4. The former is frequently used in standard (H&E) histology and immunohistochemistry (IHC), whereas the latter in molecular profiling5 (MP) such as multigene assays6. Regardless of the associated natures (morphological vs. molecular) and methods (H&E, IHC, and MP), single spatially homogenized biomarkers robust against the intratumor heterogeneity have been sequentially developed from underlying in situ biomarkers and included into various multivariate prognostic models for clinical use (Fig. 1a).

This homogenization of biomarkers is beneficial due to: (i) reliable pathology reports with less interobserver discordance; (ii) simple patient stratification for biomarker validation; and (iii) easy integration with other non-in-situ biomarkers into various prognostic models7. On the other hand, it is costly due to: (i) the discordances between primary tumors and distant metastases, e.g., discordance of estrogen receptor (ER) (or progesterone receptor PR) observed in 7–25% (or 16-49%) of breast cancer patients8; (ii) the subjectivity of disease-dependent criteria and thresholds, e.g., the histological grade based on averaged tubule formation but the highest mitotic count and degree of nuclear pleomorphism over sampled regions/views9; and (iii) small number of homogenized biomarkers in comparison to in situ biomarkers, which contributes to few reported biomarkers implemented into clinical practice10. A natural question arises whether the homogenization self-evidently leads to a high benefit-to-cost ratio, or whether it is self-evident to conduct breast cancer prognosis using the homogenized (rather than in situ) biomarkers (Fig. 1b).

In the simplest case of 1-to-1 correspondence between one homogenized biomarker and one in situ biomarker (e.g., regional percentage of ER + tumor nuclei vs. ER overall status, Fig. 1a), the benefits of the homogenization generally outweigh the corresponding costs because the candidate biomarker (ER) is not placed in the context of (does not spatially interact with) other biomarkers. However, ER overall status is rarely used as a standalone prognostic biomarker but routinely combined with other single homogenized biomarkers in a multivariate prognostic model. In this case of N-to-N correspondence between homogenized and in situ biomarkers, the balance of benefit-to-cost ratio may tip against the homogenization owing to the loss of information on implicit biomarker-biomarker interactions from multi-biomarker spatial heterogeneity (Fig. 1c). That is, multiple in situ biomarkers with co-registered multiregion sampling may derive differential prognostic value from these interactions over the multivariate non-graph prognostic model wherein the in situ biomarkers are developed into the corresponding homogenized biomarkers (Fig. 1c). This “hidden” prognostic value has not been demonstrated to date possibly due to: (i) technical difficulty to spatially co-register many in situ biomarkers among consecutive sections and different microscopic methods or scales (H&E and IHC); (ii) high cost of multiregion MP for a large cohort of patients; and (iii) lack of a computational representation for the biomarker-biomarker interactions.

We aim to demonstrate this value by representing the recently reported tumor-associated collagen signatures11 (TACS1-8), i.e., multiple co-registered biomarkers from multiphoton microscopy (MPM), with a graph neural network12 of primary breast tumor13. It should be noted that TACSs and other collagen-based biomarkers have been extensively studied14 and explored for clinical translation15. The resulting prognostic model of intratumor graph neural network (IGNN) not only forms a high-performance kernel to include additional in situ biomarkers (from H&E, IHC, MP, etc.), but also provides post hoc interpretations of various implicit interactions among the included biomarkers with distinct biological/medical insights (Fig. 1a−c). Our distance-less IGNN model based on MPM may motivate the development of similar models based on other imaging and non-imaging tissue assessment technologies (Supplementary Table 1), as long as the biomarkers possess spatial heterogeneity.

## Results

### Construction of personized IGNN structure

For a breast cancer patient characterized by one histologic formalin-fixed paraffin-embedded (FFPE) tissue block, the co-registration between H&E and MPM from two consecutive histologic tissue sections (4 µm) ensured the localization of several local MPM regions in global H&E-revealed tumor area and extraction of in situ TACS data11. One section was stained with H&E for whole slide imaging, in which a pathologist confirmed the presence of tumor cells and their borders. Dependent on the size of tumor area for sufficient sampling, several (4–20) ~2.8-mm-sized non-overlapping regions of interest (ROI) were located mainly at the tumor invasive front and then labeled (numbered) in the H&E images. The other (unstained) section was deparaffinized by alcohol and xylene to collect label-free dual-modal MPM of second harmonic generation (SHG) and two-photon excited (intrinsic) fluorescence (TPEF) images16 for all labeled ROIs (Fig. 1d).

Based on the resulting personized MPM regions and heterogeneous regional distribution of TACS1-8 (8 in situ binary biomarkers), we constructed the IGNN to represent the MPM regions (nodes) with specific TACS1-8 distributions (node attributes present as 8-bit vectors) and their interactions (edges) with learnable parameters (Fig. 1d). This structure cast the prognostic prediction of many patients as a supervised Cox proportional hazards regression learning task by establishing a nonlinear mapping between TACS1-8 spatial heterogeneity and observed prognosis (Supplementary Fig. 1 and Supplementary Fig. 2). It consisted of multiple stacked nonlinear functional layers (Fig. 1d, Supplementary Fig. 3, and Supplementary Table 2), which were trained end-to-end by back propagation algorithm. First, graph convolution was used to represent intratumor graph nodes and infer their regional interactions. Two graph convolution layers followed neighborhood aggregation framework17 in which the node attribute embeddings of graph structure were alternately updated by implementing message-passing mechanism18. Then, attention mechanism and optional gated recurrent units (GRU)19 were introduced within graph convolution layers to capture more distinguishing features and reduce the gradient disappearance and over-smoothing problems during the model training phase. Later, the convoluted graph representation from all node attribute embeddings was further aggregated within a global pooling layer and abstracted by a fully connection layer. Finally, the graph representation was converted into IGNN score by a prognostic risk prediction layer to predict patient prognosis.

As a framework based on neural network, our IGNN might incorporate other quantifiable prognostic biomarkers as multimodal input by adding the appropriate multilayer perceptron module20 and feature fusion layer before the first full connection layer. To evaluate whether combined IGNN score and traditional clinicopathological factors could synergistically improve the performance of cancer prognosis, we developed an extended model (IGNN-E) by incorporating clinical information (routine prognostic biomarkers) to the basic IGNN model (Supplementary Fig. 3 and Supplementary Table 3), just like how an extended model (Nomogram) of multivariate Cox proportional hazard regression was developed by incorporating this clinical information to the basic TACS1-8 model11,21. The cost of biomarker homogenization was assessed by the differential prognostic value of the IGNN (or IGNN-E) model over the TACS1-8 (or Nomogram) model (Fig. 1c, d).

### Differential prognostic value from diverse patients

We compared four prognostic models with different biomarkers (Supplementary Table 4). Using a training cohort of 731 patients, we performed the pre-validation22 based on 3-fold cross-validation for each of the four models (Supplementary Fig. 1). Personalized prediction scores for each instance held out during cross-validation were computed, and then all results from single cross-validation steps were combined as the model-specific validation results for the training cohort to assess their risk stratification capabilities (Supplementary Fig. 4 and Supplementary Table 5, see Methods). We subsequently retrained these models using the whole training cohort and finally validated the trained models using an independent validation cohort of 264 patients to further assess their prognostic value (Supplementary Fig. 1). For both cohorts in external validation, a correlation analysis was conducted between model scores and disease-free survival (DFS) through the Pearson correlation, which indicated stronger association of IGNN (or IGNN-E) score with DFS than TACS (or Nomogram) score (Supplementary Fig. 5a). The relative strength of various prognostic biomarkers to predict DFS, which was assessed by multivariate Cox proportional hazard regression analysis (Supplementary Table 6), indicated stronger dominance over other prognostic biomarkers by the IGNN score than the TACS score (Supplementary Fig. 6). Just like the TACS score, the IGNN score functioned as an independent prognostic factor along with tumor size, lymph node status, and molecular subtype (Supplementary Table 6).

We next calculated model-dependent survival curves in which patients in the training (or validation) cohort were stratified into low- and high-risk groups via the optimal cutoff point dictated by the two-sided log-rank test statistics in model training phase. In comparison to the TACS1-8 (or Nomogram) model, the IGNN (or IGNN-E) model produced more diverged survivals for low- and high-risk patients and more diverged distributions of predicted score for patients with DFS less versus more than 5 years (Supplementary Fig. 5b). The stronger prognostic strength of the IGNN (or IGNN-E) model over the TACS1-8 (or Nomogram) model was also revealed by larger values of hazard ratio (HR) (stratification ability), concordance index (C-index), integrated cumulative/dynamic AUC (iAUC), the associated area under receiver operating characteristic curves (AUC) (Fig. 2a), and specificity/sensitivity to predict 5-year DFS rate (Supplementary Table 7). The relatively large number of patients from rather disparate populations and institutions in southern (training cohort) and northern China (validation cohort) strengthened the statistical significance of the IGNN model free of new biomarkers.

All these results demonstrated the differential prognostic value of the IGNN over TACS1-8 model by simply accounting for the biomarker-biomarker interactions from multi-biomarker spatial heterogeneity. With no new biomarker beyond TACS1-8 (8 in situ biomarkers), our IGNN recovered this differential prognostic value from regional interactions among TACS1-8, which approximated the differential prognostic value from the combination of routine biomarkers (i.e., the Nomogram over TACS1-8 model) (Fig. 2a), particularly for patients with a small (<2 cm) tumor (Fig. 2b). These two differential prognostic values were largely additive from the TACS1-8 to IGNN-E model (Fig. 2a, b), indicating rather independent prognosis of these regional interactions from the routine prognostic biomarkers. Thus, the combined IGNN score and traditional clinicopathological factors synergistically empower cancer prognosis.

### Assessment on patient subgrouping and informed treatment

To examine how the observed differential prognostic value benefited different subgroups of patients, we combined the patients in both cohorts (n = 995) and divided them into different subgroups according to various clinicopathological factors (Supplementary Table 8). Interestingly, the observed differential prognostic value of patients with a small ≤2 cm tumor approximated that from the combination of routine biomarkers, whereas patients with a moderate 2–5 cm or large >5 cm tumor obtained a less significant result (Fig. 2b). The observed disparity in differential prognostic value was likely caused by the representative (nonrepresentative) multiregion sampling of a small (large) tumor using one section from a FFPE tissue block (~2 × 2 × 0.4 cm3). This disparity might be correlated with similar disparities observed from different histological grades, nodal statuses, and TNM-stages (Supplementary Fig. 7).

The single most surprising result was the large differential prognostic value from patients with a small tumor (n = 445). Consistently, the dominance of the IGNN score (in contrast to the TACS score) over other prognostic biomarkers was more pronounced in this subgroup of patients than the whole group (Fig. 2c vs. Supplementary Fig. 6). This implied that the recovery of previously neglected regional biomarker-biomarker interactions among existing biomarkers could empower cancer prognosis more than the addition of new prognostic biomarkers. Thus, for this subgroup of patients, in situ biomarkers should be chosen over homogenized biomarkers for more accurate cancer prognosis.

To evaluate the benefit of the IGNN model beyond risk assessment, we investigated the corresponding utility to inform the postoperative adjuvant therapy of all 995 patients according to the consensus treatment guideline23. Both TACS score and IGNN score reclassified a significant portion of low-risk (high-risk) patients according to the guideline into high-risk (low-risk) patients, and thus could minimize undertreatment (overtreatment) (Fig. 3a). Based on the retrospectively determined 5-year DFS rate, in contrast to the TACS score that might spare 125/112 patients from overtreatment/undertreatment according to the guideline, the IGNN score would spare 138/106 patients from this undertreatment/overtreatment (Fig. 3b). As to the 445-patient subgroup, in contrast to the TACS score that might overtreat 94 patients, the IGNN score would overtreat 52 patients and thus spare 42 patients from this overtreatment (Fig. 3c, d).

To sum up, the shift from the TACS to IGNN model not only empowered cancer prognosis but also could improve subsequent informed treatment, particularly for patients with a small tumor (representative multiregion sampling).

### Post hoc interpretations of biomarker-biomarker interactions

How the IGNN model learns the implicitly prognostic interactions among TACS1-8 is valuable to understand the TACS1-8 themselves beyond morphology. The message passing of the graph convolutional layers allows nodes to progressively aggregate the most relevant messages from their neighborhoods via the attention mechanism. In the training process, observable and understandable model responses are important to investigate how the node states evolve in individual patients. Without loss of generality, we randomly interrupt a training process of the IGNN model after sufficient training without any prior knowledge (see Methods), then obtain the corresponding feature vectors from the input and graph convolutional layers of the trained IGNN (Fig. 1d) and compute the similarity between regional pairs through the heat map of Pearson correlation coefficient matrix (Supplementary Fig. 8). The initial node state is encoded according to regional presence of TACS(s). The correlation of the regional pair with the initial state reflects the extent to which they contain the same TACS(s). However, as nodes and their neighbors interact through the graph convolutional layers, the node state vectors are constantly updated, resulting in the correlation changes among the nodes.

We apply the above method to the training cohort and find that node (ROI) state feature output from graph convolutional layer 2 exhibits distinct pattern and order according to regional interactions among TACSs. The heat map clearly divides thousands of MPM regions from 731 patients into three clusters. The regions with TACS5 or TACS6 accounts for 79% in Cluster 1, the regions with TACS4 accounts for 90% in Cluster 2, and the regions with TACS1 accounts for 92% in Cluster 3 (Fig. 4a). Analysis of similarities (ANOSIM) of the output feature vectors from the graph convolutional layer 2 for the nodes (ROIs) shows significant differences between the three clusters (Fig. 4a). Similar results are obtained from the validation cohort (Fig. 4b) and patients with a small tumor (Fig. 4c), indicating that the IGNN model captures prognostic value of TACS1-8 interactions by learning from a specific task.

A number of post hoc interpretations of TACS1-8 emerge in consistency with their morphologies (Fig. 4d): (i) high positive correlation between regions with TACS5 and TACS6 indicates their converging function and possibly synergistic interaction in tumor cell invasion; (ii) low correlation between regions with TACS4 and TACS5,6 implies that their effects on tumor cell invasion are relatively independent or different (TACS4 appears to block tumor cell invasion); (iii) low or negative correlations between regions with TACS1 and regions with TACS4,5,6 confirm that TACS1 is characteristic more of tumorigenesis than tumor cell invasion; and (iv) no clusters with high correlation between intra-cluster regions contain a high percentage of regions with TACS2, TACS3, TACS7, and TACS8, which may be due to the relative rarity of TACS2,3,7,8 in the samples. Additionally, the proportions of TACS2,3,7,8 are unevenly distributed across different clusters (Fig. 4a–c, middle), and ANOSIM shows that ROIs with TACS3 or TACS8 correlate more with Cluster 1 than other clusters while ROIs with TACS2 (TACS7) correlate more with Cluster 3 (Cluster 2) (Supplementary Fig. 9). Considering collagen morphologies, we hypothesize that TACS2,3,7,8 affect survival by interacting with other TACSs. For example, sparsely distributed collagen fibers at the tumor invasion front (TACS8) is conducive to TACS5,6 that enables tumor cell migration, while densely distributed collagen fibers at the tumor invasion front (TACS7) is conducive to TACS4 that restricts tumor cell migration. It should be noted that these insightful interpretations are not available from the TACS1-8 model with homogenized TACSs11.

Another post hoc interpretation arises from the correlation analysis between individual TACSs and TACS/IGNN score, which shows a generally lower correlation of individual TACSs with IGNN score in comparison to the TACS score (Fig. 4e). Because the IGNN model accounts for the interactions among TACSs, the prognostic importance (absolute values of Spearman’s rank correlation coefficients) of TACSs becomes less in the IGNN model vs. TACS1-8 model (Fig. 4e).

## Discussion

The results demonstrated in this study promote a shift of disease prognosis from conventional (spatially homogenized) biomarkers to in situ biomarkers that reflect intratumor heterogeneity (Fig. 1a, b). Also, for existing homogenized biomarkers (e.g., ER in Fig. 1a), the corresponding in situ biomarkers may be added to TACS1-8 to further improve the IGNN-E model (Fig. 2a). The required co-registration of in situ biomarkers among MPM, IHC, and H&E can be ensured by collecting and examining consecutive FFPE tissue sections. Moreover, the individual biomarkers of IGNN may be expanded from binary biomarkers (TACS1-8) to those with multiple or continuous values (e.g., regional percentage of ER+ tumor nuclei, Fig. 1a), in order to retain full prognostic information. Finally, further optimization of region size (2.8 mm in this study) for more informative spatial biomarker distribution (Supplementary Fig. 10) and optimization of model depth for a larger patient size may increase the performance of prognosis. Thus, cost-effective precision medicine may be obtained by taping current underappreciated biomarkers rather than developing novel biomarkers, because the hidden prognostic value of regional biomarker-biomarker interactions can be recovered by an IGNN model.

This study exploits intratumor heterogeneity to improve biomarker utility, defying the conventional wisdom that intratumor heterogeneity is dismal for biomarker development24. A few other studies attempted to derive prognostic values from the spatial heterogeneity of cellular25 or genetic biomarkers26,27. However, whether biomarker intratumor heterogeneity can serve as a practical prognostic biomarker remains controversial28,29. Here we avoid this controversy by deriving prognostic values from regional biomarker distributions (TACS1-8 model) and biomarker-biomarker interactions (IGNN model) originated from the intratumor heterogeneity, rather than from the diversity of regionally distributed TACS1-8 corresponding to the intratumor heterogeneity itself.

As an early attempt to exploit the benefits of intratumor heterogeneity, we have chosen a size of ~2.8 mm for ROIs, each of which produces ~1.3 distinct features of TACS1-8 (Supplementary Table 9). For this size, representative sampling of the tumor invasion front of a typical histological section requires 4-20 ROIs, depending on the dimension of the tumor area. The number of ROIs seem adequate because random removal of 20% of ROIs for individual patients results in rather small inconsistency in prognosis (Supplementary Table 10). Thus, it is unlikely that our particular choice of ROIs (number and position) introduces an arbitrary level of intratumor heterogeneity. However, our combination of ROI size, sampling site (invasion front), and number of ROIs may not be optimal to attain the full potential of the IGNN model. Future efforts to balance the trade-off between ROI size (content of TCASs in one region) and ROI number (content of TACS-TACS interaction among regions), and the trade-off of overall TACS content versus cost, may yield better overall performance in a clinical setting. The extension of TACS content from tumor invasion front to tumor center may further improve the performance.

Our multi-biomarker IGNN enables unique integrative research on tumor microenvironment from the perspective of cancer risk. The corresponding model based on MPM-revealed extracellular matrix remodeling30 achieves a high performance (Figs. 2 and 3) among tumor microenvironment-centered prognostic models based on MP-revealed stromal gene expression31, H&E-revealed stromal architectures32, H&E-revealed tumor-infiltrating lymphocytes33 or stromal cells34, and IHC-revealed specific immune cells35. This high performance may arise from the general role of extracellular matrix in modulating the hallmarks of cancer36. High-performance cancer prognosis has not been attained in previous studies14,37 on TACS1-3 and aligned collagen possibly due to their emphasis on developing single and/or homogenized biomarkers. To further improve the IGNN model, it is reasonable to complement TACS1-8 with the above tumor microenvironment components by co-registering the corresponding biomarkers (Fig. 1c). A more complete picture of tumor microenvironment should further include tumor cell components (e.g., histological or molecular type discussed above). The crosstalk among various tumor cell and microenvironment components may be interpreted post hoc from the regional biomarker-biomarker interactions (Fig. 4). It will be interesting to compare the independent generation of prognostic values under the conditions of single- versus multiple-biomarker spatial heterogeneity (Fig. 1c).

The graph-based prognostic model improves the performance of its conventional non-graph counterpart, and at the same time, retains or enhances the interpretability of the underlying biomarkers. In contrast to various prognostic morphological biomarkers derived from graph-based38 and non-graph machine/deep learning in computational histopathology39,40,41,42, our in situ biomarkers (regional heterogeneity of TACS1-8) were identified in a discovery process11 independent of subsequently developed IGNN prognostic model (Fig. 1b), so that they are more amendable for human interpretations (Fig. 4). Also, the contextual information connecting sampling regions are conveniently represented by the edges connecting the nodes (Fig. 1d), relieving the somewhat challenging task in typical deep learning to model the spatial correlations between neighboring patches (by combining low- and high-resolution inputs)43. Our IGNN significantly departs from prior GNN research that centers graph nodes on cells44, which assumes that tumor cell-cell distances and interactions have the highest prognostic value13. Strikingly, the “functional” interactions among TACSs recover a differential prognostic value larger than that of combined routine prognostic biomarkers for small tumors (Fig. 2b), without accounting for internode distances in the graph edges (see Methods). It is reasonable to assume that future inclusion of these distances may further improve the performance of cancer prognosis, particularly that associated with large tumors after more representative multiregion sampling.

One perceived weakness of our approach to manually recognize TACS1-8 may be overcome by an automatic recognition procedure after weakly supervised deep learning45. Also, our IGNN model is restricted to adjuvant therapy patients, and is not applicable to neoadjuvant therapy patients whose tumors and TACS features have been perturbed by the neoadjuvant therapy46. Moreover, the insufficient sampling of a tumor with specific selection of ROIs may introduce prognosis uncertainty at individual level (but not necessarily population level) due to uncertain intratumoral heterogeneity such as stroma versus cellularity ratio. On the other hand, our IGNN highlights an overlooked advantage of biospecimen sampling using multiple laser-capture microdissections and tissue microarray cores from one sample over homogenized tissue sampling47 and liquid biopsy. Although this study is limited to a specific disease (breast cancer) and prognostic prediction (predicting DFS), similar methodology can be adopted for other diseases and prognostic predictions, e.g., predicting treatment response to a therapy/drug.

## Methods

### Patients, samples, image acquisition, and TACS recognition

This research used anonymous data for retrospective study and was conducted under a protocol approved by the Institutional Review Boards (IRB) of Fujian Medical University Union Hospital and Harbin Medical University Cancer Hospital. Our study included a training cohort of 731 patients from Fujian Medical University (FMU) Union Hospital and a validation cohort of 264 patients from Harbin Medical University (HMU) Cancer Hospital. The demographic and clinicopathologic characteristics of patients, sample preparation, MPM image acquisition, and TACS recognition have been reported previously11. Only a small portion (~5%) of patients has multifocal tumors, and in this case, the sample preparation was performed on one of the foci with the largest size. Raw data including TACS coding observed from MPM imaging, clinical factors and follow-up information of patients is available with this paper.

### Graph dataset

As typical graph machine learning models, IGNN and IGNNE were implemented with specific irregular graph structures as input. We developed a toolset based on the PyTorch-Geometric library to build special graph structures from raw data of patients and generate specialized graph dataset (TACS_G) containing node-level attributes and connectivity matrix from a batch of graph structures, and the toolset and pre-established TACS_G dataset for this study are available within the source code.

### Personalized IGNN structure

For each given tissue section (patient), TACS distribution is encoded in a graph data structure defined as G = (V, E), which consists of an N-node set V and an edge set E. Each ROI in MPM image in the set ROIset = {ROI1, ROI2, … , ROIN} is regarded as a node object in V. An 8-bit binary coding vector $${{{\bf{v}}}}_{i}\in {\mathbb{R}}^{1\times 8}$$ forms the initial feature vector of the ROIi-associated node i V to describe the TACS1-8 distribution in ROIi (e.g., ROIi including TACS4 and TACS7 resulted in vi = 01001000), and $${{{\bf{F}}}}={[{{{\bf{v}}}}_{1}^{{{\rm{T}}}},\cdots ,{{{\bf{v}}}}_{N}^{{{\rm{T}}}}]}^{{{\rm{T}}}}\in {\mathbb{R}}^{N\times 8}$$ is defined as the initial node feature matrix of G. The edge eij E between node i and j is then constructed according to the k-nearest neighbor algorithm

$${e}_{ij}=1,{{{\rm{if}}}}\,j\,\in {{{\rm{KNN}}}}(i)\;{{{\rm{and}}}}\,{{{\bf{v}}}}_{i}\cdot {{{\bf{v}}}}_{j}^{{{\rm{T}}}}\ge 0$$
(1)

where the inner product $${{{{{{\bf{v}}}}}}}_{i}\cdot {{{{{{\bf{v}}}}}}}_{j}^{{{{{{\rm{T}}}}}}}[0,\cdots ,8]$$ indicates the number of TACS types presented in both ROIi and ROIj and KNN(i) indicates k-nearest neighbors of node i. Finally, personized TACS information is converted into an IGNN structure.

### Architectures of IGNN and IGNN-E models

The IGNN model takes the IGNN graph structure data as input and outputs a score for prognostic prediction (IGNN score). Given a set of G = {G1, G2, ... ,GN} with corresponding survival status labels Y = {y1, y2, ... ,yN}, the IGNN model constructs a multilayer nonlinear mapping to extract significant representation from graph structure data for prognostic prediction. It contains three components: (a) graph convolution network module; (b) fully connected network modules; and (c) prognostic regression layer.

The core part of graph convolution network module contains multiple stacked graph convolution layers, designed as a “message-passing” architecture that drives nodes to aggregate information with their neighbors along edges and perceives their interactions. The graph convolution is separated into two steps: node information propagation and node information aggregation. In the first step, node features are propagated as the information of node state along the edge between neighboring nodes. Attention mechanism is incorporated into the propagation step, which follows a self-attention strategy to aggregate the neighboring node states of node i by attending over them with different weights. The propagation scheme conducted in the attention layer is formulized as

$${{{\bf{X}}}}_{i}^{(t)}=\mathop{\sum}\limits_{j\in {{{\rm{N}}}}(i)}{a}_{ij}{{{\bf{H}}}}_{j}^{(t)}$$
(2)

where the set $${{{{{\rm{N}}}}}}(i)=\{\,j\in {{{{{\rm{V}}}}}}|{e}_{ij}=1\}$$ represents the set of 1-hop neighbors of node i V in the graph, .. represents the calculated weighted aggregation result of state information propagated from the neighbors of node i, $${{{\bf{H}}}}_{j}^{(t)}\in {\mathbb{R}}^{1\times {{{\rm{c}}}}^{(t)}}$$ represents the c(t)-dimensional feature vector of state information for node j V in the t-th graph convolutional layer, aij is the attention-based weight coefficient of state information propagated by node j to node i according to the attention mechanism of

$${a}_{ij}=\frac{\exp ({\beta }^{(t)}\langle {{{{{{\bf{H}}}}}}}_{i}^{(t)}{{{{{{\bf{W}}}}}}}_{{{{{{\rm{H}}}}}}}^{(t)},{{{{{{\bf{H}}}}}}}_{j}^{(t)}{{{{{{\bf{W}}}}}}}_{{{{{{\rm{H}}}}}}}^{(t)}\rangle )}{{\sum }_{k\in {{{{{\rm{N}}}}}}(i)}\exp ({\beta }^{(t)}\langle {{{{{{\bf{H}}}}}}}_{i}^{(t)}{{{{{{\bf{W}}}}}}}_{{{{{{\rm{H}}}}}}}^{(t)},{{{{{{\bf{H}}}}}}}_{k}^{(t)}{{{{{{\bf{W}}}}}}}_{{{{{{\rm{H}}}}}}}^{(t)}\rangle )}$$
(3)

where $${\beta }_{i}^{(t)}\in {\mathbb{R}}$$ is learnable normalized weight adjustment parameter, $${{{\bf{W}}}}_{{{\rm{H}}}}^{(t)}\in {\mathbb{R}}^{{{{\rm{c}}}}^{(t)}\times {{{\rm{c}}}}^{(t)}}$$ is the learnable weight matrix shared among all the nodes in t-th graph convolutional layer, 〈·〉 denotes the inner product operator on two vectors. In the second step, each node in the graph updates the current state $${{{{{{\bf{H}}}}}}}_{i}^{(t+1)}$$ by aggregating past state information from itself ($${{{{{{\bf{H}}}}}}}_{i}^{(t)}$$) and its neighbors ($${{{{{{\bf{X}}}}}}}_{i}^{(t)}$$). The attention weight aij are assigned to different nodes and edges according to the underlying dependencies to direct the network to the most prognostic parts of the TACS-based graph structure.

In addition, an optional aggregator based on Gated Recurrent Units (GRU) is designed to alleviate the potential over-smoothing issue during the information aggregation process as the network depth increases

$${{{{{{\bf{H}}}}}}}_{i}^{(t+1)}=\left\{\begin{array}{c}{{{{{\rm{GRU}}}}}}\big({{{{{{\bf{H}}}}}}}_{i}^{(t)},\,{{{{{{\bf{X}}}}}}}_{i}^{(t)}\big),{{{{{\rm{G}}}}}}{{{{{\rm{RU}}}}}}\,{{{{{\rm{is}}}}}}\,{{{{{\rm{available}}}}}}\\ {{{{{{\bf{H}}}}}}}_{i}^{(t)}+{{{{{{\bf{X}}}}}}}_{i}^{(t)},{{{{{\rm{GRU}}}}}}\,{{{{{\rm{is}}}}}}\,{{{{{\rm{unavailable}}}}}}\end{array}\right.$$
(4)

and the basic recursive formula of GRU aggregator is

$${{{\bf{H}}}}_{i}^{(t+1)} ={{{\bf{Z}}}}^{(t)}\odot {\widetilde{{{\bf{H}}}}}_{i}^{(t+1)}+(1-{{{\bf{Z}}}}^{(t)})\odot {{{\bf{H}}}}_{i}^{(t)}\\ {\widetilde{{{\bf{H}}}}}_{i}^{(t+1)} =\,{{{\mathrm{Tanh}}}}\;\big({{{\bf{X}}}}_{i}^{(t)}{{{\bf{W}}}}_{{{\rm{xh}}}}+\big({{{\bf{R}}}}^{(t)}\odot {{{\bf{H}}}}_{i}^{(t)}\big){{{\bf{W}}}}_{{{\rm{hh}}}}+{{{\bf{b}}}}_{{{\rm{H}}}}\big)\\ {{{\bf{Z}}}}^{(t)} ={{{\rm{Selu}}}}\;\big({{{\bf{X}}}}_{i}^{(t)}{{{\bf{W}}}}_{{{\rm{xz}}}}+{{{\bf{H}}}}_{i}^{(t)}{{{\bf{W}}}}_{{{\rm{hz}}}}+{{{\bf{b}}}}_{{{\rm{Z}}}}\big)\\ {{{\bf{R}}}}^{(t)} ={{{\rm{Selu}}}}\;\big({{{\bf{X}}}}_{i}^{(t)}{{{\bf{W}}}}_{{{\rm{xr}}}}+{{{\bf{H}}}}_{i}^{(t)}{{{\bf{W}}}}_{{{\rm{hz}}}}+{{{\bf{b}}}}_{{{\rm{R}}}}\big)$$
(5)

where ʘ denotes elementwise multiplication, Tanh (∙) is hyperbolic tangent nonlinear activation function, Selu () is the nonlinear activation function termed as scaled exponential linear units, Wxh, Wxz, Wxr, Whh, Whz, Whr $$\in {{\mathbb{R}}}^{{{{{{{\rm{c}}}}}}}^{(t)}\times {{{{{{\rm{c}}}}}}}^{(t)}}$$ and bH, bZ, bR $$\in {{\mathbb{R}}}^{1{{{{{{\rm{c}}}}}}}^{(t)}}$$ indicate learnable weight matrixes and bias parameters, respectively. The reset gate denoted as $${{{\bf{R}}}}^{(t)}\in {\mathbb{R}}^{1{{{\rm{c}}}}^{(t)}}$$ is used to control how the previous node state is merged into the current candidate node state, and update gate denoted as $${{{\bf{Z}}}}^{(t)}\in {\mathbb{R}}^{1{{{\rm{c}}}}^{(t)}}$$ is used to control how node states are updated by combining past states with current candidate state. Node information propagation and aggregation is performed hierarchically along the layers. The $$t$$-th graph convolutional layer takes the node state information feature matrix $${{{\bf{H}}}}^{(t)}={[{({{{\bf{H}}}}_{1}^{(t)})}^{{{\rm{T}}}},{({{{\bf{H}}}}_{2}^{(t)})}^{{{\rm{T}}}},\cdots ,{({{{\bf{H}}}}_{N}^{(t)})}^{{{\rm{T}}}}]}^{{{\rm{T}}}}\in {\mathbb{R}}^{N\times {{{\rm{c}}}}^{(t)}}$$as input, and outputs a new feature matrix $${{{\bf{H}}}}^{(t+1)}={[{({{{\bf{H}}}}_{1}^{(t+1)})}^{{{\rm{T}}}},{({{{\bf{H}}}}_{2}^{(t+1)})}^{{{\rm{T}}}},\cdots ,{({{{\bf{H}}}}_{N}^{(t+1)})}^{{{\rm{T}}}}]}^{{{\rm{T}}}}\in {\mathbb{R}}^{N\times {{{\rm{c}}}}^{(t)}}$$updated via node information propagation and aggregation operations. Specially, in the non-linearity mapping process for converting the initial node state vectors $${{{\bf{F}}}}={[{{{\bf{v}}}}_{1}^{{{\rm{T}}}},\cdots ,{{{\bf{v}}}}_{N}^{{{\rm{T}}}}]}^{{{\rm{T}}}}\in {\mathbb{R}}^{N\times 8}$$ into higher-level node feature embedding $${{{{{{\bf{H}}}}}}}^{(1)}={{{{{\rm{Selu}}}}}}\;({{{{{\bf{F}}}}}}{{{{{{\bf{W}}}}}}}_{{{{{{\rm{f}}}}}}})$$, the feature vectors of F are weighted by a learnable weight matrix $${{{\bf{W}}}}_{{{\rm{f}}}}\in {\mathbb{R}}^{8\times {{{\rm{c}}}}^{(t)}}$$ shared among all the nodes to highlight their relative importance in H(1). The node states are propagated and aggregated across the multiple graph convolutional layers, which lead to the high-level node state features. A global pooling layer follows the last graph convolution layer to generate graph-level output for a single graph G by averaging all node state features in G across the node dimension

$${{{{{{\bf{f}}}}}}}^{(l)}=\mathop{\sum }\limits_{i\,=1}^{N}{{{{{{\bf{H}}}}}}}_{i}^{(t)}$$
(6)

where $${{{\bf{f}}}}^{(l)}={\mathbb{R}}^{1\times {{{\rm{c}}}}^{(t)}}$$ is computed as the comprehensive representation of the graph structure, which extracts multi-scale localized patterns of substructure overall G. The Global pooling layer is followed by the fully connected network module consisting of multiple fully connected layers. Fully connected network module weights the portions of feature h(t) with different importance levels via a learnable weight matrix $${{{{{{\bf{W}}}}}}}_{{{{{{\rm{h}}}}}}}^{(t)}$$ and recombines them into more discriminative higher-level feature h(t+1) by the non-linearity mapping of

$${{{\bf{h}}}}^{(t+1)}={{{\rm{Selu}}}}\;\big({{{\bf{h}}}}^{(t)}{{{\bf{W}}}}_{{{\rm{h}}}}^{(t)}+{{{\bf{b}}}}_{{{\rm{h}}}}^{(t)}\big)$$
(7)

where $${{{\bf{h}}}}^{(t)}={\mathbb{R}}^{1\times {h}^{(t)}}$$ and $${{{\bf{h}}}}^{(t+1)}={\mathbb{R}}^{1\times {h}^{(t+1)}}$$are the input and output feature vectors of the t-th full connected layer, respectively, $${{{{{{\bf{W}}}}}}}_{{{{{{\rm{h}}}}}}}^{(t)}$$ $$\in {{\mathbb{R}}}^{{{{{{{\rm{h}}}}}}}^{(t)}\times {{{{{{\rm{h}}}}}}}^{(t+1)}}$$ and $${{{\bf{b}}}}_{{{\rm{h}}}}^{(t)}\in {\mathbb{R}}^{1\times {h}^{(t+1)}}$$ are the learnable connected weight matrix and bias parameter, respectively. In particular, in the original architecture of IGNN, $${{{{{{\bf{h}}}}}}}^{(1)}={{{{{{\bf{f}}}}}}}^{(l)}$$ was used as the input of the first full connected layer, while in the IGNN-E model, the feature vectors made up of 8 clinicopathological factors are cascaded with $${{{{{{\bf{f}}}}}}}^{(l)}$$ as the input of the first fully connected layer

$${{{{{{\bf{h}}}}}}}^{(l)}=\left\{\begin{array}{c}{{{{{{\bf{f}}}}}}}^{(l)},{{{{{\rm{IGNN}}}}}}\hfill \\ {{{{{{\bf{f}}}}}}}^{(l)}\cup {{{{{\rm{Selu}}}}}}\;({{{{{\bf{r}}}}}}{{{{{{\bf{W}}}}}}}_{{{{{{\rm{r}}}}}}}),{{{{{\rm{IGNN}}}}}}-{{{{{\rm{E}}}}}}\end{array}\right.$$
(8)

where concatenates two feature vectors along an assigned axis, $${{{\bf{W}}}}_{{{\rm{r}}}}\in {\mathbb{R}}^{6\times r}$$ is the learnable connected weight matrix, r represents additional clinicopathological vectors, $${{{\bf{h}}}}^{(l)}={\mathbb{R}}^{1\times {h}^{(t)}}$$ denotes the output high-level feature vector of the fully connected layer. Finally, these features are fed into the prognostic risk prediction layer along with survival data to produce the prediction score

$$p={{{{{\rm{Selu}}}}}}\;({{{{{{\bf{h}}}}}}}^{(l)}{{{{{{\bf{W}}}}}}}_{{{{{{\rm{risk}}}}}}})$$
(9)

$${{{\bf{W}}}}_{{{\rm{cox}}}}\in {\mathbb{R}}^{{{\rm{h}}}^{(t)}\times 1}$$ is learnable weight vector.

The IGNN model is currently a small-scale architecture consisting of only 2 graph convolution layers and 2 fully connected layers, where the learnable weight matrixes $${{{{{{\bf{W}}}}}}}_{{{{{{\rm{H}}}}}}}^{(t)},{{{{{{\bf{W}}}}}}}_{{{{{{\rm{xh}}}}}}},{{{{{{\bf{W}}}}}}}_{{{{{{\rm{xz}}}}}}},{{{{{{\bf{W}}}}}}}_{{{{{{\rm{xr}}}}}}},{{{{{{\bf{W}}}}}}}_{{{{{{\rm{hh}}}}}}},{{{{{{\bf{W}}}}}}}_{{{{{{\rm{hz}}}}}}},{{{{{{\bf{W}}}}}}}_{{{{{{\rm{hr}}}}}}}$$ in graph convolutional layers have fixed size (8,8) and each fully connected layer consists of 32 neural units. As to the IGNN-E model, the weight matrix $${{{{{{\bf{W}}}}}}}_{r}$$ applied to the additional feature vector r has a size of (8,16). The number of parameters for the IGNN (IGNN-E) model is ~2300 (~2500). The more learnable parameters of the model, the more training data are required to avoid the over-fitting in machine learning. It is thus reasonable to design a small-scale architecture for a relatively small patient number (995 in total). However, with increasing number of patients for training, the original models can be extended to more complex depth models by stacking more graph convolutional layers and fully connected layers.

To evaluate prognostic risk, the learning object of IGNN (IGNN-E) was to minimize the negative log partial likelihood of Cox proportional hazards regression loss as follow

$${\mathcal L} ({{{{{\boldsymbol{\theta }}}}}},{{{{{\bf{b}}}}}})=-\mathop{\sum}\limits_{i:{y}_{i}=1}({p}_{i}-\,\log (\mathop{\sum}\limits_{j:{D}_{j}\ge{D}_{i}}\exp ({p}_{j})))$$
(10)

where survival status label $${y}_{i}=1$$ indicates the occurrence of disease-related events (recurrence or death) for patient i, Di denotes the DFS of patient i,θ and b are the set of learnable weights and biases, respectively.

### Training and validation

In the pre-validation process, 3-fold cross-validation was adopted to train and evaluate the prognostic models within the training cohort (FMU dataset). The training cohort was first divided into 3 folds ($${n}_{{{{{{\rm{fold1}}}}}}}$$ = 243, $${n}_{{{{{{\rm{fold2}}}}}}}$$ = 244, $${n}_{{{{{{\rm{fold3}}}}}}}$$ = 244). Within each cross-validation, two folds were used for training and the remaining fold for validation. Since there were significantly more long-survival cases (patients with DFS > 5 years, $${n}_{P}$$ = 470) than short-survival cases (patients with DFS ≤ 5 years, nN = 261), in each cross-validation, we randomly selected the same number of long- and short-survival cases from the two-fold data to compose the training data, in order to avoid the adverse effect of unbalanced training data on the training process. To train the IGNN and IGNNE models, the Xavier initialization was used for all the layers. An adaptive moment estimation (ADAM) optimizer was adopted to learn model parameters with a batch size of 16 and initial learning rate of 0.01. The maximum training iteration were set to 325 for the IGNN model and 59 for the IGNNE model. Due to the difficulty to augment a graph dataset as a traditional image dataset, two strategies were used to prevent overfitting: first, dropout layers (Pdrop = 0.1) were applied before graph convolutional layers during the training phase, and normalized layer were applied before the full connection layer. Second, an adaptive stopping scheduler (Astopper) was developed to automatically select a suitable epoch for interrupting training process. Specifically, three checkpoints were set up across the training process, and the learning rate and weight decay rate of ADAM optimizer would be updated according to the change of training loss at Checkpoint1 or Checkpoint2, respectively. Starting from Checkpoint3 until the preset maximum iteration, the epoch with the lowest training loss would be called back as the best time to interrupt training and freeze model parameters. All results from single cross-validation steps were combined as the model-specific validation results for the training cohort.

In the external validation process, the prognostic models with the same architecture as in the pre-validation were retrained with full-scale cases from the training cohort and then evaluated on both the training and validation cohorts. For IGNN (IGNNE), model parameters were again initialized by the Xavier distribution, and optimized using ADAM optimizer with a batch size of 16(128) and an initial learning rate of 0.005. The maximum training iteration were set to 656 for IGNN and 1000 for IGNNE. During training, the best number of epochs to interrupt training was determined by Astopper under the same strategy as in the pre-validation process.

### Interpretability

To demonstrate how IGNN revealed the implicit prognostic interactions among TACS features, we trained IGNN based on the training cohort without hyperparameter tuning. Specifically, the model parameters were randomly initialized and optimized using ADAM optimizer with a batch size of 16 and a learning rate of 0.01. During training, instead of choosing the stopping moment by Astopper, the training process was randomly interrupted at the 558th epoch, and the output feature vectors of each graph convolutional layer were extracted as the observed model responses.

To quantify the relative importance of different TACS features within patient-specific TACS graph data fed into IGNN, we introduced integrated gradient (IG), a post hoc gradient-based feature imputation method that attributes the model prediction to their inputs with different contributing factors. For a straight-line path from the input F to the associated baseline F′, IG is defined as the path integral of the gradients along the straight-line path from F to F′, which can be efficiently approximated numerically. In practice, IG for the ith element of F is computed following numerical Riemman approximation

$${{{{{{\rm{IG}}}}}}}_{i}({{{{{\bf{F}}}}}})\colon\colon =\frac{1}{m}({{{{{{\bf{F}}}}}}}_{i}-{{{{{{\bf{F}}}}}}}_{i}^{{\prime} })\mathop{\sum }\limits_{k=1}^{m}\frac{\partial {{{{{\rm{IGNN}}}}}}({{{{{\bf{F}}}}}}^{\prime} +\frac{k}{m}({{{{{\bf{F}}}}}}-{{{{{\bf{F}}}}}}^{\prime} ))}{\partial {{{{{{\bf{F}}}}}}}_{i}}$$
(11)

where F is the initial node feature matrix of graph data, F′ is the starting point to calculate $${{{{{{\rm{IG}}}}}}}_{i}({{{{{\bf{F}}}}}})$$ which is set to an all-zero matrix, IGNN(·) is the IGNN prediction, $$\frac{\partial {{{{{\rm{IGNN}}}}}}(\cdot )}{\partial {{{{{{\bf{F}}}}}}}_{i}}$$ is the gradient of IGNN prediction function relative to the ith element of F, k is the scaled feature perturbation constant, and m is the number of steps in the Riemann approximation of the integral. With the patient-specific TACS graph data as input, IG reveals the relative prognostic importance of different TACS features at the ROI level.

Training and validation of IGNN and IGNNE models were implemented using PyTorch (version 1.6.0) framework and Torch-Geometric library (version 1.6.1) with Windows 10 operating system on a computer equipped with one NVIDIA GTX1080 Ti GPU and one INTEL Core i7-6700K CPU @ 4.0 GHz.

### Statistical analysis

The Kaplan-Meier and two-sided log-rank tests were used to plot and compare the survival curves with dichotomized predictive score from the prognostic models, which provided risk stratification assignment. The HRs of different risk groups in survival analysis and the association of clinical factors with patient DFS were measured using Cox proportional hazards regression analysis. Time-dependent AUC curves and corresponding iAUC (the measure of prognosis appropriate for censored time-to-event data) were used to assess the performance for DFS prediction of different prognostic models. The receiver operating characteristic (ROC) curves and associated AUCs were adopted to reflect the discriminatory accuracy of the prognostic models to predict 5-year DFS rate. The relative strengths of multiple prognostic factors to predict DFS was assessed using the Chi-squared test. The subgroups were compared using two-sided unpaired t-test with no adjustments made for multiple comparisons. All hypothesis analyzed with a two-sided p value < 0.05 were treated as statistically significant. The 95% CIs were calculated by percentile method while t-SNE technology was used for dimensionality reduction and visual analysis of high-dimensional features. Data and statistical analyses were performed using R software (version 3.6.0). The Cox Ridge regression adopted by TACS model was performed using the “glmnet” package and the Multivariate Cox proportional hazard regression adopted by Nomogram model was performed using the “survival” package.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.