Deep learning features encode interpretable morphologies within histological images

Convolutional neural networks (CNNs) are revolutionizing digital pathology by enabling machine learning-based classification of a variety of phenotypes from hematoxylin and eosin (H&E) whole slide images (WSIs), but the interpretation of CNNs remains difficult. Most studies have considered interpretability in a post hoc fashion, e.g. by presenting example regions with strongly predicted class labels. However, such an approach does not explain the biological features that contribute to correct predictions. To address this problem, here we investigate the interpretability of H&E-derived CNN features (the feature weights in the final layer of a transfer-learning-based architecture). While many studies have incorporated CNN features into predictive models, there has been little empirical study of their properties. We show such features can be construed as abstract morphological genes (“mones”) with strong independent associations to biological phenotypes. Many mones are specific to individual cancer types, while others are found in multiple cancers especially from related tissue types. We also observe that mone-mone correlations are strong and robustly preserved across related cancers. Importantly, linear mone-based classifiers can very accurately separate 38 distinct classes (19 tumor types and their adjacent normals, AUC = \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$97.1\% \pm 2.8\%$$\end{document}97.1%±2.8% for each class prediction), and linear classifiers are also highly effective for universal tumor detection (AUC = \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$99.2\% \pm 0.12\%$$\end{document}99.2%±0.12%). This linearity provides evidence that individual mones or correlated mone clusters may be associated with interpretable histopathological features or other patient characteristics. In particular, the statistical similarity of mones to gene expression values allows integrative mone analysis via expression-based bioinformatics approaches. We observe strong correlations between individual mones and individual gene expression values, notably mones associated with collagen gene expression in ovarian cancer. Mone-expression comparisons also indicate that immunoglobulin expression can be identified using mones in colon adenocarcinoma and that immune activity can be identified across multiple cancer types, and we verify these findings by expert histopathological review. Our work demonstrates that mones provide a morphological H&E decomposition that can be effectively associated with diverse phenotypes, analogous to the interpretability of transcription via gene expression values. Our work also demonstrates mones can be interpreted without using a classifier as a proxy.


Results
We analyzed the InceptionV3 20 features of tiles from whole slide images of The Cancer Genome Atlas (TCGA) 1 for 19 cancers (see Supplementary File 1 for the full list). We used features derived from this architecture because predictive models based on Inception have shown high accuracy for identifying phenotypes in prior studies 1 . We hereafter denote each of the 2048 outputs of the global average pooling layer of the Inception V3 network as mones (morphological genes). We use this terminology because mones have analogies to genes with individual expression values. Tile level mones can be combined to construct slide level mones (see "Methods"). Unless otherwise stated, in the studies below "mone" refers to a slide level characterization. Figure 1 provides an overview of interpretive mone analyses and their connection to current interpretation techniques in the field.
Individual mones differentiate phenotypes. We first investigated to what extent individual mones can differentiate phenotypes, focusing on TCGA tumor/normal slide comparisons. We initially identified individual mones with significant differences in distribution between breast cancer (BRCA) tumor and adjacent normal slides (> 1800 mones were statistically significant with FDR < 5%). A clustermap of the top 100 such mones was able to clearly separate these classes (see "Methods", Fig. 2a, and Supplementary Fig. 1, clustermap AUC = 89%, rand score = 96%, and adjusted rand score = 85%). These results were typical of mone behaviors in many tumor types-in any given tumor type, many mones were able to separate frozen tumor from normal slides (see "Methods" and Supplementary File 1). We applied several statistical methods to test the robustness of such mones. In each cancer, at least 40% of mones were statistically significant irrespective of the statistical test used (FDR = 5%, Supplementary File 1), and 75.7% of mones significant by at least one method were significant by all tests (see "Methods", Fig. 2b). A smaller subset of these mones showed strong effect sizes, as identified by optimal Bayesian filter (OBF) 21 statistics (see "Methods"). 22% of all mone-cancer pairs met this criterion based on distributional differences between tumor and normal slides ( Supplementary Fig. 2).
As an example, mone 983 is a tumor marker with a distributional difference between frozen tumor and adjacent normal breast cancer (BRCA) slides (Fig. 2c). It also behaves similarly in several other tumor types ( Supplementary Fig. 3). It strongly correlates with cell density in frozen BRCA slides (see "Methods", To further test whether mones exhibit consistent behavior in different cancer types, we analyzed the four cancer families of 1 File 4). We observed the ability of some mones to distinguish between tumor and normal is impacted by differences between frozen and FFPE modalities ( 421 ± 112 across all cancers, see "Methods", Supplementary File 4), though the majority of mones behave similarly in frozen and FFPE.
Mone clusters provide robust encodings of cancer phenotypes. We next investigated to what extent mones are independent or encode behaviors together. We did this by calculating pairs of mones that were significantly correlated, for each cancer type. We analyzed this first by restricting to frozen slides (to avoid Simpson's paradox), and second by combining frozen and FFPE slides (to avoid false correlations due to frozenspecific artifacts). The results were generally robust between the two methods-a few cancers had a non-trivial difference in the ratio of correlated mone pairs ascertained by the two methods (ESCA, KICH, SARC, and PRAD 22.3% ± 2.22% ), while the remaining cancers had small differences ( 7.8% ± 3.9% , Supplementary File 6). Therefore we subsequently analyzed frozen samples only unless otherwise specified. Overall, we found that correlated mones are prevalent among tumor slides (Fig. 2f,g, and Supplementary File 6). For example, 83.9% and 88.7% of mone-mone pairs are correlated within LUAD and within LUSC, respectively (see "Methods" and Supplementary File 6). We observed similar results for other cancers: 68.8% ± 13% of all mone-pairs were statistically significant across cancers (Supplementary File 6). Remarkably, we observed that pairwise correlations are preserved within cancer families-more than 45% of mone-pairs statistically significant in one cancer are significant in all cancers of the family (pan-GYN, pan-KIDNEY, pan-LUNG, and pan-GI, Supplementary Fig. 6).

Figure 1.
An overview of interpretation methods in deep learning. Blue arrows denote methods that require a trained classifier, and green arrows denote methods that do not require a trained classifier. (a) Several methods identify regions which drive the network's prediction. These masks can be generated by the network, e.g. spatial self-attention 62 , or as a post-process via visualization methods 7 such as GradCam, or prediction heatmaps 1 . Heatmaps of individual mones and mone-based classifiers can be used to detect predictive regions. (b) Channel attention 62 and sparse models, including sparse mone-based classifiers, identify subsets of features that are predictive of class labels. Differential mone analysis identifies discriminative features without training of a classifier. (c) Methods in (a) and (b) are typically used to select example image regions that have high attention, affect predictions the most, or affect the value of a feature. Mone analysis can be used to (d) identify features that encode a given phenotype of interest and (e) identify the morphology a feature of interest encodes without training a classification model. www.nature.com/scientificreports/ For example, the mone-mone correlations in lung adenocarcinoma and lung squamous cell carcinoma are nearly identical (Fig. 2f,g). We also calculated mone-mone correlations in the normal slides associated with each cancer type, hypothesizing that the difference in mone-mone correlations between tumor and normal might be important to distinguishing tumor from normal images. However, these differential mone correlations are weaker and less preserved across cancers (Supplementary File 6 and Supplementary Fig. 6). Different cancers can be distinguished by different sets of mones. For example, while mone 983 separates tumor and normal slides of BRCA, it does not differentiate COAD tumor and normal slides ( Supplementary  Fig. 3). We identified 105 mones that are highly correlated with mone 983 in BRCA (Null: |r| ≤ 0.5 , FDR = 0.1% , see "Methods"), among which 52, 14, and 29 also differentiate tumor from normal slides in COAD, READ, and STAD, respectively (t-test FDR < 0.1% ). The first principal component of these COAD-overlapping mones (explaining 41% of the variance across COAD samples) strongly correlates with cell density in frozen slides (Cellpose cellularity estimates: Pearson r = 0.22, p-value = 2.2e−11, HoverNet cellularity estimates: Pearson r = 0.43, P-value = 1.9e−47, see "Methods" and Supplementary File 2). Thus the high cellularity in BRCA 22,23 and COAD 24,25 involve incompletely overlapping mone sets.
Linear models of mones can detect and distinguish tumors. We investigated linear models of mones for predicting phenotypes, as they allow direct interpretation of mone values. We observed that linear models of mones can efficiently distinguish tumor from adjacent normal slides, as well as the cancer type from which they are derived (19 cancers, 38 classes, see "Methods", Fig. 3a-c). We tested two linear models, multi-class linear discriminant analysis [MLDA, One versus Rest (OVR)-AUC = 97.1% ± 4.6% , Supplementary File 7] and multinomial logistic regression with LASSO penalty (LR-LASSO, OVR-AUC = 97.1% ± 4.2% , Supplementary File 7). MLDA encodes mone patterns indicative of class labels into a low dimensional space (i.e. the number of classes -1), yielding t-SNE visualizations with improved interpretability over naïve t-SNE (compare Fig. 3a,b and Supplementary Fig. 7a,b). LR-LASSO, on the other hand, is a linear model based on small mone sets, so its regression coefficients can be interpreted directly with risks incurred by each mone.
Although CNN methods typically use difficult-to-interpret fully connected layers at the classification step, we found that efficiently designed linear models can replace fully connected layers while still achieving high prediction AUCs. Combining tumor probabilities of the LR-LASSO classifier we obtained a universal tumor detector www.nature.com/scientificreports/ with extremely high AUC ( 99.2% ± 0.12% , see "Methods"), out-performing the fully deep learning model of 26 (Reported AUC = 0.95 ± 0.02 ). Furthermore, LR-LASSO is effective at cross-classification similar to CNNs with fully connected classification layers 1 , i.e., LR-LASSO trained to distinguish tumor/normal for one cancer type can distinguish tumor/normal for other cancer types as well ( Fig. 3d and Supplementary Fig. 7c). While the LR-LASSO model has smaller average AUC (0.84) compared to the fully deep learning model of 1 (0.88), logistic regression is more interpretable than a multi-layer perceptron. Our slide level tumor detectors also produce meaningful tile level predictions. Independent review by our pathology team supports most tumor regions having high tumor probability, and most non-tumor regions have low tumor probability in these images, with the cases of misclassification tending to be prediction of non-tumor regions to be tumor ( Supplementary Fig. 8). These results indicate that the LR-LASSO slide level tumor markers are effective at the tile level.
Mones have interpretable correlations with gene expression. We next investigated whether mones are linearly associated with gene expression values, as this could provide transcriptional interpretability for mones (see "Methods" for data stratification and pre-processing). We observed many mones significantly correlated with individual genes. Across five cancers analyzed (OV, COAD, KIRC, LUAD, and LUSC) between 83 Mones encode collagen content. We first used unsupervised analysis to study clusters of correlated mone-gene pairs. We identified a cluster of highly correlated mones and collagen genes in OV ( Fig. 4a and Supplementary  Fig. 9). These mone values can be efficiently combined for association with phenotype using PCA (PC-1 explains 63% of the variance). Histopathological review by our pathology team confirmed tiles with high PC-1 values as typically rich in collagen ( Fig. 4d and Supplementary Fig. 10), and tiles with low PC-1 values as having low collagen but increased cellularity. These mone-gene associations may be clinically relevant, as high expression of collagen genes correlates with multi-drug resistance 27 and poor prognosis in ovarian cancers 28 . Mone 1062-one of the mones in the identified cluster-was additionally highly correlated with ECM2, THBS1 and THBS2, which have been suggested to play a significant role in ovarian cancer drug resistance and metastasis [28][29][30] .
Mones identify immunoglobulin gene expression in highly cellular colon adenocarcinoma tumors. Supervised correlation analysis can also clarify finer behaviors within WSIs. For example, highly cellular COAD tumors typically show high expression of immunoglobulin (IG) genes. We considered the 52 mones correlated with mone 983 and which differentiate COAD tumor from normal ("Mone clusters provide robust encodings of cancer phenotypes" section). The PC-1 dimension of these mones significantly correlates with 87 genes in COAD (FDR < 0.05 , Supplementary File 11 for the full gene list), as well with the average expression of these 87 genes (r = 0.33, p-value = 3.8e−12). Immunoglobulin (IG) genes dominate this gene set, and we define their average expression as a sample's IG score.
B-cells express immunoglobulin, and as expected we observed a statistically significant correlation between the log normalized B-cell estimates of 33 and IG score (p-value = 2.4e−12). Interestingly, however, B cells comprise only a small fraction of cells in each sample ( 0.86% ± 1.5% ), and we did not observe a significant correlation between the B-cell estimates and mone PC-1 (correlation coefficient = − 0.009, P-value = 0.85). Thus mone analysis suggests that IG expression is not due solely to B-cells. This supports recent studies suggesting colon cancer cells themselves express IG genes 34 .
Similar to TCGA, CPTAC-LUAD slides tend to have a higher mone 1914 value than LUSC slides (t-test raw p-value < 1e−50). Pathologist review suggests LUAD slides with high values of mone 1914 tend to have gland and micropapillary formations, and have finely vacuolated cytoplasm ( Supplementary Fig. 5). These features are indicative of adenocarcinoma and were not observed in CPTAC-LUSC slides with low mone 1914 values ( Supplementary Fig. 5).

Discussion
Mone analysis provides interpretability of deep neural networks through the linear correlations of deep learning features against phenotypes and gene expression profiles. While the richness of CNN features suggests potential linear correlations of individual features with human interpretable morphologies, validation of such relations is necessary, as human interpretability is not explicitly enforced in the training of CNNs. Furthermore, identification of the encoded morphology by each mone is a non-trivial task. Although correlation is not causation, our mone analysis empirically shows that mones efficiently encode strong morphological features that can often be used to replace multi-layer perceptrons with robust and interpretable linear classification models. Mone-analysis is flexible, can be used in a diverse set of interpretation modalities, and can be applied to features engineered through various training methodologies. Moreover, we have demonstrated that integrative mone-gene correlation analysis can identify specific transcriptional processes from images, and verified these through expert pathological review.

Mones provide image-based interpretability. Linear models based on mones have several empirical
and theoretical strengths for image analysis. Individual mones and small mone clusters directly correlate with phenotypes ("Individual mones differentiate phenotypes" and "Linear models of mones can detect and distinguish tumors" sections), enabling a simpler interpretation of CNNs compared with methods that integrate all CNN features together. Most interpretation models assume deep feature representations are complex and nonlinear, and therefore provide interpretability primarily through example regions identified by problem-specific classifiers. On the other hand, our results demonstrate CNNs decompose H&E stained images into interpretable features that linearly correlate with phenotypes, and the highly non-linear feature representation assumption can be relaxed for interpreting CNNs trained on H&E slides.
Some prior linear analyses of deep learning features have enabled partial interpretation of CNNs trained on one cancer type 12,37 . In contrast to 12 , which only reported on individual mones through their heatmaps overlaid with H&E images, our mone analysis identifies individually correlated mones and genes, providing finer interpretability. The canonical correlation analysis of 37 also has less interpretability than our approach as it investigates many-to-many relationships between mones and gene expression. These works also focus on individual cancers, while our study provides pan-cancer interpretations. For example, our results demonstrate a pan-cancer mone can encode conserved morphological features across multiple cancer types ("Individual mones differentiate phenotypes" section), and a conserved morphological feature can be encoded by distinct mones in different www.nature.com/scientificreports/ cancer types ("Mone clusters provide robust encodings of cancer phenotypes" section). Moreover, mones within strongly correlated clusters can be linearly combined to better identify shared encoded morphology ("Mone clusters provide robust encodings of cancer phenotypes" section and 11 ). Pan-cancer mones and correlations between mones can facilitate and improve robustness of interpretation across cancer types. For example, mone983 encodes cellularity (both in TCGA and CPTAC, see "Individual mones differentiate phenotypes" and "External validation on CPTAC " sections) and distinguishes tumor and normal slides of multiple cancers ( Supplementary Fig. 3), but not GI cancers (Supplementary Fig. 3). However, mones encoding cellularity in COAD can be identified through their correlations with mone983 ("Mone clusters provide robust encodings of cancer phenotypes" section), even though mone983 is not itself associated with cellularity in COAD. Another example is correlation of mone1895 and mone327 with immune genes in LUAD and pan-GI cancers (Fig. 4).
Mone analysis improves low dimensional visualization of large image sets, such as via t-SNE plots. Pre-trained CNNs are universal feature extractors that encode morphological features predictive of a multitude of labels 1,38,39 .
Because not all high-dimensional complex relations can be easily embedded in 2D, mone analysis can be used as a first phase of targeted search to identify the mones relevant to classes of interest ("Linear models of mones can detect and distinguish tumors" section), allowing more accurate visualization of class separations in mone space.
Mone analysis guides classifier design by exploring statistical properties of the learned features. For example, linear associations of correlated mone clusters with phenotypes suggests utility of sparse linear models for reliable classification ("Linear models of mones can detect and distinguish tumors" and "External validation on CPTAC " sections). Robustness of linear models using a small number of mones ("Linear models of mones can detect and distinguish tumors" and "External validation on CPTAC " sections) provides empirical evidence for the theoretical results establishing robustness of sparse deep learning models [40][41][42] . Furthermore, removing mones encoding tissue-specific structures and keeping mones encoding morphological features of interest can help build classifiers robust to tissue specific patterns.
Correlation analysis of mones with gene expression values is a powerful approach for interpreting mones. We identified clusters of highly correlated mone-gene sets, demonstrating clear connection of mones to the underlying genetics. Some recent studies have used exhaustive sets of deep learning features to predict expression profiles [16][17][18] , but our work shows that small mone-gene clusters can be sufficient and provide simpler interpretability. Both supervised and unsupervised analyses identify meaningful clusters ("Mones have interpretable correlations with gene expression" section). Supervised analysis using fixed gene sets is particularly interesting, as it enables direct assessment of genes of interest via mones ("Mones encode immune infiltration" section).
Strong correlations of individual mones with image-derived features ("Individual mones differentiate phenotypes" section and Fig. 1d) or gene expression values ("Mones have interpretable correlations with gene expression" section and Fig. 4b) facilitate direct interpretation, but moderate correlations require further analysis. Dimensionality reduction methods, such as PCA, combine moderately correlated mones and boost the associations between mones and the encoded morphology. Whether moderate correlations are due to multiple mones being weak detectors of a complex morphology, individual mones partially encoding multiple morphologies, or gene expression profiles having moderate correlation with morphology (see [16][17][18]  Interpretation without classifier training. A notable advantage of the mone approach is that it does not require a trained classifier, which is especially desirable when feature engineering and classification are decoupled (e.g. transfer learning, unsupervised learning). Interpretation models that require a trained classifier, such as attention-based models, are restricted to the morphologies that are predictive of a predetermined set of classes and are utilized by the classifier. Additionally, different classification architectures applied to a fixed learned feature space may use different features and morphologies for predicting a class label. Mone analysis does not require a classifier to determine if the learned features encode a given phenotype ("Mones encode immune infiltration" and "Mones identify immunoglobulin gene expression in highly cellular colon adenocarcinoma tumors" sections). It can be immediately applied to any learned feature space irrespective of training methodology, and can be used in an unsupervised fashion to identify encoded morphologies ("Mones encode collagen content" section).
Optimization of training classifiers that are robust to stain differences across datasets is an open research question. However, an advantage of mone analysis is that it can be more robust to stain differences than pure classification models, as we observed for cell classification, where our mone-based model enjoyed better generalization to external datasets than HoverNet ("External validation on CPTAC " section). It can also be used to analyze how stain differences affect feature representations and a classifier's ability to make reliable predictions.

Mone analysis across architectures and data modalities.
Our analyses demonstrate that mones provide an efficient and interpretable CNN embedding of image data, but a caveat is that they have been restricted to Inception V3 mones. Architectures with fully connected layers tend to increase non-linearity in feature representations. Therefore, models that do not utilize multiple sequences of fully connected layers, such as Inception, are more appropriate for linear mone analysis. For example, recent work suggests a small subset of VGG19 features may also be interpreted via their direct association with phenotypes 12 . However, we believe Inception V3 mones are more appropriate for linear association studies because they are the direct inputs to the classification layer. A few other studies have explored correlations between deep autoencoder features and gene 37  www.nature.com/scientificreports/ Xception mones to Inception V3 mones using autoencoders with reasonable accuracy ( R 2 ≈ 0.5 ). Recent work suggests InceptionV3 and ResNet features are almost equivalent 44 .These studies suggest that integrative linear mone-gene correlation analysis can be made effective across a range of deep learning architectures. While this work focuses solely on H&E WSIs, we believe mone-based interpretation will be valuable for extension to other spatial data types. Immunohistochemistry (IHC) images are primed for rapid progress, as recent work has shown that several IHC markers can be virtualized from H&E 45,46 . Generalizing mone analysis for other data types such as spatial transcriptomics and multi-channel protein data is also an exciting and open area, though new architectures will need to be explored to handle the high dimensionality of such images. The interpretation of CNNs for these image types is a challenging but important task, and we expect that integrative multi-modal multi-architecture mone analysis will be a potent and informative approach.

Methods
Data acquisition and pre-processing. The 20X H&E WSIs of TCGA were pre-processed, tiled, and passed through an InceptionV3 model pre-trained on image-net data as described in 1 . Consistent with 1 nonoverlapping 512 pixel-by-512 pixel tiles were used, and only tiles containing at least 50% tissue were passed through InceptionV3. The cached 2048 global average pooling layer features of InceptionV3 (called mones in this manuscript) were written to disk and analyzed. For each of these 2048 mones, the median value across all tiles within a slide was computed to yield the slide-level mone value. CPTAC-LUAD and CPTAC-LUSC cohorts were processed similarly. Differential mone analysis. Differential mone analysis identifies mones with statistically significant distributional differences across classes. Welch's t-test, Kolmogorov Smirnov (KS) test, Wilcoxon Rank Sum (WRS) test, and optimal Bayesian Filter (OBF) (see 21 for details) were used for statistical analysis. t-test, WRS test, and KS test use the Benjamini-Hochberg procedure 47 for FDR correction. The scipy python package 48 was used to implement t-test, KS test, and WRS test. The statsmodels 49 implementation of the Benjamini-Hochberg procedure was used.
Minimal risk OBF (see 21 for details) identifies mones with posterior probabilities larger than 1 − α , where α is the FDR rate. FDR-OBF (see 50 for details) outputs the feature set that bounds the sample conditioned FDR by α . OBF can report the FDR of any arbitrary feature set. Unless otherwise stated FDR-OBF is used. OBF uses Jeffrey's prior, assumes the prior probability of a mone having distributional differences is 50% (to model no preference on the identity of a mone, i.e., with or without distributional differences across classes), and sets the normalization constant of the prior to 0.1. Mones with distributional differences across classes are hereafter called markers, and mones without distributional differences are called non-markers. The posterior probabilities of OBF can be used to estimate the first two moments (mean and standard deviation) of the number of markers (see 50 for details). Assuming mone identities (marker or non-marker) are independent across cancers, the posterior probabilities can be multiplied to calculate the probability of a joint event.
OBF intrinsically computes the ratio between sample variance and weighted geometric mean of class-conditioned variances, hereafter denoted by a(m) for mone m. Similar to many ANOVA-based analyses this ratio measures distributional differences and is closely related to Bhattacharyya distance 51 Fig. 2).
Structured multi-class OBF (see 52 for details) considers the four possible relations (known as structures by OBF) between frozen normal, frozen tumor, and FFPE tumor slides: (A) a mone does not differentiate between slides (prior probability = 0.5), (B) a mone has one distribution for frozen slides (both tumor and adjacent normal slides) and another distribution for FFPE slides (prior probability = 0.5/3), (C) a mone has one distribution for tumor slides (both frozen and FFPE) and another distribution for frozen normal slides (prior probability = 0.5/3), and (D) a mone has one distribution for FFPE tumor slides and frozen adjacent normal slides and another distribution for frozen tumor slides. Mones with structure B for which frozen tumor and FFPE tumors lie on both sides of frozen adjacent normal slides (based on mean values) are considered ineffective due to FFPE/frozen differences.

Mone correlation analysis.
For each cancer type, we calculated correlations between mones over all samples of the given cancer type. This analysis was done for each cancer type. We use the Ledoit-Wolf shrinkage 53 implementation of the scikit-learn python package 54 for computing covariance matrices. We then compute the correlation matrix from the covariance matrix. We apply the Fisher transform to correlation coefficients and approximate the null with its asymptotic Gaussian distribution. Benjamini-Hochberg 47 procedure is used for FDR correction. We use seaborn package 55 with default values to generate clustergrams. Correlation matrices are averaged to compute the pooled correlation matrix. Statistically significant mone-mone correlations are referred to as "correlated mone pairs".
Differential mone correlations denote the difference between the correlation coefficient of tumor and normal slides, i.e., for each cancer, differential matrix is computed by subtracting the correlation coefficient matrix of normal slides form the correlation coefficient matrix of tumor slides. Differential mone correlation analysis uses the asymptotic Gaussian distribution of the difference between Fisher transformed correlation coefficients to compute the p-values. Statistically significant differential mone correlations are referred to as "differentially correlated mone pairs". www.nature.com/scientificreports/ Linear classification models. We implement MLDA and LR-LASSO using the scikit-learn python package 54 with default values except we set C = 100 and use the "saga" solver 56 in non-binary problems for LR-LASSO. We observed little sensitivity of AUCs to the C values ranging from 1 to 1000, and hence use C = 100 throughout. We adopted a Monte-Carlo cross-validation strategy and randomly split data to train and test sets 10 times using scikit-learn's train _ test _ split function. The splits were made at the patient level, and class ratios were preserved across train and test portions. The mean and variance of the AUCs across all splits are reported. We used the Scipy package 48 to implement the Wilcoxon signed rank test to compare AUCs. 2D t-SNE visualizations of the MLDA space were implemented using scikit-learn. PCA initialization was used to improve reproducibility of the t-SNE plots and separation of classes. Number of neighbors was set to 50. All other parameters were set to the default values.
Cell segmentation and classification. Cellpose 57 was used to segment and count number of cells in BRCA, LUAD, and COAD tiles. Given the fixed magnification and tile size the number of cells per tile captures tile level cell density. Median number of cells per tile was used as slide level cell density index. HoverNet 58 was used to segment, count, and classify nuclei within COAD tumor slides. HoverNet was executed using the pre-trained PanNuke model 59 , such that nuclei were classified into one of five types: neoplastic epithelial, nonneoplastic epithelial, connective (including fibroblasts and endothelial), inflammatory (including leukocytes, lymphocytes, and macrophages), and dead nuclei. Median number of cells nuclei across tiles were used as cell density. Median number of predicted inflammatory nuclei across tiles were used to characterize presence of immune cells.
Integrative mone-gene analysis. Gene expression data were downloaded from the GDC portal 60 . We only used slide-gene expression pairs where both the slide and the expression profile were from the same vial. Log normalized FPKMs were used. Genes with zero counts in more than half the mone-gene pairs or expression standard deviation below 0.25 were removed. Given a set of mone-gene pairs, we stack the mone and gene vectors and compute the covariance matrix using the Ledoit-Wolf shrinkage method 53 implemented in the scikit-learn python package 54 . Correlation values are computed given the covariance matrix similar to mone correlation analyses. Statistical significance tests are performed similar to mone correlation analyses.
Immune profiling and analysis. Leukocyte fractions of TCGA samples were obtained from 33 . All T-cell and B-cell categories were summed to obtain T-cell and B-cell proportions, respectively. The fractions of T-cell and B-cells were summed to obtain lymphocyte fractions. Log normalization of fractions were used throughout. Correlation analysis of immune scores with mones and IG score was performed similar to mone correlation analysis. B-cell percentages above 3% were removed for computing the B-cell correlations as they were deemed outliers.

Data availability
TCGA data are publicly available through the GDC portal (https:// portal. gdc. cancer. gov) and CPTAC data are publicly available through the cancer imaging archive 61 portal (https:// www. cance rimag ingar chive. net/).