Dissection of gene expression datasets into clinically relevant interaction signatures via high-dimensional correlation maximization

Gene expression is controlled by many simultaneous interactions, frequently measured collectively in biology and medicine by high-throughput technologies. It is a highly challenging task to infer from these data the generating effects and cooperating genes. Here, we present an unsupervised hypothesis-generating learning concept termed signal dissection by correlation maximization (SDCM) that dissects large high-dimensional datasets into signatures. Each signature captures a particular signal pattern that was consistently observed for multiple genes and samples, likely caused by the same underlying interaction. A key difference to other methods is our flexible nonlinear signal superposition model, combined with a precise regression technique. Analyzing gene expression of diffuse large B-cell lymphoma, our method discovers previously unidentified signatures that reveal significant differences in patient survival. These signatures are more predictive than those from various methods used for comparison and robustly validate across technological platforms. This implies highly specific extraction of clinically relevant gene interactions.

T he analysis of gene expression (GE) using microarrays or high-throughput RNA-sequencing 1 allows the determination of molecular interactions and GE programs in cancer cells 2,3 . However, these technologies measure concurrent GE programs in all cells of a sample collectively. They cannot directly determine which genes cooperate in specific cellular functions. It remains a major challenge to discover, dissect and extract such interactions.
Essential characteristics of cancer samples [4][5][6][7][8][9][10][11][12][13][14] are represented by GE signatures, i.e. by sets of cooperating genes. For diffuse large B-cell lymphoma (DLBCL) 2 , the most frequent lymphoma type in adults 15 , hierarchical clustering (HC) detected GE signatures that distinguish two molecular subtypes derived from different cells of origin (COO) 12 . These are clinically relevant, as patients of the germinal center B-cell-like subtype (GCB DLBCL) have a 5-year overall survival (OS) of approximately 80% compared to only 50% for patients diagnosed with the activated Bcell-like subtype (ABC DLBCL) 2 . DLBCL is a molecularly heterogeneous disease 2,16,17 and it remains a challenge to discover all disease-specific GE programs and their influence on patient outcome. Even within ABC and GCB DLBCLs, heterogeneity prevails and significant GE differences remain within these subgroups. A significantly better understanding of this heterogeneity is required to ultimately be able to treat affected patients differently.
Detecting interactions in GE datasets for many genes and samples is a highly challenging underdetermined problem. Supervised methods such as gene set enrichment analysis (GSEA) 18 use previously described information in their search for associations, while unsupervised methods aim to identify novel signatures in an unbiased manner based on GE data only. Generally, any gene subset showing significant co-expression for any subset of samples could represent a biological interaction. Detection methods therefore utilize search strategies. HC uses distance measures and linkage methods 19 to identify interacting subsets. Since its first systematic application to GE datasets 20 , HC has provided many important insights [4][5][6][7][8][9][10][11][12][13][14][21][22][23] , such as tumorimmune cell interactions in breast cancer 6 . Principal component analysis (PCA) [24][25][26] obtains principal components (PCs) by searching for maximal variance. Often few PCs already capture most of the GE variability [27][28][29] . Non-negative matrix factorization (NNMF) 30 minimizes a divergence functional 31 and has also identified clinically relevant tumor subtypes 32 . Biclustering methods such as FABIA 33 or PLAID 34 cluster genes and samples simultaneously and successfully rediscovered, e.g., breast cancer subclasses 33 . Independent component analysis (ICA) 35,36 maximizes statistical independence. Obtained independent components (ICs) may be interpreted as specific oncogenic pathways or regulatory modules 37 .
All these methods have broad ranges of applicability, but various limitations, e.g., all require missing values to be imputed in advance. While HC can identify isolated signatures 5,8,9 , it cannot dissect overlapping interactions 10,13,14 . As genes influenced by multiple interactions may show higher variability, resulting PCs may represent unspecific GE mixtures that obstruct biological understanding 25 . ICA tries to prevent such mixtures by maximizing a measure of non-Gaussianity 36 , but this does not work for normally distributed data 38 . NNMF and most biclustering methods require that the unknown number of signatures is provided in advance. Additionally, NNMF is restricted to positive signals, i.e., it cannot model gene suppression.
Here, we present signal dissection by correlation maximization (SDCM) that overcomes these limitations and that significantly extends the class of detectable interaction signatures. With interaction we generically refer to any cause of correlations between arbitrary genes in arbitrary subsets of samples. Multiple interactions may affect the same genes and samples. An interaction signature aims to extract all traces of correlation in the signal that originate from one particular interaction. After introducing SDCM, we thoroughly validate our approach and systematically compare it to 17 other approaches of unsupervised learning. We apply our algorithm together with several comparison methods to real GE data from human DLBCL samples and rank all results based on their ability to reveal differences in patient survival.

Results
SDCM concepts. The original idea of SDCM is a unifying graphical model for biologically relevant GE data that is typically represented by a heatmap depicting gene and sample dimensions simultaneously (such as Fig. 3 in ref. 3 ). Our model assumes that an individual GE program is associated with specific orders of genes and samples. In absence of any overlapping effects and measurement noise, we model GE data that was sorted by such specific gene and sample orders as a heatmap in which each participating gene row and each sample column is comprised of monotonic expression values only, i.e., all follow a common order. This consistency idea is the key for guiding data regression and dissecting overlapping GE programs that were measured simultaneously as a sum of all contributing intensities. SDCM dissects the complete GE dataset into such bi-monotonic signatures for specific gene and sample orders. Despite empowering dissection, this non-linear consistency model also significantly increased versatility and scope of detectable interaction signatures compared to linear methods.
To illustrate SDCM concepts, we first show a low-dimensional gene space and compare our approach with PCA. PCA searches for directions of maximal data variance. Quantification of data variance in any given direction ignores perpendicular distances of data points. In contrast, our search functional is maximal for directions, to which as many data points as possible are aligned as consistently as possible. To quantify this, we compute uncentered weighted correlations of data points with candidate directions (aka weighted cosine similarities). Weights decrease with angular distance to that direction. Geometrically, these weights enable SDCM to focus on data points in a double-cone around any given direction (with cone tips touching at origin, as illustrated in Fig. 1c-f). While a PC represents a data point distribution as a linear axis, SDCM extends this linear concept to nonlinear monotonic curves obtained by regression. Points with relatively low weights (outside of the cone) have lower influence on this regression than data points in the cone. In this way, SDCM extracts data structures locally, whereas PCA globally reduces data dimension by projection along the PC of maximal variance. This can result in a 1-to-many relationship between GE program and PCs, unnecessarily obstructing biological insight. We illustrate this problem in our 3D concept example that contains four simulated GE programs for different partitions of samples. For this dataset, the first PC of maximal variance passes through the empty space between the three larger simulated GE programs (Fig. 1i). As PCs are orthogonal per construction by projection, there are only three PCs in total for this 3D gene space. Hence, the optimal 1:1 relationship with the four simulated programs can principally not be obtained with PCs. The same is true for any other method utilizing projections for dimension reduction, e.g., ICA.
This problem is aggravated in high-dimensional data spaces, as more genes could play multiple roles in GE programs of different patient subsets. These subsets are then hard to represent by orthogonal PCs, similar to the 3D example. Furthermore, in highdimensional biological data, samples usually harbor multiple GE programs that are often overlapping (sharing genes). Hence, real GE datasets usually do not give rise to an unambiguous patient partitioning as simulated for the 3D example. Due to these complexities of real GE data, we shifted goal from dimension reduction, as pursued by PCA, to consistent 1:1 representation of initially overlapping interaction signatures. Similar to biclustering methods, SDCM works in full data space, treating gene and sample space on equal footing, and rather than reducing dimensions, the data signal itself is iteratively reduced to a noise cloud around zero (cf. Fig. 1g).
Because of these significant conceptual differences we had to mathematically develop SDCM from scratch based on a general superposition approach M 0 ¼ P k E k þ η. Here, the input signal, represented as a genes*samples matrix M 0 , is dissected into a sum of signature matrices E k of the same size, additively overlaid with a noise matrix η. While each signature E k formally has the same size as the input, it explains non-zero signal parts only for typically small subsets of associated genes and samples. Noise is implicitly defined by adjustable significance thresholds for signal strengths (projections of data points onto a candidate direction) and for uncentered weighted correlations (angular distances of data points to a candidate direction). As long as data structures remain in the signal that are significant with respect to these conditions, SDCM continues to dissect signatures by iterating the following four steps: (1) search for an initial candidate direction using a signature functional, (2) optimize this direction by locally maximizing the signature functional, a b c d e f g h i Fig. 1 Concepts of SDCM illustrated by a 3-dimensional example. Four interactions indicated by different colors are simulated in a space spanned by three genes. Points represent simulated log 2 (gene expression ratios) and are analyzed by SDCM (without color information). a A first representative of the simulated blue interaction is identified by the search strategy, defining the initial gene axis (yellow; Methods/Step 1). b Additional representatives (red) are selected that maximize the signature functional (Methods/Step 2). The converged gene axis (yellow) approximates the direction of interaction by a linear combination (like a principal component). c The signal of the blue interaction is regressed according to our bi-monotonic signature model. Regression focuses on a local subspace using weights decreasing by angular distance, as illustrated by cones (depicting isosurfaces for uncentered unweighted correlations of [x|a] |1〉 := 0.9). The resulting signature signal (illustrated by its corresponding gene curve, yellow) approximates the interaction more precisely (Methods/Step 3). d The discovered blue interaction (implemented as a logistic function simulating a saturation) is dissected without disturbing the signal from other interactions (Methods/Step 4). In the second detection iteration, steps a-c are repeated, resulting in the regressed gene curve for the green interaction (a super-linear polynomial). e After dissection of the green interaction, the (linear) red interaction is detected. f After dissection of the red interaction, the magenta interaction (simulating a one-sided activation threshold) is detected. g The residual signal no longer contains any qualifying signature representative and SDCM terminates. h All four detected signature signals are shown in form of their corresponding gene curves (yellow). i All three orthogonal principal components are depicted (yellow), as obtained by PCA for the same data points. (3) regress a monotonic curve through data points in the weights-based cone and (4) selectively dissect the signal parts that are consistent with this curve.
All these steps are illustrated for the 3D example ( Fig. 1) and described in the section Conceptual introduction in Methods. Identical steps are performed in case of high-dimensional signals, as illustrated in Fig. 2. We provide definitions of algorithmic steps and concepts in the Methods section. Supplementary Notes 1-7 detail subroutines. In particular we explain two key concepts, the signature functional (that guides the initial search of a linear axis and its optimization) and our bimonotonic consistency requirements (that guide signature regression onto a nonlinear monotonic curve and subsequent signal dissection).
Versatility test and method validation. To validate detection versatility for high-dimensional data, we designed seven signatures E sim k of different size and form that mimic real-world GE signatures (Supplementary Table 1). Each was simulated bimonotonically in a random gene and sample order (Fig. 3a). The test signal M sim 0 was defined as superposition P 7 k¼1 E sim k of all unordered signatures and normal noise (Fig. 3b). The challenge was to precisely detect and dissect all seven signatures and to avoid detection of false positive (FP) signatures. SDCM detected all seven signatures, reconstructed most of their signals (Fig. 3c) and stopped when the remaining signal (Fig. 3d) did no longer contain any further qualifying signature representatives. In particular, it did not return any FPs. To quantify reconstruction quality, simulated signature axes were correlated with detected ones (Fig. 3e and Supplementary Note 8). SDCM extracted all simulated signatures with correlations close to 1 (high sensitivity, red diagonals). Additionally, there was no strong correlation of detected signatures to other simulated signatures (high specificity, dark off-diagonals). These results are representative for 49 repetitions ( Supplementary Fig. 1 To test the scope of our signature model, we applied SDCM to four external benchmarks that were previously defined to rank 13 biclustering methods 33 . Each benchmark consists of 100 datasets, one was generated with a multiplicative data model, the other three with an additive data model for different signal-to-noise ratios 33 . SDCM scored best in all four benchmarks, followed by FABIAS (Supplementary Table 2). For an in-depth comparison with PCA, we designed a more challenging test signal containing two additional instances of signatures #2, #3, and #4 from the above 7-signature test (for different random subsets and orders of genes and samples). 49 simulations of this versatility test with 13 superposed signatures were analyzed with both SDCM and PCA. Detected gene axes (respectively PCs) were correlated with simulated gene axes ( Fig. 4b and Supplementary Fig. 9). SDCM was significantly more sensitive than PCA for 10/13 signatures and significantly more specific for 12/13 signatures (Supplementary Table 3  Limits. To determine the minimal signal strength required for detection, we simulated single instances of signature #6 for decreasing maximum signal strengths relative to the noise level σ. SDCM detected this signature down to approximately 0.5σ. Below 0.5σ, SDCM often terminated without any detected signature ( Supplementary Fig. 10 Fig. 15a). Even for 80% missing values (Fig. 4d), signatures #1-#4 were still detected with high correlations and the majority of their missing values were reconstructed ( Supplementary Fig. 15b).
Application of SDCM to GE data from DLBCL samples. For signature discovery in real data, we selected the largest GE dataset of human DLBCL samples available at time of this analysis with 498 patients 3 (Supplementary Note 9). SDCM dissected this dataset into 105 signatures (Supplementary Data 1 and 2). For direct usage of discovered signatures in a supervised analysis such as GSEA 18 , we also provide derived sets of signature top genes (Supplementary Data 3). Real datasets typically contain cohortspecific laboratory effects. While SDCM dissects these from biological signals, resulting FP signatures have to be filtered out.
To verify signatures on GE level and test their robustness after changing the technological platform, we transferred them to both a microarray-based validation cohort of 233 independent DLBCL patient samples 2 and to a RNA-sequencing based validation cohort of 624 independent DLBCL patient samples 39 Fig. 16 shows signature k = 6 as example). Additionally, strongly bimodal sample strengths in signature k = 6 induced binary clusters that were significantly associated with independent patient gender data (detection cohort: p ¼ 2:8 10 À83 , microarray-based validation cohort: p ¼ 5:7 10 À50 , RNA-sequencing based validation cohort: p ¼ 6:2 10 À98 , χ 2 -tests). Albeit as disease-unspecific as gender, this signature is a first biological proof of concept and it may also serve as a control signature for future dissections of human GE studies.
To filter signatures, they were associated with previously described disease-specific gene signatures or clinical data. Interpretations of these associations for all discovered signatures are beyond the scope of this study. Here, we focus on those of key importance for method validation. In the study by Lenz et al. 2 , three GE signatures were identified and combined into a survival predictor model. This model was able to identify patients with significant differences in survival. SDCM rediscovered all three signatures, as confirmed by high and significant gene set enrichment 18  . In particular, we refer to iteration k = 12 as rediscovered COO 40 signature, as it showed the highest enrichment for known ABC and GCB DLBCL gene sets. This confirms the utility of SDCM in discovering cancer subclasses. Size, inner correlation and amount of explained signal for all 105 discovered signatures are depicted in Supplementary Fig. 20. Signatures detected in early iterations already explain most of the signal, while at later iterations, signatures are typically smaller (fewer genes in their Signature focus w g j i; w s j i ð Þ , see Methods).
As general statistics like signature size or the amount of explained signal cannot robustly indicate biological relevance, we systematically tested the association of every signature with patient survival data (see below).
Method comparison with real data. We also applied PCA, ICA, NNMF and FABIA/S to this real dataset (same 13 configurations as tested for simulated data). We compared detected gene axes (PCs, ICs, NNMF or FABIA/S loading vectors) with SDCM signatures via weighted correlations (Supplementary Note 8).
Results are shown in Supplementary Figs. 21-24 and summarized in Supplementary Note 13. In brief, while the strong genderassociated signature and the large Stromal-1 signature were rediscovered by several configurations, none of the comparison methods rediscovered the top survival signatures identified in this study with high correlation. Additionally, artefacts were revealed, such as splitting the signal of the gender-associated signature over multiple components. Overall, ICA results showed the highest similarity to SDCM.
Survival models and method comparison. To determine in an unbiased fashion which of the signatures are potentially responsible for patient survival differences, and which detection method extracted the most predictive signatures, we constructed alternative Cox proportional hazard models (CPHM) 41 . We modeled the progression-free survival (PFS) of patients with combinations of up to three signatures from any of the detection methods (Supplementary Note 14 and Supplementary Fig. 25). We tested every combination of two or three signatures discovered by SDCM (including rediscovered stromal signatures) or by any of the compared methods. The best 2-signature model was Signatures k = 27 and k = 11 represent overlapping opposite survival associations (fitted Cox coefficients have different signs). Signature k = 11 did not seem to be predictive when tested alone (q LRT = 0.47, likelihood ratio test on top of the age-only model, Bonferroni-corrected for the SDCM signature family). Neither was it predictive on top of the rediscovered COO signature k = 12 (q LRT = 0.30). However, prediction was highly synergistic when combined with signature k = 27 (q LRT = 6.8 × 10 −6 ). Kaplan-Meier survival estimates for four molecular risk groups visualize E (Eq. 29; green triangles in rows). Samples are ordered by their signature-specific sample strengths (light orange triangles in columns). Heatmaps show GE intensities in the detection cohort (left), the microarray-based validation cohort (center) and the RNA-sequencing based validation cohort (right). Colored lines below heatmaps indicate outcome (light red; PFS available for microarray-based cohorts; OS for RNA-sequencing) and subtype. In all cohorts, ABC DLBCL cases (light blue) are more frequent at negative sample strengths, while better outcome occurs for higher sample strengths. Signature details on transcript cluster level are presented in Supplementary Fig. 26. Gray lines indicate missing matches between platforms. Tabular definitions (signature axes and strengths for microarray-based cohorts) are provided in Supplementary Data 1 and 2. b The second survival-associated GE signature in the best predictor model was detected in iteration k = 11. As in (a), its top 40 genes are displayed. Similar to (a) and in all cohorts, colored lines indicate a higher frequency of GCB DLBCL cases (orange) at higher sample strengths. Other than in (a), these are paired with an untypically high rate of progressions or deaths for this subtype (light red). our predictor (Fig. 6a). Between high molecular risk (patients with ≥175% of the average risk in the detection cohort) and low molecular risk, a difference in 5-year PFS of 58% was detectable (p ¼ 1:2 10 À11 via log-rank test reasserts the Cox model; Supplementary Fig. 30c shows OS). In the smaller validation cohort, the difference in 5-year PFS was still 53% for identical molecular risk groups (Fig. 6b, p ¼ 2:3 10 À5 ; Supplementary Fig.  30d shows OS). To further confirm robustness after changing the technological platform, we applied our predictor to OS in the independent RNA-sequencing validation cohort (unfortunately no PFS available). While expected cohort-to-cohort variations were observed, strong survival differences in 5-year OS of 40% remained between identical molecular risk groups (p ¼ 4:9 10 À7 ; Supplementary Fig. 30e).
Collectively, this indicates that identified signatures represent specific characteristics of DLBCL and not just of a single DLBCL To visualize the survival trend revealed by our continuous 2-signature predictor, we show Kaplan-Meier estimates for partitions of predicted molecular risk (of ≥175%, ≥100%, ≥1⁄175%, and <1⁄175% the average risk in the detection cohort). Significant differences in progression-free survival were predicted in both the detection cohort (a, 470 R-CHOP treated DLBCL patients) and the validation cohort (b, 220 patients). (Statistics of each survival curve and OS can be found in Supplementary Fig. 30.) c, d Within GCB DLBCL (as originally classified), the quartile containing highest predicted molecular risks revealed significantly and strongly adverse survival, whereas the other three quartiles shared average survival (merged into one curve; details and ABC DLBCL are depicted in Supplementary Fig. 32; OS and second validation cohort in Supplementary Fig. 33). e, f To analyze and visualize the relationship between molecular risks and known clinical risks, we repartitioned all samples having an international prognostic index (IPI) 42 into terciles. In each clinical risk class, significant survival differences remained between top and bottom terciles of predicted molecular risks (details in Supplementary Fig. 35).
cohort. In contrast, survival differences between previously described DLBCL subtypes did not generalize as well across these cohorts ( Supplementary Fig. 31).
Furthermore, the identified 2-signature model discriminated patient survival significantly better than the 3-signature model of Lenz et al. 2 (p LRT = 1.3 × 10 −8 , likelihood ratio test on top of previous predictor scores, Supplementary Note 16). Vice versa, the former 3-signature predictor could not add significant predictive information on top of our 2-signature model (p LRT ¼ 0:07).
Applied separately to DLBCL subtypes ( Supplementary Fig. 32), our predictor revealed a subgroup within GCB DLBCL that was associated with significantly adverse survival (p ¼ 1:0 10 À9 , logrank test, Fig. 6c). This subgroup was confirmed in both the smaller cohort (p ¼ 4:6 10 À5 , Fig. 6d) and for OS in the RNAsequencing validation cohort (p = 0.001 after crossing cohort, technology and survival type; OS results for all cohorts in Supplementary Fig. 33). Sample strengths in both survival signatures showed that ABC DLBCL samples cluster densely at lower strengths in both signatures. In contrast, GCB DLBCL cases expressed these signatures heterogeneously and were scattered from high to low risk ( Supplementary Fig. 34). To quantify the information provided by our molecular predictor on top of known clinical risk factors, we grouped samples by their international prognostic index (IPI) 42 that combines the clinical parameters performance status, age, stage, number of extranodal sites and serum LDH level. Interestingly, predicted survival differences remained significant in all clinical risk groups (Fig. 6e, f and Supplementary Fig. 35).
Biological associations. To elucidate biological associations of the two top survival signatures in an unbiased manner, we analyzed their gene ranks by discovered gene strengths for enrichment 18 of 16.513 previously identified GE signatures 40,43,44 (Supplementary  Note 12). Interestingly, both correlated and anti-correlated genes of k = 11 were significantly enriched with signatures differentiating Burkitt lymphoma (BL) from DLBCL samples in a supervised analysis 45 (HUMMEL_BURKITTS_LYMPHOMA_DN with enrichment score es ¼ À0:903; HUMMEL_BURKITTS_LYM-PHOMA_UP with es ¼ 0:741; each with p 0:001 by permutation tests). In all DLBCL cohorts, correlated samples (sorted to the right in Fig. 5b) showed a BL-like GE pattern. As expected, both k = 27 and k = 11 are associated with germinal center signatures (e.g., Germinal_center_Bcell_DLBCL; es ¼ 0:76 for k = 27 and es ¼ 0:55 for k = 11; p 0:001). Likewise, both are associated with a signature determined in an analysis based on survival data 2 (Ger-minal_center_B_cell_DLBCL_survival_predictor; es ¼ 0:78 for k = 27 and es ¼ 0:63 for k = 11; p 0:001). In the k = 11 signature, we also detected various previously identified target genes of the oncogenic nuclear factor-kappa B (NF-κB) signaling pathway such as BCL2A1, IRF4, TNFAIP3 or IL21R. This observation was confirmed in the unbiased GSEA, as anti-correlated genes of k = 11 (upper side in Fig. 5b) were significantly enriched with a previously identified NF-κB signature (NFkB_Up_bothOCILy3andLy10; es ¼ 0:68; p 0:001). In contrast, k = 27 was not associated with NF-κB signaling (es ¼ À0:25; enrichment results are shown in Supplementary Fig. 36).

Discussion
We provide an unsupervised learning concept for dissection of large datasets that are typically comprised of the collective signal from many unknown interactions and substantial measurement noise. As we have demonstrated, SDCM is able to discover and extract interaction signatures from such data with high sensitivity and specificity. In cancer transcriptomics, this superior performance will most likely lead to more precise dissections of entities that are molecularly heterogeneous, allowing a more specific identification of potential oncogenic programs. However, our method is generally applicable, as underlined by its top rank among 13 biclustering methods in four external benchmark datasets. Furthermore, SDCM reliably dissected highly overlapping interactions, determined the unknown number of signatures with low false discovery rates and reconstructed missing values. Finally, using outcome as independent indicator, two SDCM signatures for human DLBCL GE data showed the highest impact on patient survival of all signatures from all compared methods, suggesting high biological signature specificity. This supports the design of required comprehensive functional validation experiments to biologically identify driving interactions of these survival differences.
A central reason for high specificity and for the broad applicability of SDCM is our flexible signature model. While PCs and ICs approximate interactions only linearly, detected E k are equivalent to non-linear monotonic curves (Fig. 1c, h). They can precisely regress a much broader class of interaction signatures. Precision is crucial, as residuals missed during dissection would lead to FPs or artificially disturb not yet dissected signatures from overlapping interactions. Another reason is our signature functional E that guides detection and optimization. It is the higher the more genes and samples take part in an interaction and the more consistent correlations this interaction causes across participating genes and samples. This optimizes signature specificity. In contrast, PCA maximizes the data variance explained by each PC, instead of correlations to it. Resulting PCs therefore represented mixtures of several interactions in superposition tests and in real data (cf. Supplementary Fig. 21b, c). An additional key advantage compared to projection-based methods (including PCA and ICA) is that axes of different signatures are not constrained to be orthogonal. This enabled precise dissections even of partially correlated interactions (as illustrated in Fig. 1), whereas orthogonality forced them to be represented by several unspecific components (Fig. 1i). Similar artifacts were observed for NNMF and FABIA/S when dissecting real GE data ( Supplementary  Figs. 23 and 24). As we use a perturbative-like approach (first searching a linear signature axis and then regressing a bimonotonic signature curve), non-linearities that deviate from linearity too strongly cannot be described by bimonotonic signatures. The same is true for all comparison methods. However, we did not find any biologically meaningful pattern of this type in the context of GE data.
While our primary goal was method validation against real data, dissection of DLBCL GE data into 105 signatures has provided important insights into the molecular heterogeneity of this diagnostic category. Already the gender-associated control signature and rediscovered stromal DLBCL signatures validate the utility of SDCM. For comparison, we additionally obtained signatures using PCA, ICA, NNMF and FABIA/S for the same data. To determine which of these signatures have the largest impact on patient survival, we constructed Cox proportional hazard models and ranked them using the Akaike information criterion. The best bivariate model was comprised of two SDCM signatures. It revealed significantly higher differences in patient survival than the previously described 3signature predictor by Lenz et al. 2 that included two stromal signatures. Instead of utilizing limited survival data directly to find associated genes by supervised learning, which may cause a high false discovery rate, we dissected GE data in an unsupervised manner first. We also validated both signatures on GE level before using any survival data. Consequently, their strong survival associations (Fig. 6a, b) in both the detection and validation cohort are independent validations, indicating biological signature specificity and relevance. This was underlined on both GE and survival level in the RNAsequencing based validation cohort that only became available after method development, signature dissection and predictor construction.
While both signatures sorted samples roughly from ABC to GCB DLBCL over increasing sample strengths (Fig. 5), they were associated with overlapping anti-aligned survival trends and their combination predicted survival highly synergistically. As such anti-aligned survival trends can neutralize each other, they are hard to detect by supervised analyses. They revealed a 45% difference in 5-year PFS for patients with DLBCL currently classified with a high clinical risk (Fig. 6f) and identified a subgroup within GCB DLBCL patients with more than 40% inferior 5-year PFS (Fig. 6c, d). Further analyses revealed that this subgroup is characterized by a Burkitt lymphoma like GE profile and low expression of target genes of the oncogenic NF-κB pathway. This might potentially link to the just recently identified molecular high-grade B-cell lymphoma (MHG) subtype 46 that shares these survival and BL GE characteristics. Collectively, this confirms that SDCM is able to discover novel and clinically relevant subtypes, without using prior knowledge of GE signatures or survival data. Remaining GCB DLBCLs with relative high expression of NF-κB (left in Fig. 5b) consequently showed superior survival relative to average GCB DLBCL patients. This subgroup might be linked to a previously identified GCB DLBCL subtype with favorable outcome that is driven by BCL2 and active NF-κB signaling 47 . Discovered PFS differences were of similar magnitude in both microarraybased cohorts (Fig. 6b, d) and remained significant for OS and after crossing the technological border to RNA-sequencing ( Supplementary Fig. 33). This was not expected, as average survival differences based on previously classified COO-related DLBCL subtypes and underlying signatures did not generalize as well (Supplementary Fig. 31).
Our results implicate that SDCM is able to dissect overlapping biological interactions by searching for signatures of maximal GE correlation. Top genes of identified survival signatures (Fig. 5) might provide a more robust and more specific representation of genetic interactions responsible for survival differences in DLBCL than previous COO-related signatures. Signatures k = 11, 12, and 27 are all partially correlated, but their differences are biologically important. Similar to the 3D example ( Fig. 1), partially correlated interactions do not need to be identical and should be dissected precisely.
Due to constraints such as linearity or orthogonality, previously described comparison methods have a narrower scope of detectable interactions. Hence, they often failed to dissect overlapping interaction signatures in the signal, potentially missing biologically important correlations. Our perturbativelike expansion towards monotonically non-linear interactions provides a high versatility in dissecting collectively measured molecular interactions. This might be a substantial advance and could lead to discoveries of novel correlations in transcriptome data of different cancer types with possible translational impact on diagnosis and therapy. This is a central research goal in context of the precision medicine initiative 48 that currently quantifies transcriptomes for many cancer entities. However, as the provided SDCM toolbox is based on generic mathematical concepts, namely correlation maximization, bi-monotonic regression and iterative focused dissection, it may likewise be applied to dissect many other data sources, such as mutational frequencies based on DNA-sequencing or even non-biological data such as light spectra in astrophysics 49 to infer star classes. Essentially, any data domain that was analyzed by PCA might also be suitable for dissection with SDCM, especially if PCs were hard to interpret in the respective system context.

Methods
Conceptual introduction. Here, we conceptually explain the four iterated main steps of SDCM, as introduced in the section SDCM concepts in Results.
As the number of all possible directions (GE programs) grows exponentially with data space dimension, the search strategy in step 1 concentrates on the subset of actually measured directions. Each sample column in the input matrix (each point in Fig. 1) represents one direction in the space spanned by genes. Likewise, each measured gene row defines one direction in sample space. By restricting the search to only these n + m measured directions we assume that some of these candidate directions pass through point clouds that SDCM is designed to detect: We are looking for GE programs that are characterized by a consistent intrinsic order of co-expression of participating genes in multiple samples. The intrinsic gene order defines a characteristic direction in gene space. Samples expressing this program at various intensities are then scattered along that common direction, giving rise to a specific sample order. Our search functional obtains higher values for such characteristic directions than for axes that only pass through their generating data point with no others nearby. Many deviations from purely linear patterns (i.e., from constant gene co-expression ratios for all samples) are conceivable for real GE programs, as long as their characteristic co-expression order remains consistent. This has led to the model of monotonic curves. Compared to linear point clouds at orthogonal directions that can also be represented 1:1 by PCs, our class of detectable interaction signatures along nonlinear monotonic curves is huge. For this class, our above assumption for search space restriction always holds, as one participating sample is always in the middle of its directional point cloud. In contrast, SDCM does not search for, e.g., circular point clouds or points scattered in higher-dimensional manifolds of the gene space, as we do not expect such structures to exist in real GE datasets. Concentrating on the broad class of directional point clouds, our search strategy aims to select the candidate direction with the globally highest signature functional.
Such an initial direction generated by only a single data point is rarely an optimal representation of a GE program that is active in many samples. Therefore, starting from an initial direction, step 2 selects additional data points nearby to refine that direction by propagating it to the weighted mean of all selected data points (Fig. 1b). This continues iteratively, as long as the signature functional has not yet obtained its local maximum.
Once the linear direction has been optimized, step 3 regresses a curve through the data points in its vicinity. Regression weights of data points decrease with their angular distance, as illustrated by double-cones in Fig. 1c-f. Regression is guided by the key consistency constraint of our data model that after ordering gene rows and sample columns by signature strengths (given by projections on the axis from step 2) the resulting signature matrix is bimonotonic (for a detailed motivation for this requirement see the next subsection). Depicted as a heatmap after sorting by signature-specific gene and sample strengths, each regressed signature matrix E k thus displays with all-monotonic rows and columns (see Fig. 2d for a highdimensional regression result).
Real data patterns are only approximately bi-monotonic (see, e.g., Fig. 2c). Observed differences to exact bi-monotonicity might not be caused by noise alone but often contain valuable information about other overlapping GE programs. Therefore, SDCM only dissects those parts from the signal in step 4 that are exactly consistent with our bimonotonic model. The residuals remaining in the signal might then contribute to the detection of signatures dissected in later iterations.
In the following subsections, we introduce SDCM in a formal way. In particular, we detail key concepts such as the signature functional that guides the initial search of a linear axis and its optimization and our bimonotonic consistency requirements that guide signature regression onto a nonlinear monotonic curve and subsequent signal dissection.
Formal introduction. Here, we describe our signal and signature model again and provide an overview of the algorithmic structure, with links to detailed mathematical notes about each formal aspect.
SDCM analyzes datasets as two-dimensional matrices M 0 with n columns representing samples and m rows representing genes (or other features). We aim to extract interactions as signatures defined by their signal matrices E k of same size as M 0 . Components E k i; j ð Þ 2 R quantify the expression (or suppression) of each gene i in each sample j for one specific interaction. We assume that M 0 is the result of a superposition of different E k , i.e., M 0 ¼ Pk k¼1 E k þ β wherek is the initially unknown number of signatures and η is a normal random matrix describing measurement noise. This allows genes playing multiple roles; overlapping roles may cause partial correlations between signatures that need to be dissected into individual E k . Similar superposition hypotheses underlie PCA, ICA, and FABIA, as these methods are equivalent to matrix factorizations of M 0 33,37 .
The dissection of M 0 into individual E k is an underdetermined problem. To overcome this, we first constrain the class of signals E k . We assume that biological interactions are characterized by a hierarchy of expression of participating genes ranging from weak to strong. While expression strengths may also vary between samples, we assume that the hierarchy of genes is consistent in all samples influenced by the same interaction. Likewise, the hierarchy of samples must be consistent for all participating genes. Formally, each E k is required to have a permuted gene order I k and sample order J k such that their resorted signal matrix e E k E k I k ; J k ð Þis bi-monotonic, i.e., monotonic over all genes for each sample and vice versa (see the section Model in Methods). Genes not participating in the respective interaction and unaffected samples have E k i; j ð Þ ¼ 0. This requirement is based on the following assumption: The order of genes by their signature strengths must be consistent for each sample participating in the signature, as this gene order should represent the intrinsic structure of a single GE program that cannot change from patient to patient. Symmetrically, the order of samples by their signature strengths should be consistent for each gene participating in the signature, as this order quantifies the involvement of samples in the whole GE program that cannot change from gene to gene. Of course, a single gene of any given GE program typically cannot reliably sort samples by their involvement in the whole program due to noise or due to additional roles that gene may play in overlapping programs. However, the monotonic ties between many samples and the required repetition of that pattern in many genes and vice versa, as required by our bi-monotonic consistency model, leads to statistically significant and reliable orderings of both samples and genes. Each signature discovered by SDCM is based on such bimonotonic ties, representing one specific underlying GE program (or another statistically significant structure in the data).
A basic example is a purely linear interaction. Here, participating genes are expressed at a fixed ratio relative to each other, inducing a gene order I and defining a line in gene space, such as a PC or IC (gene axis). The same interaction may differ in strength between samples, inducing a sample order J along that line (cf. red points in Fig. 1a). Hence, every linear interaction is automatically bi-monotonic. Bi-monotonic e E k generalize this linear concept and are geometrically equivalent to monotonic curves (Fig. 1h). They can represent all nonlinear monotonic interactions, such as saturations (blue in Fig. 1a) or super-linear interactions (green). Furthermore, this model covers conventional clusters representing, e.g., activation thresholds (magenta). Projections of samples on such a signature curve quantify their strengths in the respective signature (see the section Signature strengths u g j i; u s j i ð Þin Methods). In case of many gene dimensions, more complex monotonic deviations from purely linear interactions become possible. Introduced curves enable more precise approximations of such signatures, while purely linear models may lead to large residuals, effectively splitting the description of a single signature into several artificial parts. Generally, genes may interact indirectly in a complex combinatorial fashion. As GE datasets measure an ensemble of non-synchronized cells at one time point only, they usually do not provide sufficient information to extract individual gene-gene interactions. However, many well-researched gene signatures, e.g., hallmark gene sets of the Molecular Signature Database 43 , show that genes realizing the same cellular function are effectively being co-expressed in the cell ensemble of affected samples. Hence, SDCM can detect signatures of such gene networks, even if participating genes are interacting only indirectly, provided the same effective co-expression is measured for several genes and samples.
Furthermore, we assume that larger signatures represent more robust interactions and that higher correlations are indicative for more specific (and less mixed) interactions. With the purpose of detecting and quantifying directions in the gene or sample space that maximize those characteristics, we defined a signature functional E (see the section Signature functional in Methods). Its maximization in conjunction with the signature model suffices to infer signatures from M 0 unambiguously. SDCM is a deterministic algorithm that detects signatures iteratively. Each iteration k dissects one signature signal E k from the current signal matrix M kÀ1 until termination, effectively dissecting M 0 into all E k . Each iteration consists of four major steps: (1) Search strategy: An initial representative (a sample or a gene) is selected that is associated with signature axes to which many other samples or genes have a high correlation (as quantified by E). Figure 1 illustrates SDCM concepts in a 3D gene space. Figure 1a shows an initial sample and its associated gene axis. Generally, a sample or a gene may be selected as initial representative.
Unlike other methods such as ICA, SDCM evaluates measured correlation structures simultaneously in both the gene and sample space to stabilize the search (see Methods/Step 1 for symmetrization details). (2) Correlation maximization: To make the linear signature estimation robust and independent of the initial representative, we iteratively add representatives that further maximize E until convergence to a local maximum (Methods/Step 2). Correlation-based gene and sample weights smoothly quantify signature membership (see the section Signature focus w g j i; w s j i ð Þ in Methods). Optimized signature axes represent locally maximized correlation (between signature axes and members) and maximized signature size (Fig. 1b).
(3) Bimonotonic regression: As true signatures are not necessarily linear, we employ our bimonotonic model to capture a much broader class of onedimensional manifolds. First, the signal matrix is reordered specific to gene and sample strengths in the detected signature (see section Signature strengths u g j i; u s j i ð Þin Methods). To extract the bimonotonic signature signal, we apply regression to this sorted signal matrix (Methods/Step 3). Any residuals, i.e. parts of the signal that are not monotonic, are interpreted as belonging to overlapping foreign signatures (or as noise). The signature signal is equivalent to a monotonic curve over the corresponding signature axis (Fig. 1c). (4) Dissection: Finally, the detected signature signal is dissected from the overall signal matrix (Methods/Step 4). While methods based on matrix factorization or projection like PCA or ICA reduce space dimension with every PC or IC, SDCM always retains full space dimension. Rather than reducing the signal of all genes and samples by projecting the dataset along a linear axis, we utilize gene and sample weights to precisely dissect only those parts from the signal that are consistent with the detected signature (Fig. 1d). This enables dissection in cases of partially correlated signatures, i.e. when more signatures pass through the same (sub)space than this (sub)space has dimensions, e.g., four signatures in the 3D example.
These steps are repeated for every signature (Fig. 1d-g). Afterk dissections, the remaining signal Mk no longer contains any qualifying signature representative and SDCM terminates (Fig. 1g). Termination determines the number of signatures in the signal via significance thresholds for correlations and signal strengths. As the sum of all E k and the noise residual restores the input signal matrix M 0 , SDCM provides a complete dissection of the signal.
When dissecting high-dimensional signals, identical steps are performed Figure 2 illustrates a detection and dissection iteration for many genes in coordinate view. (For the 3D example, this coordinate view would show a heatmap with only three rows for x, y and z genes and as many columns as there are data points in Fig. 1.) Generally, signatures may share the same (sub)space, as already illustrated by the 3D example. Additionally, they may overlap, i.e., they may affect measurement values for the same (gene, sample)-combinations. As long as underlying correlations are sufficiently dissimilar (i.e., as long as the angle between corresponding linear signature axes is sufficiently nonzero), such overlapping signatures can still be dissected. Their regressed bimonotonic E k then estimate the original non-overlapped source signals (e.g., Fig. 2d).
In the remainder of this section, we provide a comprehensive mathematical description of SDCM. For practical applications, we provide a MALTAB® analysis toolbox including the full SDCM source code, unit tests and examples (see Code availability).
Framework. For the mathematical description of SDCM, we use the language of vector spaces in its established notation known from quantum mechanics.
The complete measured dataset of m genes (or, more generally, features) and n samples (e.g. tumor samples from patients) is represented as a single matrix M 0 2 R m n , where R m n fðX ij Þ i¼1 m;j¼1 n jX ij 2 Rg is the signal space.
The gene space V g is a vector space spanned by m basis genes fje ; m f g . Similarly, the sample space V s is a vector space spanned by n basis samples fje s j ijj ¼ 1 ng. Every je s j i has coordinates ðδ νj Þ ν¼1 n 2 R n in the sample reference order J 0 1; ; n ð Þ . The upper index s indicates elements of this space. For each gene, all samples have been measured and hence genes are points in this vector space spanned by all basis samples. Gene i is represented by the vector jg i i ¼ P n j¼1 he s j jg i ije s j i with expression values he s j jg i i M 0 ði; jÞ for sample indices j 2 f1; ; ng.
Model. In general terms, we aim to detect interactions in subsets of genes that exist in subsets of samples. The size of subsets is initially unknown, the participation intensity in these interactions may vary from gene to gene and from sample to sample, and subsets of different interactions may be overlapping.
We assume that the signal is a linear superposition ofk signature signals E k plus a noise term η: This superposition model is ambiguous without further constraints. In order to maximize the range of detectable interactions, we decided to only weakly restrict the functional form of the E k . We require that a gene order I k and a sample order J k exist, such that every row and every column of the reordered matrix E k I k ; J k ð Þis monotonic. This signature model can fit any interaction following a curve in the gene space (or sample space) that is monotonic over the signature axis in the same space, as explained in the main text. Formally, this bi-monotonicity of all E k is defined by: 8k : 9I k 2 perm I 0 ð Þ; J k 2 perm J 0 ð Þ : where perm X ð Þ π X ð Þjπ : X ! X bijective f g is the set of all permutations of a finite set X.
To determine signatures, we utilize a maximization principle as additional constraint. As detailed in step 2 below, we iteratively maximize the signature functional E that quantifies both the signature size and correlations of genes and samples to signature axes based on our measure for interactions defined next.
Measure for interactions. To define signatures of interaction, we first have to quantify how the interactions we look for can be observed in the measured data.
We correlate genes or samples |x〉 with a signature axis |a〉 that linearly approximates a potential direction of interaction in the respective vector space by computing the uncentered weighted correlation defined by where |w〉 are context-dependent weights and dots denote component-wise products. xja ½ w j i measures the consistency of regulation as it only depends on directions of |x〉 and |a〉.
In situations where signal strengths are important, we utilize the weighted projection of |x〉 on |a〉 The upper index 0 indicates normalization by w:a k k.
Algorithm overview. SDCM is performed iteratively. In every iteration k, a signature is detected and its contribution to the signal is regressed, yielding the signature signal E k . The remaining signal is the input for the next iteration. This continues until there are no more qualifying signature axes. Each detection and dissection iteration processes the following four steps.
Step 1 Initial representative or termination. We test if measured vectors in gene and sample space qualify as initial representative of some signature. To detect large and consistent signatures (having many highly correlated top genes and top samples) first, we search for maximal signature functional E by the following loop. For each gene jg i i and each sample js j i of the signal matrix M kÀ1 • Compute the following initial signature vectors: • For gene jg i i choose ja s i jg i i as sample axis, for sample js j i choose ja g i js j i as gene axis.
• Compute initial weights jw s initial i respectively jw g initial i based on the standardized signal (computed as described in Supplementary Note 1).
• Complete the pair of signature axes a g j i; a s j i ð Þvia weighted projections: • For gene g i , define the gene axis a g j i by computing projections of all other genes g i 0 onto g i . Normalizing by w s initial yields values that can directly be interpreted in units of the input signal.
• For sample js j i, define the sample axis ja s i by computing projections e s j 0 ja s s j 0 js j of all other sample js j 0 i onto js j i.

•
The completed axes pair a g j i; a s j i ð Þpoints to a potential signature. This pair is the basis of all subsequent computations. In this way, a symmetric treatment of genes and samples is guaranteed.
• To quantify the relationship of all genes and samples to axes ðja g i; ja s iÞ, compute uncentered weighted correlations ðjr g i; jr s iÞ to them via • Compute probabilities ðjp g i; jp s iÞ that these correlations ðjr g i; jr s iÞ have been caused by noise (see significance of correlations in Supplementary Note 2).
• Refine signature vectors to make them more specific: • Define more specific weights ðjw g i; jw s iÞ (and ðjv g i; jv s iÞ) based on now available correlations ðjr g i; jr s iÞ and their ðjp g i; jp s iÞ values. We refer to these weights as the (extended) signature focus, as they are used to focus subsequent computations on those genes and samples that are most correlated to the detected underlying interaction (see the section Signature focus ðjw g i; jw s iÞ in Methods section).
• For gene jg i i, refine its gene axis ja g i by he g i 0 ja g i hg i 0 jg i i 0 jw s i =kw s k. For sample js j i, refine its sample axis ja s i by he s j 0 ja s i hs j 0 js j i 0 jw g i =kw g k.

•
Refine and update correlations ðjr g i; jr s iÞ to signature axes: he g i 0 jr g i ½g i 0 ja s jw s i and he s j 0 jr s i ½s j 0 ja g jw g i .

•
Update probabilities ðjp g i; jp s iÞ that observed correlations may have been caused by noise.
• To quantify the consistency and size of a potential signature along axes ðja g i; ja s iÞ, compute the signature functional E½ja g i; ja s i (see the section Signature functional in Methods).
• To determine if axes ðja g i; ja s iÞ qualify for further processing, several thresholds for signature size or significance for correlation and signal strength are tested (see Supplementary Notes 2 and 3 and default qualification thresholds in Supplementary Note 4). Default thresholds have been optimized to exclude false negative signatures as primary objective and minimize false positives as secondary objective.
If qualified axes pairs ðja g i; ja s iÞ are obtained by the loop, select the pair with the highest E½ja g i; ja s i and pass corresponding signature vectors on to step 2.
If no gene or sample in the current signal matrix M kÀ1 yields qualifying signature axes, SDCM terminates.
For performance optimization, we do not process every gene and sample for each iteration k. Instead we presort all genes and samples based on their uncentered standard deviations and apply a lookahead scheme (see presort order and lookahead scheme in Supplementary Note 5). This may slightly influence the order in which existing signatures are detected, but not whether they are detected. In any case, SDCM stays deterministic.
• Compute their p-values ðjp g lþ1 i; jp s lþ1 iÞ as well.
• Compute the signature focus ðjw g lþ1 i; jw s lþ1 iÞ and the extended signature focus ðjv g lþ1 i; jv s lþ1 iÞ for these correlations as in step 1.
• Compute the signature functional E½jb g lþ1 i; jb s lþ1 i and store it for the tested gene or sample.
• Finally, select that gene or sample as representative l + 1 that leads to the highest signature functional.
Once convergence criteria have been met for a certainl, b are the final signature axes. They linearly approximate contributions of the detected interaction to the signal, similar to PCs or ICs. In step 3, they serve as basis for a monotonic nonlinear (and hence more precise) approximation.
Step 3 Bi-monotonic regression and smoothening. The contributions of a signature to the signal matrix M kÀ1 are regressed by one outer and one inner loop until convergence. The outer loop (with index o) is structured as follows: • Compute signature strengths ju Mô ;q Þ is passed on to step 4.
Step 4 Signature signal E k and its dissection. Dissection strengths D k 2 R m n for signature k are defined as square roots of the components of the outer weights product jw ĝ l i jw ŝ l i, i.e., e D DðI k;ô ; J k;ô Þ denotes these dissection strengths in the final signature order. The signature signal is defined in signature order as component-wise product (e.g., Fig. 2d; gray shadings depict e D and fully grayed-out pixels correspond to zero dissection strengths).
The signature signal in reference order is obtained by sorting back via E k ðI k;ô ; J k;ô Þ Ẽ k .
Dissection of signature k is now simply a matrix subtraction of its signature signal E k from the current signal M kÀ1 . The remaining signal is the input for the next detection and dissection iteration k þ 1 (e.g., Fig. 2e or Fig. 1d).
Signature focus. Purpose of the signature focus is to define where a detected effect ends. While this may seem easy for plateau-like clusters (having few highly correlated genes or samples while all others only have low correlations), often real effects have no clear edge. In this case, we prefer a smooth decrease in membership weights. The signature focus should not influence ranks of top members, as they need to be determined by signature strengths for optimal bimonotonic regression (cf. Eqs. 29 and 30). But non-members should be identified and excluded by the signature focus to minimize the influence of noise. The signature focus consists of correlation-based gene and sample weights that allow the computation of all vectors and scores as specific as possible, even if the detected interaction only affects a small subset of measured genes and samples. To retain specificity for such small signatures, we set weight components he g i jw g i and he s j jw s i exactly to zero for all (and potentially very many) genes and samples that have only weak or insignificant correlations to detected signature axes.
For samples je s j i, let where the second factor decreases quadratically to zero with the noise probability of sample correlations (cf. Supplementary Note 2). Sample weights are defined relative to x s j . Values ≥50% of the maximum of all x s j are already mapped to full weight: Consequently, weights have no influence on the signature's order of top samples, as is intended. To exclude any unspecific influence of samples with relatively weak or insignificant correlation, we finally set all weights to zero that are lower than two thirds of their quantile: For signature size estimation and qualification thresholds, mapping all x-values above 50% to full weight is not optimal. To keep the full dynamic range of weights for these tasks, we additionally define the extended signature focus v g j i; v s j i ð Þ ð 21Þ by increasing the upper threshold from 50% to 100% (i.e., y g i x g i and y s j x s j ) and by decreasing the lower specificity threshold from two thirds to 0.4. Otherwise, Þare computed in the same way. Importantly, specificity thresholds of the signature focus exclude noise genes from computation that could otherwise reduce both specificity of signatures and detection robustness. In case of many measured genes and small signatures, this can also speed up computation.
Signature functional. To detect and unambiguously dissect a signature, first its optimal linear directions along which it extends in gene and sample space have to be determined. The goal of the signature functional is to score candidate directions during the initial search (Methods/Step 1) and during optimization (Methods/ Step 2).
The signature functional E½ja g i; ja s i assigns to every possible axes pair ðja g i; ja s iÞ a scalar 2 R. E is the larger, the more genes or samples are correlated to these axes, and the higher these correlations are. The selection of an initial representative with help of this functional (step 1) and maximizing this functional (step 2) guides SDCM to one of the largest and most consistent signatures in the current signal M kÀ1 . (Thus, in all versatility tests, the large signature #1 was always detected first, and narrow or weak signatures #5, #6, and #7 were detected last; see detection ranks in Supplementary Fig. 1.) The signature size, i.e., the number of participating genes or samples, is estimated by summing gene weights v Next, these separate scores for genes and samples need to be combined. As m k n k corresponds to the number of measurement values supporting the signature, a natural choice for the combined signature size would be their geometric average ffiffiffiffiffiffiffiffiffiffi ffi m k n k p . However, signatures affecting the same number m k n k of measured (gene, sample)-values in M 0 may be caused by noise or by artifacts with NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-019-12713-5 ARTICLE NATURE COMMUNICATIONS | (2019) 10:5417 | https://doi.org/10.1038/s41467-019-12713-5 | www.nature.com/naturecommunications different probabilities. For example, a narrow signature concerning 10,000 genes in two samples is more likely a measurement artefact than a signature representing an interaction that affects 100 genes in 200 samples. To detect and dissect broad and robust signatures first, we therefore optimized the usual geometric average by putting 90% weight on the order dimension with lower size. Hence, the combined signature size is defined as s k ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi min m k ; n k ð Þ 0:9 max m k ; n k ð Þ 0:1 q ð24Þ Putting 90% weight on min m k ; n k ð Þeffectively increases the signature functional for the broad signature of same area and decreases it for narrow stripelike signatures. The latter will still be detected, but in a later detection iteration.
A natural choice to combine the average correlations r g k for genes and r s k for samples is an arithmetic average that takes into account the number of measured values underlying the computation of these weighted correlations: To put emphasis on the most consistent signatures (that have average correlations of r k near one) and to detect them first, we modify this to r k = ffiffiffiffiffiffiffiffiffiffiffiffi 1 À r 2 k p (inspired by the t-statistic of a correlation; cf. significance of correlations in Supplementary Note 2). Hence, our final signature functional is defined by: Signature strengths. As preparation for bi-monotonic regression, genes and samples need to be sorted such that their measured signals in the signature focus are arranged as bimonotonic as possible with respect to detected signature axes. To achieve this, we sort them by their signature strengths. Whereas correlations only quantify the consistency of observed GE data with a signature axis, signature strengths are additionally proportional to the magnitude of observed signals in direction of a signature axis. In this way, signature strengths reflect gene activities in the underlying interaction (inducing a gene order) or, respectively, the activities of this interaction in samples (inducing a sample order). Signature strengths are also useful candidates for downstream analyses, e.g., they may serve as covariates in Cox survival models.
For the first regression iteration o = 1 (in step 3), the gene strengths vector ju g k;1 i 2 V g is defined by i.e. by weighted projections of each gene on the detected sample axis jb ŝ l i in the final sample focus jw ŝ l i, normalized by kw ŝ l k. This normalization guarantees values that can be directly interpreted in units of measurement values (i.e., typically as log 2 (ratios) for GE data). Likewise, the sample strength vector ju s k;1 i 2 V s for o = 1 is defined by For higher precision, we utilize regressed signature curves from the last iteration o À 1 as projection targets instead of signature axes for all iterations o > 1. Signature curves are defined as projections of the bi-monotonically regressed signal matrix e X. The associated gene curve is a vector-valued function jc g i of sample indices j defined by he g i jc g ðjÞi e Xði; jÞ. Analogously, the associated sample curve c s j i is a vector-valued function of gene indices i defined by he s j jc s ðiÞi e Xði; jÞ. For any iteration o > 1, we consider signature curves for the regression result from the last iteration o -1, i.e., for e X S ju g k;oÀ1 i;ju s k;oÀ1 i ð e M oÀ1;q Þ (see algorithm step 3). With these signature curves as targets, again weighted projections are utilized to determine refined signature strengths: Smoothening operator. Purpose of smoothening is that the signature signal becomes (nearly) the same for all genes and samples with (nearly) equal signature strengths. In particular, the signature signal for all genes with signature strengths near zero, i.e., all genes outside of the signature's focus, are averaged by smoothening. Other than regression, smoothening is a local operation that cannot guarantee monotonicity if applied to a non-monotonic signal. However, if applied to an already monotonic signal, this monotonicity is always preserved. Let I denote the gene order by components of gene strengths ju g i, J the sample order by components of sample strengths ju s i and e X X I; J ð Þ a sorted signal matrix. To compute the adaptive smoothening S ju g i;ju s i ð e XÞ, e X is first rescaled using signature strengths ju g i and ju s i. A resolution of m À 512 rows and n À 512 columns is sufficient for this rescaled space, as only monotonic changes need to be resolved. Basis vectors je gÀ i À i of the rescaled (and downscaled) gene space correspond for i À 2 ½1; m À to equidistant gene strengths u g Ið1Þ þ ðÀu g Ið1Þ þ u g IðmÞ Þ Á i À m À , where u g i abbreviates he g i ju g i. Likewise, basis vectors je sÀ j À i of the rescaled and downscaled sample space correspond for j À 2 1; n À ½ to equidistant sample strengths u s Jð1Þ þ ðÀu s Jð1Þ þ u s JðnÞ Þ Á j À n À . The rescaled and downscaled signal e X À 2 R m À n À is computed by arithmetic averaging of e X components in corresponding grid cells of signature strength. If no (gene, sample) pair has signature strengths in corresponding intervals, nearest-neighbor interpolation is employed.
A smoothening window of constant width in this rescaled space of equidistant signature strengths corresponds to an adaptive smoothening window in the original signal space R m n . Let G σ g ;σ s 2 R m À n À denote a Gaussian kernel centered at indices ð m À 2 ; n À 2 Þ. Default standard deviations σ g and σ s correspond to eight pixels in the downscaled space R m À n À , i.e., to a smoothening window size of 1 64 of the distance between minimum and maximum signature strength.
The smoothening is given by the convolution e X À Ã G σ g ;σ s . As the direct implementation of convolutions has a complexity that is quadratic in the number of points, we apply the convolution theorem 50 : This reduces the problem to the component-wise multiplication of two Fouriertransformed matrices and one inverse Fourier transform of the result. Fast Fourier transform implementations for F and F À1 (provided by fft2 and ifft2 functions in MATLAB®) are employed and have only log-linear complexity.
To finally obtain the smoothened signal for original gene and sample strengths, we look up values in this rescaled smoothened matrix by linear interpolation. Let I u g j i; u s j i denote this linear interpolation; then the final smoothened matrix is This signal smoothening procedure is the development result of maximizing robustness. It is robust against too strong smoothening that would otherwise introduce dissection artefacts into the signal when the signature signal changes abruptly. Additionally, it is robust against too weak smoothening which would cause a too fine-grained dissection, thereby dissecting and losing information about overlapping signatures. Both robustness goals were achieved by rescaling the signal to and smoothening it over a grid representing equidistant signature strengths. Hence, if signature strength decreases rapidly from one gene to its neighbor in signature strength order, there are many window sizes in-between (sharp signal edges are retained). Conversely, all member genes with approximately equal signature strength end up in the same smoothening window (signal roughness due to noise is completely averaged here). This adaptiveness makes smoothening robust and generic. For the same reason, our smoothening concept is largely independent of actual window width, within a certain range. The actually chosen grid size of 512*512 pixels is more than sufficient to resolve bi-monotonic patterns. As robustness was already taken care of, we optimized this resolution and the window width subsequently for 2D FFT performance (hence the powers of 2).
Bi-monotonic regression. To approximate interactions by our signature model as precisely as possible, we start from measured data in the signature focus ordered by signature strengths and regress it to the nearest bimonotonic representation. Residuals that do not fit bi-monotonicity are assumed to originate from overlapping distinct signatures that will be dissected in later iterations.
Bi-monotonic regression is based on weighted 1D isotonic regressions as implemented by the Generalized Pool Adjacent Violator (GPAV) algorithm 51 . The following loop until convergence (with index j) realizes the 2D bi-monotonic regression.
For q = 0, start with the presorted but not yet regressed signal matrix where W 2D 0 2 R m n is defined by the components of the outer product jw ĝ l i jw ŝ l i. Each iteration q of the convergence loop is structured as follows: • 1D regression of genes: Apply GPAV to every gene row in e M o;q and use corresponding rows of W e 2D q as regression weights. This results in monotonically regressed genes that are collected again as matrix e G qþ1 2 R m n . Each gene row in e G qþ1 is a step function that consists of blocks with constant regressed GE values, while expression values of neighboring blocks are either all monotonically increasing or all monotonically decreasing. Each block corresponds to a sample interval in the sample order J k;o . GPAV also updates sample weights (individually for each gene) by averaging input weights for each block; they are collected as rows of the matrix W e G qþ1 .
• 1D regression of samples: Likewise, regressions of sample columns in e M o;q are realized with GPAV, using corresponding columns of W e 2D q as regression weights. This results in a matrix of monotonic columns e S qþ1 2 R m n and in updated gene weights for each sample W e S qþ1 2 R m n . Algorithmic complexity. Algorithms having a complexity that grows like the volume of a vector space, i.e., exponentially in gene or sample dimensions, are typically of no practical use when analyzing high-dimensional datasets (curse of dimensionality). Here we show that SDCM's complexity is bounded by a low-order polynomial.
Our search strategy tests up to m + n initial representatives (genes or samples) to identify the one with maximal signature functional. Each candidate leads to a pair of signature axes (see the section Algorithm overview in Methods). Correlating the gene axis with all n samples has complexity O(nm), as each correlation is computed with linear complexity in the number of genes per sample, i.e., O(m). Correlating the sample axis with all m genes likewise results in O(nm). Taken together, the search strategy has complexity Oððm þ nÞ Á ðnm þ mnÞÞ ¼ Oððm þ nÞ Á nmÞ. As all other steps are essentially bounded by constants (e.g., by the unknown but finite size of the signature focus), they are irrelevant for complexity estimation. In addition to input size, complexity of input signals influences runtime. Due to the iterative structure of SDCM, this translates into a linear dependency on the number of detectable signaturesk. Overall, SDCM's complexity is To test this complexity analysis and to quantify SDCM performance in practice, we analyzed signals for gene spaces with up to m = 80,000 dimensions. Consistently, the empirical complexity showed runtime asymptotics that are quadratic in m (Supplementary Fig. 14). Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
Analyzed DLBCL GE datasets are publicly available in NCBI GEO 52 via accessions GSE31312 3 for the detection cohort and GSE10846 2 for the validation cohort. Data for the RNA-sequencing based add-on validation cohort are available from the European Genome-Phenome Archive (https://www.ega-archive.org) via accession EGAS00001002606 39 . Discovered DLBCL signatures and derived top gene sets are provided as Supplementary Data 1-3.