Abstract
Gene expression is controlled by many simultaneous interactions, frequently measured collectively in biology and medicine by highthroughput technologies. It is a highly challenging task to infer from these data the generating effects and cooperating genes. Here, we present an unsupervised hypothesisgenerating learning concept termed signal dissection by correlation maximization (SDCM) that dissects large highdimensional datasets into signatures. Each signature captures a particular signal pattern that was consistently observed for multiple genes and samples, likely caused by the same underlying interaction. A key difference to other methods is our flexible nonlinear signal superposition model, combined with a precise regression technique. Analyzing gene expression of diffuse large Bcell lymphoma, our method discovers previously unidentified signatures that reveal significant differences in patient survival. These signatures are more predictive than those from various methods used for comparison and robustly validate across technological platforms. This implies highly specific extraction of clinically relevant gene interactions.
Similar content being viewed by others
Introduction
The analysis of gene expression (GE) using microarrays or highthroughput RNAsequencing^{1} allows the determination of molecular interactions and GE programs in cancer cells^{2,3}. However, these technologies measure concurrent GE programs in all cells of a sample collectively. They cannot directly determine which genes cooperate in specific cellular functions. It remains a major challenge to discover, dissect and extract such interactions.
Essential characteristics of cancer samples^{4,5,6,7,8,9,10,11,12,13,14} are represented by GE signatures, i.e. by sets of cooperating genes. For diffuse large Bcell lymphoma (DLBCL)^{2}, the most frequent lymphoma type in adults^{15}, hierarchical clustering (HC) detected GE signatures that distinguish two molecular subtypes derived from different cells of origin (COO)^{12}. These are clinically relevant, as patients of the germinal center Bcelllike subtype (GCB DLBCL) have a 5year overall survival (OS) of approximately 80% compared to only 50% for patients diagnosed with the activated Bcelllike subtype (ABC DLBCL)^{2}. DLBCL is a molecularly heterogeneous disease^{2,16,17} and it remains a challenge to discover all diseasespecific GE programs and their influence on patient outcome. Even within ABC and GCB DLBCLs, heterogeneity prevails and significant GE differences remain within these subgroups. A significantly better understanding of this heterogeneity is required to ultimately be able to treat affected patients differently.
Detecting interactions in GE datasets for many genes and samples is a highly challenging underdetermined problem. Supervised methods such as gene set enrichment analysis (GSEA)^{18} use previously described information in their search for associations, while unsupervised methods aim to identify novel signatures in an unbiased manner based on GE data only. Generally, any gene subset showing significant coexpression for any subset of samples could represent a biological interaction. Detection methods therefore utilize search strategies. HC uses distance measures and linkage methods^{19} to identify interacting subsets. Since its first systematic application to GE datasets^{20}, HC has provided many important insights^{4,5,6,7,8,9,10,11,12,13,14,21,22,23}, such as tumorimmune cell interactions in breast cancer^{6}. Principal component analysis (PCA)^{24,25,26} obtains principal components (PCs) by searching for maximal variance. Often few PCs already capture most of the GE variability^{27,28,29}. Nonnegative matrix factorization (NNMF)^{30} minimizes a divergence functional^{31} and has also identified clinically relevant tumor subtypes^{32}. Biclustering methods such as FABIA^{33} or PLAID^{34} cluster genes and samples simultaneously and successfully rediscovered, e.g., breast cancer subclasses^{33}. Independent component analysis (ICA)^{35,36} maximizes statistical independence. Obtained independent components (ICs) may be interpreted as specific oncogenic pathways or regulatory modules^{37}.
All these methods have broad ranges of applicability, but various limitations, e.g., all require missing values to be imputed in advance. While HC can identify isolated signatures^{5,8,9}, it cannot dissect overlapping interactions^{10,13,14}. As genes influenced by multiple interactions may show higher variability, resulting PCs may represent unspecific GE mixtures that obstruct biological understanding^{25}. ICA tries to prevent such mixtures by maximizing a measure of nonGaussianity^{36}, but this does not work for normally distributed data^{38}. NNMF and most biclustering methods require that the unknown number of signatures is provided in advance. Additionally, NNMF is restricted to positive signals, i.e., it cannot model gene suppression.
Here, we present signal dissection by correlation maximization (SDCM) that overcomes these limitations and that significantly extends the class of detectable interaction signatures. With interaction we generically refer to any cause of correlations between arbitrary genes in arbitrary subsets of samples. Multiple interactions may affect the same genes and samples. An interaction signature aims to extract all traces of correlation in the signal that originate from one particular interaction. After introducing SDCM, we thoroughly validate our approach and systematically compare it to 17 other approaches of unsupervised learning. We apply our algorithm together with several comparison methods to real GE data from human DLBCL samples and rank all results based on their ability to reveal differences in patient survival.
Results
SDCM concepts
The original idea of SDCM is a unifying graphical model for biologically relevant GE data that is typically represented by a heatmap depicting gene and sample dimensions simultaneously (such as Fig. 3 in ref. ^{3}). Our model assumes that an individual GE program is associated with specific orders of genes and samples. In absence of any overlapping effects and measurement noise, we model GE data that was sorted by such specific gene and sample orders as a heatmap in which each participating gene row and each sample column is comprised of monotonic expression values only, i.e., all follow a common order. This consistency idea is the key for guiding data regression and dissecting overlapping GE programs that were measured simultaneously as a sum of all contributing intensities. SDCM dissects the complete GE dataset into such bimonotonic signatures for specific gene and sample orders. Despite empowering dissection, this nonlinear consistency model also significantly increased versatility and scope of detectable interaction signatures compared to linear methods.
To illustrate SDCM concepts, we first show a lowdimensional gene space and compare our approach with PCA. PCA searches for directions of maximal data variance. Quantification of data variance in any given direction ignores perpendicular distances of data points. In contrast, our search functional is maximal for directions, to which as many data points as possible are aligned as consistently as possible. To quantify this, we compute uncentered weighted correlations of data points with candidate directions (aka weighted cosine similarities). Weights decrease with angular distance to that direction. Geometrically, these weights enable SDCM to focus on data points in a doublecone around any given direction (with cone tips touching at origin, as illustrated in Fig. 1c–f). While a PC represents a data point distribution as a linear axis, SDCM extends this linear concept to nonlinear monotonic curves obtained by regression. Points with relatively low weights (outside of the cone) have lower influence on this regression than data points in the cone. In this way, SDCM extracts data structures locally, whereas PCA globally reduces data dimension by projection along the PC of maximal variance. This can result in a 1tomany relationship between GE program and PCs, unnecessarily obstructing biological insight. We illustrate this problem in our 3D concept example that contains four simulated GE programs for different partitions of samples. For this dataset, the first PC of maximal variance passes through the empty space between the three larger simulated GE programs (Fig. 1i). As PCs are orthogonal per construction by projection, there are only three PCs in total for this 3D gene space. Hence, the optimal 1:1 relationship with the four simulated programs can principally not be obtained with PCs. The same is true for any other method utilizing projections for dimension reduction, e.g., ICA.
This problem is aggravated in highdimensional data spaces, as more genes could play multiple roles in GE programs of different patient subsets. These subsets are then hard to represent by orthogonal PCs, similar to the 3D example. Furthermore, in highdimensional biological data, samples usually harbor multiple GE programs that are often overlapping (sharing genes). Hence, real GE datasets usually do not give rise to an unambiguous patient partitioning as simulated for the 3D example. Due to these complexities of real GE data, we shifted goal from dimension reduction, as pursued by PCA, to consistent 1:1 representation of initially overlapping interaction signatures. Similar to biclustering methods, SDCM works in full data space, treating gene and sample space on equal footing, and rather than reducing dimensions, the data signal itself is iteratively reduced to a noise cloud around zero (cf. Fig. 1g).
Because of these significant conceptual differences we had to mathematically develop SDCM from scratch based on a general superposition approach \({\mathbf{M}}_0 = \sum_{k} {\mathbf{E}}_{k} + {\mathbf{\upeta}}\). Here, the input signal, represented as a genes*samples matrix M_{0}, is dissected into a sum of signature matrices E_{k} of the same size, additively overlaid with a noise matrix η. While each signature E_{k} formally has the same size as the input, it explains nonzero signal parts only for typically small subsets of associated genes and samples. Noise is implicitly defined by adjustable significance thresholds for signal strengths (projections of data points onto a candidate direction) and for uncentered weighted correlations (angular distances of data points to a candidate direction). As long as data structures remain in the signal that are significant with respect to these conditions, SDCM continues to dissect signatures by iterating the following four steps:

(1)
search for an initial candidate direction using a signature functional,

(2)
optimize this direction by locally maximizing the signature functional,

(3)
regress a monotonic curve through data points in the weightsbased cone and

(4)
selectively dissect the signal parts that are consistent with this curve.
All these steps are illustrated for the 3D example (Fig. 1) and described in the section Conceptual introduction in Methods. Identical steps are performed in case of highdimensional signals, as illustrated in Fig. 2. We provide definitions of algorithmic steps and concepts in the Methods section. Supplementary Notes 1–7 detail subroutines. In particular we explain two key concepts, the signature functional (that guides the initial search of a linear axis and its optimization) and our bimonotonic consistency requirements (that guide signature regression onto a nonlinear monotonic curve and subsequent signal dissection).
Versatility test and method validation
To validate detection versatility for highdimensional data, we designed seven signatures \({\mathbf{E}}_k^{{\mathrm{sim}}}\) of different size and form that mimic realworld GE signatures (Supplementary Table 1). Each was simulated bimonotonically in a random gene and sample order (Fig. 3a). The test signal \({\mathbf{M}}_0^{{\mathrm{sim}}}\) was defined as superposition \(\mathop {\sum}\nolimits_{k = 1}^7 {{\mathbf{E}}_k^{{\mathrm{sim}}}}\) of all unordered signatures and normal noise (Fig. 3b). The challenge was to precisely detect and dissect all seven signatures and to avoid detection of false positive (FP) signatures. SDCM detected all seven signatures, reconstructed most of their signals (Fig. 3c) and stopped when the remaining signal (Fig. 3d) did no longer contain any further qualifying signature representatives. In particular, it did not return any FPs. To quantify reconstruction quality, simulated signature axes were correlated with detected ones (Fig. 3e and Supplementary Note 8). SDCM extracted all simulated signatures with correlations close to 1 (high sensitivity, red diagonals). Additionally, there was no strong correlation of detected signatures to other simulated signatures (high specificity, dark offdiagonals). These results are representative for 49 repetitions (Supplementary Fig. 1).
Method comparisons with simulated data
We applied HC, NNMF, FABIA, FABIAS, ICA, and PCA to the same versatility tests (Supplementary Figs. 2–6 and Supplementary Note 9). Only PCA performed comparably well. HC found only signature #1 and ICA, FABIA, FABIAS and NNMF found only a few. The latter three also mixed signals from multiple signatures (red offdiagonals). For the 3D test, neither PCA (Fig. 1i), nor HC or FABIAS (Supplementary Figs. 7 and 8) were able to recover simulated interactions (ICA is restricted to three components like PCA and FABIAS; NNMF is only applicable to positive values). To test the scope of our signature model, we applied SDCM to four external benchmarks that were previously defined to rank 13 biclustering methods^{33}. Each benchmark consists of 100 datasets, one was generated with a multiplicative data model, the other three with an additive data model for different signaltonoise ratios^{33}. SDCM scored best in all four benchmarks, followed by FABIAS (Supplementary Table 2). For an indepth comparison with PCA, we designed a more challenging test signal containing two additional instances of signatures #2, #3, and #4 from the above 7signature test (for different random subsets and orders of genes and samples). 49 simulations of this versatility test with 13 superposed signatures were analyzed with both SDCM and PCA. Detected gene axes (respectively PCs) were correlated with simulated gene axes (Fig. 4b and Supplementary Fig. 9). SDCM was significantly more sensitive than PCA for 10/13 signatures and significantly more specific for 12/13 signatures (Supplementary Table 3).
Number of signatures
In contrast to the other methods, SDCM can infer the typically unknown number of signatures (termination rules in Methods/Step 1 and Supplementary Note 4). Defining FPs as detected signatures with correlations <0.4 to all simulated signatures, 53% of the 49 13signature tests had zero FPs, 33% had one, 10% had two, 4% had three and none had ≥4 FPs (overall false discovery rate 0.048). Defining false negatives as simulated signatures with correlations <0.4 to all detected signatures, there was only 1/49 simulations with 1/13 signatures that was not detected (overall sensitivity 0.998; Supplementary Fig. 9).
Limits
To determine the minimal signal strength required for detection, we simulated single instances of signature #6 for decreasing maximum signal strengths relative to the noise level σ. SDCM detected this signature down to approximately 0.5σ. Below 0.5σ, SDCM often terminated without any detected signature (Supplementary Fig. 10). No FP signatures were returned. For correlations r ≥ 0.8 to the simulated gene axis (indicating good reproducibility), SDCM required approximately 0.75σ. Here, PCA performed better by requiring only 0.63σ. PCA always returns PCs. However, for weak strengths ≤0.4σ, all were FPs (r ≤ 0.2). Next, we asked how many interactions may overlap while still being dissectible. We simulated test signals with an increasing number of superposed signatures (plus noise). For series based on signatures #3 or #4, SDCM showed high detection rates. The series for the weak signature #6 illustrates superposition limits of our method. PCA performance broke down completely for all three superposition series (Fig. 4c and Supplementary Figs 11–13). Scaling limits of SDCM are determined by its analytic complexity \(O(\hat k(m + n)nm)\) (see Eq. 37). Consistently, measured computation times scaled quadratically in m (Supplementary Fig. 14). A typical runtime is 5 min per signature \((m = 20000,n = 100)\).
Missing values and their reconstruction
Various measurement problems or quality filters can lead to missing values. The other methods require either exclusion of affected genes or replacement values. As regression in SDCM is based on weights (see the section Signature focus \(\left( {\left {{\mathbf{w}}^{\mathrm{g}}} \right\rangle ,\left {{\mathbf{w}}^{\mathrm{s}}} \right\rangle } \right)\) in Methods), we support missing values in a natural way by assigning zero weights. SDCM additionally predicts specific contributions from missing values to each detected signature based on our signature model. For the 7signature versatility test, SDCM detected all signatures with high correlations up to missing value rates of 35% (Supplementary Fig. 15a). Even for 80% missing values (Fig. 4d), signatures #1–#4 were still detected with high correlations and the majority of their missing values were reconstructed (Supplementary Fig. 15b).
Application of SDCM to GE data from DLBCL samples
For signature discovery in real data, we selected the largest GE dataset of human DLBCL samples available at time of this analysis with 498 patients^{3} (Supplementary Note 9). SDCM dissected this dataset into 105 signatures (Supplementary Data 1 and 2). For direct usage of discovered signatures in a supervised analysis such as GSEA^{18}, we also provide derived sets of signature top genes (Supplementary Data 3). Real datasets typically contain cohortspecific laboratory effects. While SDCM dissects these from biological signals, resulting FP signatures have to be filtered out.
To verify signatures on GE level and test their robustness after changing the technological platform, we transferred them to both a microarraybased validation cohort of 233 independent DLBCL patient samples^{2} and to a RNAsequencing based validation cohort of 624 independent DLBCL patient samples^{39} (Supplementary Note 11). To this end, only gene axis and gene weights of each signature are transferred. Projections of target samples onto transferred axes then yield signaturespecific orderings of the target cohort. Hence, signatures are validated on GE level if the target cohort exhibits similar significant differences between samples as in the source cohort, resulting in similar correlations between genes (Supplementary Fig. 16 shows signature k = 6 as example). Additionally, strongly bimodal sample strengths in signature k = 6 induced binary clusters that were significantly associated with independent patient gender data (detection cohort: \(p = 2.8 \times 10^{  83}\), microarraybased validation cohort: \(p = 5.7 \times 10^{  50}\), RNAsequencing based validation cohort: \(p = 6.2 \times 10^{  98}\), \(\chi ^2\)tests). Albeit as diseaseunspecific as gender, this signature is a first biological proof of concept and it may also serve as a control signature for future dissections of human GE studies.
To filter signatures, they were associated with previously described diseasespecific gene signatures or clinical data. Interpretations of these associations for all discovered signatures are beyond the scope of this study. Here, we focus on those of key importance for method validation. In the study by Lenz et al.^{2}, three GE signatures were identified and combined into a survival predictor model. This model was able to identify patients with significant differences in survival. SDCM rediscovered all three signatures, as confirmed by high and significant gene set enrichment^{18} (Supplementary Figs. 17–19 and Supplementary Note 12). In particular, we refer to iteration k = 12 as rediscovered COO^{40} signature, as it showed the highest enrichment for known ABC and GCB DLBCL gene sets. This confirms the utility of SDCM in discovering cancer subclasses. Size, inner correlation and amount of explained signal for all 105 discovered signatures are depicted in Supplementary Fig. 20. Signatures detected in early iterations already explain most of the signal, while at later iterations, signatures are typically smaller (fewer genes in their Signature focus \(\left( {\left {{\mathbf{w}}^{\mathrm{g}}} \right\rangle ,\left {{\mathbf{w}}^{\mathrm{s}}} \right\rangle } \right)\), see Methods). As general statistics like signature size or the amount of explained signal cannot robustly indicate biological relevance, we systematically tested the association of every signature with patient survival data (see below).
Method comparison with real data
We also applied PCA, ICA, NNMF and FABIA/S to this real dataset (same 13 configurations as tested for simulated data). We compared detected gene axes (PCs, ICs, NNMF or FABIA/S loading vectors) with SDCM signatures via weighted correlations (Supplementary Note 8). Results are shown in Supplementary Figs. 21–24 and summarized in Supplementary Note 13. In brief, while the strong genderassociated signature and the large Stromal1 signature were rediscovered by several configurations, none of the comparison methods rediscovered the top survival signatures identified in this study with high correlation. Additionally, artefacts were revealed, such as splitting the signal of the genderassociated signature over multiple components. Overall, ICA results showed the highest similarity to SDCM.
Survival models and method comparison
To determine in an unbiased fashion which of the signatures are potentially responsible for patient survival differences, and which detection method extracted the most predictive signatures, we constructed alternative Cox proportional hazard models (CPHM)^{41}. We modeled the progressionfree survival (PFS) of patients with combinations of up to three signatures from any of the detection methods (Supplementary Note 14 and Supplementary Fig. 25). We tested every combination of two or three signatures discovered by SDCM (including rediscovered stromal signatures) or by any of the compared methods. The best 2signature model was comprised of SDCM signatures k = 11 and k = 27. The secondbest model only had a relative likelihood of 0.02 (Supplementary Table 4a). The best 3signature model was comprised of the same two signatures and the 54^{th} IC detected by ICA with the Gaussian contrast function^{35} (Table 4b). All three top survival signatures were present in the independent validation cohort (Supplementary Figs. 26–28). Due to better predictor generalization properties in this validation cohort, we selected the simpler 2signature model for further analysis (Supplementary Figs 29–30). Comprising survival signatures are presented in Fig. 5. Corresponding pvalues in the final Cox model were \(p_{{\mathrm{CPHM}},k = 27} = 2.1 \times 10^{  11}\) and \(p_{{\mathrm{CPHM}},k = 11} = 2.2 \times 10^{  9}\) (details including Cox β coefficients provided in Supplementary Note 15).
Signatures k = 27 and k = 11 represent overlapping opposite survival associations (fitted Cox coefficients have different signs). Signature k = 11 did not seem to be predictive when tested alone (q_{LRT} = 0.47, likelihood ratio test on top of the ageonly model, Bonferronicorrected for the SDCM signature family). Neither was it predictive on top of the rediscovered COO signature k = 12 (q_{LRT }= 0.30). However, prediction was highly synergistic when combined with signature k = 27 (q_{LRT} = 6.8 × 10^{−6}). KaplanMeier survival estimates for four molecular risk groups visualize our predictor (Fig. 6a). Between high molecular risk (patients with ≥175% of the average risk in the detection cohort) and low molecular risk, a difference in 5year PFS of 58% was detectable (\(p = 1.2 \times 10^{  11}\) via logrank test reasserts the Cox model; Supplementary Fig. 30c shows OS). In the smaller validation cohort, the difference in 5year PFS was still 53% for identical molecular risk groups (Fig. 6b, \(p = 2.3 \times 10^{  5}\); Supplementary Fig. 30d shows OS). To further confirm robustness after changing the technological platform, we applied our predictor to OS in the independent RNAsequencing validation cohort (unfortunately no PFS available). While expected cohorttocohort variations were observed, strong survival differences in 5year OS of 40% remained between identical molecular risk groups (\(p = 4.9 \times 10^{  7}\); Supplementary Fig. 30e).
Collectively, this indicates that identified signatures represent specific characteristics of DLBCL and not just of a single DLBCL cohort. In contrast, survival differences between previously described DLBCL subtypes did not generalize as well across these cohorts (Supplementary Fig. 31).
Furthermore, the identified 2signature model discriminated patient survival significantly better than the 3signature model of Lenz et al.^{2} (p_{LRT} = 1.3 × 10^{−8}, likelihood ratio test on top of previous predictor scores, Supplementary Note 16). Vice versa, the former 3signature predictor could not add significant predictive information on top of our 2signature model (\(p_{{\mathrm{LRT}}} = 0.07\)).
Applied separately to DLBCL subtypes (Supplementary Fig. 32), our predictor revealed a subgroup within GCB DLBCL that was associated with significantly adverse survival (\(p = 1.0 \times 10^{  9}\), logrank test, Fig. 6c). This subgroup was confirmed in both the smaller cohort (\(p = 4.6 \times 10^{  5}\), Fig. 6d) and for OS in the RNAsequencing validation cohort (p = 0.001 after crossing cohort, technology and survival type; OS results for all cohorts in Supplementary Fig. 33). Sample strengths in both survival signatures showed that ABC DLBCL samples cluster densely at lower strengths in both signatures. In contrast, GCB DLBCL cases expressed these signatures heterogeneously and were scattered from high to low risk (Supplementary Fig. 34). To quantify the information provided by our molecular predictor on top of known clinical risk factors, we grouped samples by their international prognostic index (IPI)^{42} that combines the clinical parameters performance status, age, stage, number of extranodal sites and serum LDH level. Interestingly, predicted survival differences remained significant in all clinical risk groups (Fig. 6e, f and Supplementary Fig. 35).
Biological associations
To elucidate biological associations of the two top survival signatures in an unbiased manner, we analyzed their gene ranks by discovered gene strengths for enrichment^{18} of 16.513 previously identified GE signatures^{40,43,44} (Supplementary Note 12). Interestingly, both correlated and anticorrelated genes of k = 11 were significantly enriched with signatures differentiating Burkitt lymphoma (BL) from DLBCL samples in a supervised analysis^{45} (HUMMEL_BURKITTS_LYMPHOMA_DN with enrichment score \({\mathrm{es}} =  0.903;\) HUMMEL_BURKITTS_LYMPHOMA_UP with \({\mathrm{es}} = 0.741\); each with \(p \le 0.001\) by permutation tests). In all DLBCL cohorts, correlated samples (sorted to the right in Fig. 5b) showed a BLlike GE pattern. As expected, both k = 27 and k = 11 are associated with germinal center signatures (e.g., Germinal_center_Bcell_DLBCL; \({\mathrm{es}} = 0.76\) for k = 27 and \({\mathrm{es}} = 0.55\) for k = 11; \(p \le 0.001\)). Likewise, both are associated with a signature determined in an analysis based on survival data^{2} (Germinal_center_B_cell_DLBCL_survival_predictor; \({\mathrm{es}} = 0.78\) for k = 27 and \({\mathrm{es}} = 0.63\) for k = 11; \(p \le 0.001\)). In the k = 11 signature, we also detected various previously identified target genes of the oncogenic nuclear factorkappa B (NFκB) signaling pathway such as BCL2A1, IRF4, TNFAIP3 or IL21R. This observation was confirmed in the unbiased GSEA, as anticorrelated genes of k = 11 (upper side in Fig. 5b) were significantly enriched with a previously identified NFκB signature (NFkB_Up_bothOCILy3andLy10; \({\mathrm{es}} = 0.68\); \(p \le 0.001\)). In contrast, k = 27 was not associated with NFκB signaling (\({\mathrm{es}} =  0.25\); enrichment results are shown in Supplementary Fig. 36).
Discussion
We provide an unsupervised learning concept for dissection of large datasets that are typically comprised of the collective signal from many unknown interactions and substantial measurement noise. As we have demonstrated, SDCM is able to discover and extract interaction signatures from such data with high sensitivity and specificity. In cancer transcriptomics, this superior performance will most likely lead to more precise dissections of entities that are molecularly heterogeneous, allowing a more specific identification of potential oncogenic programs. However, our method is generally applicable, as underlined by its top rank among 13 biclustering methods in four external benchmark datasets. Furthermore, SDCM reliably dissected highly overlapping interactions, determined the unknown number of signatures with low false discovery rates and reconstructed missing values. Finally, using outcome as independent indicator, two SDCM signatures for human DLBCL GE data showed the highest impact on patient survival of all signatures from all compared methods, suggesting high biological signature specificity. This supports the design of required comprehensive functional validation experiments to biologically identify driving interactions of these survival differences.
A central reason for high specificity and for the broad applicability of SDCM is our flexible signature model. While PCs and ICs approximate interactions only linearly, detected E_{k} are equivalent to nonlinear monotonic curves (Fig. 1c, h). They can precisely regress a much broader class of interaction signatures. Precision is crucial, as residuals missed during dissection would lead to FPs or artificially disturb not yet dissected signatures from overlapping interactions. Another reason is our signature functional \(\cal{E}\) that guides detection and optimization. It is the higher the more genes and samples take part in an interaction and the more consistent correlations this interaction causes across participating genes and samples. This optimizes signature specificity. In contrast, PCA maximizes the data variance explained by each PC, instead of correlations to it. Resulting PCs therefore represented mixtures of several interactions in superposition tests and in real data (cf. Supplementary Fig. 21b, c). An additional key advantage compared to projectionbased methods (including PCA and ICA) is that axes of different signatures are not constrained to be orthogonal. This enabled precise dissections even of partially correlated interactions (as illustrated in Fig. 1), whereas orthogonality forced them to be represented by several unspecific components (Fig. 1i). Similar artifacts were observed for NNMF and FABIA/S when dissecting real GE data (Supplementary Figs. 23 and 24). As we use a perturbativelike approach (first searching a linear signature axis and then regressing a bimonotonic signature curve), nonlinearities that deviate from linearity too strongly cannot be described by bimonotonic signatures. The same is true for all comparison methods. However, we did not find any biologically meaningful pattern of this type in the context of GE data.
While our primary goal was method validation against real data, dissection of DLBCL GE data into 105 signatures has provided important insights into the molecular heterogeneity of this diagnostic category. Already the genderassociated control signature and rediscovered stromal DLBCL signatures validate the utility of SDCM. For comparison, we additionally obtained signatures using PCA, ICA, NNMF and FABIA/S for the same data. To determine which of these signatures have the largest impact on patient survival, we constructed Cox proportional hazard models and ranked them using the Akaike information criterion. The best bivariate model was comprised of two SDCM signatures. It revealed significantly higher differences in patient survival than the previously described 3signature predictor by Lenz et al.^{2} that included two stromal signatures. Instead of utilizing limited survival data directly to find associated genes by supervised learning, which may cause a high false discovery rate, we dissected GE data in an unsupervised manner first. We also validated both signatures on GE level before using any survival data. Consequently, their strong survival associations (Fig. 6a, b) in both the detection and validation cohort are independent validations, indicating biological signature specificity and relevance. This was underlined on both GE and survival level in the RNAsequencing based validation cohort that only became available after method development, signature dissection and predictor construction.
While both signatures sorted samples roughly from ABC to GCB DLBCL over increasing sample strengths (Fig. 5), they were associated with overlapping antialigned survival trends and their combination predicted survival highly synergistically. As such antialigned survival trends can neutralize each other, they are hard to detect by supervised analyses. They revealed a 45% difference in 5year PFS for patients with DLBCL currently classified with a high clinical risk (Fig. 6f) and identified a subgroup within GCB DLBCL patients with more than 40% inferior 5year PFS (Fig. 6c, d). Further analyses revealed that this subgroup is characterized by a Burkitt lymphoma like GE profile and low expression of target genes of the oncogenic NF\(\kappa\)B pathway. This might potentially link to the just recently identified molecular highgrade Bcell lymphoma (MHG) subtype^{46} that shares these survival and BL GE characteristics. Collectively, this confirms that SDCM is able to discover novel and clinically relevant subtypes, without using prior knowledge of GE signatures or survival data. Remaining GCB DLBCLs with relative high expression of NFκB (left in Fig. 5b) consequently showed superior survival relative to average GCB DLBCL patients. This subgroup might be linked to a previously identified GCB DLBCL subtype with favorable outcome that is driven by BCL2 and active NFκB signaling^{47}. Discovered PFS differences were of similar magnitude in both microarraybased cohorts (Fig. 6b, d) and remained significant for OS and after crossing the technological border to RNAsequencing (Supplementary Fig. 33). This was not expected, as average survival differences based on previously classified COOrelated DLBCL subtypes and underlying signatures did not generalize as well (Supplementary Fig. 31).
Our results implicate that SDCM is able to dissect overlapping biological interactions by searching for signatures of maximal GE correlation. Top genes of identified survival signatures (Fig. 5) might provide a more robust and more specific representation of genetic interactions responsible for survival differences in DLBCL than previous COOrelated signatures. Signatures k = 11, 12, and 27 are all partially correlated, but their differences are biologically important. Similar to the 3D example (Fig. 1), partially correlated interactions do not need to be identical and should be dissected precisely.
Due to constraints such as linearity or orthogonality, previously described comparison methods have a narrower scope of detectable interactions. Hence, they often failed to dissect overlapping interaction signatures in the signal, potentially missing biologically important correlations. Our perturbativelike expansion towards monotonically nonlinear interactions provides a high versatility in dissecting collectively measured molecular interactions. This might be a substantial advance and could lead to discoveries of novel correlations in transcriptome data of different cancer types with possible translational impact on diagnosis and therapy. This is a central research goal in context of the precision medicine initiative^{48} that currently quantifies transcriptomes for many cancer entities. However, as the provided SDCM toolbox is based on generic mathematical concepts, namely correlation maximization, bimonotonic regression and iterative focused dissection, it may likewise be applied to dissect many other data sources, such as mutational frequencies based on DNAsequencing or even nonbiological data such as light spectra in astrophysics^{49} to infer star classes. Essentially, any data domain that was analyzed by PCA might also be suitable for dissection with SDCM, especially if PCs were hard to interpret in the respective system context.
Methods
Conceptual introduction
Here, we conceptually explain the four iterated main steps of SDCM, as introduced in the section SDCM concepts in Results.
As the number of all possible directions (GE programs) grows exponentially with data space dimension, the search strategy in step 1 concentrates on the subset of actually measured directions. Each sample column in the input matrix (each point in Fig. 1) represents one direction in the space spanned by genes. Likewise, each measured gene row defines one direction in sample space. By restricting the search to only these n + m measured directions we assume that some of these candidate directions pass through point clouds that SDCM is designed to detect: We are looking for GE programs that are characterized by a consistent intrinsic order of coexpression of participating genes in multiple samples. The intrinsic gene order defines a characteristic direction in gene space. Samples expressing this program at various intensities are then scattered along that common direction, giving rise to a specific sample order. Our search functional obtains higher values for such characteristic directions than for axes that only pass through their generating data point with no others nearby. Many deviations from purely linear patterns (i.e., from constant gene coexpression ratios for all samples) are conceivable for real GE programs, as long as their characteristic coexpression order remains consistent. This has led to the model of monotonic curves. Compared to linear point clouds at orthogonal directions that can also be represented 1:1 by PCs, our class of detectable interaction signatures along nonlinear monotonic curves is huge. For this class, our above assumption for search space restriction always holds, as one participating sample is always in the middle of its directional point cloud. In contrast, SDCM does not search for, e.g., circular point clouds or points scattered in higherdimensional manifolds of the gene space, as we do not expect such structures to exist in real GE datasets. Concentrating on the broad class of directional point clouds, our search strategy aims to select the candidate direction with the globally highest signature functional.
Such an initial direction generated by only a single data point is rarely an optimal representation of a GE program that is active in many samples. Therefore, starting from an initial direction, step 2 selects additional data points nearby to refine that direction by propagating it to the weighted mean of all selected data points (Fig. 1b). This continues iteratively, as long as the signature functional has not yet obtained its local maximum.
Once the linear direction has been optimized, step 3 regresses a curve through the data points in its vicinity. Regression weights of data points decrease with their angular distance, as illustrated by doublecones in Fig. 1c–f. Regression is guided by the key consistency constraint of our data model that after ordering gene rows and sample columns by signature strengths (given by projections on the axis from step 2) the resulting signature matrix is bimonotonic (for a detailed motivation for this requirement see the next subsection). Depicted as a heatmap after sorting by signaturespecific gene and sample strengths, each regressed signature matrix E_{k} thus displays with allmonotonic rows and columns (see Fig. 2d for a highdimensional regression result).
Real data patterns are only approximately bimonotonic (see, e.g., Fig. 2c). Observed differences to exact bimonotonicity might not be caused by noise alone but often contain valuable information about other overlapping GE programs. Therefore, SDCM only dissects those parts from the signal in step 4 that are exactly consistent with our bimonotonic model. The residuals remaining in the signal might then contribute to the detection of signatures dissected in later iterations.
In the following subsections, we introduce SDCM in a formal way. In particular, we detail key concepts such as the signature functional that guides the initial search of a linear axis and its optimization and our bimonotonic consistency requirements that guide signature regression onto a nonlinear monotonic curve and subsequent signal dissection.
Formal introduction
Here, we describe our signal and signature model again and provide an overview of the algorithmic structure, with links to detailed mathematical notes about each formal aspect.
SDCM analyzes datasets as twodimensional matrices M_{0} with n columns representing samples and m rows representing genes (or other features). We aim to extract interactions as signatures defined by their signal matrices E_{k} of same size as M_{0}. Components \({\mathbf{E}}_k\left( {i,j} \right) \in {\Bbb R}\) quantify the expression (or suppression) of each gene i in each sample j for one specific interaction. We assume that M_{0} is the result of a superposition of different E_{k}, i.e., \({\mathbf{M}}_0 = \mathop {\sum}\nolimits_{k = 1}^{\hat k} {{\mathbf{E}}_k} + {\mathbf{\eta }}\) where \(\hat k\) is the initially unknown number of signatures and η is a normal random matrix describing measurement noise. This allows genes playing multiple roles; overlapping roles may cause partial correlations between signatures that need to be dissected into individual E_{k}. Similar superposition hypotheses underlie PCA, ICA, and FABIA, as these methods are equivalent to matrix factorizations of M_{0}^{33,37}.
The dissection of M_{0} into individual E_{k} is an underdetermined problem. To overcome this, we first constrain the class of signals E_{k}. We assume that biological interactions are characterized by a hierarchy of expression of participating genes ranging from weak to strong. While expression strengths may also vary between samples, we assume that the hierarchy of genes is consistent in all samples influenced by the same interaction. Likewise, the hierarchy of samples must be consistent for all participating genes. Formally, each E_{k} is required to have a permuted gene order I_{k} and sample order J_{k} such that their resorted signal matrix \(\widetilde {\mathbf{E}}_k \equiv {\mathbf{E}}_k\left( {{\mathbf{I}}_k,{\mathbf{J}}_k} \right)\) is bimonotonic, i.e., monotonic over all genes for each sample and vice versa (see the section Model in Methods). Genes not participating in the respective interaction and unaffected samples have \({\mathbf{E}}_k\left( {i,j} \right) = 0\).
This requirement is based on the following assumption: The order of genes by their signature strengths must be consistent for each sample participating in the signature, as this gene order should represent the intrinsic structure of a single GE program that cannot change from patient to patient. Symmetrically, the order of samples by their signature strengths should be consistent for each gene participating in the signature, as this order quantifies the involvement of samples in the whole GE program that cannot change from gene to gene. Of course, a single gene of any given GE program typically cannot reliably sort samples by their involvement in the whole program due to noise or due to additional roles that gene may play in overlapping programs. However, the monotonic ties between many samples and the required repetition of that pattern in many genes and vice versa, as required by our bimonotonic consistency model, leads to statistically significant and reliable orderings of both samples and genes. Each signature discovered by SDCM is based on such bimonotonic ties, representing one specific underlying GE program (or another statistically significant structure in the data).
A basic example is a purely linear interaction. Here, participating genes are expressed at a fixed ratio relative to each other, inducing a gene order I and defining a line in gene space, such as a PC or IC (gene axis). The same interaction may differ in strength between samples, inducing a sample order J along that line (cf. red points in Fig. 1a). Hence, every linear interaction is automatically bimonotonic. Bimonotonic \(\widetilde {\mathbf{E}}_k\) generalize this linear concept and are geometrically equivalent to monotonic curves (Fig. 1h). They can represent all nonlinear monotonic interactions, such as saturations (blue in Fig. 1a) or superlinear interactions (green). Furthermore, this model covers conventional clusters representing, e.g., activation thresholds (magenta). Projections of samples on such a signature curve quantify their strengths in the respective signature (see the section Signature strengths \(\left( {\left {{\mathbf{u}}^{\mathrm{g}}} \right\rangle ,\left {{\mathbf{u}}^{\mathrm{s}}} \right\rangle } \right)\) in Methods).
In case of many gene dimensions, more complex monotonic deviations from purely linear interactions become possible. Introduced curves enable more precise approximations of such signatures, while purely linear models may lead to large residuals, effectively splitting the description of a single signature into several artificial parts. Generally, genes may interact indirectly in a complex combinatorial fashion. As GE datasets measure an ensemble of nonsynchronized cells at one time point only, they usually do not provide sufficient information to extract individual genegene interactions. However, many wellresearched gene signatures, e.g., hallmark gene sets of the Molecular Signature Database^{43}, show that genes realizing the same cellular function are effectively being coexpressed in the cell ensemble of affected samples. Hence, SDCM can detect signatures of such gene networks, even if participating genes are interacting only indirectly, provided the same effective coexpression is measured for several genes and samples.
Furthermore, we assume that larger signatures represent more robust interactions and that higher correlations are indicative for more specific (and less mixed) interactions. With the purpose of detecting and quantifying directions in the gene or sample space that maximize those characteristics, we defined a signature functional \(\cal{E}\) (see the section Signature functional in Methods). Its maximization in conjunction with the signature model suffices to infer signatures from M_{0} unambiguously.
SDCM is a deterministic algorithm that detects signatures iteratively. Each iteration k dissects one signature signal E_{k} from the current signal matrix \({\mathbf{M}}_{k  1}\) until termination, effectively dissecting M_{0} into all E_{k}. Each iteration consists of four major steps:

(1)
Search strategy: An initial representative (a sample or a gene) is selected that is associated with signature axes to which many other samples or genes have a high correlation (as quantified by \(\cal{E}\)). Figure 1 illustrates SDCM concepts in a 3D gene space. Figure 1a shows an initial sample and its associated gene axis. Generally, a sample or a gene may be selected as initial representative. Unlike other methods such as ICA, SDCM evaluates measured correlation structures simultaneously in both the gene and sample space to stabilize the search (see Methods/Step 1 for symmetrization details).

(2)
Correlation maximization: To make the linear signature estimation robust and independent of the initial representative, we iteratively add representatives that further maximize \(\cal{E}\) until convergence to a local maximum (Methods/Step 2). Correlationbased gene and sample weights smoothly quantify signature membership (see the section Signature focus \(\left( {\left {{\mathbf{w}}^{\mathrm{g}}} \right\rangle ,\left {{\mathbf{w}}^{\mathrm{s}}} \right\rangle } \right)\) in Methods). Optimized signature axes represent locally maximized correlation (between signature axes and members) and maximized signature size (Fig. 1b).

(3)
Bimonotonic regression: As true signatures are not necessarily linear, we employ our bimonotonic model to capture a much broader class of onedimensional manifolds. First, the signal matrix is reordered specific to gene and sample strengths in the detected signature (see section Signature strengths \(\left( {\left {{\mathbf{u}}^{\mathrm{g}}} \right\rangle ,\left {{\mathbf{u}}^{\mathrm{s}}} \right\rangle } \right)\) in Methods). To extract the bimonotonic signature signal, we apply regression to this sorted signal matrix (Methods/Step 3). Any residuals, i.e. parts of the signal that are not monotonic, are interpreted as belonging to overlapping foreign signatures (or as noise). The signature signal is equivalent to a monotonic curve over the corresponding signature axis (Fig. 1c).

(4)
Dissection: Finally, the detected signature signal is dissected from the overall signal matrix (Methods/Step 4). While methods based on matrix factorization or projection like PCA or ICA reduce space dimension with every PC or IC, SDCM always retains full space dimension. Rather than reducing the signal of all genes and samples by projecting the dataset along a linear axis, we utilize gene and sample weights to precisely dissect only those parts from the signal that are consistent with the detected signature (Fig. 1d). This enables dissection in cases of partially correlated signatures, i.e. when more signatures pass through the same (sub)space than this (sub)space has dimensions, e.g., four signatures in the 3D example.
These steps are repeated for every signature (Fig. 1d–g). After \(\hat k\) dissections, the remaining signal \({\mathbf{M}}_{\hat k}\) no longer contains any qualifying signature representative and SDCM terminates (Fig. 1g). Termination determines the number of signatures in the signal via significance thresholds for correlations and signal strengths. As the sum of all E_{k} and the noise residual restores the input signal matrix M_{0}, SDCM provides a complete dissection of the signal.
When dissecting highdimensional signals, identical steps are performed Figure 2 illustrates a detection and dissection iteration for many genes in coordinate view. (For the 3D example, this coordinate view would show a heatmap with only three rows for x, y and z genes and as many columns as there are data points in Fig. 1.) Generally, signatures may share the same (sub)space, as already illustrated by the 3D example. Additionally, they may overlap, i.e., they may affect measurement values for the same (gene, sample)combinations. As long as underlying correlations are sufficiently dissimilar (i.e., as long as the angle between corresponding linear signature axes is sufficiently nonzero), such overlapping signatures can still be dissected. Their regressed bimonotonic E_{k} then estimate the original nonoverlapped source signals (e.g., Fig. 2d).
In the remainder of this section, we provide a comprehensive mathematical description of SDCM. For practical applications, we provide a MALTAB® analysis toolbox including the full SDCM source code, unit tests and examples (see Code availability).
Framework
For the mathematical description of SDCM, we use the language of vector spaces in its established notation known from quantum mechanics.
The complete measured dataset of m genes (or, more generally, features) and n samples (e.g. tumor samples from patients) is represented as a single matrix \({\mathbf{M}}_0 \in {\Bbb R}^{m \times n}\), where \({\Bbb R}^{m \times n} \equiv \{ {( {{\mathbf{X}}_{ij}} )_{i = 1 \ldots m,j = 1 \ldots n}{\mathrm{}}{\mathbf{X}}_{ij} \in {\Bbb R}} \}\) is the signal space.
The gene space V^{g} is a vector space spanned by m basis genes \(\{ { {{\mathbf{e}}_i^{\mathrm{g}}} \rangle {\mathrm{}}i = 1 \ldots m} \}\). Every \( {{\mathbf{e}}_i^{\mathrm{g}}} \rangle\) has coordinates \(( {\delta _{\mu i}} )_{\mu = 1 \ldots m} \in {\Bbb R}^m\) in the gene reference order \({\mathbf{I}}_0 \equiv \left( {1, \ldots ,m} \right)\), where \(\delta _{\mu i}\) is the Kronecker delta, i.e. \(\delta _{\mu i} = 1\), if \(\mu = i\) and zero otherwise. We use the upper index ^{g} to indicate elements of this space. For each sample, all genes have been measured. Thus, samples are points in this vector space spanned by all basis genes. Sample j is represented by the vector \( {{\mathbf{s}}_j} \rangle = \mathop {\sum}\nolimits_{i = 1}^m {\langle {{\mathbf{e}}_i^{\mathrm{g}}{\mathrm{}}{\mathbf{s}}_j} \rangle }  {{\mathbf{e}}_i^{\mathrm{g}}} \rangle\) with expression values \(\langle {{\mathbf{e}}_i^{\mathrm{g}}{\mathrm{}}{\mathbf{s}}_j} \rangle \equiv {\mathbf{M}}_0( {i,j} )\) for gene indices \(i \in \left\{ {1, \ldots ,m} \right\}\).
Similarly, the sample space V^{s} is a vector space spanned by n basis samples \(\{ { {{\mathbf{e}}_j^{\mathrm{s}}} \rangle j = 1 \ldots n} \}\). Every \( {{\mathbf{e}}_j^{\mathrm{s}}} \rangle\) has coordinates \(( {\delta _{\nu j}} )_{\nu = 1 \ldots n} \in {\Bbb R}^n\) in the sample reference order \({\mathbf{J}}_0 \equiv \left( {1, \ldots ,n} \right)\). The upper index ^{s} indicates elements of this space. For each gene, all samples have been measured and hence genes are points in this vector space spanned by all basis samples. Gene i is represented by the vector \( {{\mathbf{g}}_i} \rangle = \mathop {\sum}\nolimits_{j = 1}^n {\langle {{\mathbf{e}}_j^{\mathrm{s}}{\mathrm{}}{\mathbf{g}}_i} \rangle }  {{\mathbf{e}}_j^{\mathrm{s}}} \rangle\) with expression values \(\langle {{\mathbf{e}}_j^{\mathrm{s}}{\mathrm{}}{\mathbf{g}}_i} \rangle \equiv {\mathbf{M}}_0( {i,j} )\) for sample indices \(j \in \{ {1, \ldots ,n} \}\).
Model
In general terms, we aim to detect interactions in subsets of genes that exist in subsets of samples. The size of subsets is initially unknown, the participation intensity in these interactions may vary from gene to gene and from sample to sample, and subsets of different interactions may be overlapping.
We assume that the signal is a linear superposition of \(\hat k\) signature signals E_{k} plus a noise term η:
This superposition model is ambiguous without further constraints. In order to maximize the range of detectable interactions, we decided to only weakly restrict the functional form of the E_{k}. We require that a gene order I_{k} and a sample order J_{k} exist, such that every row and every column of the reordered matrix \({\mathbf{E}}_k\left( {{\mathbf{I}}_k,{\mathbf{J}}_k} \right)\) is monotonic. This signature model can fit any interaction following a curve in the gene space (or sample space) that is monotonic over the signature axis in the same space, as explained in the main text. Formally, this bimonotonicity of all E_{k} is defined by:
where \({\mathrm{perm}}\left( {\mathbf{X}} \right) \equiv \left\{ {\pi \left( {\mathbf{X}} \right){\mathrm{}}\pi :{\mathbf{X}} \to {\mathbf{X}}\,{\mathrm{bijective}}} \right\}\) is the set of all permutations of a finite set X.
To determine signatures, we utilize a maximization principle as additional constraint. As detailed in step 2 below, we iteratively maximize the signature functional \(\cal{E}\) that quantifies both the signature size and correlations of genes and samples to signature axes based on our measure for interactions defined next.
Measure for interactions
To define signatures of interaction, we first have to quantify how the interactions we look for can be observed in the measured data.
We correlate genes or samples x〉 with a signature axis a〉 that linearly approximates a potential direction of interaction in the respective vector space by computing the uncentered weighted correlation defined by
where w〉 are contextdependent weights and dots denote componentwise products. \(\left[ {{\mathbf{x}}{\mathrm{}}{\mathbf{a}}} \right]_{\left {\mathbf{w}} \right\rangle }\) measures the consistency of regulation as it only depends on directions of x〉 and a〉.
In situations where signal strengths are important, we utilize the weighted projection of x〉 on a〉
The upper index ^{0} indicates normalization by \(\left\ {{\mathbf{w}}.{\mathbf{a}}} \right\\).
Algorithm overview
SDCM is performed iteratively. In every iteration k, a signature is detected and its contribution to the signal is regressed, yielding the signature signal E_{k}. The remaining signal
is the input for the next iteration. This continues until there are no more qualifying signature axes. Each detection and dissection iteration processes the following four steps.
Step 1 Initial representative or termination
We test if measured vectors in gene and sample space qualify as initial representative of some signature. To detect large and consistent signatures (having many highly correlated top genes and top samples) first, we search for maximal signature functional \(\cal{E}\) by the following loop.
For each gene \( {{\mathbf{g}}_i} \rangle\) and each sample \( {{\mathbf{s}}_j} \rangle\) of the signal matrix \({\mathbf{M}}_{k  1}\)

Compute the following initial signature vectors:

For gene \( {{\mathbf{g}}_i} \rangle\) choose \( {{\mathbf{a}}^{\mathrm{s}}} \rangle \equiv  {{\mathbf{g}}_i} \rangle\) as sample axis, for sample \( {{\mathbf{s}}_j} \rangle\) choose \( {{\mathbf{a}}^{\mathrm{g}}} \rangle \equiv  {{\mathbf{s}}_j} \rangle\) as gene axis.

Compute initial weights \( {{\mathbf{w}}_{{\mathrm{initial}}}^{\mathrm{s}}} \rangle\) respectively \( {{\mathbf{w}}_{{\mathrm{initial}}}^{\mathrm{g}}} \rangle\) based on the standardized signal (computed as described in Supplementary Note 1).

Complete the pair of signature axes \(\left( {\left {{\mathbf{a}}^{\mathrm{g}}} \right\rangle ,\left {{\mathbf{a}}^{\mathrm{s}}} \right\rangle } \right)\) via weighted projections:

For gene \(\left {{\mathbf{g}}_i} \right\rangle\), define the gene axis \(\left {{\mathbf{a}}^{\mathrm{g}}} \right\rangle\) by computing projections
$$\begin{array}{*{20}{c}} {\left\langle {{\mathbf{e}}_{i' }^{\mathrm{g}}{\mathrm{}}{\mathbf{a}}^{\mathrm{g}}} \right\rangle \equiv \left\langle {{\mathbf{g}}_{i'}{\mathrm{}}{\mathbf{g}}_i} \right\rangle _{\left {{\mathbf{w}}_{{\mathrm{initial}}}^{\mathrm{s}}} \right\rangle }^0/\left\ {{\mathbf{w}}_{{\mathrm{initial}}}^{\mathrm{s}}} \right\} \end{array}$$(6)of all other genes \(\left {{\mathbf{g}}_{i' }} \right\rangle\) onto \(\left {{\mathbf{g}}_i} \right\rangle\). Normalizing by \(\left\ {{\mathbf{w}}_{{\mathrm{initial}}}^{\mathrm{s}}} \right\\) yields values that can directly be interpreted in units of the input signal.

For sample \( {{\mathbf{s}}_j} \rangle\), define the sample axis \( {{\mathbf{a}}^{\mathrm{s}}} \rangle\) by computing projections
$${\big\langle {{\mathbf{e}}_{j' }^{\mathrm{s}}{\mathrm{}}{\mathbf{a}}^{\mathrm{s}}} \big\rangle \equiv \left\langle {{\mathbf{s}}_{j' }{\mathrm{}}{\mathbf{s}}_j} \right\rangle _{\left {{\mathbf{w}}_{{\mathrm{initial}}}^{\mathrm{g}}} \right\rangle }^0/\left\ {{\mathbf{w}}_{{\mathrm{initial}}}^{\mathrm{g}}} \right\}$$(7)of all other sample \( {{\mathbf{s}}_{j' }} \rangle\) onto \( {{\mathbf{s}}_j} \rangle\).

The completed axes pair \(\left( {\left {{\mathbf{a}}^{\mathrm{g}}} \right\rangle ,\left {{\mathbf{a}}^{\mathrm{s}}} \right\rangle } \right)\) points to a potential signature. This pair is the basis of all subsequent computations. In this way, a symmetric treatment of genes and samples is guaranteed.


To quantify the relationship of all genes and samples to axes \(( { {{\mathbf{a}}^{\mathrm{g}}} \rangle , {{\mathbf{a}}^{\mathrm{s}}} \rangle } )\), compute uncentered weighted correlations \(( { {{\mathbf{r}}^{\mathrm{g}}} \rangle , {{\mathbf{r}}^{\mathrm{s}}} \rangle } )\) to them via
$${\left\langle {{\mathbf{e}}_{i' }^{\mathrm{g}}{\mathrm{}}{\mathbf{r}}^{\mathrm{g}}} \right\rangle \equiv \left[ {{\mathbf{g}}_{i' }{\mathrm{}}{\mathbf{a}}^{\mathrm{s}}} \right]_{\left {{\mathbf{w}}_{{\mathrm{initial}}}^{\mathrm{s}}} \right\rangle }\ {\mathrm{and}}\ {\left\langle {{\mathbf{e}}_{j' }^{\mathrm{s}}{\mathrm{}}{\mathbf{r}}^{\mathrm{s}}} \right\rangle \equiv \left[ {{\mathbf{s}}_{j' }{\mathrm{}}{\mathbf{a}}^{\mathrm{g}}} \right]_{\left {{\mathbf{w}}_{{\mathrm{initial}}}^{\mathrm{g}}} \right\rangle }}}$$(8) 
Compute probabilities \(( { {{\mathbf{p}}^{\mathrm{g}}} \rangle , {{\mathbf{p}}^{\mathrm{s}}} \rangle } )\) that these correlations \(( { {{\mathbf{r}}^{\mathrm{g}}} \rangle , {{\mathbf{r}}^{\mathrm{s}}} \rangle } )\) have been caused by noise (see significance of correlations in Supplementary Note 2).


Refine signature vectors to make them more specific:

Define more specific weights \(( { {{\mathbf{w}}^{\mathrm{g}}} \rangle , {{\mathbf{w}}^{\mathrm{s}}} \rangle } )\) (and \(( { {{\mathbf{v}}^{\mathrm{g}}} \rangle , {{\mathbf{v}}^{\mathrm{s}}} \rangle } )\)) based on now available correlations \(( { {{\mathbf{r}}^{\mathrm{g}}} \rangle , {{\mathbf{r}}^{\mathrm{s}}} \rangle } )\) and their \(( { {{\mathbf{p}}^{\mathrm{g}}} \rangle , {{\mathbf{p}}^{\mathrm{s}}} \rangle } )\) values. We refer to these weights as the (extended) signature focus, as they are used to focus subsequent computations on those genes and samples that are most correlated to the detected underlying interaction (see the section Signature focus \(( { {{\mathbf{w}}^{\mathrm{g}}} \rangle , {{\mathbf{w}}^{\mathrm{s}}} \rangle } )\) in Methods section).

For gene \( {{\mathbf{g}}_i} \rangle\), refine its gene axis \( {{\mathbf{a}}^{\mathrm{g}}} \rangle\) by \(\langle {{\mathbf{e}}_{i' }^{\mathrm{g}}{\mathrm{}}{\mathbf{a}}^{\mathrm{g}}} \rangle \equiv \langle {{\mathbf{g}}_{i'}{\mathrm{}}{\mathbf{g}}_i} \rangle _{ {{\mathbf{w}}^{\mathrm{s}}} \rangle }^0/\ {{\mathbf{w}}^{\mathrm{s}}} \\). For sample \( {{\mathbf{s}}_j} \rangle\), refine its sample axis \( {{\mathbf{a}}^{\mathrm{s}}} \rangle\) by \(\langle {{\mathbf{e}}_{j' }^{\mathrm{s}}{\mathrm{}}{\mathbf{a}}^{\mathrm{s}}} \rangle \equiv \langle {{\mathbf{s}}_{j' }{\mathrm{}}{\mathbf{s}}_j} \rangle _{ {{\mathbf{w}}^{\mathrm{g}}} \rangle }^0/\ {{\mathbf{w}}^{\mathrm{g}}} \\).

Refine and update correlations \(( { {{\mathbf{r}}^{\mathrm{g}}} \rangle , {{\mathbf{r}}^{\mathrm{s}}} \rangle } )\) to signature axes: \(\langle {{\mathbf{e}}_{i' }^{\mathrm{g}}{\mathrm{}}{\mathbf{r}}^{\mathrm{g}}} \rangle \equiv [ {{\mathbf{g}}_{i' }{\mathrm{}}{\mathbf{a}}^{\mathrm{s}}} ]_{ {{\mathbf{w}}^{\mathrm{s}}} \rangle }\) and \(\langle {{\mathbf{e}}_{j' }^{\mathrm{s}}{\mathrm{}}{\mathbf{r}}^{\mathrm{s}}} \rangle \equiv [ {{\mathbf{s}}_{j' }{\mathrm{}}{\mathbf{a}}^{\mathrm{g}}} ]_{ {{\mathbf{w}}^{\mathrm{g}}} \rangle }\).

Update probabilities \(( { {{\mathbf{p}}^{\mathrm{g}}} \rangle , {{\mathbf{p}}^{\mathrm{s}}} \rangle } )\) that observed correlations may have been caused by noise.

To quantify the consistency and size of a potential signature along axes \(( { {{\mathbf{a}}^{\mathrm{g}}} \rangle , {{\mathbf{a}}^{\mathrm{s}}} \rangle } )\), compute the signature functional \(\cal{E}[ { {{\mathbf{a}}^{\mathrm{g}}} \rangle , {{\mathbf{a}}^{\mathrm{s}}} \rangle } ]\) (see the section Signature functional in Methods).

To determine if axes \(( { {{\mathbf{a}}^{\mathrm{g}}} \rangle , {{\mathbf{a}}^{\mathrm{s}}} \rangle } )\) qualify for further processing, several thresholds for signature size or significance for correlation and signal strength are tested (see Supplementary Notes 2 and 3 and default qualification thresholds in Supplementary Note 4). Default thresholds have been optimized to exclude false negative signatures as primary objective and minimize false positives as secondary objective.


If qualified axes pairs \(( { {{\mathbf{a}}^{\mathrm{g}}} \rangle , {{\mathbf{a}}^{\mathrm{s}}} \rangle } )\) are obtained by the loop, select the pair with the highest \(\cal{E}[ { {{\mathbf{a}}^{\mathrm{g}}} \rangle , {{\mathbf{a}}^{\mathrm{s}}} \rangle } ]\) and pass corresponding signature vectors on to step 2.
If no gene or sample in the current signal matrix \({\mathbf{M}}_{k  1}\) yields qualifying signature axes, SDCM terminates.
For performance optimization, we do not process every gene and sample for each iteration k. Instead we presort all genes and samples based on their uncentered standard deviations and apply a lookahead scheme (see presort order and lookahead scheme in Supplementary Note 5). This may slightly influence the order in which existing signatures are detected, but not whether they are detected. In any case, SDCM stays deterministic.
Step 2 Signature axes via maximization principle
To make identified signature axes independent of individual features of the initial representative, additional representative genes and/or samples are collected and combined until signature axes have converged (as in Fig. 1b).
Denote signature axes based on l representatives by \(( { {{\mathbf{b}}_l^{\mathrm{g}}} \rangle , {{\mathbf{b}}_l^{\mathrm{s}}} \rangle } )\). For l = 1, these vectors are defined by \( {{\mathbf{a}}^{\mathrm{g}}} \rangle\) and \( {{\mathbf{a}}^{\mathrm{s}}} \rangle\) from step 1. The same index l is used for w〉, v〉, r〉 and p〉.
While signature axes have not converged or are not based on sufficiently many representatives (see convergence criteria for step 2 in Supplementary Note 6), compute the following:

Sort all genes and/or samples in descending order by their absolute correlations to current axes \(( { {{\mathbf{b}}_l^{\mathrm{g}}} \rangle , {{\mathbf{b}}_l^{\mathrm{s}}} \rangle } )\), i.e. by \( {\langle {{\mathbf{e}}_i^{\mathrm{g}}{\mathrm{}}{\mathbf{r}}_l^{\mathrm{g}}} \rangle } \) and by \( {\langle {{\mathbf{e}}_j^{\mathrm{s}}{\mathrm{}}{\mathbf{r}}_l^{\mathrm{s}}} \rangle } \). Then, compute updated signature axes for the top candidates in this order as follows and finally select the one leading to maximal signature functional.

Test the gene or sample as representative l + 1 by computing accumulated signature axes \(( { {{\mathbf{b}}_{l + 1}^{\mathrm{g}}} \rangle , {{\mathbf{b}}_{l + 1}^{\mathrm{s}}} \rangle } )\) as follows:

For all representatives \(l' = 1 \ldots l + 1\), denote their individual signature axes as \(( { {{\mathbf{a}}_{l'}^{\mathrm{g}}} \rangle , {{\mathbf{a}}_{l'}^{\mathrm{s}}} \rangle } )\) (each computed as described in step 1).

The accumulated sample axis \( {{\mathbf{b}}_{l + 1}^{\mathrm{s}}} \rangle \in V^{\mathrm{s}}\) is defined as arithmetic average of all individual sample axes \( {{\mathbf{a}}_{l' }^{\mathrm{s}}} \rangle\), using correlations to the current sample axis \( {{\mathbf{b}}_l^{\mathrm{s}}} \rangle\) as weights:
$${\left {{\mathbf{b}}_{l + 1}^{\mathrm{s}}} \right\rangle \equiv \mathop {\sum }\limits_{l' = 1}^{l + 1} \left[ {{\mathbf{a}}_{l' }^{\mathrm{s}}{\mathrm{}}{\mathbf{b}}_l^{\mathrm{s}}} \right]_{\left {{\mathbf{w}}_l^{\mathrm{s}}} \right\rangle }\left {{\mathbf{a}}_{l'}^{\mathrm{s}}} \right\rangle \bigg/\mathop {\sum }\limits_{l' = 1}^{l + 1} \left {\left[ {{\mathbf{a}}_{l'}^{\mathrm{s}}{\mathrm{}}{\mathbf{b}}_l^{\mathrm{s}}} \right]_{\left {{\mathbf{w}}_l^{\mathrm{s}}} \right\rangle }} \right}$$(9) 
Analogously, the accumulated gene axis \( {{\mathbf{b}}_{l + 1}^{\mathrm{g}}} \rangle \in V^{\mathrm{g}}\) is defined as:
$${\left {{\mathbf{b}}_{l + 1}^{\mathrm{g}}} \right\rangle \equiv \mathop {\sum }\limits_{l' = 1}^{l + 1} \left[ {{\mathbf{a}}_{l' }^{\mathrm{g}}{\mathrm{}}{\mathbf{b}}_l^{\mathrm{g}}} \right]_{\left {{\mathbf{w}}_l^{\mathrm{g}}} \right\rangle }\left {{\mathbf{a}}_{l' }^{\mathrm{g}}} \right\rangle \bigg/\mathop {\sum }\limits_{l' = 1}^{l + 1} \left {\left[ {{\mathbf{a}}_{l' }^{\mathrm{g}}{\mathrm{}}{\mathbf{b}}_l^{\mathrm{g}}} \right]_{\left {{\mathbf{w}}_l^{\mathrm{g}}} \right\rangle }} \right}$$(10)


As in step 1, compute correlations \(( { {{\mathbf{r}}_{l + 1}^{\mathrm{g}}} \rangle , {{\mathbf{r}}_{l + 1}^{\mathrm{s}}} \rangle } )\) to the accumulated axes, i.e.
$${\left\langle {{\mathbf{e}}_{i' }^{\mathrm{g}}{\mathrm{}}{\mathbf{r}}_{l + 1}^{\mathrm{g}}} \right\rangle \equiv \left[ {{\mathbf{g}}_{i' }{\mathrm{}}{\mathbf{b}}_{l + 1}^{\mathrm{s}}} \right]_{\left {{\mathbf{w}}_l^{\mathrm{s}}} \right\rangle }\ {\mathrm{and}}\ \left\langle {\mathbf{e}}_{j'}^{\mathrm{s}}{\mathrm{}}{\mathbf{r}}_{l + 1}^{\mathrm{s}}\right\rangle \equiv \left[ {{\mathbf{s}}_{j' }{\mathrm{}}{\mathbf{b}}_{l + 1}^{\mathrm{g}}} \right]_{\left {{\mathbf{w}}_l^{\mathrm{g}}} \right\rangle }}$$(11) 
Compute their pvalues \(( { {{\mathbf{p}}_{l + 1}^{\mathrm{g}}} \rangle , {{\mathbf{p}}_{l + 1}^{\mathrm{s}}} \rangle } )\) as well.

Compute the signature focus \(( { {{\mathbf{w}}_{l + 1}^{\mathrm{g}}} \rangle , {{\mathbf{w}}_{l + 1}^{\mathrm{s}}} \rangle } )\) and the extended signature focus \(( { {{\mathbf{v}}_{l + 1}^{\mathrm{g}}} \rangle , {{\mathbf{v}}_{l + 1}^{\mathrm{s}}} \rangle } )\) for these correlations as in step 1.

Compute the signature functional \({\cal{E}}[ { {{\mathbf{b}}_{l + 1}^{\mathrm{g}}} \rangle , {{\mathbf{b}}_{l + 1}^{\mathrm{s}}} \rangle } ]\) and store it for the tested gene or sample.


Finally, select that gene or sample as representative l + 1 that leads to the highest signature functional.
Once convergence criteria have been met for a certain \(\hat l\), \(\left( {\left {{\mathbf{b}}_{\hat l}^{\mathrm{g}}} \right\rangle ,\left {{\mathbf{b}}_{\hat l}^{\mathrm{s}}} \right\rangle } \right)\) are the final signature axes. They linearly approximate contributions of the detected interaction to the signal, similar to PCs or ICs. In step 3, they serve as basis for a monotonic nonlinear (and hence more precise) approximation.
Step 3 Bimonotonic regression and smoothening
The contributions of a signature to the signal matrix \({\mathbf{M}}_{k  1}\) are regressed by one outer and one inner loop until convergence. The outer loop (with index o) is structured as follows:

Compute signature strengths \( {{\mathbf{u}}_{k,o}^{\mathrm{g}}} \rangle\) for genes and \( {{\mathbf{u}}_{k,o}^{\mathrm{s}}} \rangle\) for samples (e.g., overlaid curves in Fig. 2c; see the section Signature strengths \(\left( {\left {{\mathbf{u}}^{\mathrm{g}}} \right\rangle ,\left {{\mathbf{u}}^{\mathrm{s}}} \right\rangle } \right)\) in Methods).

Determine signature orders \(( {{\mathbf{I}}_{k,o},{\mathbf{J}}_{k,o}} )\) as follows. The gene order \({\mathbf{I}}_{k,o}\) sorts genes by \(\langle {{\mathbf{e}}_i^{\mathrm{g}}{\mathrm{}}{\mathbf{u}}_{k,o}^{\mathrm{g}}} \rangle\), i.e.,
$${\forall i \in \left[ {1,m  1} \right]:\left\langle {{\mathbf{e}}_{{\mathbf{I}}_{k,o}\left( {i + 1} \right)}^{\mathrm{g}}{\mathrm{}}{\mathbf{u}}_{k,o}^{\mathrm{g}}} \right\rangle \ge \left\langle {{\mathbf{e}}_{{\mathbf{I}}_{k,o}\left( i \right)}^{\mathrm{g}}{\mathrm{}}{\mathbf{u}}_{k,o}^{\mathrm{g}}} \right\rangle }$$(12) 
Likewise, the sample order \({\mathbf{J}}_{k,o}\) sorts samples by \(\langle {{\mathbf{e}}_j^{\mathrm{s}}{\mathrm{}}{\mathbf{u}}_{k,o}^{\mathrm{s}}} \rangle\), i.e.,
$${\forall j \in \left[ {1,n  1} \right]:\left\langle {{\mathbf{e}}_{{\mathbf{J}}_{k,o}\left( {j + 1} \right)}^{\mathrm{s}}{\mathrm{}}{\mathbf{u}}_{k,o}^{\mathrm{s}}} \right\rangle \ge \left\langle {{\mathbf{e}}_{{\mathbf{J}}_{k,o}\left( j \right)}^{\mathrm{s}}{\mathrm{}}{\mathbf{u}}_{k,o}^{\mathrm{s}}} \right\rangle }$$(13) 
Resort rows and columns of the current signal matrix \({\mathbf{M}}_{k  1}\) by these signature orders. This yields an already roughly bimonotonic signal \({\mathbf{M}}_{k  1}( {{\mathbf{I}}_{k,o},{\mathbf{J}}_{k,o}} )\) (e.g., the heatmap in Fig. 2c). However, this sorted signal matrix still needs to be regressed and smoothened to obtain exact bimonotonicity and finally to subtract solely the signature’s contributions to the signal. (Directly subtracting the sorted but otherwise unmodified \({\mathbf{M}}_{k  1}( {{\mathbf{I}}_{k,o},{\mathbf{J}}_{k,o}} )\) would lead to a zero signal, i.e., a loss of all information.)

Compute the bimonotonically regressed matrix \(\widetilde {\bf{\cal{M}}}_{o,\hat q}\) for the presorted signal matrix \({\mathbf{M}}_{k  1}( {{\mathbf{I}}_{k,o},{\mathbf{J}}_{k,o}} )\) by processing the inner convergence loop with index q, where \(\hat q\) denotes the inner iteration that reached convergence (see the section Bimonotonic regression in Methods).

Apply adaptive smoothening \({\cal{S}}_{ {{\mathbf{u}}_{k,o}^{\mathrm{g}}} \rangle , {{\mathbf{u}}_{k,o}^{\mathrm{s}}} \rangle }( {\widetilde {\bf{\cal{M}}}_{o,{\hat{q}}}} )\) by rescaling to the signature strengths space and via 2D Fourier transforms (see the section Smoothening operator in Methods). The smoothened matrix is still bimonotonic, as smoothening is a local operation that cannot change monotonicity.

Once convergence criteria are met for a certain \(\hat o\) (Supplementary Note 7), \( {{\mathbf{u}}_{k,\hat o}^{\mathrm{g}}} \rangle\) and \( {{\mathbf{u}}_{k,\hat o}^{\mathrm{s}}} \rangle\) are the final signature strengths and the result \({\cal{S}}_{ {{\mathbf{u}}_{k,\hat o}^{\mathrm{g}}} \rangle , {{\mathbf{u}}_{k,\hat o}^{\mathrm{s}}} \rangle }( {\widetilde{\bf{\cal{M}}}_{\hat{o},\hat{q}}} )\) is passed on to step 4.
Step 4 Signature signal E_{k} and its dissection
Dissection strengths \({\mathbf{D}}_k \in {\Bbb R}^{m \times n}\) for signature k are defined as square roots of the components of the outer weights product \( {{\mathbf{w}}_{\hat l}^{\mathrm{g}}} \rangle \otimes  {{\mathbf{w}}_{\hat l}^{\mathrm{s}}} \rangle\), i.e.,
\(\widetilde {\mathbf{D}} \equiv {\mathbf{D}}( {{\mathbf{I}}_{k,\hat o},{\mathbf{J}}_{k,\hat o}} )\) denotes these dissection strengths in the final signature order.
The signature signal is defined in signature order as componentwise product
(e.g., Fig. 2d; gray shadings depict \(\widetilde {\mathbf{D}}\) and fully grayedout pixels correspond to zero dissection strengths).
The signature signal in reference order is obtained by sorting back via \({\mathbf{E}}_k( {{\mathbf{I}}_{k,\hat o},{\mathbf{J}}_{k,\hat o}} ) \equiv {\tilde{\mathbf{E}}}_k\).
Dissection of signature k is now simply a matrix subtraction of its signature signal \({\mathbf{E}}_k\) from the current signal \({\mathbf{M}}_{k  1}\). The remaining signal
is the input for the next detection and dissection iteration \(k + 1\) (e.g., Fig. 2e or Fig. 1d).
Signature focus
Purpose of the signature focus is to define where a detected effect ends. While this may seem easy for plateaulike clusters (having few highly correlated genes or samples while all others only have low correlations), often real effects have no clear edge. In this case, we prefer a smooth decrease in membership weights. The signature focus should not influence ranks of top members, as they need to be determined by signature strengths for optimal bimonotonic regression (cf. Eqs. 29 and 30). But nonmembers should be identified and excluded by the signature focus to minimize the influence of noise.
The signature focus consists of correlationbased gene and sample weights that allow the computation of all vectors and scores as specific as possible, even if the detected interaction only affects a small subset of measured genes and samples. To retain specificity for such small signatures, we set weight components \(\langle {{\mathbf{e}}_i^{\mathrm{g}}{\mathrm{}}{\mathbf{w}}^{\mathrm{g}}} \rangle\) and \(\langle {{\mathbf{e}}_j^{\mathrm{s}}{\mathrm{}}{\mathbf{w}}^{\mathrm{s}}} \rangle\) exactly to zero for all (and potentially very many) genes and samples that have only weak or insignificant correlations to detected signature axes.
For samples \( {{\mathbf{e}}_j^{\mathrm{s}}} \rangle\), let
where the second factor decreases quadratically to zero with the noise probability of sample correlations (cf. Supplementary Note 2). Sample weights are defined relative to \(x_j^{\mathrm{s}}\). Values ≥50% of the maximum of all \(x_j^{\mathrm{s}}\) are already mapped to full weight:
Consequently, weights have no influence on the signature’s order of top samples, as is intended. To exclude any unspecific influence of samples with relatively weak or insignificant correlation, we finally set all weights to zero that are lower than two thirds of their quantile:
Analogously, gene weights \(\left {{\mathbf{w}}^{\mathrm{g}}} \right\rangle\) are given by
where \(y_i^{\mathrm{g}} \equiv {\mathrm{min}}\left( {1,x_i^{\mathrm{g}}/\left( {\frac{1}{2}\max _{i\prime }\left( {x_{i\prime }^{\mathrm{g}}} \right)} \right)} \right)\) and \(x_i^{\mathrm{g}} \equiv  {\langle {{\mathbf{e}}_i^{\mathrm{g}}{\mathrm{}}{\mathbf{r}}^{\mathrm{g}}} \rangle }  \cdot ( {1  \langle {{\mathbf{e}}_i^{\mathrm{g}}{\mathrm{}}{\mathbf{p}}^{\mathrm{g}}} \rangle } )^2\).
For signature size estimation and qualification thresholds, mapping all xvalues above 50% to full weight is not optimal. To keep the full dynamic range of weights for these tasks, we additionally define the extended signature focus
by increasing the upper threshold from 50% to 100% (i.e., \(y_i^{\mathrm{g}} \equiv x_i^{\mathrm{g}}\) and \(y_j^{\mathrm{s}} \equiv x_j^{\mathrm{s}}\)) and by decreasing the lower specificity threshold from two thirds to 0.4. Otherwise, weights \(\left( {\left {{\mathbf{v}}^{\mathrm{g}}} \right\rangle ,\left {{\mathbf{v}}^{\mathrm{s}}} \right\rangle } \right)\) and \(\left( {\left {{\mathbf{w}}^{\mathrm{g}}} \right\rangle ,\left {{\mathbf{w}}^{\mathrm{s}}} \right\rangle } \right)\) are computed in the same way.
Importantly, specificity thresholds of the signature focus exclude noise genes from computation that could otherwise reduce both specificity of signatures and detection robustness. In case of many measured genes and small signatures, this can also speed up computation.
Signature functional
To detect and unambiguously dissect a signature, first its optimal linear directions along which it extends in gene and sample space have to be determined. The goal of the signature functional is to score candidate directions during the initial search (Methods/Step 1) and during optimization (Methods/Step 2).
The signature functional \(\cal{E}[ { {{\mathbf{a}}^{\mathrm{g}}} \rangle , {{\mathbf{a}}^{\mathrm{s}}} \rangle } ]\) assigns to every possible axes pair \(( { {{\mathbf{a}}^{\mathrm{g}}} \rangle , {{\mathbf{a}}^{\mathrm{s}}} \rangle } )\) a scalar \(\in {\Bbb R}\). \(\cal{E}\) is the larger, the more genes or samples are correlated to these axes, and the higher these correlations are. The selection of an initial representative with help of this functional (step 1) and maximizing this functional (step 2) guides SDCM to one of the largest and most consistent signatures in the current signal \({\mathbf{M}}_{k  1}\). (Thus, in all versatility tests, the large signature #1 was always detected first, and narrow or weak signatures #5, #6, and #7 were detected last; see detection ranks in Supplementary Fig. 1.)
The signature size, i.e., the number of participating genes or samples, is estimated by summing gene weights \({\mathrm{v}}_i^{\mathrm{g}} \equiv \langle {{\mathbf{e}}_i^{\mathrm{g}}{\mathrm{}}{\mathbf{v}}^{\mathrm{g}}} \rangle\) respectively sample weights \({\mathrm{v}}_j^{\mathrm{s}} \equiv \langle {{\mathbf{e}}_j^{\mathrm{s}}{\mathrm{}}{\mathbf{v}}^{\mathrm{s}}} \rangle\) of the extended signature focus:
Average absolute correlations for all genes and all samples in the extended signature focus are computed as weighted means:
Next, these separate scores for genes and samples need to be combined.
As m_{k}n_{k} corresponds to the number of measurement values supporting the signature, a natural choice for the combined signature size would be their geometric average \(\sqrt {m_kn_k}\). However, signatures affecting the same number m_{k}n_{k} of measured (gene, sample)values in M_{0} may be caused by noise or by artifacts with different probabilities. For example, a narrow signature concerning 10,000 genes in two samples is more likely a measurement artefact than a signature representing an interaction that affects 100 genes in 200 samples. To detect and dissect broad and robust signatures first, we therefore optimized the usual geometric average by putting 90% weight on the order dimension with lower size. Hence, the combined signature size is defined as
Putting 90% weight on \(\min \left( {m_k,n_k} \right)\) effectively increases the signature functional for the broad signature of same area and decreases it for narrow stripelike signatures. The latter will still be detected, but in a later detection iteration.
A natural choice to combine the average correlations \(r_k^{\mathrm{g}}\) for genes and \(r_k^{\mathrm{s}}\) for samples is an arithmetic average that takes into account the number of measured values underlying the computation of these weighted correlations:
To put emphasis on the most consistent signatures (that have average correlations of r_{k} near one) and to detect them first, we modify this to \(r_k/\sqrt {1  r_k^2}\) (inspired by the tstatistic of a correlation; cf. significance of correlations in Supplementary Note 2).
Hence, our final signature functional is defined by:
Signature strengths
As preparation for bimonotonic regression, genes and samples need to be sorted such that their measured signals in the signature focus are arranged as bimonotonic as possible with respect to detected signature axes. To achieve this, we sort them by their signature strengths. Whereas correlations only quantify the consistency of observed GE data with a signature axis, signature strengths are additionally proportional to the magnitude of observed signals in direction of a signature axis. In this way, signature strengths reflect gene activities in the underlying interaction (inducing a gene order) or, respectively, the activities of this interaction in samples (inducing a sample order). Signature strengths are also useful candidates for downstream analyses, e.g., they may serve as covariates in Cox survival models.
For the first regression iteration o = 1 (in step 3), the gene strengths vector \( {{\mathbf{u}}_{k,1}^{\mathrm{g}}} \rangle \in V^{\mathrm{g}}\) is defined by
i.e. by weighted projections of each gene on the detected sample axis \( {{\mathbf{b}}_{\hat l}^{\mathrm{s}}} \rangle\) in the final sample focus \( {{\mathbf{w}}_{\hat{l}}^{\mathrm{s}}} \rangle\), normalized by \(\ {{\mathbf{w}}_{\hat l}^{\mathrm{s}}} \\). This normalization guarantees values that can be directly interpreted in units of measurement values (i.e., typically as log_{2}(ratios) for GE data). Likewise, the sample strength vector \( {{\mathbf{u}}_{k,1}^{\mathrm{s}}} \rangle \in V^{\mathrm{s}}\) for o = 1 is defined by
For higher precision, we utilize regressed signature curves from the last iteration \({\it{o}}\ \!\!\!\! \ 1\) as projection targets instead of signature axes for all iterations \({\it{o}}\ > \ 1 \). Signature curves are defined as projections of the bimonotonically regressed signal matrix \(\widetilde{\mathbf{X}}\). The associated gene curve is a vectorvalued function \( {{\mathbf{c}}^{\mathrm{g}}} \rangle\) of sample indices j defined by \(\langle {{\mathbf{e}}_{i}^{\mathrm{g}}{\mathrm{}}{\mathbf{c}}^{\mathrm{g}}( j )} \rangle \equiv {\widetilde{\mathbf{X}}}( {i,j} )\). Analogously, the associated sample curve \(\left {{\mathbf{c}}^{\mathrm{s}}} \right\rangle\) is a vectorvalued function of gene indices i defined by \(\langle {{\mathbf{e}}_{j}^{\mathrm{s}}{\mathrm{}}{\mathbf{c}}^{\mathrm{s}}( i )} \rangle \equiv {\widetilde{\mathbf{X}}}( {i,j} )\). For any iteration \({o} \ > \ {1}\), we consider signature curves for the regression result from the last iteration \({o} \ – \ {1}\), i.e., for \({\widetilde{\mathbf{X}}}\, {\equiv}\, {\cal{S}}_{ {\mathbf{u}}_{k,o1}^{\mathrm{g}} \rangle , {\mathbf{u}}_{k,o1}^{\mathrm{s}} \rangle }( {\widetilde{\mathbf{\cal{M}}}}_{o1,{\hat{q}}})\) (see algorithm step 3). With these signature curves as targets, again weighted projections are utilized to determine refined signature strengths:
Smoothening operator
Purpose of smoothening is that the signature signal becomes (nearly) the same for all genes and samples with (nearly) equal signature strengths. In particular, the signature signal for all genes with signature strengths near zero, i.e., all genes outside of the signature’s focus, are averaged by smoothening. Other than regression, smoothening is a local operation that cannot guarantee monotonicity if applied to a nonmonotonic signal. However, if applied to an already monotonic signal, this monotonicity is always preserved.
Let I denote the gene order by components of gene strengths \( {{\mathbf{u}}^{\mathrm{g}}} \rangle\), J the sample order by components of sample strengths \( {{\mathbf{u}}^{\mathrm{s}}} \rangle\) and \(\widetilde {\mathbf{X}} \equiv {\mathbf{X}}\left( {{\mathbf{I}},{\mathbf{J}}} \right)\) a sorted signal matrix. To compute the adaptive smoothening \({\cal{S}}_{ {{\mathbf{u}}^{\mathrm{g}}} \rangle , {{\mathbf{u}}^{\mathrm{s}}} \rangle }( {\widetilde{\mathbf{X}}} )\), \(\widetilde {\mathbf{X}}\) is first rescaled using signature strengths \( {{\mathbf{u}}^{\mathrm{g}}} \rangle\) and \( {{\mathbf{u}}^{\mathrm{s}}} \rangle\). A resolution of \(m^  \equiv 512\) rows and \(n^  \equiv 512\) columns is sufficient for this rescaled space, as only monotonic changes need to be resolved. Basis vectors \( {{\mathbf{e}}_{i^  }^{{\mathrm{g}}  }} \rangle\) of the rescaled (and downscaled) gene space correspond for \(i^  \in [ {1,m^  } ]\) to equidistant gene strengths \({\mathrm{u}}_{{\mathbf{I}}( 1 )}^{\mathrm{g}} + ( {  {\mathrm{u}}_{{\mathbf{I}}( 1 )}^{\mathrm{g}} + {\mathrm{u}}_{{\mathbf{I}}( m )}^{\mathrm{g}}} ) \cdot \frac{{i^  }}{{m^  }}\), where \({\mathrm{u}}_i^{\mathrm{g}}\) abbreviates \(\langle {{\mathbf{e}}_i^{\mathrm{g}}{\mathrm{}}{\mathbf{u}}^{\mathrm{g}}} \rangle\). Likewise, basis vectors \( {{\mathbf{e}}_{j^  }^{{\mathrm{s}}  }} \rangle\) of the rescaled and downscaled sample space correspond for \(j^  \in \left[ {1,n^  } \right]\) to equidistant sample strengths \({\mathrm{u}}_{{\mathbf{J}}( 1 )}^{\mathrm{s}} + ( {  {\mathrm{u}}_{{\mathbf{J}}( 1 )}^{\mathrm{s}} + {\mathrm{u}}_{{\mathbf{J}}( n )}^{\mathrm{s}}} ) \cdot \frac{{j^  }}{{n^  }}\). The rescaled and downscaled signal \({\widetilde{\mathbf{X}}}^  \in {\Bbb{R}}^{m^ \times n^ }\) is computed by arithmetic averaging of \(\widetilde {\mathbf{X}}\) components in corresponding grid cells of signature strength. If no (gene, sample) pair has signature strengths in corresponding intervals, nearestneighbor interpolation is employed.
A smoothening window of constant width in this rescaled space of equidistant signature strengths corresponds to an adaptive smoothening window in the original signal space \({\Bbb R}^{m \times n}\). Let \({\mathbf{G}}_{\sigma ^{\mathrm{g}},\sigma ^{\mathrm{s}}} \in {\Bbb R}^{m^ \times n^ }\) denote a Gaussian kernel centered at indices \(( {\frac{{m^  }}{2},\frac{{n^  }}{2}} )\). Default standard deviations σ^{g} and σ^{s} correspond to eight pixels in the downscaled space \({\Bbb R}^{m^ \times n^ }\), i.e., to a smoothening window size of \(\frac{1}{{64}}\) of the distance between minimum and maximum signature strength.
The smoothening is given by the convolution \(\widetilde{\mathbf{X}}^  \ast {\mathbf{G}}_{\sigma ^{\mathrm{g}},\sigma ^{\mathrm{s}}}\). As the direct implementation of convolutions has a complexity that is quadratic in the number of points, we apply the convolution theorem^{50}:
This reduces the problem to the componentwise multiplication of two Fouriertransformed matrices and one inverse Fourier transform of the result. Fast Fourier transform implementations for \(\cal{F}\) and \({\cal{F}}^{  1}\) (provided by fft2 and ifft2 functions in MATLAB®) are employed and have only loglinear complexity.
To finally obtain the smoothened signal for original gene and sample strengths, we look up values in this rescaled smoothened matrix by linear interpolation. Let \({\cal{I}}_{\left {{\mathbf{u}}^{\mathrm{g}}} \right\rangle ,\left {{\mathbf{u}}^{\mathrm{s}}} \right\rangle }\) denote this linear interpolation; then the final smoothened matrix is
This signal smoothening procedure is the development result of maximizing robustness. It is robust against too strong smoothening that would otherwise introduce dissection artefacts into the signal when the signature signal changes abruptly. Additionally, it is robust against too weak smoothening which would cause a too finegrained dissection, thereby dissecting and losing information about overlapping signatures. Both robustness goals were achieved by rescaling the signal to and smoothening it over a grid representing equidistant signature strengths. Hence, if signature strength decreases rapidly from one gene to its neighbor in signature strength order, there are many window sizes inbetween (sharp signal edges are retained). Conversely, all member genes with approximately equal signature strength end up in the same smoothening window (signal roughness due to noise is completely averaged here). This adaptiveness makes smoothening robust and generic. For the same reason, our smoothening concept is largely independent of actual window width, within a certain range. The actually chosen grid size of 512*512 pixels is more than sufficient to resolve bimonotonic patterns. As robustness was already taken care of, we optimized this resolution and the window width subsequently for 2D FFT performance (hence the powers of 2).
Bimonotonic regression
To approximate interactions by our signature model as precisely as possible, we start from measured data in the signature focus ordered by signature strengths and regress it to the nearest bimonotonic representation. Residuals that do not fit bimonotonicity are assumed to originate from overlapping distinct signatures that will be dissected in later iterations.
Bimonotonic regression is based on weighted 1D isotonic regressions as implemented by the Generalized Pool Adjacent Violator (GPAV) algorithm^{51}. The following loop until convergence (with index j) realizes the 2D bimonotonic regression.
For q = 0, start with the presorted but not yet regressed signal matrix
from the outer loop o of algorithm step 3. Regression weights are initialized with
where \({\mathbf{W}}_0^{2{\mathrm{D}}} \in {\Bbb R}^{m \times n}\) is defined by the components of the outer product \( {{\mathbf{w}}_{\hat l}^{\mathrm{g}}} \rangle \otimes  {{\mathbf{w}}_{\hat l}^{\mathrm{s}}} \rangle\).
Each iteration q of the convergence loop is structured as follows:

1D regression of genes: Apply GPAV to every gene row in \(\widetilde {\bf{\cal{M}}}_{o,q}\) and use corresponding rows of \({\mathbf{W}}_q^{\widetilde {2{\mathrm{D}}}}\) as regression weights. This results in monotonically regressed genes that are collected again as matrix \(\widetilde {\mathbf{G}}_{q+1} \in {\Bbb R}^{m \times n}\). Each gene row in \(\widetilde {\mathbf{G}}_{q + 1}\) is a step function that consists of blocks with constant regressed GE values, while expression values of neighboring blocks are either all monotonically increasing or all monotonically decreasing. Each block corresponds to a sample interval in the sample order \({\mathbf{J}}_{k,o}\). GPAV also updates sample weights (individually for each gene) by averaging input weights for each block; they are collected as rows of the matrix \({\mathbf{W}}_{q + 1}^{{\widetilde{\mathbf{G}}}}\).

1D regression of samples: Likewise, regressions of sample columns in \(\widetilde {\bf{\cal{M}}}_{o,q}\) are realized with GPAV, using corresponding columns of \({\mathbf{W}}_q^{\widetilde {2{\mathrm{D}}}}\) as regression weights. This results in a matrix of monotonic columns \(\widetilde {\mathbf{S}}_{q + 1} \in {\Bbb R}^{m \times n}\) and in updated gene weights for each sample \({\mathbf{W}}_{q + 1}^{\widetilde {\mathbf{S}}} \in {\Bbb R}^{m \times n}\).

Adaptively smoothen the signal, i.e. compute \({\cal{S}}_{ {{\mathbf{u}}_{k,o}^{\mathrm{g}}} \rangle , {{\mathbf{u}}_{k,o}^{\mathrm{s}}} \rangle }( {\widetilde {\bf{\cal{M}}}_{o,q}} )\) (see Eq. 32). This smoothening result will only be relevant below in Eq. (36) for genes and samples with weak or zero regression weights, i.e. for those outside of the signature focus.

For a smooth and quick convergence, we combine weights symmetrically by
$$\begin{array}{*{20}{c}} {{\mathbf{W}}_{q + 1}^{\widetilde {2{\mathrm{D}}}} \equiv \left( {n{\mathbf{W}}_q^{\widetilde{2{\mathrm{D}}}} + m{\mathbf{W}}_{q + 1}^{\widetilde {\mathbf{S}}}} \right)/\left( {2\left( {n + m} \right)} \right) + \left( {m{\mathbf{W}}_q^{\widetilde {2{\mathrm{D}}}} + n{\mathbf{W}}_{q + 1}^{\widetilde {\mathbf{G}}}} \right)/\left( {2\left( {n + m} \right)} \right)} \end{array}$$(35) 
The approximation of the bimonotonic signal by iteration o is computed as the weighted average of monotonic genes, monotonic samples and the smoothened signal for pixels with low weight:
$$\begin{array}{*{20}{c}} {\widetilde {\bf{\cal{M}}}_{o,q + 1} = \frac{{\left( {n{\mathbf{W}}_q^{\widetilde {2{\mathrm{D}}}} + m{\mathbf{W}}_{q + 1}^{\widetilde {\mathbf{S}}}} \right)}}{{2\left( {n + m} \right)}}.{\widetilde{\mathbf{G}}}_{q + 1} + \frac{{\left( {m{\mathbf{W}}_q^{\widetilde {2{\mathrm{D}}}} + n{\mathbf{W}}_{q + 1}^{\widetilde {\mathbf{G}}}} \right)}}{{2\left( {n + m} \right)}}.{\widetilde{\mathbf{S}}}_{q + 1} + \left( {1  {\mathbf{W}}_{q + 1}^{\widetilde {2{\mathrm{D}}}}} \right).{\cal{S}}_{\left {{\mathbf{u}}_{k,o}^{\mathrm{g}}} \right\rangle ,\left {{\mathbf{u}}_{k,o}^{\mathrm{s}}} \right\rangle }\left( {\widetilde {\bf{\cal{M}}}_{o,q}} \right)} \end{array}$$(36) 
Finally, we test for convergence. (In the default configuration, the pixel standard deviation of remaining differences \(\widetilde {\bf{\cal{M}}}_{o,q+1}  \widetilde {\bf{\cal{M}}}_{o,q}\) must be less than \(\frac{1}{{1000}}\) of the data noise level.) Once converged (iteration \(\hat q\)), return the bimonotonic \(\widetilde {\bf{\cal{M}}}_{o,\hat q}\) to algorithm step 3.
Algorithmic complexity
Algorithms having a complexity that grows like the volume of a vector space, i.e., exponentially in gene or sample dimensions, are typically of no practical use when analyzing highdimensional datasets (curse of dimensionality). Here we show that SDCM’s complexity is bounded by a loworder polynomial.
Our search strategy tests up to m + n initial representatives (genes or samples) to identify the one with maximal signature functional. Each candidate leads to a pair of signature axes (see the section Algorithm overview in Methods). Correlating the gene axis with all n samples has complexity O(nm), as each correlation is computed with linear complexity in the number of genes per sample, i.e., O(m). Correlating the sample axis with all m genes likewise results in O(nm). Taken together, the search strategy has complexity \(O((m + n) \cdot (nm + mn)) = O((m + n) \cdot nm)\). As all other steps are essentially bounded by constants (e.g., by the unknown but finite size of the signature focus), they are irrelevant for complexity estimation. In addition to input size, complexity of input signals influences runtime. Due to the iterative structure of SDCM, this translates into a linear dependency on the number of detectable signatures \(\hat k\). Overall, SDCM’s complexity is
To test this complexity analysis and to quantify SDCM performance in practice, we analyzed signals for gene spaces with up to m = 80,000 dimensions. Consistently, the empirical complexity showed runtime asymptotics that are quadratic in m (Supplementary Fig. 14).
Further methods
While not essential for core detection concepts, SDCM subroutines are detailed in Supplementary Notes 1–7 for completeness. Procedures for method comparison and validation including details about analyzed GE datasets are provided in Supplementary Notes 8–12. Finally, patient survival analyses are detailed in Supplementary Notes 14–16.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
Analyzed DLBCL GE datasets are publicly available in NCBI GEO^{52} via accessions GSE31312^{3} for the detection cohort and GSE10846^{2} for the validation cohort. Data for the RNAsequencing based addon validation cohort are available from the European GenomePhenome Archive (https://www.egaarchive.org) via accession EGAS00001002606^{39}. Discovered DLBCL signatures and derived top gene sets are provided as Supplementary Data 1–3.
Code availability
A MATLAB® analysis toolbox including the full SDCM source code is provided at https://github.com/GrauLab/SDCM (the first release commit is the paper version). The toolbox is ready to run and includes low and highdimensional examples for unit testing and for a quick start with your own datasets (simply open selfTest_lowDim.m or selfTest_highDim.m in your MATLAB® editor and press F5 to run it). It is free to use for any academic purpose. For commercial contexts, SDCM is free for all testing and development purposes (contact otherwise). We used MATLAB® (versions R2013bR2016b, The MathWorks® Inc., Natick, Massachusetts, United States) for development and all computations.
References
Mortazavi, A., Williams, Ba, McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNASeq. Nat. Methods 5, 621–628 (2008).
Lenz, G. et al. Stromal gene signatures in largeBcell lymphomas. N. Engl. J. Med. 359, 2313–2323 (2008).
Visco, C. et al. Comprehensive gene expression profiling and immunohistochemical studies support application of immunophenotypic algorithm for molecular subtype classification in diffuse large Bcell lymphoma. Leukemia 26, 2103–2113 (2012).
Hicks, C. et al. Analysis of patterns of gene expression variation within and between ethnic populations in pediatric BALL. Cancer Inform. 12, 155–173 (2013).
Lindgren, D. et al. Combined gene expression and genomic profiling define two intrinsic molecular subtypes of urothelial carcinoma and gene signatures for molecular grading and outcome. Cancer Res. 70, 3463–3472 (2010).
Nagalla, S. et al. Interactions between immunity, proliferation and molecular subtype in breast cancer prognosis. Genome Biol. 14, R34 (2013).
Beck, A. H. et al. Discovery of molecular subtypes in leiomyosarcoma through integrative molecular profiling. Oncogene 29, 845–854 (2010).
Pawitan, Y. et al. Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two populationbased cohorts. Breast Cancer Res. 7, R953–R964 (2005).
Holleman, A. et al. Geneexpression patterns in drugresistant acute lymphoblastic leukemia cells and response to treatment. N. Engl. J. Med. 351, 533–542 (2004).
Sorlie, T. et al. Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc. Natl Acad. Sci. USA 100, 8418–8423 (2003).
Bignotti, E. et al. Trefoil factor 3: a novel serum marker identified by gene expression profiling in highgrade endometrial carcinomas. Br. J. Cancer 99, 768–773 (2008).
Alizadeh, A. A. et al. Distinct types of diffuse large Bcell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000).
Broyl, A. et al. Gene expression profiling for molecular classification of multiple myeloma in newly diagnosed patients. Blood 116, 2543–2553 (2010).
Lapointe, J. et al. Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc. Natl Acad. Sci. USA 101, 811–816 (2004).
Nogai, H., Dörken, B. & Lenz, G. Pathogenesis of nonHodgkin’s lymphoma. J. Clin. Oncol. 29, 1803–1811 (2011).
Lam, L. T. et al. Cooperative signaling through the signal transducer and activator of transcription 3 and nuclear factor{kappa}B pathways in subtypes of diffuse large Bcell lymphoma. Blood 111, 3701–3713 (2008).
Rosenwald, A. et al. The use of molecular profiling to predict survival after chemotherapy for diffuse largeBcell lymphoma. N. Engl. J. Med. 346, 1937–1947 (2002).
Subramanian, A. et al. Gene set enrichment analysis: a knowledgebased approach for interpreting genomewide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).
Murtagh, F. & Contreras, P. Algorithms for hierarchical clustering: an overview. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2, 86–97 (2012).
Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysis and display of genomewide expression patterns. Proc. Natl Acad. Sci. USA 95, 14863–14868 (1998).
Dawany, N. et al. Identification of a 251 gene expression signature that can accurately detect M. tuberculosis in patients with and without HIV coinfection. PLoS ONE 9, e89925 (2014).
Sridhar, S. et al. Smokinginduced gene expression changes in the bronchial airway are reflected in nasal and buccal epithelium. BMC Genomics 9, 259 (2008).
Spira, A. et al. Effects of cigarette smoke on the human airway epithelial cell transcriptome. Proc. Natl Acad. Sci. USA 101, 10143–10148 (2004).
Pearson, K. On lines and planes of closest fit to systems of points in space. Lond., Edinb., Dublin Philos. Mag. J. Sci. 2, 559–572 (1901).
Ma, S. & Dai, Y. Principal component analysis based Methods in bioinformatics studies. Brief. Bioinform. 12, 714–722 (2011).
Bro, R. & Smilde, A. K. Principal component analysis. Anal. Methods 6, 2812 (2014).
Ringnér, M. What is principal component analysis? Nat. Biotechnol. 26, 303–304 (2008).
Alter, O., Brown, P. O. & Botstein, D. Singular value decomposition for genomewide expression data processing and modeling. Proc. Natl Acad. Sci. USA 97, 10101–10106 (2000).
Freeman, J. et al. Mapping brain activity at scale with cluster computing. Nat. Methods 11, 941–950 (2014).
Lee, D. D. & Seung, H. S. Learning the parts of objects by nonnegative matrix factorization. Nature 401, 788–791 (1999).
Brunet, J. P., Golub, T. R., Tamayo, P. & Mesirov, J. P. Metagenes and molecular pattern discovery using matrix factorization. Proc. Natl Acad. Sci. USA 101, 4164–4169 (2004).
Frigyesi, A. & Höglund, M. Nonnegative matrix factorization for the analysis of complex gene expression data: identification of clinically relevant tumor subtypes. Cancer Inform. 6, 275–292 (2008).
Hochreiter, S. et al. FABIA: factor analysis for bicluster acquisition. Bioinformatics 26, 1520–1527 (2010).
Lazzeroni, L. & Owen, A. Plaid Models for Gene Expression Data. Stat. Sin. 12, 61–86 (2002).
Hyvärinen, A. Fast and robust fixedpoint algorithms for independent component analysis. Ieee. Trans. Neural Netw. 10, 626–634 (1999).
Kong, W., Vanderburg, C. R., Gunshin, H., Rogers, J. T. & Huang, X. A review of independent component analysis application to microarray gene expression data. Biotechniques 45, 501–520 (2008).
Teschendorff, A. E., Journée, M., Absil, Pa, Sepulchre, R. & Caldas, C. Elucidating the altered transcriptional programs in breast cancer using independent component analysis. PLoS Comput. Biol. 3, e161 (2007).
Hyvärinen, A. & Oja, E. Independent component analysis: algorithms and applications. Neural Netw. 13, 411–430 (2000).
Reddy, A. et al. Genetic and functional drivers of diffuse large B cell lymphoma. Cell 171, 481–494.e15 (2017).
Shaffer, A. L. et al. A library of gene expression signatures to illuminate normal and pathological lymphoid biology. Immunol. Rev. 210, 67–85 (2006).
Cox, D. R. Regression models and life tables. J. R. Stat. Soc. Ser. B 34, 187–220 (1972).
Shipp, M. A. et al. A predictive model for aggressive nonhodgkin’s lymphoma. N. Engl. J. Med. 329, 987–994 (1993).
Liberzon, A. et al. The Molecular Signatures Database Hallmark Gene Set Collection. Cell Syst. 1, 417–425 (2015).
Culhane, A. C. et al. GeneSigDB: a manually curated database and resource for analysis of gene expression signatures. Nucleic Acids Res. 40, D1060–D1066 (2012).
Hummel, M. et al. A biologic definition of Burkitt’s lymphoma from transcriptional and genomic profiling. N. Engl. J. Med. 354, 2419–2430 (2006).
Sha, C. et al. Molecular highgrade BCell lymphoma: defining a poorrisk group that requires different approaches to therapy. J. Clin. Oncol. JCO.18.01314 https://doi.org/10.1200/JCO.18.01314 (2018).
Jing, H. et al. Opposing roles of NFκB in anticancer treatment outcome unveiled by crossspecies investigations. Genes Dev. 25, 2137–2146 (2011).
Collins, F. S. & Varmus, H. A new initiative on precision medicine. N. Engl. J. Med. 372, 793–795 (2015).
Alam, S. et al. The Eleventh and Twelfth Data Releases of the Sloan Digital Sky Survey: Final Data from SDSSIII. Astrophys. J. Suppl. Ser. 219, 12 (2015).
Bronstein, I. N. et al. TeubnerTaschenbuch der Mathematik. TeubnerTaschenbuch der Mathematik 2, (B. G. Teubner Verlag, 2003).
Burdakov, O., Sysoev, O., Grimvall, A. & Hussian, M. An O(n^{2}) Algorithm for Isotonic Regression. LargeScale Nonlinear Optim. SE  3 83, 25–33 (2006).
Edgar, R., Domrachev, M. & Lash, A. E. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30, 207–210 (2002).
Acknowledgements
We thank Eyke Hüllermeier for helpful discussions and comments on the manuscript. This work was supported by the Marburg Center for Synthetic Microbiology (SYNMIKRO), by a Doctoral scholarship to M.G. from the Philipps University of Marburg as well as by research grants to G.L. from the German Research Foundation (DFG), the Deutsche Krebshilfe, the Swiss National Science Foundation, the Else KrönerFreseniusStiftung, and the DFG CellsinMotion Cluster of Excellence (EXC 1003).
Author information
Authors and Affiliations
Contributions
P.L. coordinated research. M.G. conceived and implemented SDCM. All authors contributed to study and validation design. M.G. performed computations. All authors analyzed signatures. G.L. interpreted DLBCL biology. P.L. and M.G. formulated the algorithm mathematically. All authors wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Communications thanks AikChoon Tan, Wayne Xu and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Grau, M., Lenz, G. & Lenz, P. Dissection of gene expression datasets into clinically relevant interaction signatures via highdimensional correlation maximization. Nat Commun 10, 5417 (2019). https://doi.org/10.1038/s41467019127135
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467019127135
This article is cited by

Identifying plastics with photoluminescence spectroscopy and machine learning
Scientific Reports (2022)

Molecular and functional profiling identifies therapeutically targetable vulnerabilities in plasmablastic lymphoma
Nature Communications (2021)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.