Abstract
Advanced measurement and data storage technologies have enabled highdimensional profiling of complex biological systems. For this, modern multiomics studies regularly produce datasets with hundreds of thousands of measurements per sample, enabling a new era of precision medicine. Correlation analysis is an important first step to gain deeper insights into the coordination and underlying processes of such complex systems. However, the construction of large correlation networks in modern highdimensional datasets remains a major computational challenge owing to rapidly growing runtime and memory requirements. Here we address this challenge by introducing CorALS (Correlation Analysis of Largescale (biological) Systems), an opensource framework for the construction and analysis of largescale parametric as well as nonparametric correlation networks for highdimensional biological data. It features offtheshelf algorithms suitable for both personal and highperformance computers, enabling workflows and downstream analysis approaches. We illustrate the broad scope and potential of CorALS by exploring perspectives on complex biological processes in largescale multiomics and singlecell studies.
Main
The advancement of modern technologies, including singlecell^{1} and multiomics approaches^{2}, wearable devices^{3}, and integrated electronic health records^{4,5}, have enabled an exciting era of precision medicine. These technologies regularly produce datasets with hundreds of thousands of variables (here referred to as features), allowing for unprecedented profiling of complex biological processes such as diseases, pregnancy or healing^{2,6,7}. Correlation analysis is typically the first step to gain insights into a complex system (56% of all papers on the preprint server bioRxiv contain the word ‘correlation’). While erroneous data may skew correlation analyses^{8} and correlation does not imply causation, correlation analysis can guide coordinated subprocesses in complex systems for further investigation. Consequently, a broad range of algorithms have been developed for analyzing largescale correlation networks from the perspective of topology, connectivity patterns or community structures^{9,10}. In addition, extensive gene graphs and celltocell relations derived from largescale correlation networks are integrated in modern deep learning and graph neural network applications^{11,12}.
Despite diverse applications, the construction of correlation networks for large datasets remains a major computational challenge (for example, for only n = 1, 000 features, at least 499,000 pairs need to be examined). As such, computation time and memory requirements for constructing correlation networks grow rapidly and quickly exceed computational resources as the dimensionality of the datasets increases. Current approaches for constructing correlation networks (Supplementary Section 5) either rely on specialized parallel processing and highperformancecomputing frameworks (for example, graphics processing units, MapReduce and so)^{13,14,15,16,17}, focus on specialized correlation measures^{18,19} or address only limited aspects of time and memory requirements^{20,21,22,23}.
Thus, here we introduce an opensource framework for highdimensional correlation analysis called CorALS (Correlation Analysis of Largescale (biological) Systems). CorALS enables efficient correlation computation, as well as topk and differential correlation network approximations without requiring specialized software or hardware. We provide CorALS as an easytouse and extensible Python package. The corresponding routines are notably faster and require substantially less memory than commonly employed methods, allowing for complex analyses even on regular laptop computers. To achieve this, CorALS combines specialized vector projections, modern optimized linear algebra routines, spatial partitioning techniques and big data programming models (Fig. 1). CorALS supports Pearson correlation, the nonparametric Spearman correlation and the Phi coefficient for binary variables. We demonstrate the broad scope and potential of CorALS by analyzing several highdimensional datasets, and present complementary perspectives on largescale multiomics and singlecell datasets, revealing concerted coordination of biological systems during pregnancy. Overall, CorALS will allow practitioners to integrate largescale correlation network analysis into rapid turnaround workflows that have previously been inaccessible owing to time and resource limitations, and enable the development of downstream applications for deeper insights into complex systems.
Results
CorALS enables efficient, largescale correlation analysis for highdimensional data by employing a combination of versatile vector projections and big data computation models. In particular, features (Fig. 1a, left) are projected onto correlation vectors (Fig. 1a, middle) or embedded into a differential space (Fig. 1a, right). On the basis of these feature projections, CorALS exploits the tight relation of the scalar product and Euclidean distance in the corresponding vector space to derive efficient indexing schemes (Fig. 1b, left). These index structures are then embedded into a computational pipeline that splits correlation analysis tasks into batches (Fig. 1b, middle), which yield intermediate results based on a specifically designed approximation scheme. This scheme ensures that aggregating intermediate topk results yield accurate approximations of the global correlation network (Fig. 1b, right). Importantly, the batched approach allows for effective memory management and inherent parallelization. Further technical details can be found in ‘Efficient approximation of correlation networks’ in Methods. On the basis of this framework, CorALS enables a wide variety of efficient analytical components (Fig. 1c). The following sections introduce these components and illustrates their runtime and memory advantages. For this, we employ several realworld highdimensional datasets, in the context of pregnancyrelated disease (preeclampsia), healthy pregnancy and cancer^{1,6,24,25,26}. See Table 1 for basic dataset statistics and ‘Datasets’ in Methods for more information. For these highdimensional datasets, the goal is to investigate the correlations between large amounts of features (n) often based on a comparably small number of samples (m).
Efficient computation of correlation matrices
Full correlation matrix computation is the task of calculating the correlations between all feature pairs in a highdimensional dataset. CorALS employs correlation projections (Fig. 1b) in combination with modern linear algebra routines for lowrank matrix multiplication to calculate complete correlation networks (Fig. 1c, left). The runtime and memory results for this task are shown in Supplementary Data 1 including a comparison with existing software libraries in R. This includes packages such as WGCNA(Weighted Correlation Network Analysis)^{27}, Rfast (A collection of Efficient and Extremely Fast R Functions)^{28}, coop (CoOperation: Fast Covariance, Correlation, and Cosine Similarity Operations)^{29} and HiClimR (Hierarchical Climate Regionalization)^{30} with different advantages and disadvantages. For example, WGCNA can handle small amounts of missing values, and HiClimR tries to save memory by calculating only the upper half of the correlation matrix. These methods either use efficient custom C implementations, which are often based on (multithreaded) nested forloops (for example, Rfast^{28}) or use efficient BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra PACKage) matrix multiplication routines with projected vectors similar to CorALS (for example, coop^{29}). In addition, we also investigated a selected set of methods from the category of highperformance computing, parallel and distributed frameworks including Deep Graph, Dask and Spark. For a more detailed discussion on these and further alternatives, we refer to Supplementary Section 3. An analysis for varying numbers of features and samples is given in Supplementary Fig. 2. Consistently, CorALS outperforms other approaches and implementations, particularly, the baseline implementations in Julia (statistics.cor), Python (numpy.corrcoef) and R (stats::cor) (Supplementary Data 1). The results for CorALS are based on a Python implementation (single core). Furthermore, the matrix multiplication routines used by CorALS can take advantage of multiple central processing units (here with 64 cores), considerably reducing runtimes (multicore). Note that some baseline implementations (such as Python) also support parallelization. However, their baseline singlecore version is already slower than CorALS and thus we skip these experiments, and illustrate only the scaling capabilities of CorALS with increasing computational resources. Other methods from the category of highperformance computing, parallel and distributed frameworks performed slower, did not return results or were not easily available (Supplementary Section 3). Memory requirement differences across all methods are negligible. The cancer and singlecell datasets illustrate how memory requirements can easily exceed the resources of even specialized highperformance computing hardware, when naively calculating full correlation matrices. However, investigating full correlation matrices may not be necessary, as often only the most prominent correlation structures are of interest. This makes focused correlation network construction schemes (for example, based on topk correlations, as introduced in the following sections) useful tools to explore and analyze largescale correlation structures, while avoiding resource limitations.
Efficient approximation of largescale correlation networks
Correlation analysis is often focused on the strongest correlations in a study, that is, by selecting the topk correlations. Straightforward implementations of this approach are based on calculating the overall correlation matrix, and utilizing default sorting algorithms to extract the topk correlations. However, this incurs substantial runtime and memory overhead as illustrated by the baseline implementations in R, Julia and Python (Table 2). To the best of our knowledge, no easytouse efficient algorithms exist to calculate only the topk correlations of a given set of features. To address this, for systems with large amounts of memory, CorALS provides a basic algorithm (matrix) that utilizes the previously introduced fast correlation matrix routine (Supplementary Data 1) together with selection algorithms that are able to efficiently partition topk values from the remaining correlations^{31}. However, on the one hand, memory requirements quickly exceed available resources (see, for example, memory use in the cancer (0.50) dataset in Table 2), and, on the other hand, the employed partitioning algorithms are not easily parallelizable while consuming the majority of runtime of the topk search (compare matrix in Supplementary Data 1 and matrix in Table 2). To address this, CorALS employs a combination of specific feature vector projections (Fig. 1a), space partitioning techniques and an efficient computation pipeline (Fig. 1b) to approximate the set of strongest correlations in a network (Fig. 1c, middle). The use of efficient space partitioning techniques circumvents calculating the overall correlation matrix, thus avoiding large memory requirements while allowing for substantial runtime improvements (Table 2, index). For more results on runtime, memory and parallelization efficiency for varying numbers of samples and features, also see Supplementary Section 2. The employed approximation scheme trades off resource usage against accuracy. Thus, we provide a theoretical analysis of lower bounds on the amount of potentially found values and the associated sensitivity across various approximation factors (‘Approximate search for global topk correlations’ in Methods). Note that in practice, the sensitivity associated with specific approximation factors can be much higher then the provided estimates. Consequently, CorALS can even yield perfect results with smaller approximation factors increasing efficiency (Supplementary Section 5). The runtime efficiency of the employed space partitioning techniques grows as the ratio of the number features and samples increases. Thus, particularly on highdimensional, realworld datasets, CorALS reduces runtimes substantially. In addition, this approach is inherently parallelizable and by employing multiple cores (here, 64), CorALS achieves notable performance gains (for example, from 8 hours to 11 minutes for the cancer (1.00) dataset) and outperforms all baseline for any of the considered datasets by a large margin (parallel). Finally, and importantly, the methods provided by CorALS have a very small memory consumption profile of only a fraction of the baseline implementations. This can be even further reduced depending on the application scenario, for example, by lowering the number of top correlations to extract, introducing explicit correlation thresholds or decreasing the size of the batches in the CorALS computation pipeline (Fig. 1b, middle). Thus, CorALS enables largescale correlation analyses that are not possible with any of the baseline or basic implementations, even on dedicated highperformance computation hardware (Table 2, cancer (0.5), cancer (1.0) and single cell).
Overall, CorALS allows the calculation of largescale topk correlation networks on personal computers, enabling accessible workflows that previously required dedicated highperformance infrastructures. For additional runtime, memory and accuracy analyses, see Supplementary Sections 2–5.
Differential analysis of correlation networks
Differential network analysis^{32,33,34}, and specifically systematically studying the largest differences in correlation networks across more than one condition (or timepoint), can be instrumental to understanding the underlying processes of complex systems^{35}. To enable this, CorALS represents features as vectors in a ‘differential space’ (Fig. 1), each of which combines information from two conditions (or timepoints) simultaneously (Fig. 1c). This, allows CorALS to employ an algorithmic approach similar to topk correlation search, enabling efficient topk differential correlation discovery (Fig. 1c, right) with analogous runtime and memory characteristics. Comparable methods such as Differential Gene Correlation Analysis (DGCA) or DiffCor^{35,36} provide approaches for ensuring statistical robustness of their results based on sampling. However, even with sampling disabled, these methods are substantially slower than CorALS. Other approaches such as Discordant and Differential Correlation across Ranked Samples (DCARS)^{37,38}, do not allow for topk functionality and thus will quickly run into memory issues. Thus, CorALS allows for a much more efficient discovery of topk correlation discovery. For a more indepth discussion, we refer to Supplementary Section 3.3. To ensure robustness, either CorALS can be used as an efficient candidate selection step, which can then potentially be tested with the methods mentioned above, or similar sampling techniques can be implemented in CorALS. We apply such a sampling approach in ‘Largescale multiomics correlation analysis across pregnancy’, where we take advantage of the efficient runtime characteristics of CorALS to account for spurious correlations by employing a corresponding samplingbased strategy.
Explicit visualization of correlation structure
Feature embeddings are an essential tool to represent and analyze features in lowdimensional spaces. For example, Fig. 2 shows features visualized using tdistributed stochastic neighbor embeddings (tSNE). However, tSNE is generally based on Euclidean distances and thus does not directly represent the correlation structure of features. Although some tSNE implementations support custom correlationbased distance information, this is often inefficient owing to algorithmic overhead. To address this, CorALS uses correlation projections (Fig. 1a, middle) to exploit the direct relationship between correlation and Euclidean distance (Supplementary Section 6). This allows to employ any existing distancebased method for embedding features without adding substantial computational overhead. All feature and cell visualizations throughout this paper are based on this approach (for example, Figs. 1–3).
Largescale multiomics correlation analysis across pregnancy
Understanding maternal biological changes during and immediately after pregnancy is a fundamental step to improving diagnostic and therapeutic strategies in peripartum management, to prevent critical conditions that extend well into the child’s adulthood (for example, preterm birth, the single largest cause of death in children under 5 years of age). Despite this, previous studies have not investigated possible changes in the crosstalk across various biological modalities^{6,39}. To demonstrate CorALS’s utility in highdimensional multiomics studies, we analyzed a dataset containing third trimester and postpartum measurements of biospecimens from 17 healthy pregnant women^{6}. Each sample in the corresponding data contains more than 60,000 synchronized measurements from seven different omics (Fig. 2) from which we selected ~41,000 by filtering features with missing or constant values. Details on assays and the measured biomarkers can be found in ‘Datasets’ in Methods.
We used CorALS to calculate the top10% Spearman correlations between all feature pairs for the third trimester. Furthermore, we extracted the top0.1% strongest differential correlations in contrast to postpartum by employing CorALS’s corresponding implementation. To focus the results on strong signals, we selected feature pairs passing a correlation threshold of 0.8 in the third trimester for further analysis. For visualization (Fig. 2), we utilized CorALS’s correlationbased feature embeddings based on tSNE^{40} for each individual omic.
The visualization reveals various prominent changes in correlation structure between the different omics from the third trimester to postpartum. In particular, the correlation changes between transcriptome and microbiome, as well as between the transcriptome (cellfree RNA) and immunome (including phenotypical, and the functional markers measured by mass cytometry or cytometry by time of flight, are prominent. These correlations appear in the third trimester but vanish postpartum (edges marked in dark gray). Refer to the Supplementary Section 8 for details and an expanded biological analysis.
Overall, while establishing causal links requires careful followup studies and biological validations, the results outlined in Supplementary Section 8 are a powerful example that illustrates how the efficient analysis of largescale correlation networks as enabled by CorALS can drive the generation of biological hypotheses.
Correlated functional changes across immune cells
While recent advances in singlecell technologies have enabled the production of large immunological datasets, data analysis approaches for singlecell data have remained limited to traditional analysis of changes in the frequency and signaling pathways of cell types. In this example, we demonstrate that CorALS allows to derive a complementary perspective on the dynamic coordination of functional characteristics across several immunesystem components on the singlecell level. We analyzed a dataset of more than 24 million cells from 17 participants tracking the immune system through pregnancy using mass cytometry^{1}. Notably, this dataset contains simultaneous measurements of both phenotypic markers as well as intracellular proteins, the latter serving as markers for endogenous signaling responsiveness of single cells. The phenotypic markers were used to identify various cell populations via manual gating^{1}, and CorALS was used to study shifts in cell similarities across the signaling pathways of various cell types using the available ten functional markers (Fig. 3) based on Spearman correlation. To increase the robustness of the dynamic changes identified, this analysis requires repeated sampling and topk correlation calculations across millions of individual cells, making CorALS an essential component of the analytical pipeline by substantially reducing runtime and memory requirements (processing a single sample corresponds to the singlecell experiment in Table 2). Refer to the Supplementary Section 9 for details and an expanded biological analysis.
Figure 3 shows a summary of this analysis and visualizes the amount and direction of change in the relative number of functional cell correlations attributed individual cell type pairs within the topk functional cell correlations between the third trimester and postpartum. These changes mostly revolve around B cells and CD56^{dim}CD16^{+} natural killer (NK) cells. While a detailed analysis may be of interest, we focus on these changes as an example to illustrate the complementary perspectives enabled by CorALS. In general, from the third trimester to postpartum, B cells and CD56^{dim}CD16^{+} NK cells show a higher degree of similarity in terms of signaling response signatures postpartum (orange edges) to cell types of the adaptive immune system (light background). At the same time, they share less similarity (blue edges) in their intracellular signaling response signature with the cells of the innate immune system (dark background). We further visualize this trend through density plots in Fig. 3, directly comparing the number of topk correlations of B cells and CD56^{dim}CD16^{+} NK cells, respectively, with the total pool of innate or adaptive immunecell subsets in the third trimester versus postpartum. This analysis provides a complementary perspective on the coordination of singlecell systems during pregnancy, and suggests that B cells and CD56^{dim}CD16^{+} NK cells acquire innatelike functional characteristics in the third trimester, and that, postpartum, these two cell types and various Tcell subsets shift functionally to resemble each other.
On the basis of these conjectures, and given further datasets for validation, the changes observed in Fig. 3 may guide further research on the role of B cells and CD56^{dim}CD16^{+} NK cells and the phenotypes they acquire over the course of pregnancy. This serves as a practical example on how CorALS can enable complementary perspectives on many different domains, including the coordination of singlecell systems, by enabling the efficient implementation and application of largescale correlation analysis.
Discussion
Modern biological profiling techniques will enable the collection of datasets with increasingly high dimensions and sample sizes. Therefore, the consistent analysis of evolving datasets will require continuous improvements. We can further advance CorALS with advanced indexing and sorting algorithms, ondisk sorting algorithms, or employing distributed computing environments. The computational pipeline of CorALS is designed to support such extensions.
For example, the current version of CorALS is optimized for highdimensional datasets with small sample sizes. However, as sample sizes increase, the efficiency of the employed indexing structure can deteriorate. Alternatively, approximate indexing structures increase runtime in exchange for sensitivity. Also, approaches based on a batched computation of partial correlation matrices combined with thresholding may be an alternative (see ‘Feature projections’ in Methods for details). However, the latter approach will require careful balance between the number of batches, the number of concurrent tasks, threshold size, memory availability and runtime, as a threshold does not provide memory guarantees. To tackle this, various methods to cache data outside of the main memory can be employed. A principled approach to this are distributed frameworks, for example, based on MapReduce^{41}. CorALS already supports such distributed computation on various backends. We provide a Jupyter notebook that exemplifies running CorALS on a Spark cluster^{42}. However, while the implementation of CorALS already contains many of the previously mentioned extensions, a detailed comparison and analysis is beyond the scope of this work. For further practical consideration, also see ‘Practical considerations’ in Methods.
Also, the CorALS implementation provides tools to derive P values to gauge the significance of the measured correlations (‘Pvalue calculation and multiple testing correction’ in Methods), and supports the nonparametric Spearman correlation (‘Correlation coefficient classes’ in Methods) to account for outliers or certain error types^{8,43}. However, P values and the Spearman coefficient do not generally address challenges such as data errors and noisy data. To tackle this issue, correlation measures are often calculated based on computationally expensive techniques, for example, based on bootstrapping^{43}, making their application in highdimensional data impractical. In this context, CorALS can be used either to efficiently sample correlations using full correlation matrix calculation or to first select topk correlations for which robust methods can then be applied selectively. Similarly, CorALS does not account for confounding or causation. However, more advanced approaches to account for these effects, such as partial correlation or Bayesian networks^{44,45}, are often restricted to small datasets and do not scale for highdimensional data. In this context, CorALS can be used to effectively suggest highly correlated components of the data for further investigation with such methods. Thus, overall, investigating correlation networks can be broadly applied to gain insight into the underlying functional structures, which then may provide input for downstream analysis and also for more advanced methods such as graph neural networks^{11,12}.
Finally, as the number of features increases with advancing technologies, it may be necessary to introduce more sophisticated methods that find correlated compounds, for example, based on existing domain knowledge, rather than individual correlations, for which CorALS can lay the computational foundation.
Overall, owing to its wide range and scope, we anticipate CorALS to be a catalyst that will be adopted to enable a multitude of downstream applications of largescale correlation networks. For example, in ‘Correlated functional changes across immune cells’, the efficiency characteristics of CorALS’s top correlation network estimation allow to derive an innovative samplingbased approach to analyze the interaction of hundreds of thousands of cells simultaneously. In future work, CorALS may also support advanced tensor and network analysis or deep learning and graph neural network modeling (for example, for geneinteraction graphs and celltocell relationships^{11,12}). Thus, it will lay the analytical foundations and provide computational tools to unravel the intricate interactions of biological systems as developing computational approaches are able to analyze increasingly complex network structures.
Methods
Derivation of efficient feature representations by CorALS
The different components of CorALS rely on transforming features into specific vector representations that connect the scalar product of these vectors to efficient correlation computations. In the following, we outline the derivation of these transformations for correlation projections (used for efficient correlation matrix calculation, top correlation network approximation and correlation embeddings) as well as differential projections (used for top differential correlation search), respectively. It is noted that the following feature representations are derived for the Pearson correlation coefficient; however, without loss of generality, these derivations hold for Spearman’s rank correlation coefficient by replacing individual feature values with ranks per feature. This is supported by CorALS’s implementation.
Correlation projections
By transforming feature representations appropriately, correlation computation can be formulated as a scalar product of two preprocessed vectors^{46}. We refer to this preprocessing step as correlation projection. In particular, the Pearson correlation cor(x, y) between two features x and y with respective sample vectors x = (x_{1}, ..., x_{m}) and y = (y_{1}, ..., y_{m}), can be rewritten as follows:
where \({\mu}_{\mathbf z}\) is the mean of vector z. Thus, the operator corresponds to the correlation projection that allows the transformation of the original sample vectors so that their scalar product is equal to their correlation. CorALS exploits this vector representation to formulate correlation matrix computation as an efficient matrix product.
This transformation allows to derive a direct relationship between the correlation cor(x, y) of any two vectors and the Euclidean distance \({d}_{{\mathrm{e}}}(\hat{{{{\bf{x}}}}},\hat{{{{\bf{y}}}}})\) of their correlation projections^{46}. In particular, cor(x, y) and \({d}_{{\mathrm{e}}}(\hat{{{{\bf{x}}}}},\hat{{{{\bf{y}}}}})\) are orderequivalent and it holds that:
CorALS exploits this relationship between correlation and Euclidean distance, for example, in top correlation approximation and correlationbased embeddings. For more details and corresponding proofs, see Supplementary Section 6.1.
Differential projections
CorALS further introduces a dual feature representation in a differential space that allows to calculate correlation differences across two conditions or timepoints using a single scalar product. In particular, for two features x and y, let \({{{{\bf{x}}}}}_{1}=({x}_{1,1},...,{x}_{1,{m}_{1}})\) and \({{{{\bf{y}}}}}_{1}=({y}_{1,1},...,{y}_{1,{m}_{1}})\) denote respective sample vectors in the first condition/timepoint and \({{{{\bf{x}}}}}_{2}=({x}_{2,1},...,{x}_{2,{m}_{2}})\) and \({{{{\bf{y}}}}}_{2}=({y}_{2,1},...,{y}_{2,{m}_{2}})\) in the second condition/timepoint. Then, the goal is to find vector transformations δ(x_{1}, x_{2}), κ(y_{1}, y_{2}) that represent information form both conditions/timepoints simultaneously so that
Given the correlation projection from ‘Correlation projections’, the following definitions for \(\delta\) and κ provide such a dual vector representation.
We call the vector space containing the codomain of these functions differential space.
Similar to the connection of Euclidean distance and basic correlation (see above), the dual feature representations in the differential space exhibit a connection between Euclidean distance and correlation difference across conditions or timepoints. In particular, for two features x and y with sample vectors x_{1}, x_{2} and y_{1}, y_{2} across two conditions or timepoints, cor(x_{1}, y_{1}) − cor(x_{2}, y_{2}) and −d_{e}(δ(x_{1}, x_{2}), κ(y_{1}, y_{2})) are orderequivalent and it holds that:
Thus, analogously to correlation projections, CorALS exploits this order equivalence of Euclidean distance and correlation differences for top differential correlation approximation. For more details and corresponding proofs, see Supplementary Section 6.2.
Efficient calculation of full correlation matrices
Efficiently calculating full correlation matrices is achieved by recognizing that the inner product formulation in equation (1) allows to condense the correlation calculation between all possible feature pairs in a dataset to a single matrix product \({\hat{{{{X}}}}}^{\top }\hat{{{{X}}}}\). Here, \(\hat{{{{X}}}}\in {{\mathbb{R}}}^{m\times n}\) is the samplefeature matrix representing the corresponding dataset with m samples and n features where each column corresponds to the correlation projected sample vector of each feature, respectively (see ‘Correlation projections’). This approach can be directly formulated in any recent programming language without requiring additional software packages, and is able to take advantage of builtin efficient linear algebra routines such as BLAS and LAPACK^{47,48}, which inherently support parallelization as showcased in Supplementary Data 1 and Supplementary Section 3. This approach outperforms many other implementations employing similar concepts as demonstrated in Supplementary Table 2.
Efficient approximation of correlation networks
Top correlation computation as a query search problem
By default correlation networks are fully connected. However, often it is more valuable to study only the most interesting interactions, that is, the strongest correlations. For this, it is common to either define a fixed threshold or concentrate the analysis on the topk correlations. A straightforward approach to achieve this is to calculate the full correlation network and then keep only those correlations that are sufficiently strong according to either criterion. However, for highdimensional data, calculating the full correlation matrix between features is often not feasible owing to memory restrictions, and in the topk case, the subsequent sorting operation has more than cubic complexity with the number of features n (\({{{\mathcal{O}}}}({n}^{2}\log\,n)\)). And even when using partial sorting techniques based on selection algorithms for topk search, this may result in impractical runtimes (\({{{\mathcal{O}}}}({n}^{2}+k\log\,k)\))^{31,49}.
To address this, we fist observe that owing to the symmetry property of correlation measures, a single feature can never be strongly correlated to all other features (except in cases where all features highly correlated). Thus, we assume that the top global correlations can be approximated by finding and merging the top correlations locally, for example, for each feature separately, given an appropriate local margin (coined approximation factor as introduced below). This allows CorALS to reinterpret the task of top correlation computation as a query search problem^{50} where an indexed set of elements is efficiently queried based on a set of query vectors and a given distance measure. In particular, CorALS constructs an efficient index structure T_{X} over a set of features X and then interprets another (often the same) set of features as queries Y to find the top correlated feature pairs. This approach prevents the construction of the complete correlation matrix and the corresponding implementation is inherently parallelizable, resulting in substantially reduced runtimes and memory requirements.
In the following, we describe the individual steps to enable this approach. This includes (1) the construction of an optimized indexing and query method that circumvents limitations of the previously derived relation between Euclidean distance and correlation (‘Joint ball trees for local top correlation discovery’), (2) the description of an approximation scheme to generalize singlequerybased search to return global topk correlations (‘Approximate search for global topk correlations’), and (3) a discussion on the implementation of thresholdbased search (‘Thresholdbased correlation filtering’).
Joint ball trees for local top correlation discovery
While in principle, any metricbased knearestneighbor algorithm can be used for CorALS, we focus on space partitioning algorithms that allow for efficient topk as well as thresholdbased queries in highdimensional settings. Ball trees (or metric trees) in particular automatically adjust their structure to the represented data, provide good averagecase performance and can cope well with highdimensional entities^{50,51}. While such indexing structures are mostly optimized for metrics such as the Euclidean distance, CorALS takes advantage of the correlation projection introduced in ‘Correlation projections’ and its properties (see ‘Correlation projections’) to enable top correlation and differential correlation search.
In particular, CorALS first represents each feature as a correlation vector by applying the correlation projection introduced in ‘Correlation projections’ to their respective sample vectors. These correlation vectors X are then indexed using ball tree space partitioning resulting in index T_{X}. On the basis of the relation between Euclidean distance and correlation derived in ‘Correlation projections’, this index allows to search for topk positively correlated features search(T_{X}, y, k) based on a given query feature y ∈ Y. It also allows to search for a set of features search(T_{X}, y, t) passing a positive correlation threshold t with respect to the query feature y.
Note that this setup has two specific limitations that we address in the following. First, ball trees generally only support to search for top correlations relative to a single reference feature y. The algorithm to generalize this to a set of features will be described in ‘Approximate search for global topk correlations’ and ‘Thresholdbased correlation filtering’. Second, by default, only feature pairs with positive correlations are returned because only positive correlations correspond to small Euclidean distances while negative correlations will result in large distances (see equation (2) and the corollary in Supplementary Section 6.1).
To address the latter, CorALS takes advantage of the fact that correlation (as well as the scalar product) is associative with respect to scalar multiplication. In particular, changing the sign of a sample vector also changes the sign of the correlation:
Now, without loss of generality, we focus on topk search in the following derivation. Assuming that at least k features with positive correlations to a query feature y exist in X, then all correlations returned by search(T_{X}, y, k) are positive. Similarly, assuming that at least k negative correlations exist, switching the sign of all features in the dictionary X, that is, search(T_{−X}, y, k), or switching the sign of the query, that is, search(T_{X}, −y, k), allows to also extract the strongest negative correlations (see equation (6)). Thus, a simple solution to find those features with the top positive and negative correlations is to run the search twice, once to extract positive and once to extract negative correlations, followed by a merging step.
However, for topk search, this merging step, involves returning the topk correlations twice, resulting in a sorting step that orders 2k elements, which can double memory requirements. This can be prevented by building the ball tree based on positive and negative dictionary features simultaneously, that is, search(T_{−X∪X}, y, k). This search only returns k elements, and thus can reduce runtime and memory requirements. See Supplementary Table 1 for a comparison of top0.1% search on realworld datasets (Table 1). The corresponding experiments are based on the CorALS’s Python implementation and were repeated ten times; reported medians had no substantial fluctuations between runs. While the runtime improvements are marginal, the memory consumption can be reduced by half. Also note that for multiple queries, ball trees support to preprocess the set of queries resulting in a dualtree approach^{52} for speeding up the search. Supplementary Table 1 also demonstrates the effectiveness of this approach. For the final implementation of CorALS, we jointly build the ball tree structure on negative and positive features and employ the dualtree search whenever provided by the underlying software library.
Approximate search for global topk correlations
Focusing on the topk correlations can be an effective way to construct interpretable visualizations of correlation matrices without having to explicitly specify a threshold. For this, k is often large, defined either as a multiple of the number of features (for example, 100n, 1,000n), or as a percentage (say 0.1% of all correlations ~⌈n^{2} * 0.001⌉). However, the ball tree algorithm (see ‘Joint ball trees for local top correlation discovery’) returns only the top correlations for each feature rather than the overall topk correlations between all features. To address this, CorALS employs an approximation scheme.
In particular, for each query feature y ∈ Y, CorALS heuristically sets the number of k′ top correlated features to retrieve and then merges the results to approximate the global set of topk features. Selecting k′ presents a tradeoff. On the one hand, if k′ is greater than or equal to the number of features n, all feature pairs will be considered, thus allowing for an exact determination of the topk features but no gain in runtime. On the other hand, if k′ < n, then there is no guarantee that the exact topk features are retrieved; however, the runtime can be substantially improved as only a subset of candidates is returned and processed. To address this, CorALS uniformly draws top correlation candidates across all query features with a sufficient margin that accounts for biases in the correlation structure. That is, we chose k′ to be dependent on k with \({k}^{{\prime} }=a \lceil \frac{k}{n}\rceil\) as a middle ground between drawing the exact number of required candidates from each query \({k}^{{\prime} }=\lceil \frac{k}{n}\rceil\) and considering all candidates from each query k′ = n. Here a is called the approximation factor and regulates how many correlations are inspected per feature. The approximation factor can be selected so that CorALS returns results up to a specific sensitivity s. In particular, for a desired sensitivity up to s ≤ 0.75, the approximation factor can be chosen based on \(a=s\frac{n}{\sqrt{k}}\); and for a desired sensitivity s ≥ 0.75, the approximation factor can be chosen based on \(a=\frac{s n}{2\sqrt{k}\sqrt{1s}}\). When formulating k in terms of the overall number of correlations n^{2}, that is, k = rn^{2}, for a sensitivity of s ≤ 0.75, the approximation factor can be calculated via \(a=\frac{s}{\sqrt{r}}\), and for s ≥ 0.75 it can be calculated via \(a=\frac{s}{2\sqrt{r}\sqrt{1s}}\). However, in practice the number of missed correlations can be substantially smaller as correlations are usually not distributed according to the the worst case (Supplementary Figure 5). The derivation of sensitivity estimates as well as a study of the effects of a itself can be found in Supplementary Section 5. Supplementary Algorithm 1 summarizes the overall approach.
Thresholdbased correlation filtering
To calculate all correlations greater than a threshold t, for each feature y ∈ Y, we can also employ the ball tree data structure (see ‘Joint ball trees for local top correlation discovery’) by issuing radius queries. For this, the correlation threshold needs to be converted into an Euclidean radius using equation (2). Thus, for each query feature, the respective query returns all indexed features with correlations greater than the respective correlation threshold. The results of each query are then merged to retrieve the final list of the filtered feature pairs. This approach is more memory efficient than calculating correlations for all possible feature pairs, for example, using the methodology introduced in ‘Efficient calculation of full correlation matrices’. However, it can also result in substantially increased runtimes compared with calculating the complete correlation matrix. The corresponding algorithm is implemented analogously to the topk search in Supplementary Algorithm 1 but replaces k with a correlation threshold that is converted into a corresponding Euclidean radius via equation (2) to be used by the ball tree index structure.
Top correlation difference search
To efficiently calculate the top differences in correlation between pairs of features across more than one timepoint or condition, the naive implementation involves calculating the full correlation matrices for two conditions or timepoints, subtracting them and then extracting the top differences, for example, through thresholding or by identifying the topk candidates. As previously shown for topk correlation search, this is runtime and memory extensive if implemented naively and thus can easily exceed computational resources (Table 2).
To address this, CorALS builds on the dual feature representation introduced in ‘Differential projections’. In particular, it exploits the connection of correlation difference and Euclidean distance between the dual representation of features in differential space and then applies the same query search approach as for top correlation search (see ‘Efficient approximation of correlation networks’).
Thus, this first requires representing all features x ∈ X as their dual representations δ(x) ∈ δ(X) and κ(x) ∈ κ(X). Then, analogously to ‘Joint ball trees for local top correlation discovery’, a combined ball tree T_{δ(X)∪−δ(X)} is constructed to cover negative as well as positive differences. This ball tree can then be used to query the topk (or thresholded) correlation differences search(T_{δ(X)∪−δ(X)}, y, k) by querying with the feature representations κ(x) ∈ κ(X). This already includes positive and negative correlation differences as we index positive and negative projections δ(X) ∪ − δ(X), while indexing only δ(X) would solely return the top positive correlation differences (see equation (2) and the corollary in Supplementary Section 6.2). After the construction of T_{δ(X)∪−δ(X)}, the same approximation approach as laid out in ‘Approximate search for global topk correlations’ and ‘Thresholdbased correlation filtering’ is employed to query the top correlation differences across all query features κ(X).
Correlation embeddings
tSNE^{40} was used to embed highdimensional data points into lowdimensional spaces, for example, for visualization. In this work, we employ tSNE to embed features based on their correlation structure across samples. However, tSNE is based on Euclidean distance and thus does not directly represent the correlation structure of features.
In particular, tSNE reduces the dimensionality of data by minimizing the Kullback–Leibler divergence between a probability distribution, P, in the highdimensional space and a probability distribution, Q, in the lowdimensional space^{40}:
where the probabilities p_{ij} and q_{ij} represent probabilities for features j to belong to the neighborhood of feature i based on Euclidean distance in the corresponding space:
with ∥z_{i} − z_{j}∥^{2} and \(\parallel {\tilde{{{{\bf{z}}}}}}_{i}{\tilde{{{{\bf{z}}}}}}_{j}{\parallel }^{2}\) representing pairwise Euclidean distances between features i and j for highdimensional z and lowdimensional feature representations \(\tilde{{{{\boldsymbol{z}}}}}\), respectively.
Now, by projecting features onto correlation vectors, CorALS establishes an order equivalence between Euclidean distance and correlation as introduced in ‘Correlation projections’. This allows to directly employ distancebased embeddings methods such as tSNE on the projected features without adding substantial computational overhead or requiring implementations that support customized distance information. A performance example is given in Supplementary Section 7.
Correlation coefficient classes
The underlying computation of CorALS is based on the Pearson correlation coefficient as discussed in the previous sections. On this basis, CorALS also supports any class of correlation coefficients that can be reduced to the Pearson calculation scheme. In particular, Spearman correlation can be calculated using the Pearson formula by replacing individual feature values with featurelocal ranks, which may help to account for outliers or certain error types^{8,43}. CorALS provides the corresponding capabilities to switch between Pearson and Spearman. Similarly, the Phi coefficient for binary variables can be calculated using the Pearson formula^{53}. Finally, other correlation coefficient classes may be supported by future versions of CorALS by finding a mapping between the corresponding coefficient and Euclidean distance as derived in the previous section for the Pearson correlation coefficient.
Pvalue calculation and multiple testing correction
P values for Pearson correlation coefficients r, can be derived from the correlation coefficient together with the number of samples n. That is, first the tstatistic can be derived using \(t=r\frac{\sqrt{n2}}{\sqrt{1{r}^{2}}}\). Then, the P value can be calculated by examining the cumulative tdistribution function p: P = 2 ⋅ p(T > t) where T follows a tdistribution function with N − 2 degrees of freedom. This approach is implemented in CorALS as derive_pvalues and can be applied as a postprocessing step.
Note that owing to the large amount of correlations calculated, multiple test correction is necessary when working with P values. The most straightforward approach is to control for familywise error rate using Bonferroni correction, which multiplies the corresponding P values by the number of compared correlation coefficients \(\frac{{n}^{2}n}{2}\). Other approaches such as the false discovery controlling procedure Benjamini–Hochberg generally require the full P value distribution, which is not available when applying topk correlation discovery. In these cases, padding the calculated P values with 1s for unknown P values can provide an upper bound for adjusted P values. However, this generally requires instantiating the full number of P values, which causes memory issues like in the full correlation matrix case Supplementary Table 1. To address this we provide a truncated version of the Benjamini–Hochberg procedure that avoids this issue.
The Benjamini–Hochberg (BH) procedure yields adjusted P values^{54} through
with \({P}_{(i)}^{{\mathrm{BH}}}\) representing the BH corrected P value at rank (i) for ascendingly ranked P values, m being the number of overall P values, for example, \(m=\frac{({n}^{2}n)}{2}\), and j represents the rank of the P value P_{j}. On the basis of this formula, a truncated upperbound version of BH calculates the adjusted P values for all topk P values. Then a upperbound adjusted value is calculated by \(u=\frac{m\cdot 1}{k+1}\). If P_{k} > u, then all adjusted P values P with P = P_{k} are replaced by u. This yields a minimally invasive truncated BH procedure for adjusted P values without instantiating the full distribution of P values. The approach is implemented in CorALS as multiple_test_correction and can be applied as a postprocessing step.
Extensible framework for largescale correlation analysis
The computational framework of CorALS is based on three steps (Fig. 1b): a feature projection step, a dynamic batching step and a reduction step. As such, the general structure is compatible with the the big data computation model MapReduce^{41}.
The feature projection step (Fig. 1b, left) allows for preparing the data so that it can be split and processed independently in an efficient manner. In this paper, we specifically focus deriving an indexing structure based on space partitioning that allows for efficiently querying topk correlations.
The dynamic batching step (Fig. 1b, middle) then splits the data matrix into multiple batches. The prepared data (and indexing structures) are then used to locally extract the relevant values in each batch independent of the other batches. Batches can be processed sequentially, in parallel or even in a distributed manner. Thus, the smaller the batches and the smaller the number of batches that run simultaneously, the less memory is required. This finegrained control over batches introduces an effective mechanism to manage and tradeoff memory requirements and runtime based on the available resources. Furthermore, batches may store their results on disk rather than inmemory, further reducing memory requirements. In this paper, for each batch of features, we focus on utilizing the previously mentioned indexing structure to extract the local topk correlations in line with the corresponding approximation factor (see ‘Approximate search for global topk correlations’). We also provide a thresholding feature that can reduce memory requirements of the batch results.
Finally, the batch results are reduced into the final result by merging batches. Dependent on the batch implementation and the local results, this can be done directly in memory for the fastest runtimes, sequentially by merging one batch result at a time or even mostly on disk, which can be used to further reduce memory requirements in favor of computation time. In the implementation of the final join analyzed in this paper, the results from the batches consist of individual correlations, which are merged, partitioned and then sorted to return the final topk values.
Feature projections
Note that the implementation provided by CorALS is highly extensible and nearly all aspects can be replaced by custom implementations to optimize for particular application scenarios. For example, during the feature projection step, the index structures employed in the current implementation are based on ball trees, which optimizes for highdimensional datasets with small samples sizes by employing correlation and differential spaces (Fig. 1a). However, this index structure can easily be replaced by implementations with different computational characteristics. For example, it may make sense to consider approximate nearestneighbor methods^{55} to replace the current index, which may potentially reduce runtimes for a cut in sensitivity. Similarly, particularly for larger sample sizes, instead of using indexing structures, it may be advantageous to directly calculate correlations for smaller batches via the efficient matrix multiplication scheme introduced in ‘Efficient calculation of full correlation matrices’. While this direct calculation and partitioning of correlations increases time complexity from \({{{\mathcal{O}}}}(n\log n)\) to \({{{\mathcal{O}}}}({n}^{2})\), this may be faster than the currently employed ball tree indexing structure as the corresponding search performance of \({{{\mathcal{O}}}}(\log n)\) may deteriorate to \({{{\mathcal{O}}}}(n)\) with increasing dimensionality (in our case sample size). Here it is important to appropriately select the number of simultaneous batches to limit the memory requirements of this approach (for example, if only one batch is used, the complete correlation matrix will be instantiated). A corresponding implementation is provided by CorALS. A detailed comparison with indepth parameter optimization and the corresponding relation to more efficient approximate nearestneighbor schemes is left for future studies.
Distributed computation
The methods in this paper are focused on inmemory computations. However, as mentioned earlier, the computational framework of CorALS allows for sequential computation of batch results which can be cached on disk, circumventing potential memory limitations and allowing for calculating correlations for massive datasets. Furthermore, CorALS also supports distributed computation of correlation and differential matrices through the joblib backend (https://github.com/joblib/joblib). This directly enables Spark (https://github.com/joblib/joblibspark), Dask (https://ml.dask.org/joblib.html) or Ray (https://docs.ray.io/en/latest/joblib.html).
In principle, the batchbased design of CorALS also allows for more specialized implementations based on the MapReduce paradigm^{41}. Thus, overall, CorALS provides a very flexible algorithmic framework for largescale correlation analysis that can be easily extended and adjusted to the application at hand.
Practical considerations
Full correlation matrix calculation
On the basis of the results in Table 2 and Supplementary Table 3, where CorALS substantially outperforms all other methods, we recommend generally using CorALS for full correlation matrix calculation. As the number of features grows, however, the full correlation matrix will not fit into memory. For example, at n = 32,000 features, the full matrix uses more than 8 GB of memory; at n = 64,000 features, it already requires more than 32 GB. This can be calculated roughly by assuming 64bit float values (default in Python) and the formula: \({\mathrm{memory}}(n)=\frac{64 {n}^{2}}{8\times 1{0}^{9}}\). Thus, we recommend switching to topk correlation analysis after n = 32,000 features.
Topk correlation search
For topk correlation search, we recommend using the basic CorALS implementation (referred to as matrix in Table 2) as long as the full correlation matrix fits into memory, independent of the number of samples. However, as the number of features increases, memory issues will make this approach impossible to use. When this is the case, switching to the indexbased CorALS implementation is the best option. With increasing sample numbers, CorALS becomes slower, which may warrant other heuristics such as dimensionality reduction such as locality sensitive hashing or random projections (see ‘Discussion’).
Note that, by default, the topk approximation approach does not guarantee symmetric results, that is, even if cor(x, y) is returned, cor(y, x) may be missing. This can be addressed by various postprocessing steps, for example, by adding missing values. CorALS provides the option to enable this feature. In the experiments, this is not enabled as symmetric results are redundant for practical purposes.
Correlation structure visualization
For practical purposes, there are two properties of the proposed correlation structure visualization to consider. First, by design, CorALS visualizes strongly positively correlated features close to each other while the distance to strongly negatively correlated features will be large (see corollary in ‘Correlation projections’). In some settings it may be desirable to simultaneously visualize negatively correlated features close to each other, which is currently not supported by CorALS. Second, the relationship between Euclidean distance and correlation established in is not linear, which may result in bias toward tightly clustering highly correlated features. See Supplementary Fig. 1 for an illustration of the relation between correlation and the corresponding Euclidean distance.
Investigating the coordination of singlecell functions
For the analysis in ‘Correlated functional changes across immune cells’ and Fig. 3, we first divide cells into 20 individual nonoverlapping cell types based on manual gating^{1}. We then repeatedly sample 10,000 cells from each cell type across all patients using a dual bootstrapping scheme to ensure appropriate variations in cell types where less than 10,000 cells are present. The dual bootstrapping scheme first samples n cells from each cell type with replacement, where n is the number of available cells for that cell type. From this intermediate sample, we sample the final 10,000 cells for that cell type with replacement.
On the resulting sample of 200,000 cells across cell types, we calculate the top0.01% Spearman correlations across all sampled cells based on their functional markers. We then count the number of top correlations between each pair of cell types. This allows to measure the relative correlation strengths between cell types.
By generating pairs of samples in each repetition, one from thirdtrimester cells and one from postpartum cells, we calculate the effect size (Cliff’s δ) of the topk frequency differences between each pair of cell types. Supplementary Fig. 6 depicts a single instance of such a pair. We sample 1,000 times. Very large effect sizes defined by a corresponding effect size threshold (t = 0.622) are visualized in Fig. 3. This threshold has been derived based on analogous interpretation intervals proposed for Cohen’s d (refs. ^{56,57}).
As described above, this procedure requires repeated sampling and topk correlation calculations across millions of individual cells, making CorALS an essential component of this pipeline, enabling this analysis on our available servers by substantially reducing runtime and particularly memory requirements.
Datasets
The four realworld datasets we use for runtime and memory evaluation stem from biological applications in the context of preeclampsia, healthy pregnancy and cancer.All previously reported feature counts are subject to the following preprocessing procedure. We set negative values to 0, remove features that have only a single value and drop duplicate features (features are considered duplicates if all their sample values are the same). Dataset statistics are summarized in Table 1. For dataset availability, see Section ‘Data availability’.
The preeclampsia dataset^{24,26} contains aligned measurements from the immunome, transcriptome, microbiome, lipidome, proteome and metabolome, from 23 pregnant women with and without preeclampsia across the three trimesters of pregnancy. In brief, women of at least 18 years of age in their first trimester of a singleton pregnancy were recruited to the study after providing their informed consent and under institutional review board (IRB)approved protocols. Whole blood, plasma and urine samples, and vaginal swabs were collected throughout pregnancy and processed to generate immunome, transcriptome, microbiome, lipidome, proteome and metabolome datasets. After aligning omics and dropping features with missing or only homogeneous values, 32 samples with 16,897 features where obtained.
The pregnancy dataset^{6} contains 68 samples from 17 pregnancies with four samples per woman in the first, second and third trimesters as well as postpartum, respectively. Each sample contains immunome, transcriptome, microbiome, proteome and metabolome measurements obtained simultaneously. In brief, women of at least 18 years of age in their first trimester of a singleton pregnancy were recruited to the study after providing their informed consent and under IRBapproved protocols. Whole blood, plasma and serum samples, and vaginal, stool, saliva and gum swabs were collected throughout pregnancy and processed to generate immunome, transcriptome, microbiome, proteome and metabolome datasets. After aligning omics and dropping features with missing or only homogeneous values, 32,211 features where obtained.
The cancer dataset contains samples from 443 patients with gastric adenocarcinoma^{58} and 185 patients with esophageal carcinoma^{59}, for a total of 628 samples obtained via the LinkedOmics platform^{25}. In brief, fresh frozen tumor samples and accompanying healthy tissue were collected from patients after providing their informed consent and under IRBapproved protocols. Samples were used to generate DNA methylation profiling at the CpGsite and gene levels (methylation CpGsite level, HM450K; methylation gene level, HM450K), wholeexome sequencing (mutation gene level), messenger RNA sequencing (HiSeq, gene level), reversephase protein array (analyte level) and somatic copy number variation (gene level, logratio) datasets. After aligning omics and dropping features with missing or only homogeneous values, the dataset consisted of samples from 258 patients. For our runtime and memory experiments, we sample increasing numbers of features (25%, 50% and 100%).
The singlecell dataset^{1} contains 68 mass cytometry samples from 17 pregnancies with four samples per woman in the first, second and third trimesters as well as postpartum, respectively. In brief, women of at least 18 years of age in their first trimester of a singleton pregnancy were recruited to the study after providing their informed consent and under IRBapproved protocols. Whole blood samples were collected throughout pregnancy and processed to generate an immunome dataset. For the benchmark experiments, samples from the third trimester were used. We process the data by sampling 10,000/30,000 cells from each of the 20 cell types, resulting in a dataset with 200,000/600,000 cells and 10 functional markers per cell.
We also add one more dataset (sim) that corresponds to 400,000 features and 500 samples to test larger sample sizes. The data are generated randomly.
Experimental settings for runtime and memory analysis
Experiments were repeated from 3 to 10 times depending on their runtime, the first sample was always dropped (to account for burnins, for example, for Julia’s JIT compiler), and respective medians are reported. No substantial runtime or memory fluctuations were observed. The experiments were run on a bare metal server with two AMD EPYC 7452 32Core Processors and hyperthreading enabled amounting to 128 processing units. The machine provided 314 GB of memory and ran on Ubuntu 20.04.1 LTS. We use Python 3.9.1 and R 4.0.3 with current packages installed from condaforge and Bioconductor. The employed Julia version was 1.5.3. Multithreading was disabled explicitly if not otherwise specified.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The preeclampsia dataset is available from a public repository^{60}. The multiomics pregnancy dataset is available from a public repository^{60}, and the original authors’ website^{6}. Intermediate data to produce Fig. 2 are provided through a public repository^{60}. The cancer dataset is derived from a multiomics study available from LinkedOmics (http://linkedomics.org/data_download/TCGASTAD/). In particular, we integrate the datasets methylation (CpGsite level, HM450K), methylation (gene level, HM450K), mutation (gene level), RNA sequencing (HiSeq, gene level), reversephase protein array (analyte level) and somatic copy number variation (gene level, logratio). The singlecell dataset used to derive the benchmark dataset single cell and to support the findings is available from FlowRepository (http://flowrepository.org/id/FRFCMZY3Q). Preprocessed data for benchmarking as well as intermediate data to produce Fig. 3 are provided through a public repository^{60}. We provide source data for all figures and tables, as well as download instructions and preprocessing scripts through a public repository^{60,61} and via https://nalab.stanford.edu/corals/.
Code availability
The complete code for CorALS, code to reproduce all experiments and figures in this paper, and links and instructions to prepare the corresponding datasets are available in a public repository^{61}, and are listed at https://nalab.stanford.edu/corals/.
References
Aghaeepour, N. et al. An immune clock of human pregnancy. Sci. Immunol. 2, eaan2946 (2017).
Hasin, Y., Seldin, M. & Lusis, A. Multiomics approaches to disease. Genome Biol. 18, 83 (2017).
Preece, S. J., Goulermas, J. Y., Kenney, L. P. & Howard, D. A comparison of feature extraction methods for the classification of dynamic activities from accelerometer data. IEEE Trans. Biomed. Eng. 56, 871–879 (2008).
Jensen, P. B., Jensen, L. J. & Brunak, S. Mining electronic health records: towards better research applications and clinical care. Nat. Rev. Genet. 13, 395–405 (2012).
De Francesco, D. et al. Datadriven longitudinal characterization of neonatal health and morbidity. Sci. Transl. Med. 15, eadc9854 (2023).
Ghaemi, M. S. et al. Multiomics modeling of the immunome, transcriptome, microbiome, proteome and metabolome adaptations during human pregnancy. Bioinformatics 35, 95–103 (2019).
Subramanian, I., Verma, S., Kumar, S., Jere, A. & Anamika, K. Multiomics data integration, interpretation, and its application. Bioinform. Biol. Insights 14, 1177932219899051 (2020).
Saccenti, E., Hendriks, M. H. & Smilde, A. K. Corruption of the pearson correlation coefficient by measurement error and its estimation, bias, and correction under different error models. Sci. Rep. 10, 438 (2020).
Benson, A., Gleich, D. F. & Leskovec, J. Higherorder organization of complex networks. Science 353, 163–166 (2016).
Nassar, H., Kennedy, C., Jain, S., Benson, A. R. & Gleich, D. F. Using cliques with higherorder spectral embeddings improves graph visualizations. In Proc. Web Conference 2020 2927–2933 (Association for Computing Machinery, 2020).
Rao, J., Zhou, X., Lu, Y., Zhao, H. & Yang, Y. Imputing singlecell RNAseq data by combining graph convolution and autoencoder neural networks. iScience 24, 102393 (2021).
Wang, J. et al. scGNN is a novel graph neural network framework for singlecell RNAseq analyses. Nat. Commun. 12, 1882 (2021).
Traxl, D., Boers, N. & Kurths, J. Deep graphs—a general framework to represent and analyze heterogeneous complex systems across scales. Chaos 26, 065303 (2016).
Chang, D.J., Desoky, A. H., Ouyang, M. & Rouchka, E. C. Compute pairwise Manhattan distance and Pearson correlation coefficient of data points with GPU. In 10th ACIS International Conference on Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing 501–506 (IEEE, 2009).
Kijsipongse, E., Suriya, U., Ngamphiw, C. & Tongsima, S. Efficient large Pearson correlation matrix computing using hybrid MPI/CUDA. In Eighth International Joint Conference on Computer Science and Software Engineering 237–241 (IEEE, 2011).
Wang, S. et al. Optimising parallel R correlation matrix calculations on gene expression data using MapReduce. BMC Bioinformatics 15, 351 (2014).
Chilson, J., Ng, R., Wagner, A. & Zamar, R. Parallel computation of highdimensional robust correlation and covariance matrices. Algorithmica 45, 403–431 (2006).
Kim, S. ppcor: an R package for a fast calculation to semipartial correlation coefficients. Commun. Stat. Appl. Methods 22, 665 (2015).
Xiong, H., Brodie, M. & Ma, S. TOPCOP: mining topk strongly correlated pairs in large databases. In Sixth International Conference on Data Mining 1162–1166 (IEEE, 2006).
Langfelder, P. & Horvath, S. Fast R functions for robust correlations and hierarchical clustering. J. Stat. Softw. 46, 1–17 (2012).
Papadakis, M. et al. Rfast: a collection of efficient and extremely fast R functions. R package version 2.0.1 https://CRAN.Rproject.org/package=Rfast (2020).
Badr, H. S., Zaitchik, B. F. & Dezfuli, A. K. A tool for hierarchical climate regionalization. Earth Sci. Inform. 8, 949–958 (2015).
Schmidt, D. Cooperation: fast correlation, covariance, and cosine similarity. R package version 0.62 https://cran.rproject.org/package=coop (2019).
Han, X. et al. Differential dynamics of the maternal immune system in healthy pregnancy and preeclampsia. Front. Immunol. 10, 1305 (2019).
Vasaikar, S. V., Straub, P., Wang, J. & Zhang, B. Linkedomics: analyzing multiomics data within and across 32 cancer types. Nucleic Acids Res. 46, D956–D963 (2018).
Marić, I. et al. Early prediction and longitudinal modeling of preeclampsia from multiomics. Patterns 3, 100655 (2022).
Langfelder, P. & Horvath, S. Fast R functions for robust correlations and hierarchical clustering. J. Stat. Softw. 46, i11 (2012).
Papadakis, M. et al. Rfast: a collection of efficient and extremely fast R functions. R package version 2.0.3 https://CRAN.Rproject.org/package=Rfast (2021).
Schmidt, D. Cooperation: fast correlation, covariance, and cosine similarity. R package version 0.63 https://cran.rproject.org/package=coop (2021).
Badr, H. S., Zaitchik, B. F. & Dezfuli, A. K. A tool for hierarchical climate regionalization. Earth Sci. Inform. 8, 949–958 (2015).
Musser, D. R. Introspective sorting and selection algorithms. Softw. Pract. Exp. 27, 983–993 (1997).
Jardim, V. C., Santos, S. d. S., Fujita, A. & Buckeridge, M. S. Bionetstat: a tool for biological networks differential analysis. Front. Genet. 10, 594 (2019).
Tu, J.J. et al. Differential network analysis by simultaneously considering changes in gene interactions and gene expression. Bioinformatics 37, 4414–4423 (2021).
Ha, M. J., Baladandayuthapani, V. & Do, K.A. DINGO: differential network analysis in genomics. Bioinformatics 31, 3413–3420 (2015).
McKenzie, A. T., Katsyv, I., Song, W.M., Wang, M. & Zhang, B. DGCA: a comprehensive r package for differential gene correlation analysis. BMC Syst. Biol. 10, 106 (2016).
Fukushima, A. DiffCorr: an R package to analyze and visualize differential correlations in biological networks. Gene 518, 209–214 (2013).
Siska, C., Bowler, R. & Kechris, K. The discordant method: a novel approach for differential correlation. Bioinformatics 32, 690–696 (2016).
Ghazanfar, S., Strbenac, D., Ormerod, J. T., Yang, J. Y. & Patrick, E. DCARS: differential correlation across ranked samples. Bioinformatics 35, 823–829 (2019).
Espinosa, C. et al. Datadriven modeling of pregnancyrelated complications. Trends Mol. Med. https://doi.org/10.1016/j.molmed.2021.01.007 (2021).
Maaten, Lvd. & Hinton, G. Visualizing data using tSNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Dean, J. & Ghemawat, S. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008).
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S. & Stoica, I. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing 1–10 (HotCloud 2010).
Bishara, A. J. & Hittner, J. B. Reducing bias and error in the correlation coefficient due to nonnormality. Educ. Psychol. Meas. 75, 785–804 (2015).
Epskamp, S. & Fried, E. I. A tutorial on regularized partial correlation networks. Psychol. Methods 23, 617 (2018).
Pearl, J. Bayesian networks. Department of Statistics, UCLA (2011).
Greenacre, M. & Primicerio, R. Multivariate Analysis of Ecological Data (Fundacion BBVA, 2014).
Anderson, E. et al. LAPACK Users’ Guide 3rd edn (Society for Industrial and Applied Mathematics, 1999).
Blackford, L. S. et al. An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 28, 135–151 (2002).
Martınez, C. Partial quicksort. In Proceedings of the Sixth Workshop on Algorithm Engineering and Experiments and the First Workshop on Analytic Algorithmics and Combinatorics 224–228 (2004).
Ram, P. & Gray, A. G. Maximum innerproduct search using cone trees. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining 931–939 (KDD 2012).
Omohundro, S. M. Five Balltree Construction Algorithms Technical Report TR89063 (International Computer Science Institute, 1989).
Curtin, R., March, W., Ram, P., Anderson, D., Gray, A. & Isbell, C. TreeIndependent DualTree Algorithms. In Proceedings of the 30th International Conference on Machine Learning 1435–1443 (ICML 2013).
Matthews, B. W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta 405, 442–451 (1975).
Benjamini, Y., Heller, R. & Yekutieli, D. Selective inference in complex research. Phil. Trans. R. Soc. A 367, 4255–4271 (2009).
Aumüller, M., Bernhardsson, E. & Faithfull, A. ANNBenchmarks: a benchmarking tool for approximate nearest neighbor algorithms. In Proceedings of the 10th International Conference on Similarity Search and Applications 34–49 (SISAP 2017).
Sawilowsky, S. S. New effect size rules of thumb. J. Mod. Appl. Stat. Methods 8, 26 (2009).
Romano, J., Kromrey, J. D., Coraggio, J., Skowronek, J. & Devine, L. Exploring methods for evaluating group differences on the nsse and other surveys: are the ttest and Cohen’s d indices the most appropriate choices? In Annual Meeting of the Southern Association for Institutional Research 1–51 (2006).
The Cancer Genome Atlas Research Network Comprehensive molecular characterization of gastric adenocarcinoma. Nature 513, 202–209 (2014).
The Cancer Genome Atlas Research Network Integrated genomic characterization of oesophageal carcinoma. Nature 541, 169–175 (2017).
Becker, M. et al. CorALS—intermediate data. Zenodo https://doi.org/10.5281/zenodo.7713898 (2023).
Becker, M. et al. CorALS—source code. Zenodo https://doi.org/10.5281/zenodo.7714039 (2023).
Acknowledgements
This work was supported by NIH R35GM138353 (N.A.), 1R01HL139844 (N.A., D.S., G.S., B.G., M.S.A.), the March of Dimes (N.A., D.S., G.S., B.G., M.S.A.), Burroughs Wellcome Fund 1019816 (N.A.), the Robertson Foundation (D.S), and the Bill and Melinda Gates Foundation INV001734, OPP1113682, INV003225 (N.A., D.S., G.S., B.G., M.S.A.), and the Alfred E. Mann Family Foundation (N.A.). The authors are solely responsible for the content of this article, which does not necessarily represent the official views of the US Department of Health and Human Services (HHS). The results published here are in part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.
Author information
Authors and Affiliations
Contributions
Based on the CRediT model, M.B., H.N. and N.A. were responsible for conceptualization; M.B., H.N. and M.X. for data curation; M.B., H.N. and C.E. for formal analysis; M.B. and N.A. for funding acquisition; M.B., H.N. and C.E. for investigation; M.B., H.N., C.E. and N.A. for methodology development and design; M.B. and N.A. for project administration; N.A. for providing resources; M.B., H.N. and M.X. for software; N.A. for supervision; M.B., C.E. and I.A.S. for validation; M.B. for visualization; M.B. and C.E. for writing the original draft; and all authors for reviewing and editing.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Computational Science thanks Ali Rahnavard and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Ananya Rastogi and Kaitlin McCardle, in collaboration with the Nature Computational Science team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Sections 1–10, Algorithm 1, Figs. 1–6 and Tables 1–3.
Supplementary Data 1
Full correlation matrix calculation—runtime and memory comparison. CorALS outperforms baselines as well as specialized methods broadly used scientific computing platforms. Runtime is reported as hours:minutes:seconds. The memory requirements are given in GB and measure the internal memory (RAM) being used by the methods. Multicore refers to 64 cores. Dashes represent the lack of runtime measurements for examples exceeding our server resources. Bolded entries mark estimated memory consumption for examples exceeding our server resources.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Becker, M., Nassar, H., Espinosa, C. et al. Largescale correlation network construction for unraveling the coordination of complex biological systems. Nat Comput Sci 3, 346–359 (2023). https://doi.org/10.1038/s4358802300429y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s4358802300429y
This article is cited by

Omics correlation for efficient network construction
Nature Computational Science (2023)