Abstract
Genes are linked by underlying regulatory mechanisms and by jointly implementing biological functions, working in coordination to apply different tasks in the cells. Assessing the coordination level between genes from singlecell transcriptomic data, without a priori knowledge of the map of gene regulatory interactions, is a challenge. A ‘topdown’ approach has recently been developed to analyze singlecell transcriptomic data by evaluating the global coordination level between genes (called GCL). Here, we systematically analyze the performance of the GCL in typical scenarios of singlecell RNA sequencing (scRNAseq) data. We show that an individual anomalous cell can have a disproportionate effect on the GCL calculated over a cohort of cells. In addition, we demonstrate how the GCL is affected by the presence of clusters, which are very common in scRNAseq data. Finally, we analyze the effect of the sampling size of the Jackknife procedure on the GCL statistics. The manuscript is accompanied by a description of a custombuilt Python package for calculating the GCL. These results provide practical guidelines for properly preprocessing and applying the GCL measure in transcriptional data.
Similar content being viewed by others
Introduction
Most of the functions of living cells are carried out by interacting genes, which are dictated by delicate gene regulatory networks^{1,2}. Characterizing these networks and their relation to different phenotypes is of great scientific interest^{3,4}. Recent progress in experimental and bioinformatical techniques provides large datasets of singlecell transcriptomic profiles and have the potential to advance the study of gene regulatory networks associated with specific tissues, cell types and conditions^{5,6,7,8,9}. However, complete reconstruction of the gene regulatory network is still a major challenge, not only because of the huge number of functional genes^{10,11,12}, but also because of the inherent stochasticity of such systems^{13}. Even so, certain important and general features of this network can sometimes be extracted without fully inferring it. For example, the network connectivity, i.e., the density of actual among all possible gene–gene interactions, may have important implications for general processes in the cells, as it can be related to the percolation properties of the network which affects signal transmission^{2}. Furthermore, in several processes, such as aging or cancer, cellular gene regulatory networks are subject to alterations that are stochastic in nature, so the affected elements are different across individual cells of the same tissue^{14}. In such cases, the tissuelevel effect may be captured better by general measures than by detailed network inference.
To tackle this difficulty we have recently introduced the Global Coordination Level (GCL) measure in the context of aging research^{15}. Using this new method, we were able to show that the inherent stochastic processes in aging cells are associated with a decrease in the coordination between genes, a pattern that is observed consistently across different cell types and organisms. The GCL measures the average multivariate dependency between the expression levels of random subsets of genes in singlecell RNAseq data. A key advantage of the GCL is that it is a topdown measure, i.e., it does not require the complete reconstruction of the gene regulatory network. This is in contrast to network analysis of the gene regulatory relations extracted in a bottomup approach, e.g., using coexpression analysis^{1}. By avoiding the network reconstruction, the GCL also has the advantage that it does not assume pairwise interactions or a specific form of relation (e.g., linear relations), and it does not assume the same type of interaction for all the pairs of interacting genes. The GCL was also found useful in distinguishing between celltocell variability induced by biological process and variability induced by technical noise in simulations of cellular networks^{16}.
In this manuscript, we systematically investigate several key aspects of the GCL and, based on it, provide precise instructions and recommendations on how to use this method on scRNAseq data, as well as the preprocessing filtering steps required for it and recommended parameters selections. As mentioned, the GCL works by calculating the general multivariate dependencies between the genes for a cohort of cells. It does so by a repeated procedure of randomly selecting gene subsets and calculating the distance correlation between them^{17}. Averaging over many such subsets of genes yields a single numerical value (the GCL) which evaluates the total dependencies between the genes. Before doing so however, there are several pre and postprocessing steps which must be taken into account. There are three major steps: (1) Clustering cells into homogeneous groups. (2) Filtering cells that differ significantly from the other cells (‘outliers’), or cells which are too similar to each other (‘inliers’). (3) Performing jackknife analysis to ensure the stability of the results. These steps are required because, as we show, the GCL (like other correlation measures) is sensitive to unusual cells and heterogeneous cohorts, while scRNAseq data is typically sparse, noisy and characterized by outliers and clusters. Based on a systematic analysis of the effect of the above steps, the recommended guidelines presented here are important to ensure stable and reliable results of the GCL measure.
Material and methods
Global coordination level (GCL)
As mentioned in the introduction, the GCL is a ‘topdown’ computational method to evaluate the systemwide multivariate dependency between the genes, by calculating the distance correlation of random gene sets. A set of M measured cells with N genes is represented by the matrix \(X\in {\mathbb {R}}^{N\times M}\), where each vector column \({\mathbf{}}{x}^{(\nu )}(\nu =1\dots M)\) is the measurement of an individual cell \(\nu\), and each element within the vector \(x_i^{\nu } (i=1\dots N)\) is the measured expression value of gene i. The GCL is then performed as follows. The matrix X is divided into two random complementary parts G and \(\overline{G}\). The two parts are represented by the matrices \(X_G\in \mathbb {R}^{N/2 \times M}\) and \(X_{\overline{G}}\in \mathbb {R}^{N/2 \times M}\) such that \(X_G\) represents one randomly selected subset of the genes of size N/2, while \(X_{\overline{G}}\) represents the rest of the genes (\(G\cap \overline{G}=\varnothing\) and \(G \cup \overline{G}=N\)). Then, we calculate the biascorrected distance correlation (bcdCorr)^{17} between groups G and \(\overline{G}\). The bcdCorr, a refined version of the distance correlation (dCorr) measure^{18}, evaluates the level of dependence between two highdimensional variables by testing how the distance between the samples, with respect to one (highdimensional) variable, is changed compared to the distance between the same samples with respect to the other (highdimensional) variable. Accordingly, when applied to geneexpression data, the bcdCorr is a measure of the dependency level between the genesets G and \(\overline{G}\). A short description of how to calculate the bcdCorr between two variables is presented below. The process of calculating the bcdCorr is repeated m times, for different random divisions of the genes. Finally, the GCL is defined as the average across these divisions, i.e.,
where all m random divisions of the data \((X_G^k, X^k_{\overline{G}})\) are independently chosen. For a large homogeneous cohort, the empirical GCL is bounded between zero and one: zero corresponds to the case of independent gene expression, whereas a significantly nonzero GCL reflects coordinated transcriptional expression, which could be interpreted as a result of underlying molecular dynamics, such as genetogene regulatory interactions.
Calculating the bcdCorr
Here we briefly show the full procedure of calculating the empirical bcdCorr between two variables as defined by Székely and Rizzo^{17}. Consider M observations of two highdimensional variables \(X_i\in \mathbb {R}^p\) and \(Y_i\in \mathbb {R}^q\), \(i=1,\dots ,M\), where \(X_i=(X_{i, 1}, \dots , X_{i,p})\) and \(Y_i=(Y_{i, 1}, \dots , Y_{i, q})\). Note that q does not necessarily have to be equal to p (i.e., the two variables may have unequal dimensions). The M observations are represented by the \(p\times M\) matrix X and \(q\times M\) matrix Y. The empirical \(\mathrm {bcdCorr}(X, Y)\) is defined as
where
and \(A^*_{i, J}\) and \(B^*_{i, j}\) are matrices defined as
\(A_{i, j}\) and \(B_{i, j}\) are matrices defined as
where
\(\left\Vert X\right\Vert = \langle X, X\rangle ^{1/2}\) is the Euclidean norm and \(b_{ij}\), \(\bar{b}_i\), \(\bar{b}_j\) and \(\bar{b}\) are defined similarly for Y. It can be shown that this estimator for the distance correlation is unbiased with repsect to the dimensionality of \(X_i\) and \(Y_i\), namely, q and p.
Clusters detection
Methods of identifying and removing clusters from the data are numerous and well studied. In this manuscript, we advise to apply detection and removal of clusters in the following manner. We apply the kmeans algorithm, with varying the total number of clusters, \(1<K<10\). The distance between the samples is measured using the Spearman dissimilarity.
We follow that by a Silhouette coefficient analysis that determines the optimal number of clusters. For each of the \(C_k\) clusters (\(k=1\dots K\)), we calculate the mean distance, \(a_i\), between each cell \({\mathbf{}}{x}^i\) and all the other cells which belong to the same cluster, \({\mathbf{}}{x}^j\) (\(j\in C_k, j\ne i\)), namely,
where \(\leftC_k\right\) is the number of samples in cluster k, and \(\mathrm {Spearman}({\mathbf{}}{x}^i, {\mathbf{}}{x}^j)\), is the distance between cell i and j based on the Spearman dissimilarity. Additionally, we calculate the smallest mean distance between cell i to all other cells in the other clusters, of which i is not a member, \(b_i\),
The silhouette score of each cell, \(s_i^K\) (\(i=1\dots M\)) is then defined as
where the superscript K indicates that the value is calculated for this particular choice of total number of clusters. nThe value of the silhouette score is bounded \(1<s_i^K<1\). Values which are close to 1 means that the data is clustered correctly with respect to cell i. We average the silhouette value across all the cells to get a unified value
Finally, the optimal value of K is chosen by maximizing the value of \(S^K\) with 10 realizations of the kmeans algorithm. In general, as each clustering detection method has its own advantages and disadvantages, we recommend supplementing the detection step with a visual inspection using dimensional reduction analysis, such as tSNE.
Effect of an individual cell on the GCL
In “Removal of abnormal cells” section of this manuscript, we measure the effect of a single outlier or inlier on the GCL value. The relative effect \(\Delta GCL_i\) of cell i on the cohort of cells is defined as
where ‘\(\mathrm {GCL\ with\ cell\ } i\)’ is the GCL of the entire cohort and ‘\(\mathrm {GCL\ without\ cell\ } i\)’ is the GCL of the cohort but with cell i excluded. In general, outliers and inliers tend to increase the GCL value of the cohort (see e.g. Figure 1), making the \(\Delta GCL_i\) positive.
Transcription data sets used in this work
In the demonstration presented here we have used singlecell RNAseq datasets from the following sources (Table 1).
Numerical model for synthetic gene expression data
In the Results we investigate the effect of the presence of clusters in the data on the GCL value with synthetic gene expression data obtained from simulated numerical models of gene regulatory dynamics. The expression profile \({\mathbf{}}{x}^{(\nu )}\) of cell \(\nu\) is modelled as the steady state of a set of coupled ordinary differential equations (ODEs), representing the gene regulatory dynamics^{23,24,25}. Specifically, we use the following set of ODEs,
where we set the number of genes, \(N=200\). The first term expresses a self degradation of gene i. The second term is responsible for the growth of \(\varvec{x}^{(\nu )}_i\) as a MichaelisMetnten kinetics function^{23} of \(\varvec{x}^{(\nu )}_j\), i.e., gene i is activated by gene j. The activation relation can be represented as a link in the gene regulatory network (GRN) with weight \(w_{i,j}\). In our simulation we use GRNs with random links between the nodes, i.e., each pair of genes is connected with a constant probability in the form of an ErdősRényi network with an average degree of three. Finally, the GRN weights \(w_{i,j}\) (for existing links) are randomly selected from the uniform distribution \(\mathbb {U}(0,2)\). To create different cells from the same GRN, a random subset of genes in each cell are set as inoperative, i.e., their expression levels are set to zero. Specifically, in our simulations we randomly choose 5 out of the \(N=200\) genes to be inoperative. The expression profile of each cell \(\varvec{x}^{(\nu )}\) is generated by solving the GRN differential equations with the same initial conditions (\(x_i^{(\nu )}(t=0)=0.5\)) and evaluating the steady state using the ode45 MATLAB function.
Using this procedure, we create two distinct cohorts of cells, which differ by their GRNs. Each cohort A and B has its own weight matrix, \(w^A_{i,j}\) and \(w^B_{i,j}\). The cohorts has the same number of cells (\(M=50\)) and genes (\(N=200\)). The GRN of the two cohorts have a fraction \(1p\) of shared links, i.e., the GRN of cohort A has a p fraction of links that are not present in B and viceversa. If \(p=0\) the GRN of the cohorts are identical. As p is increased, the number of shared links is decreased, and when \(p=1\) the GRNs of the two cohorts have no shared links.
After generating the cells in each cohorts, the sampletosample mean distances, D, between the cohorts is measured. It is defined as the mean distance between the cohorts divided by the sum of mean of distances inside each cohort, i.e.,
where,
When the cohorts are generated with the same GRN (i.e., \(p=0\)), \(D=0.5\). As p is increased, D gets larger. The dependency of the GCL value on these parameters is presented in the Results section.
Results
Here we describe the necessary preprocessing steps and the application of the GCL measure. Specifically, this includes: (1) Removal of abnormal cells, (2) Clustering analysis, and (3) Jackknife resampling.
Removal of abnormal cells
In our previous work^{15}, during the analysis of the singlecell transcriptomic data, we have noticed the significant sensitivity of the GCL towards single abnormal cells. The sensitivity is not only towards outliers, cells which are far away from the rest of the cells, but also for cells which are unusually similar to other cells, which we refer to here as ‘inliers’. The abnormality of the cells with respect to the majority of the cohort can be observed by examining the distribution of ‘distances’ (Spearman dissimilarity) between all the cells (Fig. 1a). While the celltocell distances among most of the cells are distributed near a characteristic value, there are a small number of cells, associated with exceptionally large or small values, compared with the mean (red and blue shaded area respectively). Figure 1b shows the effects of outliers and inliers of a typical cohort of cells. The data is a cohort of shortterm hematopoietic stem cells (STHSCs) from young C57 mice^{19}. We measure the effect of each individual cell on the GCL, \(\Delta \mathrm {GCL}_i\), defined as the relative difference between the GCL with and without cell i (see “Material and methods” section) For each cell we calculate the minimal distance between it and any of the other cells in the cohort. The minimal distance of an outlier would be relatively large, whilst the minimal distance of an inlier would be relatively low. When plotting \(\Delta \mathrm {GCL}\) versus the minimal distance, a characteristic parabolalike shape is apparent (Fig. 1b), where both outliers and inliers have a significant relative effect on the value of the GCL (where outliers seem to be more significant then inliers). To ensure that the GCL is not biased due to the effect of a few outliers or inliers cells, we recommend filtering them out prior to the analysis. This can be done, for example, by looking into the distribution of distances between all the cells, and removing cells which have some distance which is two standard deviations above/below the mean distance between the cells (Fig. 1a). Note the inherent difference between outliers and inliers with respect to the cohort. When an inlier is present in the data, it means that there are two cells which are abnormally similar to each other. In this case, it is sufficient to remove only one of the cells in order to increase the homogeneity of the cohort. When an outlier is present, there is no arbitrary choice in the filtering process.
To examine this effect thoroughly, we introduce a simulated cell into a cohort of real cells and evaluate its \(\Delta \mathrm {GCL}\), as follows. First, we chose two random cells from the cohort of real cells (blue and black circles in Fig. 1c). Then, we replace one of these cells with a simulated cell, constructed as a linear combination between the two cells, with a tuning parameter, \(0\le \lambda \le 1\). When \(\lambda =0\) the simulated cell is identical to the replaced cell. When \(0<\lambda <1\), the simulated cell becomes more similar to the other cell. For each value of \(\lambda\), we measure both the minimal distance and the \(\Delta \mathrm {GCL}\) associated with the simulated cell. When the minimal distance decreases, the effect on the GCL increases (blue lines in Fig. 1d). We repeat this process for 10 random cells initiations.
We then perform a similar process on an outlier cell, by replacing one of the outlier cells (red circle in Fig. 1c) by a simulated cell, constructed as a linear combination between it and a random cell (the black circle in Fig. 1c). This time, in order to focus on the effect of the outlier with respect to the cohort and not on the similarity between two cells (the ‘inliers’ effect), we also remove the random cell from the analysis. This manipulation results in a simulated cell that, by tuning the \(\lambda\) value, move on the trajectory between normal and outlier cell, i.e., when \(\lambda =0\) the simulated cell is identical to the randomly selected cell, and when \(\lambda =1\) the simulated cell is identical to the outlier cell. Figure 1d) shows that when the minimal distance increases, the effect of the simulated cell on the GCL significantly increases. Both the inliers and the outliers’ effects, which are shown in Fig. 1d) using simulated cells, demonstrate a similar pattern to the parabola shape observed for the real data.
Finally, we test the effects of outliers and inliers on three additional datasets from independent sources^{19,21,22} (Fig. 1e). The figure shows the GCL values of cohorts of cells before and after the removal of outliers/inliers. While it is clear that their presence tends to increase the GCL, the most pronounced effect is visible in the STHSC cohort from Ref.^{19} (Fig. 1a). There, the presence of a few outlier/inliers (5) can cause the GCL of the jackknife group to dramatically increase to the maximum value (1). In fact, a bimodel distribution can be observed when the outliers/inliers are present, which represent jackknife groups with and without some of these unusual cells.
The effect of clusters
Like other statistical inference tools, the GCL is also sensitive to heterogeneity in the data. A common source of heterogeneity is the presence of clusters in the cohort of cells. In this section, we detail how does the presence of clusters affects the GCL value, as well as provide recommended guidelines on how to filter out clusters. In general, clusters in the cohort tend to increase the GCL values, sometimes even to extreme values close to 1. This effect is analogous to the biases of Pearson correlation values calculated over heterogeneous data. The goal of the GCL is to provide a sense of the strength of the interdependencies between the genes. To this aim, we wish to calculate the GCL for isolated homogeneous cohorts, in order to avoid a type of Simpson’s paradox, where the presence of two or more clusters in the data affects the apparent relations between the genes (see “Discussion”).
Figure 2a1–a2 shows this effect in a heterogeneous cohort, artificially induced by combining multipotent progenitors (MPP) cells from two different mice (ages 3 months (‘young’) and 22 months (‘old’)) (from Ref.^{20}, see “Material and methods”). Calculating the GCL over the entire cohort yields an extreme value which is very close to 1, whereas the GCL values of each separate group (young and old) have much more sensible values, lower than 0.5 (Fig. 2a2). To verify that this change of the GCL values is not due to the change of sample size (number of cells), we calculate it also over a fourth group, which is comprised of a random assortment of the same number of cells as one of the clusters, selected from both the young and old groups. Figure 2a2 shows that this group also has an artificially large GCL value. Figure 2b–d also show the effect of heterogeneity, but for naturally occurring clusters (from Ref.^{19,20}). The clusters are marked with blue and green colors. The effect of the clusters on the GCL is evident (although less dramatic than the artificial case in a). A mixed cohort will tend to have larger GCL values than any of the separate clusters.
We also verify this effect using numerical models of synthetic gene expression data (see “Material and methods” ). Two cohorts, A and B, of increasingly differentiated gene regulatory network are generated, and the GCL of them is measured (Fig. 3). The cohorts have the same number of cells (\(M=50\)) and genes (\(N=200\)). When the GRN of the two cohorts is identical, the GCL value of each cohort approximately equals to the GCL of the joint cohorts, which is comprised of all the cells from both A and B. As the differences between the GRN of the cohorts increase, the GCL of each individual cohort remains stable. This shows that the GCL, which measures the general dependencies between the genes is not sensitive to smallscale differences. However, the GCL of the combined cohort increases. This is due to the heterogeneity in the data, which is a confounding factor that artificially increases the GCL, similar to what is observed in real data.
We have shown here how the presence of clusters creates artificial inflation of the GCL value. In “Material and methods” , we detail our proposed recommended method of how to detect and filter clusters in the data.
Jackknife resampling procedure
Generally, the stability of the GCL calculation, i.e., how sensitive is it to the exact cell composition of the cohort, is related to two factors: the homogeneity of the cell population, and the sample size (number of cells in the analyzed cohort). To evaluate the stability of the GCL for a given cohort of cells, we apply the jackknife resampling procedure^{26}. Specifically, we randomly resample s subsets of k cells, with repetition, and calculate the GCL of each of them. If the cell population is substantially homogeneous, the variance of the GCL values calculated over the different subsets will mainly reflect the finite size effect. In contrast, if the cell population includes a small number of outlier cells (see “Removal of abnormal cells”), they will be sampled and affect the GCL values in only some of the subsets, increasing the variability of the GCL values. Therefore, the jackknife procedure is wellsuited for the stability estimation of the GCL.
However, resampling subsets of cells, with repetitions, from a finite cohort introduces the issue of the choice of the overlap level between the different subset. Thus, the predetermined value of the k parameter evidently affects the variance of the GCL values. For large values of k, the overlap between the different subsets is high, and subsequently the variance of the GCL values will be relatively small. On the other hand, for small values of k (small number of cells in each subset), the GCL results are highly vulnerable to finitesize noise and hence the variance of the GCL values will be relatively large.
To test this effect, we systematically calculate the GCL values with the jackknife procedure using different values of k. The data presented in Fig. 4 contains the gene expression profiles of \(N=2000\) genes (with the largest mean expression) from \(M=113\) LTHSCs from a young C57 mouse^{19}.
Figure 4a shows the GCL distributions of the jackknife realizations for different values of k. While the average GCL is reasonably constant of k, the distributions are increasingly narrower for larger values of k (specifically, when \(k=113\), all the subsets are identical and their GCL values converge to a single value). When plotting the standard deviation of the GCL distributions, \(\sigma (\mathrm {GCL})\), versus the number of cells, k, on loglog plot, two regions can be observed (Fig. 4b). For \(k<k^*\), where \(k^*\approx 85\), \(\sigma (\mathrm {GCL})\sim k^{1}\) (linear relation on the loglog plot at the red colored zone). For \(k>k^*\), \(\sigma (\mathrm {GCL})\) rapidly decreases as k approaches the cohort size (blue colored zone). The two regions correspond to the two effects mentioned above. For small values of k, the decrease of \(\sigma (\mathrm {GCL})\) is mainly due to the increased sample size of each jackknife realization. For large values of k, the large overlap between the resampled subsets becomes the dominant effect.
These results suggest that \(k=k^*\) may be considered as an optimal choice, as it decreases the ‘finite size effect’ with a minimal ‘overlap effect’. Next, we test how the value of \(k^*\) depends on the cohort size. We repeat the same analysis as in Fig. 4b for smaller cohort sizes (selected from the same dataset). Figure 4c shows that the crossover value, \(k^*\), between the linear decay (on the loglog scale) and the rapid decrease of \(\sigma (\mathrm {GCL})\) can be approximated as \(\approx 75\%\) of the cohort size.
We repeat the analysis for shuffled data, with a much larger sample size (\(M=1000\)). The shuffling procedure removes the relationship between the genes, causing the average GCL to be equal to 0 (Fig. 4d). By having a cohort size which is much larger than k, the overlap effect is negligible and the samplesize effect is dominant. As shown in Fig. 4e, the samplesize effect yields \(k^{1}\) dependency of the standard deviation of the GCL, similar to the observation in Fig. 4b. When analyzing cohorts of shuffled data with different sizes, we find that the \(k^*\) value is about \(\approx 75\%\) of the cohort size (Fig. 4f).
To conclude, when analyzing the stability of the GCL results in other cohorts, we recommend to repeat the analysis presented here to find the optimal value of k for the jackknife procedure. Otherwise, we would recommend to choose \(k\approx 75\%\times M\) as a default value.
GCL script description
We publish on GitHub^{27} an opensource code package written in Python 3.9 which includes a tutorial and working example of the GCL calculation as well as a GCL library. The tutorial demonstrates the calculation of the GCL for two scRNAseq data sets of LTHSCs from young (3 months) and old (22 months) mice from Ref.^{19}. These data sets are in a .CSV format, where rows represent the expression levels of the 3000 genes with the highest mean expression, and columns represent individual cells. The output of the tutorial is a figure showing the distributions of the jackknife realizations values of both scRNAseq data sets. The figure compares the two histograms showing how different parameters of the input (‘percentage’, ‘jackknife realizations’ and ‘divisions number’) influence the results.
The library can be imported into an existing Python project or executed as a standalone program to calculate the GCL for given data (vector or GCL values for the jackknife realizations). Different features of the GCL calculations can be manually chosen. For example, Jackknife_percentage determines the fraction of cells in each jackknife realization from the entire cohort, k/M; and Jackknifes determines the number of jackknife realizations, s. Figure 5 shows a typical output of the Python script for a cohort of LTHSC cells. The GCL distributions become narrower for larger values of k, where k/M is set to be 0.5, 0.7 and 0.9 in Fig. 5a–c), respectively. A second script, which only calculates the GCL using MATLAB is also freely available^{28}.
Discussion
In this manuscript, we present general guidelines and recommended practices on how to appropriately utilize the GCL method on scRNAseq datasets. A main consideration when applying any bulk analysis method to a cohort of cells, such as coexpression analysis or the GCL, is to ensure that the analyzed cohort is as homogeneous as possible. This does not mean that the transcription profiles of all the cells must be identical, or very similar to each other, but rather that the celltocell variability should not be dominated by a small number of outlier cells or by the presence of distinct clusters. The presence of outliers or clusters may lead to spurious results when applying measures of gene–gene interrelations, such as coexpression analysis or GCL. Therefore, we consider the removal of outliers and clusters as a standard preprocessing procedure and a necessary step in analyses that focus on the interrelations between genes.
‘Inliers’, i.e., cells which have unusually similar transcription profiles, is another condition that should be taken into account when calculating the GCL. Two very similar cells could be present due to biological effects, for example, if the sampling corresponded with a recent division of the cell. We show that the GCL may be disproportionately affected by such cases in the cohort, with a tendency to artificially increase its value. We therefore recommend to view such cells as another form of heterogeneity, and to remove one of them before the calculation of the GCL.
We recommend to accompany the calculation of the GCL with a jackknife procedure, where the GCL is calculated for a number of resampled subsets of cells, yielding a distribution of values. There are two benefits to this step. First, it can serve as an additional layer for testing the homogeneity of the analyzed cohort. Specifically, if the bulk calculation of the GCL is disproportionately affected by individual cells, the subsets that include it will have elevated GCL values that may be observed by examining the distribution (see e.g. Fig. 1e). Such cells may be missed in the initial preprocessing steps described above. Note, however, that the jackknife procedure is not suited for detecting the presence of clusters, since this kind of heterogeneity is preserved in the resampled subsets, as demonstrated in Fig. 2. Second, when comparing GCL values from two different cohorts, the jackknife distributions are essential. For this purpose, a rational choice of the resampling size (the value of k) is required. We propose a methodology to balance between two extremes: a too small value of k introduces statistical noise due to small samplesize, whereas a value of k that is too close to the cohort size will not generate the representative variability of the GCL, due to the high overlap between the resampled subsets. Additionally, the jackknife resampling is an effective way to eliminate the potential bias due to a different number of cells when comparing two or more cohorts of different sizes, by calculating the GCL over the same size of resampled subsets.
The recommendation to apply the GCL on homogeneous cohorts may seem counterproductive to the usefulness of scRNAseq to reveal the heterogeneity of the data. Indeed, the heterogeneity can discover hidden information, such as identifying new subtypes of cells. Importantly, we distinguish between two types of heterogeneity: (1) heterogeneous cohorts that contain different cell types/subtypes, or individual outliers/inliers cells; (2) the natural celltocell variability of gene expression profiles among cells of the same tissue, cell type and biological conditions. The goal of the GCL is to unveil underlying order, in terms of gene–gene coordination, from the latter type, i.e., natural celltocell variability. For the GCL analysis, like other correlation analysis methods, heterogeneity of the first type in the analyzed data may be detrimental. It has the proclivity of generating spurious relations, e.g., strong observed correlations due to confounding factors (Simpson’s paradox). Therefore, the preprocessing steps should minimize the first type of heterogeneity while preserving the natural celltocell variability.
The added value of the GCL with respect to the traditional bottomup approaches can be illustrated by the following example. Consider an experiment to test the effect of a given condition by comparing the scRNAseq data of control and case cohorts. We can divide the apparent differences between these two cohorts into two kinds. Genespecific differences are ‘smallscale’. They can be detected as deferentially expressed genes (DEGs) between the case and control, or as ‘differential gene–gene interactions’, which may indicate an alteration of specific elements in the underlying regulatory mechanism. The second type is ‘largescale’ differences, i.e., not genespecific. For example, during aging, cells accumulate stochastic genetic and epigenetic damage that affect each individual cell differently. The groupeffect of such random damage is not observed in the same genes across different cells, but rather may be observed as a general decline of the genetogene dependency level. The GCL is a measure of this kind.
In practice, when comparing two cohorts we recommend first to detect any smallscale differences between them using the traditional approaches, e.g. filtering DEGs between them. Then, the GCL can be applied on the remaining genes, i.e., the nondifferentially expressed genes. This is beneficial in two ways. On the one hand, it rules out the potential bias due to the effect of the DEGs on the GCL. On the other hand, if there are observed significant differences between the GCL values calculated over the nondifferentially expressed genes, then they may be interpreted as a result of largescale alterations.
As a final remark, the proposed filtering process presented here uses auxiliary algorithms (Silhouette, Spearmanbased outlier removals, etc) which might not be best suited in all cases. There is a plethora of cluster identification and outlier removal algorithms in the scientific literature, and it is certainly possible that the use of one of them would be more advantageous than the ones presented here for other scRNA datasets. For example, UMAP can replace tSNE in the visual inspection step, DBSCAN can be used for clusters identification, and the Spearman matrix can be replaced by other dissimilarity measures, such as Euclidean distance, normalized mutual information, adjusted Rand index, Fowlkes–Mallows index or the Jaccard index. The goal of this manuscript is to demonstrate the sensitivity of the GCL to the structure of the data and propose possible solutions.
The GCL is therefore a complementary measure to the existing classical toolset of geneexpression analysis. We hope that the GCL method, along with the recommended guidelines in this manuscript, be useful in the future analysis of scRNA sequencing.
References
Karlebach, G. & Shamir, R. Modelling and analysis of gene regulatory networks. Nat. Rev. Mol. Cell Biol. 9, 770–780 (2008).
Alon, U. An Introduction to Systems Biology: Design Principles of Biological Circuits (Chapman and Hall/CRC, 2006).
Levine, M. & Davidson, E. H. Gene regulatory networks for development. Proc. Natl. Acad. Sci. 102, 4936–4942 (2005).
Sorek, M., Balaban, N. Q. & Loewenstein, Y. Stochasticity, bistability and the wisdom of crowds: A model for associative learning in genetic regulatory networks. PLoS Comput. Biol. 9, e1003179 (2013).
Kolodziejczyk, A. A., Kim, J. K., Svensson, V., Marioni, J. C. & Teichmann, S. A. The technology and biology of singlecell rna sequencing. Mol. Cell 58, 610–620 (2015).
Hwang, B., Lee, J. H. & Bang, D. Singlecell rna sequencing technologies and bioinformatics pipelines. Exp. Mol. Med. 50, 1–14 (2018).
Potter, S. S. Singlecell rna sequencing for the study of development, physiology and disease. Nat. Rev. Nephrol. 14, 479–492 (2018).
Fiers, M. W. et al. Mapping gene regulatory networks from singlecell omics data. Brief. Funct. Genom. 17, 246–254 (2018).
Sonawane, A. R., DeMeo, D. L., Quackenbush, J. & Glass, K. Constructing gene regulatory networks using epigenetic data. NPJ Syst. Biol. Appl. 7, 1–13 (2021).
Banf, M. & Rhee, S. Y. Computational inference of gene regulatory networks: Approaches, limitations and opportunities. Biochim. Biophys. Acta Gene Regul. Mech. 1860, 41–52 (2017).
Chai, L. E. et al. A review on the computational approaches for gene regulatory network construction. Comput. Biol. Med. 48, 55–65 (2014).
Mochida, K., Koda, S., Inoue, K. & Nishii, R. Statistical and machine learning approaches to predict gene regulatory networks from transcriptome datasets. Front. Plant Sci. 9, 1770 (2018).
Chalancon, G. et al. Interplay between gene expression noise and regulatory network architecture. Trends Genet. 28, 221–232 (2012).
Vijg, J. From dna damage to mutations: All roads lead to aging. Ageing Res. Rev. 20, 101316 (2021).
Levy, O. et al. Agerelated loss of genetogene transcriptional coordination among single cells. Nat. Metab. 2, 1305–1315 (2020).
Vaknin, D., Amit, G. & Bashan, A. A topdown measure of genetogene coordination for analyzing celltocell variability. Sci. Rep. 11, 1–8 (2021).
Székely, G. J. & Rizzo, M. L. The distance correlation ttest of independence in high dimension. J. Multivar. Anal. 117, 193–213 (2013).
Székely, G. J., Rizzo, M. L. & Bakirov, N. K. Measuring and testing dependence by correlation of distances. Ann. Stat. 35, 2769–2794 (2007).
Kowalczyk, M. S. et al. Singlecell rnaseq reveals changes in cell cycle and differentiation programs upon aging of hematopoietic stem cells. Genome Res. 25, 1860–1872 (2015).
Mann, M. et al. Heterogeneous responses of hematopoietic stem cells to inflammatory stimuli are altered with age. Cell Rep. 25, 2992–3005 (2018).
Grover, A. et al. Singlecell rna sequencing reveals molecular and functional platelet bias of aged haematopoietic stem cells. Nat. Commun. 7, 1–12 (2016).
Davie, K. et al. A singlecell transcriptome atlas of the aging drosophila brain. Cell 174, 982–998 (2018).
Klipp, E. Systems Biology in Practice: Concepts, Implementation and Application (Wiley, 2005).
Alon, U. An Introduction to Systems Biology: Design Principles of Biological Circuits (Chapman and Hall CRC, 2006).
Karlebach, G. & Shamir, R. Modelling and analysis of gene regulatory networks. Nat. Rev. Mol. Cell Biol. 9, 770–80. https://doi.org/10.1038/nrm2503 (2008).
Shao, J. & Tu, D. The Jackknife and Bootstrap (Springer, 2012).
Acknowledgements
A.B. thanks the Israel Science Foundation (Grant No. 1258/21), the GermanIsraeli Foundation for Scientific Research and Development and the Azrieli foundation for supporting this research.
Author information
Authors and Affiliations
Contributions
G.A. and A.B. conceived the research, G.A., D.V.B.P. and O.L. performed the analysis, O.H. wrote the Python package and the association section, G.A., D.V.B.P. and A.B. wrote the manuscript. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Amit, G., Vaknin Ben Porath, D., Levy, O. et al. Global coordination level in singlecell transcriptomic data. Sci Rep 12, 7547 (2022). https://doi.org/10.1038/s4159802211507y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4159802211507y
This article is cited by

Nature of epigenetic aging from a singlecell perspective
Nature Aging (2024)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.