Introduction

Microarray-based gene coexpression analysis has been demonstrated as an emerging field that offers opportunities to the researcher to discover coregulation pattern among gene expression profiles. Genes with similar transcriptomal expression are more likely to be regulated by the same process. Coexpression analysis seeks to identify genes with similar expression patterns, which can be believed to associate with the common biological process1,2,3. Recent approaches are interested to find the differences between coexpression pattern of genes in two different conditions4,5. This is essential to get a more informative picture of the differential regulation pattern of genes under two phenotype conditions. Identifying the difference in coexpression patterns, which is commonly known as differential coexpression, is no doubt a challenging task in computational biology. Several computational studies exist for identifying a change in gene coexpression patterns across normal and disease states6,7,8,9. Finding differentially coexpressed (DC) gene pairs, gene clusters, and dysregulated pathways between normal and disease states is most common6,10,11,12,13. Another way for identifying DC gene modules is to find gene cluster in one condition, and test whether these clusters show a change in coexpression patterns in another condition significantly8,10.

For example, CoXpress10 utilizes hierarchical clustering to model the relationship between genes. The modules are identified by cutting the dendrogram at some specified level. It used a resampling technique to validate the modules coexpressed in one condition but not in the other. Another approach called DiffCoex11 utilized a statistical framework to identify DC modules. DiffCoex proposed a score to quantify differential coexpression between gene pairs and transform this into dissimilarity measures to use in clustering. A popularly used tool WGCNA (Weighted Gene Coexpression Network Analysis) is exploited to group genes into DC clusters14. Another method called DICER (Differential Correlation in Expression for meta-module Recovery)15 also identifies gene sets whose correlation patterns differ between disease and control samples. Dicer not only identifies the differentially coexpressed module, but it goes one step beyond and identifies metamodules or a class of modules where a significant change in coexpression patterns is observed between modules, while the same patterns exist within each module.

In another approach, Ray and Maulik16 proposed a multiobjective framework called DiffCoMO to detect differential coexpression between two stages of HIV-1 disease progression. Here, the algorithm operates on two objective functions that simultaneously optimize the distances between two correlation matrices obtained from two microarray data of HIV-infected individuals.

Most of the methods proposed some scoring technique to capture the differential coexpression pattern and utilized some searching algorithm to optimize it. Here, we have proposed CODC Copula-based model to identify Differential Coexpression of genes under two different conditions. Copula17,18 produces a multivariate probability distribution from multiple uniform marginal distribution. It was extensively used in high-dimensional data applications. In the proposed method, first, a pairwise dependency between gene expression profile is modeled using an empirical copula. As the marginals are unknown, so we used empirical copula to model the joint distribution between each pair of gene expression profiles. To investigate the difference in coexpression pattern of a gene pair across two conditions, we compute a statistical distance between the joint distributions. We hypothesized that the distance between two joint distributions can model the differential coexpression of a gene pair between two conditions. To investigate this fact, we have performed a simulation study that provides the correctness of our method. We have also validated the proposed method by applying it in real-life datasets. For this, we have used five pan-cancer RNA-Seq data from TCGA: Breast-invasive carcinoma [BRCA], Head and Neck squamous carcinoma [HNSC], Liver hepatocellular carcinoma [LIHC], Thyroid carcinoma [THCA] and Lung adenocarcinoma [LUAD], which are publicly available in the TCGA data portal (https://tcga-data.nci.nih.gov/docs/publications/tcga/).

Results

Dataset preparation

We have evaluated the performance of the proposed method in five RNA-seq expression data downloaded from the TCGA data portal (https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga). We have downloaded a matched pair of tumor and normal samples from five pan-cancer datasets: Breast-invasive carcinoma (BRCA, #samples = 112), head and neck squamous cell carcinoma (HNSC, #samples = 41), liver hepatocellular carcinoma (LIHC, #samples = 50), thyroid carcinoma (THCA, #samples = 59), and Lung Adenocarcinoma (LUAD, #samples = 58). For preprocessing the dataset, we first take those genes that have raw read count greater than two in at least four cells. The filtered data matrix is then normalized by dividing each UMI (Unique Molecular Identifiers) count by the total UMI counts in each cell, and subsequently, these scaled counts are multiplied by the median of the total UMI counts across cells19. The top 2000 most variable genes were selected based on their relative dispersion (variance/mean) with respect to the expected dispersion across genes with similar average expression. Transcriptional responses of the resulting genes were represented by the log2(fold change) of gene expression levels from paired tumor and normal samples. A brief description of the datasets used in this paper is summarized in Table 1. Figure 1a, b represents box and violin plot of the average expression value of samples for each dataset.

Table 1 Tumor types and number of TCGA RNA-seq samples used in the analysis.
Fig. 1
figure 1

Description of TCGA data used in the analysis: Panel-A and -B describes box and violin plots of mean expression values of the used datasets.

Detection of DC gene pair

Differential coexpression between a gene pair is modeled as a statistical distance between the joint distributions of their expression profiles in a paired sample. Joint distribution is computed by using empirical copula that takes the expression profile of a gene as marginals in normal and tumor samples. The K–S distance, computed between the joint distribution, served as differential coexpression score between a gene pair. The score for a gene pair (gi, gj) can be formulated as \({\rm{DC}}\_{\rm{Copula}}(g_i,g_j)\) = \({\rm{KS}}\_{\rm{dist}}({\rm{e.c}}(g_i^{{\rm{tumor}}},g_j^{{\rm{tumor}}}),{\rm{e.c}}(g_i^{{\rm{normal}}},g_j^{{\rm{normal}}}))\), where KS-dist represents Kolmogorov–Smirnov (K–S) distance between two joint probability distributions, e.c represents empirical copula, and \(g_i^P\) represents the expression profile of gene gi at phenotype P. For each RNA-seq data, we have computed the DC_Copula matrix, from which we identify differentially coexpressed gene pairs.

To know how the magnitude of differential coexpression is changing with the score, we plot the distribution of correlation values of gene pairs with their scores in Fig. 2. The figure also shows the number of gene pairs having positive and negative correlations in each stage (normal/tumor). It can be noticed from the figure that high scores produce differentially coexpressed gene pairs having a higher positive and negative correlation. We collected the gene pairs having the score greater than 0.56 and plot the correlations values in Fig. 3. This figure shows plots of all gene pairs having a positive correlation in normal and the negative correlation in tumor (shown in panel a) and vice versa (shown in panel b). The density of the correlation values is shown in panels c and d for each case. In Fig. 4, we create a visualization of top differentially coexpressed gene pairs in BRCA data, which show a strong positive correlation in tumor stage and negative correlation in normal stage. The figure shows a heatmap of a binary matrix constructed from the expression data of those gene pairs in tumor and normal stages. The expression values showing the same pattern for a gene pair are assumed 1, while 0 represents a nonmatching pattern. From the figure, it is quite understandable that most of the entry in the normal stage is 0 (nonmatch) while in tumor stage, it is 1 (match). For other datasets, the plots are shown in Supplementary Fig. 2.

Fig. 2: The figure shows the distribution of correlation values in normal and cancer samples of BRCA data with the DC_Copula score.
figure 2

a shows the distribution for different DC_Copula scores. Here, four pirate plots are shown in each facet, two for positive and two for negative correlations. The violins in each facet represent the distribution of positive and negative correlations of gene pairs in normal and cancer samples. b shows a bar plot representing the number of positive and negatively correlated gene pairs in normal and cancer samples in each facet.

Fig. 3: The figure shows visualizations of gene pairs having DC_copula score greater than 0.56.
figure 3

a, b show the visualization of correlation values of gene pairs having a positive correlation in normal and negative correlation in tumor and vice versa, respectively. c, d represent the distribution of correlation values according to a, b, respectively.

Fig. 4: The figure shows a heatmap representation of a binary matrix constructed from the expression matrix of top differentially coexpressed gene pairs in normal and tumor stages.
figure 4

Expression values of a gene pair showing the same pattern are indicated as 1, and showing a different pattern is indicated as 0 in the matrix. The columns represent differentially coexpressed gene pairs, while rows are the samples of BRCA data.

Stability performance of CODC

To prove the stability of CODC, we have performed the following analysis:

First, we add Gaussian noise to the original expression data of normal and cancer samples to transform these into noisy datasets. We use the rnorm function of R to create normally distributed noise with mean 0 and standard deviation 1, and we add this into the input data. We have utilized BRCA data for this analysis.

First, we compute the K–S distance and then obtain DC_Copula matrix for both original and noisy datasets. Let us denote these two matrices as D and D′.

The usual way is to pick a threshold t for D (or D′) and extract the gene pair (i,j) for which D(i, j)(orD′(i, j)) ≥ t. First, we set t as the maximum of D and D′, and then decrease it continuously to extract the gene pairs. For each t, we observe the number of common gene pairs obtained from D and D′. Figure 5 shows the proportion of common genes selected from D and D′ for different threshold selection and different level of noise. Theoretically, CODC produces D with scores no more than D′ (see the above section for details). So, it is quite obvious that the number of common genes increased with a lower threshold value. From the property (see the above section for details), it can be noticed that the scores in D get preserved in D′. So, it is expected that obtained gene pairs from original data are also preserved in noisy data. Figure 5 shows the evidence for this case. As can be seen from the figure that even the noise label is 80%, for threshold value above 0.25, more than 55% of the gene pairs are common between noisy and original datasets.

Fig. 5
figure 5

Performance of CODC in noisy datasets.

Detection of differentially coexpressed modules

Detection of DC gene modules is performed by using hierarchical clustering on the DC matrix. Here, the differential coexpression score obtained from each gene pair is treated as the similarity measure between genes. The distance between a gene pair is formulated as dist_copula(gi, gj) = 1 − DC_Copula(gi, gj). For each dataset, modules are extracted using average linkage hierarchical clustering by using the dist_copula as a dissimilarity measure between a pair of a gene. For BRCA and HNSC data, we have identified 15 modules, for LIHC data 14 modules, for LUAD 21 modules, and for THCA 22 modules are identified. For studying the relationship between the modules, we have identified module eigengene networks for each dataset. According to ref. 14 module, eigengene represents a summary of the module expression profiles. Here, module eigengene network signifies coexpression relationship among the identified modules in two stages. We create visualizations of the module eigengene network for normal and tumor stages in Fig. 6. The upper triangular portion of the correlation matrix represents the correlation between module eigengenes for normal samples, whereas the lower triangular portion represents the same for tumor samples. This figure shows the heatmap for BRCA, HNSC, and LUAD datasets. It is clear from Fig. 6 that most of the modules show differential coexpression pattern in normal and tumor stages. For a differentially coexpressed module, it is expected that it shows an opposite correlation pattern in two different phenotype conditions. Here, the correlation pattern between two modules is represented as the correlation between the module eigengenes. In the heatmap of Fig. 6, we can observe that in all three datasets, the correlation pattern between most of the MEs in normal and tumor stages has the opposite direction. For example, from panel a, it can be noticed that for BRCA data, the modules have a negative correlation in the normal phase while showing a strong positive correlation in tumor phase. For the HNSC dataset, the opposite case is observed. Modules have a strong positive correlation in normal phases while having a negative correlation in tumor phase. In Supplementary Fig. 2, the visualization of all datasets is given.

Fig. 6: Heatmap of differentially coexpressed modules.
figure 6

Here the heatmap is shown for module eigengenes.The upper triangular portion of the matrix represents correlations of module eigengenes in normal samples, whereas the lower triangular portion signifies the same for tumor samples. Left and right sidebar of the heatmap represents −log(p value) of significantly enriched GO terms and pathways, respectively. “NA” stands for unavailability of significant pathway or GO terms. The upper annotation bar of the heatmap shows the DC_copula score of the module. a Shows the heatmap for BRCA data, whereas (b, c) Show heatmap HNSC and LUAD data.

Comparisons with competing methods

For comparison purpose, we have taken three competing techniques, such as Diffcoex, coXpress, and DiffCoMO, and compared them with our proposed method. All these methods are extant DC based, which look for gene modules with altered coexpression between two classes. DiffCoEx performed hierarchical clustering on the distance matrix complied from correlation matrices of two phenotype stages. CoXpress detects a correlation module in one stage and finds the alternation of the correlation pattern within the module in other classes. DiffCoMO uses the multiobjective technique to detect differential coexpression modules between two phenotype stages. We have made two approaches for comparing our proposed method with competing methods. We first compare the efficacy of these methods for detecting differential coexpressed gene pairs, and next compare the modules identified in each case. For the first case, we take the top 1000 gene pairs having high DC_Copula scores from the DC matrix, and perform classification using normal and tumor samples. The expression ratio of each DC gene pair from the expression matrix was taken and compiled a n × 1000, where n represents the number of samples in each data. For the other three methods, we have also selected the same number of differentially coexpressed gene pairs for the classification task. Table 2 shows some parameters we have used for the selection of the gene pairs. For CoXpress, first, we have used “cluster.gene” and “cuttree” function with default parameters provided in the R package of CoXpress to get the gene clusters according to the similarity of their expression profiles. These groups are then examined by the coXpress R function to identify the differentially coexpressed modules by comparing with the t statistics generated by randomly resampling the dataset 10,000 times for each group. We have taken the top 10 modules based on the robustness parameter, which tells the number of times that the group was differentially coexpressed in 1000 randomly resampled data. Now, we have selected 1000 gene pairs randomly from those modules. For the DiffCoEx method, we collected the DC gene pairs before partitioning them in modules. We used the code available in the supplementary file of the original paper of DiffCoEx, to get the distance score matrix that is used in the hierarchical clustering for module detection. We sort the score of the distance matrix and pick the top 1000 gene pairs based on the scores. For DiffCoMO, we use the default parameters to cluster the network to obtained differentially coexpressed modules. As it utilized the multiobjective method, all the Pareto optimal solutions of the final generation are taken as selected modules. We then choose 1000 gene pairs randomly from the identified modules. Classification is performed by treating normal and tumor samples as class labels. A toy example of the comparison is shown in Fig. 7. Note that all these methods are meant for differentially coexpressed module detection. So, for comparison, we collected the DC gene pairs before partitioning them in modules. We train four classifiers Boosted GLM, Naive Bayes, Random Forest, and SVM with the data and take the classification accuracy. The classification results are shown in Fig. 8. It can be noticed from the figure that for most of the dataset, the proposed method achieved high accuracy compared with the other methods.

Table 2 Table shows the different parameters/threshold we have used for selecting differentially coexpressed gene pairs for other methods.
Fig. 7: A toy example of performing classification on differentially coexpressed gene pairs.
figure 7

From the DC matrix, the top gene pairs are selected based on DC_copula score. The expression ratio is computed for each gene pair for normal and tumor samples. The final matrix is then transposed and subsequently, classification is performed using normal and tumor samples as class labels.

Fig. 8
figure 8

Performance comparison with state-of-the-art:Comparison of classification accuracy for five datasets with four classifiers Boosted -GLM, Naive Bayes, Random Forest, and SVM.

To assess the performance of all the methods for detecting differential coexpression modules, we check the distribution of the correlation score of gene pairs within top modules in normal and tumor samples. Extant methods do a comparison by computing the absolute change in correlation value between a pair of a gene within a module. The problem for this type of comparison is that the score ignores a small change in differential coexpression. It also fails to consider the gene pair having a low score and correlation of opposite sign in two conditions. For example, it emphasized the gene pair with correlation value 0.2 in normal and 0.7 in the tumor (here the score is 0.5) rather than the gene pair whose correlation value is −0.2 in normal and 0.2 in the tumor (here the score is 0.4). So, for comparison, it is required to investigate the number of gene pairs having correlation values of an opposite sign over −1 to +1. So, for all identified modules, we calculate the correlation score of each gene pair in two different samples (normal and cancer) and plot the frequency polygon in Fig. 9. To investigate whether the gene pairs within the modules show a good balance in positive and negative correlations, we have computed the correlation score for all the identified modules of DiffCoMO, DiffCoEx, and CoXpress. Figure 9 shows the comparisons of the correlation scores. It is noticed from the figure that gene pairs within the identified modules of the proposed method show good balance in positive and negative correlation values. DiffCoMO and DiffCoEX have also achieved the same, whereas most of the gene pairs within the coXpress modules shifted toward positive correlation in both tumor and normal samples. In Fig. 9a, we have also shown the boxplot of the correlation values obtained from different methods. As can be seen from the figure, the median line of correlation values for the proposed method is nearer to 0, which signifies good distribution of correlation scores in normal and tumor samples over −1 to +1. Thus, the proposed method can able to detect differentially coexpressed gene pairs having correlation values well distributed between −1 and +1.

Fig. 9: Distribution of correlation scores of the gene pairs in normal and tumor stage.
figure 9

Each facet shows the distribution for different methods.

To compare CODC with the other methods, we tested its performance in a simulated dataset also. To create the simulated data, we have used HNSC RNA-seq expression dataset. We create the simulated data as follows:

  1. 1.

    For each gene gi in the normal sample, we simulated the expression profile of sample sj as \(X_{ij} = {\cal{N}}(m_i,\sigma _i^2)\) where mi represents mean expression value of gene gi across all 51 HNSC normal samples, and σ2 represents their variance.

  2. 2.

    Similarly, we simulated the expression profile of tumor sample \(s_j^\prime\) as \(Y_{i,j} = {\cal{N}}(m_i^\prime ,\sigma _i^{\prime 2})\), where \(m_i^\prime\) is mean of the expression value of gene \(g_i^\prime\) and \(\sigma _i^{\prime 2}\) is the variance. Here, we assume that a gene pair is truly differentially coexpressed if the following condition holds: DC_score(gi, gj) > 0.5 and the correlation between gi and gj has opposite sign in tumor and normal stage.

  3. 3.

    We then add different levels of Gaussian noise to the Yij to create different noisy expression data (\(Y_{ij}^\prime\)) from the simulated tumor samples. We use rnorm function of R to produce normally distributed noise with mean 0 and standard deviation 1.

Now, we compute differentially coexpressed gene pairs between simulated normal (Xij) and noisy sample (\(Y_{ij}^\prime\)) and compare them with the underlying true differentially coexpressed gene pairs. We compute the proportion of matched gene pairs and plot the results against all the different noise levels in Fig. 10. We have done this analysis for all competing methods. To select the differentially coexpressed gene pairs from simulated, normal, and noisy tumor samples, we use DC_Copula threshold as 0.6. For other competing methods, we use the threshold and other parameters, same as the previous analysis, which are provided in Table 2.

Fig. 10: Figure shows the percentage of matching between identified gene pairs with true differentially coexpressed gene pairs of simulated data.
figure 10

Results are shown for different noise levels and for different methods.

Pathway analysis

To compare functional enrichment of identified modules, we have utilized KEGG pathway enrichment analysis. We defined enrichment score of a pathway in a module as \({\rm{pathway}}\_{\rm{score}} = \frac{n}{m}\), where n is the fraction of the pathway genes in the module and m is the fraction of the pathway genes in the dataset. We compare the pathway score for the modules identified for DiffCoEx, DiffCoMO, CoXpress, and the proposed method. For comparison purpose, we have utilized the HNSC dataset. For identifying modules in other competing methods, we have utilized the the parameters provided in Table 2. For coXpress, we take the modules with robustness parameter value greater than 760, which produces 19 clusters. For DiffcoEx, we have used the default parameters for cutreeDynamic function (cutHeight = 0.996, minClusterSize = 20), which produces 42 clusters. Figure 11 shows the result. The Y axis represents CCDF (complementary cumulative distribution function), which represents how often the number of modules is above a certain value of pathway score. It is clear from the figure that more modules for the proposed method achieved a high pathway score compared with other competing methods. In Fig. 6, we have shown heatmaps of differentially coexpressed modules for BRCA, HNSC, and for LUAD data. The heatmap also provides pathways and GO terms significantly enriched with the modules. The p value for KEGG pathway and GO enrichment is computed by using the hypergeometric test with 0.05 FDR corrections. We have utilized GOstats, kegg.db, and GO.db R package for that. It can be seen from Fig. 6a that some pathways such as “Complement and coagulation cascades”, “Proximal tubule bicarbonate reclamation”, “Caffeine metabolism”, “Protein digestion and absorption”, “Tryptophan metabolism”, and “ABC transporters”, are strongly associated with the identified modules of BRCA. “Tryptophan metabolism” has eminent evidence to link with malignant progression in breast cancer20. In ref. 21, the association between ABC transporters with breast carcinoma has been established. From panel b, it can be seen that drug metabolism–cytochrome P45022 ECM–receptor interaction23, “Nitrogen metabolism”, and “Protein digestion and absorption” are significantly associated with the modules of HNSC data. Some pathways such as “Drug metabolism–cytochrome P450” and “ECM–receptor interaction” have strong evidence associated with the head and neck squamous cell carcinomas22,23. Similarly from panel c, it can be noticed that pathways such as “Metabolism of xenobiotics by cytochrome P450”, “Pancreatic secretion”, and “Linoleic acid metabolism” are significantly associated with modules of LUAD data. Among them, there exists strong evidence for pathways: “Metabolism of xenobiotics by cytochrome P450”24, “Pancreatic secretion”25, and “Linoleic acid metabolism”26 to be associated with lung carcinoma.

Fig. 11: Distribution of pathway score for each of the comparing methods.
figure 11

The figure shows the fraction of identified modules having a pathway score above a certain value.

Discussion

In this paper, we have proposed CODC, a copula-based model to detect differential coexpression of genes in two different samples. CODC seeks to identify the dependency between expression patterns of a gene pair in two conditions separately. The Copula is used to model the dependency in the form of two joint distributions. K–S distance between two joint distributions is treated as differential coexpression score of a gene pair. We have compared CODC with three competing methods DiffCoex, CoXpress, and DiffCoMO in five pan-cancer RNA-Seq data of TCGA. CODC’s ability for delineating a minor change of coexpression in two different samples makes it unique and suitable for differential coexpression analysis. The scale-invariant property of copula inherited into CODC to make it robust against noisy expression data. It is advantageous for detecting the minor change in correlation across two different conditions, which is the most desirable feature of any differential coexpression analysis.

Under the premise that the differential coexpressed genes are likely to be important biomarkers, we demonstrate that CODC identifies those that achieve better accuracy for classifying samples. Moreover, CODC goes a step further from the pairwise analysis of genes, and seeks modules wherein differential coexpression is prevalent among each pair of genes. We have also analyzed the identified modules enriched with different biological pathways, and highlighted some of these such as “Complement and coagulation cascades”, “Tryptophan metabolism”, “Drug metabolism–cytochrome P450”, and “ECM–receptor interaction”.

We have evaluated the efficacy of CODC on five different pan-cancer datasets to effectively extract differential coexpression gene pairs. Besides that, we have also compared different methods for detecting differentially coexpressed modules in those data. It is worth mentioning that CODC improves upon the competing methods. We have also proved that the scale-invariant property of copula makes CODC more robust for detecting differential coexpression in noisy data. The most important part of the DC analysis is to reveal changes in gene correlation that would not be detected by traditional DE analysis. CODC uses copula for measuring gene–gene dependency, and copula is a multivariate measure, so it can be easily extendible to use in the measurement of dependence structure of multiple genes.

Methods

In this section, we have briefly introduced the proposed method.

Modeling differential coexpression using Copula

Differential coexpression is simply defined as the change in coexpression patterns of a gene pair across two conditions. A straightforward method to measure this is to take the absolute difference of correlations between two gene expression profiles in two conditions. For a gene pair genei and genej, this can be formally stated as \({\rm{DC}}\_{\rm{Score}}_{i,j}^{p1,p2} = |{\rm{Sim}}(x_i,x_j)^{p1} - {\rm{Sim}}(x_i,x_j)^{p2}|\), where p1, p2 are two different phenotype conditions, and xi,xj represent expression profile of genei and genej, respectively. Here Sim(xi, xj)p signifies Pearson correlation between xi and xj for phenotype p.

In the statistical analysis, a simple way to measure the dependence between the correlated random variable is to use copulas27. Copula is extensively used in high-dimensional data applications to obtain joint distributions from a random vector, easily by estimating their marginal functions.

Copulas can be described as a multivariate probability distribution function for which the marginal distribution of each variable is uniform. For a bivariate case, copula is a function: C: [0, 1]2 → [0, 1], and can be defined as C(x, y) = P(Xx, Yy), for 0 ≤ x, y ≤ 1, where X and Y are uniform random variables. Let, Y1 and Y2 be the random vectors whose marginals are uniformly distributed in [0, 1] and having marginal distribution FY1 and FY2, respectively. By Sklar’s theorem28, we have the following: there exists a copula C such that F(y1, y2) = C(FY1(y1), FY2(y2)), for all y1 and y2 in the domain of FY1 and FY2. In other words, there exists a bivariate copula that represents the joint distribution as a function of its marginals. For the multivariate case, the copula (C) function can be represented as

$$F_Y(y_1,y_2, \ldots ,y_n) = C(F_1(y_1),F_2(y_2), \ldots ,F_n(y_n)),$$
(1)

where (Y1, Y2, …, Yn) be the random vectors whose marginals are F1(y1), F2(y2), …, Fn(yn). The converse of the theorem is also true. Any copula function with individual marginals Fi(yi) as the arguments, represents valid joint distribution function. Assuming that F(Y1, Y2, …, Yn) has nth-order partial derivatives, the relation between the joint probability-density function and the copula-density function, say c, can be obtained as

$$\begin{array}{l}f(y_1,y_2, \ldots ,y_n) =\frac{{\partial ^n(F(Y_1,Y_2, \ldots ,Y_n))}}{{\partial Y_1\partial Y_2 \ldots \partial Y_n}}\\ =\frac{{\partial ^n\left( {C\left( {F_1(y_1),F_2(y_2), \ldots ,F_n(y_n)} \right)} \right.}}{{\partial Y_1\partial Y_2 \ldots \partial Y_n}}\\ =cF_1(y_1),F_2(y_2), \ldots ,F_n(y_n)\mathop {\prod}\limits_i {f_i(y_i)} \\ \end{array}$$
(2)

where, we define

$$c(y_1, \ldots ,y_n) = \frac{{\partial ^nC(y_1, \ldots ,y_n)}}{{\partial y_1 \cdots \partial y_n}}.$$
(3)

So, Copula is also known as joint distribution-generating function with a separate choice of marginals. Hence, different families (parametric and nonparametric) of copulas exist, which model different types of dependence structure. The example includes Farlie–Gumbel–Morgenstern family (parametric), Archimedean Copula (parametric), Empirical Copula (nonparametric), Gaussian (parametric), and t (parametric). Empirical copulas are governed by the empirical distribution functions, which try to estimate the underlying probability distribution from given observations.

Empirical Copula is defined as follows:

Let Y1, Y2, …, Yn be the random variables with marginal cumulative distribution function F1(y1), F2(y2), …, Fn(yn), respectively.

The empirical estimate of (Fi, i = 1, …, n), based on a sample, {yi1, yi2, …, yim}, of size m is given by

$$\hat F_i(y) = \frac{1}{m}\mathop {\sum}\limits_{j = 1}^m {1_{\{ Y_{ij}\, \le \,y\} }} ,[i = 1, \ldots ,n]$$
(4)

The Empirical Copula of Y1, Y2, …, Yn is then defined as

$$\begin{array}{l}\hat C(u_1,u_2, \ldots ,u_n)\\ = \frac{1}{m}\mathop {\sum}\limits_{j = 1}^m 1 \{ \widehat {F_1}(y_{1,j}) \le u_1,\widehat {F_2}(y_{2,j}) \le u_2, \ldots ,\widehat {F_n}(y_{n,j}) \le u_n\} ,\end{array}$$
(5)

for ui [0, 1], [i = 1, …, n]. Here, we model the dependence between each pair of gene expression profile using empirical copulas. As we were unaware of the distributions of expression profiles, so empirical copulas are the only choice here. Notably, we have estimated joint empirical copula density from the marginals of each gene expression profile. We have used beta-kernel estimation to determine the copula density directly from the given data. The smoothing parameters are selected by minimizing the asymptotic mean-integrated squared error (AMISE) using the Frank copula as the reference copula. The input to the copula-density estimator is of size n × 2, where n is the number of samples in different datasets. For each pair of samples, we estimate the empirical copula density using beta-kernel estimator. We have shown some marginal normal contour plots of copula density during the estimation process of the BRCA dataset in Supplementary Fig. 1. To model the differential coexpression of a gene pair, we have measured a statistical distance between two joint distributions provided by the copulas. We have utilized the K–S test to quantify the distance between two empirical distributions. The value of d statistic represents the distance here. Thus, the distance obtained for a gene pair is treated as a differential coexpression score.

To check whether the distance between the joint distribution perfectly models the differential coexpression, we have performed an analysis. To show the concordance between the DC_Score with the proposed distance, we have performed the following analysis. We create a 20 × 20 matrix M, whose rows (i) and columns (j) annotated with correlation values from −1 to +1 with 0.1 spacing. We create two pairs of marginals (Fx1, Fx2) and (Fy1, Fy2) having correlations i and j, respectively. For the generation of marginals, we have used mvnorm function of MASS R-package and set the “empirical” parameter of mvnorm as “TRUE”. This generates (Fx1, Fx2) or (Fy1, Fy2) with an exact correlation of i or j. Next, we compute joint distributions using copula function FX1X2 = C(Fx1, Fx2), FY1Y2 = C(Fy1, Fy2) and finally compute KS distance between FX1X2 and FY1Y2. Each entry of (i,j) in M is filled with this distance value. We generate M 100 times following the same method. Now, to visualize the matrices, each row is represented as a series of boxplots in Fig. 12. For a fixed row, the DC_Score will increase from left to right along the column as it ranges from correlation value −1 to +1. Each facet in the figure corresponds to a row/column in the matrix, which represents 20 sets of 100 distances corresponding to the correlations ranging from −1 to +1 with a spacing of 0.1. Considering each facet of the plot, it can be noticed that distances are gradually increasing with the increase in the DC_Score. For example, considering the second facet (corr value = −0.9), the distances increased from left to right gradually. So, it is evident from the figure that there exists a strong correlation between the distance and DC_Score, which signifies that the proposed method can model the difference in coexpression patterns.

Fig. 12: Boxplot showing the dependency between DC_Score and K–S distance, between two joint distributions.
figure 12

a shows the distances for the facets from correlation −1 to −0.4 and b shows the same for correlation −0.3 to +1.

Stability of CODC

CODC is stable under noisy expression data. This is because of the popular “non-parametric”, “distribution-free”, or “scale-invariant” nature of the copula29. The properties can be written as follows: let CXY be a copula function of two random variables X and Y. Now, suppose α and β are two functions of X and Y, respectively. The relation of C(α(X),β(Y)) and CXY can be written as follows:

  • Property 1: If α and β are strictly increasing functions, then the following is true:

    $$C_{\alpha (X)\beta (Y)}(u,v) = C_{XY}(u,v)$$
    (6)
  • Property 2: If α is strictly increasing and β is strictly decreasing, then the following holds:

    $$C_{\alpha (X)\beta (Y)}(u,v) = u - C_{XY}(u,1 - v)$$
    (7)
  • Property 3: If α is strictly decreasing and β is a strictly increasing function, then we have

    $$C_{\alpha (X)\beta (Y)}(u,v) = v - C_{XY}(v,1 - u)$$
    (8)
  • Property 4: If both α and β are strictly decreasing functions, then the following holds:

$$C_{\alpha (X)\beta (Y)}(u,v) = u + v - 1 - C_{XY}(1 - u,1 - u)$$
(9)

These properties of copula are used to prove that the distance measure used in CODC is approximately scaled invariant. Theoretical proofs are described below, and the simulation result is given later in section (Stability performance of CODC). The proof is as follows: we know that the K–S statistic for a cumulative distribution function F(x) can be expressed as

$$D = {\rm{sup}}_x|H_n(x) - F(x)|,$$

where Hn is an empirical distribution function for n i.i.d observation Xix, and sup corresponds to supremum function. The two-sample K–S test used in CODC can be described similarly

$$\begin{array}{c}D = {\rm{sup}}_{x,y}|(H_n^1(x,y) - F(x,y)) - (H_n^2(x,y) - F(x,y))|\\ = {\rm{sup}}_{x,y}|(H_n^1(x,y) - H_n^2(x,y))|,\end{array}$$
(10)

where \(H_n^1\), \(H_n^2\) are denoted as the joint empirical distribution for two samples taken from normal and cancer, respectively. Now the D statistic can be written as

$$\begin{array}{l}D = {\rm{sup}}_{x,y}|(H_n^1(x,y) - H_n^2(x,y))|\\ = {\rm{sup}}_{x,y}|C(F_1(x),F1(y)) - C(F_2(x),F2(y))|\\ = {\rm{sup}}_{x,y}|C_{XY}(u,v) - C_{XY}(\tilde u,\tilde v)|,\end{array}$$
(11)

where C(.) is copula function and u = F1(x), v = F1(y), \(\tilde u = F_2(x),\tilde v = F_2(y)\) are uniform marginals of joint distributions \(H_n^1\) and \(H_n^2\).

Let us assume that both α and β functions are strictly increasing. Then from Eqs. (6) and (11), the distance D between \(H_n^1\)(α(x), β(y)) and \(H_n^2\)(α(x), β(y)) has the form

$$\begin{array}{*{20}{l}}D ={\rm{sup}}_{x,y}|H_n^1(\alpha (x),\beta (y)) - H_n^2(\alpha (x),\beta (y))|\\ ={\rm{sup}}_{x,y}|C(F_1(\alpha ),F1(\beta )) - C(F_2(\alpha ),F2(\beta ))|\\ ={\rm{sup}}_{x,y}|C_{\alpha (x),\beta (y)}(u,v) - C_{\alpha (x),\beta (y)}(\tilde u,\tilde v)|\\ ={\rm{sup}}_{x,y}|C_{XY}(u,v) - C_{XY}(\tilde u,\tilde v)|\\ {\left[{{\rm{By}}\,{\rm{using}}\,{\rm{the}}\,{\rm{property}}\,{\rm{in}}\,{\rm{Eq}}. (6)} \right]}\\={{{\sup }_{x,y}}\left| {C({F_1}(x),F1(y)) - C({F_2}(x),F2(y))} \right|}\\={{{\sup }_{x,y}}\left| {(H_n^1(x,y) - H_n^2(x,y))} \right|}.\end{array}$$
(12)

Now, if α is strictly increasing and β is strictly decreasing, then D can be written as

$$\begin{array}{*{20}{l}}D^{\prime} ={\rm{sup}}_{x,y}|H_n^1(\alpha (x),\beta (y)) - H_n^2(\alpha (x),\beta (y))|\\ ={\rm{sup}}_{x,y}|C(F_1(\alpha ),F1(\beta )) - C(F_2(\alpha ),F2(\beta ))|\\ ={\rm{sup}}_{x,y}|C_{\alpha (x),\beta (y)}(u,v) - C_{\alpha (x),\beta (y)}(\tilde u,\tilde v)|\\ ={\rm{sup}}_{x,y}|u - C_{XY}(u,1 - v) - \tilde u - C_{XY}(\tilde u,1 - \tilde v)|\\\left[{\mathrm{By}}\,{\mathrm{using}}\,{\mathrm{the}}\,{\mathrm{property}}\,{\mathrm{in}}\,{\mathrm{Eq.}}\,(7)\right]\\ ={\rm{sup}}_{x,y}|(u - \tilde u) + C_{XY}(\tilde u,1 - \tilde v) - C_{XY}(u,1 - v)|\\ ={\rm{sup}}_{x,y}|(u - \tilde u) + C_{XY}(\tilde u,\tilde m) - C_{XY}(u,m)|\\ \ge\,{\rm{sup}}_{x,y}|[C_{XY}(\tilde u,\tilde m) - C_{XY}(u,m)]|\\ \ge\,{\rm{sup}}_{x,y}|[C_{XY}(u,m) - C_{XY}(\tilde u,\tilde m)]|\\ ={\rm{sup}}_{x,y}|H_n^1(x,y) - H_n^2(x,y)|\\ =D\end{array}$$
(13)

Similarly, for strictly increasing β and strictly decreasing α, the distance D′ between \(H_n^1\)(α(x), β(y)) and \(H_n^2\)(α(x), β(y)) can be shown to satisfy the relation

$$\begin{array}{*{20}{l}}D^{\prime} = {\rm{sup}}_{x,y}|H_n^1(\alpha (x),\beta (y)) - H_n^2(\alpha (x),\beta (y))|\\ \ge {\rm{sup}}_{x,y}|H_n^1(x,y) - H_n^2(x,y)|\\ = D.\end{array}$$
(14)

Finally, let us consider that both α and β are strictly decreasing functions. The distance D′ can be described as

$$\begin{array}{*{20}{l}} D^{\prime} ={\rm{sup}}_{x,y}|H_n^1(\alpha (x),\beta (y)) - H_n^2(\alpha (x),\beta (y))|\\ ={\rm{sup}}_{x,y}|C(F_1(\alpha ),F1(\beta )) - C(F_2(\alpha ),F2(\beta ))|\\ ={\rm{sup}}_{x,y}|C_{\alpha (x),\beta (y)}(u,v) - C_{\alpha (x),\beta (y)}(\tilde u,\tilde v)|\\ ={\rm{sup}}_{x,y}|u + v - 1 + C_{XY}(1 - u,1 - v)\\-\, \tilde u + \tilde v - 1 + C_{XY}(1 - \tilde u,1 - \tilde v)|\\ {\left[{{{\rm By}\,{\rm using}\,{\rm the}\,{\rm property}\,{\rm in}\,{\rm Eq.}}}\,(8)\right]}\\ =\,{\rm{sup}}_{x,y}|(u - \tilde u) + (v - \tilde v) + C_{XY}(1 - u,1 - v)\\-\, C_{XY}(1 - \tilde u,1 - \tilde v)|\\ \ge\, {\rm{sup}}_{x,y}|C_{XY}(1 - u,1 - v) - C_{XY}(1 - \tilde u,1 - \tilde v|)|\\ ={\rm{sup}}_{x,y}|C_{XY}(m,n) - c_{XY}(\tilde m,\tilde n)|\\ ={\rm{sup}}_{x,y}|H_n^1(x,y) - H_n^2(x,y)|.\\ \end{array}$$
(15)

Thus, the value of D′ between two joint distributions \(H_n^1\)(α(x), β(y)) and \(H_n^2\)(α(x), β(y)) is the same as that when we add Gaussian noise to the original expression data of normal and cancer samples to transform these into noisy datasets of D that represents the distances \(H_n^1\)(x, y) and \(H_n^2\)(x, y) when both α and β are increasing functions. For other cases of α and β, D′ attains at least the value of D. So, the distance for two random variables α(X) and β(Y) is equal or at least that of the random variables X and Y. CODC treats the distance D as differential coexpression score; thus, it remains the same (or at least equal) under any transformation of X and Y.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.