Introduction

Clustering algorithms have been widely used to group genes based on their similarities in expression1,2,3,4. The gene expression coherence is often related to functional coherence. A recent comparative assessment of 21 existing clustering algorithms showed that the clustering algorithms report more functionally meaningful clusters by exploiting the relationships between all pairs of genes1. Clustering algorithms are also known to perform better grouping of co-functional genes if they search for similarities between expression patterns rather than similarities between expression values1, 5.

Functional relations between genes may vary over conditions6, 7. For example, a group of genes may act coherently under one set of conditions but may become inactive or perform different functions separately under another set of conditions. Clustering algorithms that obtain grouping based on similarities over all the samples in a dataset are not effective for detecting condition-dependent coexpression patterns. Biclustering algorithms have been developed to address this problem1, 8,9,10,11,12. A bicluster consists of a group of genes and a set of conditions under which the genes are co-expressed. Searching for bicluster is a challenging problem because the number of potential biclusters is exponential to the number of genes and samples. An important difference between various biclustering algorithms is how they apply heuristic rules to detect biclusters. Most biclustering algorithms use local search heuristics that may miss many biclusters. Conventional way of finding biclusters depends on the selection of random seeds of genes and/or samples followed by their augmentation based on a scoring function. However, random selection of initial seeds is often unable to efficiently cover the search space and to obtain a consistent set of biclusters from multiple runs on the same input data. The selection of scoring function for the heuristic search is also important for finding the biologically meaningful biclusters. Most common biclustering approaches adopt arithmetic mean of the gene expression or an up/down-regulation patterns on corresponding discretized data matrix. Both are inefficient for finding the interesting co-regulatory modules where genes are expressed with similar or opposite patterns of expression but the expression values are very different. Such modules are important as they may represent relations between genes in the same biological functions1, 13.

A Pearson Correlation Coefficient based scoring function was introduced by Correlated Pattern Biclusters (CPB)14. However, the performance CPB is still depends on the random selection of samples. Benefits of Pearson Correlation Coefficient based similarity measures over the conventional mean square residue based bicluster scores again was demonstrated by Bi-Correlation Clustering algorithm (BCCA) that looks for positively correlated biclusters and reports biclusters for each pair of genes present in a dataset13. Initially, for each pair of genes all the samples are selected and then a greedy search is made to eliminate samples one at a time until the gene pair shows positive correlation higher than a predefined high positive correlation threshold. The gene pair is then augmented with other genes that are correlated to form a bicluster. However the backward elimination of samples in a greedy search is subject to finding a local optimal subset of samples, hence in reality it imposes a restriction on the search space for finding the optimal set of biclusters. BIclustering by Correlated and Large number of Individual Clustered seeds (BICLIC) introduces another alternative correlation based biclustering. BICLIC forms biclusters by starting from a large number of clustered seeds followed by augmentation and deletion of rows and columns based on the correlations with seeds15. The main drawback of BICLIC which is also true for BCCA is finding a large number of overlapping biclusters which are very much identical. Moreover, BICLIC, BCCA and other correlation based biclustering methods, only look for finding positively correlated gene groups as bicluster. However, in reality negatively correlated genes are equally relevant and may represent important regulatory mechanisms in biological functions.

Here we propose a more effective correlation-based biclustering algorithm named Condition-dependent Correlation Subgroup (CCS). It integrates several important features for developing an effective algorithm for comprehensive discovery of functionally coherent biclusters1. A significant challenge in genomic data analysis is to utilize the fast growing high performance computing capacity to process and analyze large complex datasets efficiently16,17,18. CCS is particularly suitable for parallel computing. We used the GPGPU computing for a parallel implementation of the algorithm in CUDA C which shows a substantial speedup compared to the sequential C program. The performance of CCS was compared with CPB, BCCA, BICLIC and ten other widely-used biclustering algorithms on 5 synthetic and 5 real gene expression datasets. CCS outperforms the other biclustering algorithms in all the comparisons. We also showed that there is equivalence between the CCS biclusters and condition-dependent coexpression network modules.

Methods

A bicluster is a group of genes with a related pattern of expression over a group of samples. We defined the related pattern of gene expression in terms of positive and negative Pearson correlation coefficients. Let us consider a gene expression data set D = G × E where G = {g1, g2, …, gn} is a set of “n” genes and E = {e1, e2, …, es, …, em} is a set of “m” samples. For each gene gi there is an m-dimensional vector xi. In vector xi, xis is the value of es for gene gi. In our algorithm we defined a bicluster as a group of genes “I” over a group of samples “J” where each gene gi in “I” is correlated to all the other genes gj in “I” with an absolute correlation value greater than a threshold θ. Mathematically a bicluster “C” is represented as C = (I; J) where “I” is a subset of “G” and “J” is a subset of “E”. The Pearson correlation coefficient between gi and gj over a subset of samples “J” is represented as r(gi, gj)J and defined as

$$r{(gi,gj)}_{J}=\frac{{\sum }_{s=1}^{m}(({x}_{is}\times {b}_{s}-\bar{{x}_{i}\times b})({x}_{js}\times {b}_{s}-\bar{{x}_{j}\times b}))}{({\sum }_{s=1}^{m}{({x}_{is}\times {b}_{s}-\bar{{x}_{i}\times b})}^{2})({\sum }_{s=1}^{m}{({x}_{js}\times {b}_{s}-\bar{{x}_{j}\times b})}^{2})},$$
(1)

where b is a bit vector of size “m”. If sample es is in “J” then we set the sth bit of b s  = 1, otherwise we set b s  = 0. The terms xis and xjs represent sth sample values for gene gi and gj respectively. Average expression values of gi and gj for samples in “J” are represented as \(\overline{{x}_{i}\times b}\) and \(\overline{{x}_{j}\times b}\) respectively. We considered a pair of genes gi and gj similar for a subset of samples “J” if \(|r{(gi,gj)}_{J}| > \theta \), where θ is the correlation threshold and \(|r{(gi,gj)}_{J}|\,\,\)is the absolute value of correlation between gi and gj for the subset of samples “J”. For each gene gi in a dataset “D”, CCS considers gi as the base gene to start forming a bicluster for each gi.

Search space sorting and base-gene selection

The gene with the lower variability over the samples (commonly known as housekeeping genes) are often considered as less significant for the corresponding biological conditions, hence the less important candidates for forming a bicluster. Prior to biclustering, the search space (input data matrix) was sorted by the standard deviations of the gene expression values. In an ordered data matrix the genes with the higher variability were placed before the other with the lower variability.

CCS algorithm selects ‘base_number’ of genes as base-gene from the sorted data matrix in the decreasing order of variability. Ordering of the search space ensured that a gene with higher variability considered for forming a bicluster and also for augmentation before the other with lower variability. The ‘base_number’ can be set between “1” to “n” (total number of genes). Here we set the base_number value equals 1,000 for the biological dataset to restrict the search time.

Similarity pattern classes and sample selections

CCS algorithm considers three classes of gene expression pattern similarities: (i) up-regulated positive correlation: gi and gj are positively correlated and the expression values of gi and gj for the selected samples are higher than the arithmetic mean expression values over all smaples; (ii) down-regulated positive correlation: gi and gj are positively correlated and the expression values of gi and gj for the selected samples are lower than the arithmetic mean expression values over all smaples; (iii) negative correlation: gi and gj are negatively correlated over the selected samples and their expression values are at the opposite sides to their respective arithmetic mean expressions.

For each pair of genes gi and gj the sample sets “J1: for up-regulated positive correlation”, “J2: for down-regulated positive correlation” and “J3: for negative correlation” are selected from sample selection rules 1, 2 and 3 respectively. For each sample es, s = 1:m, we defined the following three rules to determine J1, J2 and J3: Rule 1. IF \(({x}_{is}-\overline{{x}_{i}}) > 0\,AND\,({x}_{js}-\overline{{x}_{j}}) > 0\) THEN include es to J1 [Expression values of gi and gj for the samples in J1 are higher than the arithmetic mean expression values], Rule 2. IF \(({x}_{is}-\overline{{x}_{i}}) < 0\,AND\,({x}_{js}-\overline{{x}_{j}}) < 0\) THEN include es to J2 [Expression values of gi and gj for the samples in J2 are lower than the arithmetic mean expression values], Rule 3. IF \(({x}_{is}-\overline{{x}_{i}})\times ({x}_{js}-\overline{{x}_{j}}) < 0\) THEN include es to J3 [Expression values of gi and gj for the samples in J3 are opposite to their arithmetic mean expression values]. Here \(\overline{{x}_{i}}\) and \(\overline{{x}_{j}}\) are the mean expression value of gene gi and gj over all samples. The correlations between gi and gj for sample sets Jk=1:3 are determined first by computing Pearson Correlation Coefficient ‘r’ and then by comparing against a threshold ‘θ’ which is denoted by |r(gi, gj)Jk| > θ in the algorithm.

Defining biclusters as condition dependent correlation modules

CCS defines biclusters as condition dependent correlation modules where the genes in a bicluster are expected to show correlations only for the samples in that bicluster. CCS introduces a scoring function named BScore as defined in Equation (2). The BScore is designed to compare two sets of correlated gene pairs N and M. The correlations in set ‘N’ are computed over the samples that are included in a bicluster while the correlations in set ‘M’ are computed over the rest of the samples that are not included in the same bicluster. Thus, the BScore measures the degree to which the gene coexpression in a bicluster is condition-specific. In this work, we used a small BScore threshold (<0.01) to ensure the discovered biclusters are based on condition-specific coexpression.

$${\rm{BScore}}=\frac{|N{\cap }^{}M|}{|N{\cup }^{}M|}$$
(2)

Algorithm

The algorithm at each iteration of step 2 starts with a new base-gene gi. In each iteration of step 5, gi is paired with a gene gj (i < j ≤ n). In step 6, the sample sets are selected for gene pair gi, gj. In step 8, the algorithm computes the absolute value of the correlation |r(gi, gj)Jk| for each sample set Jk=1:3 (Equation (1)). If |r(gi, gj)Jk|\( > \theta \) then the algorithm starts a bicluster with an initial gene set Ii = {gi, gj}. In step 10, augmentation of the gene set is performed by including a new gene gp in Ii. In step 16, \({\rm{BScore}}\)(Ii,Ji) is compared against the previous best \({\rm{BScore}}\)(I,J) to update the sets I and J. The most condition dependent bicluster for each base gene gi is selected in step 22. In step 26, all the overlapping biclusters are merged while keeping \({\rm{BScore}} < 0.01\)

Input: (i) A gene expression data set D = G × H, G = {g1, g2,…, gn} is a set of n genes, H = {e1, e2,…, em} is a set of m samples for gene set G. (ii) A correlation threshold θ.

Output: A set of biclusters S = {bicluster(g1), bicluster(g2),…, bicluster(gbase_number)}, base_number ≤ n.

1. S ← NULL

2. for all gi \(\in \,\)G, i ≤ base_number do

3.       I ← NULL

4.       J ← NULL

5.       for all gj \(\in \,\)G, i < j ≤ n do

6.          apply Rules(1,2,3) on {gi, gj} for sample sets Jk(k=1,2,3)

7.          for all Jk, k = 1:3 do

8.                if |r(gi, gj)Jk| > θ then

9.                        Ii ← {gi, gj}

10.                       for all gp not in Ii do

11.                           if |r(gp, gq)Jk| > θ, for all gq \(\in \,\,\)Ii then

12.                                 Ii ← Ii gp

13.                             end if

14.                       end for

15.                end if

16.                If (BScore(Ii,Ji) < BScore(I,J) < 0.01) or

                        (BScore(Ii,Ji) = BScore(I,J) < 0.01 and |Ii| > |I|) then

17.                       I ← Ii

18.                       J ← Ji

19.                end if

20.          end for

21.       end for

22.       if I ≠ NULL and J ≠ NULL then

23.        bicluster(gi) ← {I, J}

24.       end if

25. end for

26. for each bicluster({I, J}) do

27.       for each bicluster({K, L})\(\ne \,\)NULL do

28.          if BScore(I  K, J  L) < 0.01 and I ∩ K ≠ NULL then

29. bicluster({I, J}) ← {I K, JL}

30.                 bicluster({K, L}) ← NULL

31.          end if

32.   end for

33. end for

for the final set of biclusters ‘S’.

The input parameter for our biclustering algorithm is the correlation threshold θ, which is the minimum absolute value of correlation between genes in a bicluster. Allocco et al. investigated the relation between co-regulation and coexpression and came up with a conclusion that for a correlation greater than 0.84 there is at least 50% chance of co-regulation19. Here we set θ = 0.8. If the number of biclusters is zero for θ = 0.8 then we decrease θ by 0.05 until we get a nonempty set of biclustering result or θ drops to 0.

GPU computing for biclustering analysis of large datasets

GPU was initially introduced for rendering video graphics on display devices. The GPU designing company NVIDIA introduced their CUDA architecture in 2007 for general purpose computing on graphics processing unit (GPGPU computing). The single instruction set and multiple dataset architecture (SIMD) of GPU is particularly useful for executing a single instruction over thousands of core processors in a GPU card. CUDA is currently the most popular programming model for general purpose GPU computing. CUDA C is an extension of regular C that supports GPGPU computing. The parallel part is implemented as a CUDA C kernel function. Thousands of instances of a CUDA C kernel function run parallel on a group of GPU core processors. Each group of GPU cores is also known as a thread block. Each thread block is a virtual processor formed from one or more GPU core processors, registers, local memories, shared memory and global memory. CUDA C kernel functions are sent to the GPU for parallel execution, and sequential parts of the code run on the CPU.

We used GPGPU computing to reduce the execution time. In this work, we executed our CUDA C code for CCS on NVIDIA Tesla K20 (2,496 CUDA cores) and K80 GPU (4,992 CUDA cores) accelerator. For compilation of CUDA C code we installed CUDA 7.5 toolkit in our Linux workstation. Transferring big datasets between CPUs and GPUs is time consuming. To reduce the data transfer overhead we moved the entire process of finding new biclusters to GPU.

Step 2 of CCS executes “n” times when “base_number” is set at “n” for a dataset with “n” rows. For each execution the inner loop at step 5 executes a maximum “n-1” times. To achieve increased speed from parallel GPGPU computing, we removed the step 2 loop from CCS algorithm and implement steps 5 to 25 of CCS as a CUDA C kernel function.

Synthetic datasets

Five synthetic datasets were generated using the BiBench-0.2 python library8. The synthetic datasets contains two types of biclustering: Constant biclusters and Shift-Scale biclusters. Constant biclusters are sub-matrices with a constant value and Shift-Scale biclusters are biclusters that are formed from both shifting and scaling a base row by random numbers. Table 1 describes the synthetic datasets for their types, dimensions and number of true biclusters.

Table 1 Summary of the synthetic datasets.

Gene expression datasets

Five publicly available gene expression datasets GDS531, GDS589, GDS3603, GDS3966 and GDS4794 were downloaded from Gene Expression Omnibus (GEO). Genes with standard deviation less than 2.0 were removed. For genes appeared multiple times, only the row (probe set) with the highest standard deviation was kept. The dataset information is summarized in Table 2.

Table 2 Summary of the gene expression datasets and number of CCS biclusters obtained for θ = 0.8.

Evaluating biclustering results

We evaluated the results of the biclustering algorithms on synthetic datasets using Recovery and Relevance metrics8, 20. They compare the set of found biclusters ‘F’ against the expected biclusters ‘E’ (the true bicluster). Both for Recovery and Relevance the scores are ranging between ‘0’ and ‘1’. The highest score ‘1’ for \({\rm{Recovery}}\) denotes that all the expected biclusters are found. Similarly, the highest ‘1’ for \({\rm{Relevance}}\) denotes that all the found biclusters are expected.

To evaluate the results on the real gene expression datasets, we performed functional enrichment tests for each bicluster1, 8, 13, 20. We used gProfileR21 for selecting the enriched gene annotation terms. gProfilerR retrieves a comprehensive set of gene annotations from Gene Ontology (GO) terms, biological pathways, regulatory motifs in DNA, microRNA targets and protein complexes. From gProfiler search, all the annotation terms with a Benjamini-Hochberg FDR less than 0.01 were considered as enriched. From the biclustering results on all five gene expression datasets we computed the average number of enriched annotation terms. Higher average number of enriched terms indicates better functional grouping. We also evaluated the performance of biclustering algorithms from percentage of total biclusters that have one or more enriched annotation terms.

Results and Discussions

We obtained the biclustering results from CCS and 13 other biclustering algorithms namely Cheng and Church Biclustering algorithm (CC)22, Iterative Signature Algorithm (ISA)23, Plaid24, Spectral25, Order Preserving Sub Matrix (OPSM)26, xMotifs27, SAMBA28, Bimax20, BicSPAM29, UniBic30, CPB14, BICLIC15 and BCCA13. We used BicAT20 software package for ISA, CC, OPSMs, xMotifs and Bimax. We also used ‘biclust’ R package (https://cran.r-project.org/web/packages/biclust/) for Plaid and Spectral. For UniBic, BicSPAM, BCCA, CPB and BICLIC the published versions of the software packages were used. SAMBA biclustering was performed using the Expander package31.

Performance on synthetic data

Figure 1 shows the Recovery and Relevance scores on five synthetic datasets (Table 1) for CCS and the best among the other 13 algorithms for comparison. For all of the five synthetic datasets CCS significantly outperformed other algorithms for finding the expected biclusters (recovery) and the most relevant set of biclusters (relevance). For constant bicluster on a datasets of 100 rows and 100 columns, CCS obtained 0.88 score for both recovery and relevance. For larger size datasets and shift and scale biclusters, still the recovery and relevance scores are higher than the other algorithms which clearly demonstrate the accuracy of CCS for recovering the relevant biclusters irrespective of the number and pattern of the biclusters and size of the datasets.

Figure 1
figure 1

Recovery and Relevance scores on five synthetic datasets for CCS and the next-best performing algorithm.

Performance on gene expression data

CCS biclustering algorithm obtained total 25 biclusters from five gene expression datasets (Table 2). Figure 2 shows the average number of enriched terms from gProfiler enrichment analysis. Figure 2 shows there are more enriched terms per bicluster from CCS than the others. Figure 3 compares the biclustering algorithms for the percentage of biclusters with at least one enriched term. Again, CCS obtained the highest percentage of enriched biclusters (Figure 3).

Figure 2
figure 2

Average number of enriched terms on five gene expression datasets. All the enriched gene ontology terms with Benjamini-Hochberg FDR less than 0.01 were considered.

Figure 3
figure 3

Percentage of bicluster from five gene expression datasets that have at list one enriched term. All the enriched gene ontology terms with Benjamini-Hochberg FDR less than 0.01 were considered.

GPU accelerated biclustering

The CPU version of the CCS algorithm is computationally expensive. If the ‘base_number’ is ‘n’ then step 5 executes ‘n’ times. Again for ‘n’ iterations of step 5 the augmentation task at step 10 executes ‘n’ times. Hence, the worst case time complexity of CCS is O(n3). The CUDA C implementation of CCS eliminates the outer loop and runs from step 5 to step 25 as CUDA C kernel function. This reduces the time complexity to O(n2). We compared the execution time of the CPU and GPU implementations of CCS. Figure 4 shows the speedup gained from CCS running on the NVIDIA Tesla K20 and K80 GPUs. For larger datasets the GPU-based CCS runs more than 20 times faster than the CPU-based CCS.

Figure 4
figure 4

Comparison between the execution time for the CPU and GPU implementations of CCS on gene expression dataset GDS531. The “x” axis shows the number base genes for bicluster search. The “y” axis shows GPU speedup from CPU vs. GPU execution times.

CCS biclusters are equivalent to condition-dependent coexpression network modules

A network module is usually defined as a highly connected sub-network. A module in coexpression network consists of a group of genes whose expression levels are highly correlated32. The correlative relations in a coexpression network often depends on conditions such as genotype, environment, treatment, cell type, disease state or developmental stage33,34,35,36,37,38. Therefore, coexpression network modules may also change with those conditions. There is an interesting relation between CCS biclusters and coexpression network modules. A CCS bicluster consists of genes whose expression levels are highly correlated under a set of conditions. Therefore, a CCS bicluster is equivalent to a condition-dependent module in a coexpression network. This has been illustrated here from two CCS-biclusters. We picked two CCS-biclusters from GDS589 dataset. The first bicluster (bicluster 1) includes 166 genes and 70 samples. The second bicluster (bicluster 2) includes 81 genes and 60 samples. We also identified all the neighboring genes of the biclusters. Neighboring genes are the genes that are not from a bicluster but they are correlated with at least one gene in that bicluster. Figure 5(A) shows the coexpression network of the genes in two biclusters and their neighboring genes. Because this network shows coexpression over all the samples, none of the biclusters are forming a network module. In contrast, when the coexpression network is constructed using correlations over the samples of bicluster 1, the genes of this bicluster form a network module, but the genes of bicluster 2 do not form a module (Figure 5(B)). Similarly, genes of bicluster 2 form a module when the coexpression network is based on correlations over samples of bicluster 2, but genes from bicluster 1 do not form a module in this case (Figure 5(C)).

Figure 5
figure 5

The coexpression network related to two CCS biclusters from GDS589. The blue, green and white nodes represent the genes from bicluster 1, bicluster 2 and the neighboring genes respectively. Red nodes are common genes between bicluster 1 and 2. Edges represent correlation >=0.8 or <=−0.8. (A) Coexpression network based on correlations over all the samples in GDS589. (B) Coexpression network based on correlations over the samples of bicluster 1 (C) Coexpression network based on correlations over the samples of bicluster 2.

Conclusion

In this work we designed a novel algorithm, CCS, to discover the functionally coherent biclusters from large-scale gene expression datasets. The performance of CCS was compared with thirteen widely used biclustering algorithms. CCS consistently outperforms all the other algorithms in the comparison for obtaining true biclusters from synthetic datasets and discovering functionally enriched biclusters from the real gene expression datasets. Moreover, the CCS biclusters are equivalent to condition-dependent modules in coexpression networks. This important feature makes the CCS algorithm also useful for the study of the condition-dependent structural characteristics of the coexpression networks. The biclusters and network modules of co-regulated and co-functional genes discovered by the CCS algorithm may provide an important entry point for many other analyses such as gene set enrichment analysis, regulatory network inference and disease genes identification.

The recent development of fast and accurate data acquisitions and quantifications in genomics, transcriptomics and proteomics greatly increased the data volume that needs to be processed by clustering algorithms. To enable rapid discovery of high quality biclusters, we implemented CCS algorithm using CUDA C for parallel GPGPU computing. For large datasets, the GPU implementation of CCS achieved about 20 fold speedup compared to the sequential version of CCS. We expect that the GPU- accelerated biclustering will have important applications in the time-sensitive analysis of genomic medicine data.