A GPU-accelerated algorithm for biclustering analysis and detection of condition-dependent coexpression network modules

In the analysis of large-scale gene expression data, it is important to identify groups of genes with common expression patterns under certain conditions. Many biclustering algorithms have been developed to address this problem. However, comprehensive discovery of functionally coherent biclusters from large datasets remains a challenging problem. Here we propose a GPU-accelerated biclustering algorithm, based on searching for the largest Condition-dependent Correlation Subgroups (CCS) for each gene in the gene expression dataset. We compared CCS with thirteen widely used biclustering algorithms. CCS consistently outperformed all the thirteen biclustering algorithms on both synthetic and real gene expression datasets. As a correlation-based biclustering method, CCS can also be used to find condition-dependent coexpression network modules. We implemented the CCS algorithm using C and implemented the parallelized CCS algorithm using CUDA C for GPU computing. The source code of CCS is available from https://github.com/abhatta3/Condition-dependent-Correlation-Subgroups-CCS.

expressed with similar or opposite patterns of expression but the expression values are very different. Such modules are important as they may represent relations between genes in the same biological functions 1,13 .
A Pearson Correlation Coefficient based scoring function was introduced by Correlated Pattern Biclusters (CPB) 14 . However, the performance CPB is still depends on the random selection of samples. Benefits of Pearson Correlation Coefficient based similarity measures over the conventional mean square residue based bicluster scores again was demonstrated by Bi-Correlation Clustering algorithm (BCCA) that looks for positively correlated biclusters and reports biclusters for each pair of genes present in a dataset 13 . Initially, for each pair of genes all the samples are selected and then a greedy search is made to eliminate samples one at a time until the gene pair shows positive correlation higher than a predefined high positive correlation threshold. The gene pair is then augmented with other genes that are correlated to form a bicluster. However the backward elimination of samples in a greedy search is subject to finding a local optimal subset of samples, hence in reality it imposes a restriction on the search space for finding the optimal set of biclusters. BIclustering by Correlated and Large number of Individual Clustered seeds (BICLIC) introduces another alternative correlation based biclustering. BICLIC forms biclusters by starting from a large number of clustered seeds followed by augmentation and deletion of rows and columns based on the correlations with seeds 15 . The main drawback of BICLIC which is also true for BCCA is finding a large number of overlapping biclusters which are very much identical. Moreover, BICLIC, BCCA and other correlation based biclustering methods, only look for finding positively correlated gene groups as bicluster. However, in reality negatively correlated genes are equally relevant and may represent important regulatory mechanisms in biological functions.
Here we propose a more effective correlation-based biclustering algorithm named Condition-dependent Correlation Subgroup (CCS). It integrates several important features for developing an effective algorithm for comprehensive discovery of functionally coherent biclusters 1 . A significant challenge in genomic data analysis is to utilize the fast growing high performance computing capacity to process and analyze large complex datasets efficiently [16][17][18] . CCS is particularly suitable for parallel computing. We used the GPGPU computing for a parallel implementation of the algorithm in CUDA C which shows a substantial speedup compared to the sequential C program. The performance of CCS was compared with CPB, BCCA, BICLIC and ten other widely-used biclustering algorithms on 5 synthetic and 5 real gene expression datasets. CCS outperforms the other biclustering algorithms in all the comparisons. We also showed that there is equivalence between the CCS biclusters and condition-dependent coexpression network modules.

Methods
A bicluster is a group of genes with a related pattern of expression over a group of samples. We defined the related pattern of gene expression in terms of positive and negative Pearson correlation coefficients. Let us consider a gene expression data set D = G × E where G = {g 1 , g 2 , …, g n } is a set of "n" genes and E = {e 1 , e 2 , …, e s , …, e m } is a set of "m" samples. For each gene g i there is an m-dimensional vector x i . In vector x i , x is is the value of e s for gene g i . In our algorithm we defined a bicluster as a group of genes "I" over a group of samples "J" where each gene g i in "I" is correlated to all the other genes g j in "I" with an absolute correlation value greater than a threshold θ. Mathematically a bicluster "C" is represented as C = (I; J) where "I" is a subset of "G" and "J" is a subset of "E". The Pearson correlation coefficient between g i and g j over a subset of samples "J" is represented as r(g i , g j ) J and defined as where b is a bit vector of size "m". If sample e s is in "J" then we set the s th bit of b s = 1, otherwise we set b s = 0. The terms x is and x js represent s th sample values for gene g i and g j respectively. Average expression values of g i and g j for samples in "J" are represented as × x b i and × x b j respectively. We considered a pair of genes g i and g j similar for a subset of samples "J" if θ > r gi gj ( , ) J , where θ is the correlation threshold and r gi gj ( , ) J is the absolute value of correlation between g i and g j for the subset of samples "J". For each gene g i in a dataset "D", CCS considers g i as the base gene to start forming a bicluster for each g i .
Search space sorting and base-gene selection. The gene with the lower variability over the samples (commonly known as housekeeping genes) are often considered as less significant for the corresponding biological conditions, hence the less important candidates for forming a bicluster. Prior to biclustering, the search space (input data matrix) was sorted by the standard deviations of the gene expression values. In an ordered data matrix the genes with the higher variability were placed before the other with the lower variability.
CCS algorithm selects 'base_number' of genes as base-gene from the sorted data matrix in the decreasing order of variability. Ordering of the search space ensured that a gene with higher variability considered for forming a bicluster and also for augmentation before the other with lower variability. The 'base_number' can be set between "1" to "n" (total number of genes). Here we set the base_number value equals 1,000 for the biological dataset to restrict the search time. For each pair of genes g i and g j the sample sets "J 1 : for up-regulated positive correlation", "J 2 : for down-regulated positive correlation" and "J 3 : for negative correlation" are selected from sample selection rules 1, 2 and 3 respectively. For each sample e s , s = 1:m, we defined the following three rules to determine J 1 , J 2 and J 3 : THEN include e s to J 1 [Expression values of g i and g j for the samples in J 1 are higher than the arithmetic mean expre s sion values], Rule 2.
THEN include e s to J 2 [Expression values of g i and g j for the samples in J 2 are lower than the arithmetic mean expression values], THEN include e s to J 3 [Expression values of g i and g j for the samples in J 3 are opposite to their arithmetic mean expression values]. Here x i and x j are the mean expression value of gene g i and g j over all samples. The correlations between g i and g j for sample sets J k=1:3 are determined first by computing Pearson Correlation Coefficient 'r' and then by comparing against a threshold 'θ' which is denoted by |r(g i , g j ) Jk | > θ in the algorithm.
Defining biclusters as condition dependent correlation modules. CCS defines biclusters as condition dependent correlation modules where the genes in a bicluster are expected to show correlations only for the samples in that bicluster. CCS introduces a scoring function named BScore as defined in Equation (2). The BScore is designed to compare two sets of correlated gene pairs N and M. The correlations in set 'N' are computed over the samples that are included in a bicluster while the correlations in set 'M' are computed over the rest of the samples that are not included in the same bicluster. Thus, the BScore measures the degree to which the gene coexpression in a bicluster is condition-specific. In this work, we used a small BScore threshold (<0.01) to ensure the discovered biclusters are based on condition-specific coexpression.
Algorithm. The algorithm at each iteration of step 2 starts with a new base-gene g i . In each iteration of step 5, g i is paired with a gene g j (i < j ≤ n). In step 6, the sample sets are selected for gene pair g i , g j . In step 8, the algorithm computes the absolute value of the correlation |r(g i , g j ) Jk | for each sample set J k=1:3 (Equation (1)). If |r(g i , g j ) Jk | θ > then the algorithm starts a bicluster with an initial gene set I i = {g i , g j }. In step 10, augmentation of the gene set is performed by including a new gene g p in I i . In step 16, BScore(I i ,J i ) i s compared against the previous best BScore(I,J) to update the sets I and J. The most condition dependent bicluster for each base gene g i is selected in step 22. In step 26, all the overlapping biclusters are merged while keeping < . BScore 0 01 for the final set of biclusters 'S' . apply Rules ( The input parameter for our biclustering algorithm is the correlation threshold θ, which is the minimum absolute value of correlation between genes in a bicluster. Allocco et al. investigated the relation between co-regulation and coexpression and came up with a conclusion that for a correlation greater than 0.84 there is at least 50% chance of co-regulation 19 . Here we set θ = 0.8. If the number of biclusters is zero for θ = 0.8 then we decrease θ by 0.05 until we get a nonempty set of biclustering result or θ drops to 0. GPU computing for biclustering analysis of large datasets. GPU was initially introduced for rendering video graphics on display devices. The GPU designing company NVIDIA introduced their CUDA architecture in 2007 for general purpose computing on graphics processing unit (GPGPU computing). The single instruction set and multiple dataset architecture (SIMD) of GPU is particularly useful for executing a single instruction over thousands of core processors in a GPU card. CUDA is currently the most popular programming model for general purpose GPU computing. CUDA C is an extension of regular C that supports GPGPU computing. The parallel part is implemented as a CUDA C kernel function. Thousands of instances of a CUDA C kernel function run parallel on a group of GPU core processors. Each group of GPU cores is also known as a thread block. Each thread block is a virtual processor formed from one or more GPU core processors, registers, local memories, shared memory and global memory. CUDA C kernel functions are sent to the GPU for parallel execution, and sequential parts of the code run on the CPU.
We used GPGPU computing to reduce the execution time. In this work, we executed our CUDA C code for CCS on NVIDIA Tesla K20 (2,496 CUDA cores) and K80 GPU (4,992 CUDA cores) accelerator. For compilation of CUDA C code we installed CUDA 7.5 toolkit in our Linux workstation. Transferring big datasets between CPUs and GPUs is time consuming. To reduce the data transfer overhead we moved the entire process of finding new biclusters to GPU.
Step 2 of CCS executes "n" times when "base_number" is set at "n" for a dataset with "n" rows. For each execution the inner loop at step 5 executes a maximum "n-1" times. To achieve increased speed from parallel GPGPU computing, we removed the step 2 loop from CCS algorithm and implement steps 5 to 25 of CCS as a CUDA C kernel function. Synthetic datasets. Five synthetic datasets were generated using the BiBench-0.2 python library 8 . The synthetic datasets contains two types of biclustering: Constant biclusters and Shift-Scale biclusters. Constant biclusters are sub-matrices with a constant value and Shift-Scale biclusters are biclusters that are formed from both shifting and scaling a base row by random numbers. Table 1 describes the synthetic datasets for their types, dimensions and number of true biclusters.

Gene expression datasets.
Five publicly available gene expression datasets GDS531, GDS589, GDS3603, GDS3966 and GDS4794 were downloaded from Gene Expression Omnibus (GEO). Genes with standard deviation less than 2.0 were removed. For genes appeared multiple times, only the row (probe set) with the highest standard deviation was kept. The dataset information is summarized in Table 2.
Evaluating biclustering results. We evaluated the results of the biclustering algorithms on synthetic datasets using Recovery and Relevance metrics 8,20 . They compare the set of found biclusters 'F' against the expected biclusters 'E' (the true bicluster). Both for Recovery and Relevance the scores are ranging between '0' and '1' . The highest score '1' for Recovery denotes that all the expected biclusters are found. Similarly, the highest '1' for Relevance denotes that all the found biclusters are expected.
To evaluate the results on the real gene expression datasets, we performed functional enrichment tests for each bicluster 1,8,13,20 . We used gProfileR 21 for selecting the enriched gene annotation terms. gProfilerR retrieves a comprehensive set of gene annotations from Gene Ontology (GO) terms, biological pathways, regulatory motifs in DNA, microRNA targets and protein complexes. From gProfiler search, all the annotation terms with a Benjamini-Hochberg FDR less than 0.01 were considered as enriched. From the biclustering results on all five gene expression datasets we computed the average number of enriched annotation terms. Higher average number of enriched terms indicates better functional grouping. We also evaluated the performance of biclustering algorithms from percentage of total biclusters that have one or more enriched annotation terms. 23

Results and Discussions
We obtained the biclustering results from CCS and 13 other biclustering algorithms namely Cheng and Church Biclustering algorithm (CC) 22 , Iterative Signature Algorithm (ISA) 23  Performance on synthetic data. Figure 1 shows the Recovery and Relevance scores on five synthetic datasets (Table 1) for CCS and the best among the other 13 algorithms for comparison. For all of the five synthetic datasets CCS significantly outperformed other algorithms for finding the expected biclusters (recovery) and the most relevant set of biclusters (relevance). For constant bicluster on a datasets of 100 rows and 100 columns, CCS obtained 0.88 score for both recovery and relevance. For larger size datasets and shift and scale biclusters, still the recovery and relevance scores are higher than the other algorithms which clearly demonstrate the accuracy of CCS for recovering the relevant biclusters irrespective of the number and pattern of the biclusters and size of the datasets.
Performance on gene expression data. CCS biclustering algorithm obtained total 25 biclusters from five gene expression datasets (Table 2). Figure 2 shows the average number of enriched terms from gProfiler enrichment analysis. Figure 2 shows there are more enriched terms per bicluster from CCS than the others. Figure 3 compares the biclustering algorithms for the percentage of biclusters with at least one enriched term. Again, CCS obtained the highest percentage of enriched biclusters (Figure 3).  Table 2. Summary of the gene expression datasets and number of CCS biclusters obtained for θ = 0.8. GPU accelerated biclustering. The CPU version of the CCS algorithm is computationally expensive. If the 'base_number' is 'n' then step 5 executes 'n' times. Again for 'n' iterations of step 5 the augmentation task at step 10 executes 'n' times. Hence, the worst case time complexity of CCS is O(n 3 ). The CUDA C implementation of CCS eliminates the outer loop and runs from step 5 to step 25 as CUDA C kernel function. This reduces the time complexity to O(n 2 ). We compared the execution time of the CPU and GPU implementations of CCS. Figure 4 shows the speedup gained from CCS running on the NVIDIA Tesla K20 and K80 GPUs. For larger datasets the GPU-based CCS runs more than 20 times faster than the CPU-based CCS.

CCS biclusters are equivalent to condition-dependent coexpression network modules. A net-
work module is usually defined as a highly connected sub-network. A module in coexpression network consists of a group of genes whose expression levels are highly correlated 32 . The correlative relations in a coexpression network often depends on conditions such as genotype, environment, treatment, cell type, disease state or developmental stage [33][34][35][36][37][38] . Therefore, coexpression network modules may also change with those conditions. There is an interesting relation between CCS biclusters and coexpression network modules. A CCS bicluster consists of genes whose expression levels are highly correlated under a set of conditions. Therefore, a CCS bicluster is equivalent to a condition-dependent module in a coexpression network. This has been illustrated here from two CCS-biclusters. We picked two CCS-biclusters from GDS589 dataset. The first bicluster (bicluster 1) includes 166 genes and 70 samples. The second bicluster (bicluster 2) includes 81 genes and 60 samples. We also identified all the neighboring genes of the biclusters. Neighboring genes are the genes that are not from a bicluster but they are correlated with at least one gene in that bicluster. Figure 5(A) shows the coexpression network of the genes in two biclusters and their neighboring genes. Because this network shows coexpression over all the samples, none of the biclusters are forming a network module. In contrast, when the coexpression network is constructed using correlations over the samples of bicluster 1, the genes of this bicluster form a network module, but the genes of bicluster 2 do not form a module ( Figure 5(B)). Similarly, genes of bicluster 2 form a module when the coexpression network is based on correlations over samples of bicluster 2, but genes from bicluster 1 do not form a module in this case ( Figure 5(C)).

Conclusion
In this work we designed a novel algorithm, CCS, to discover the functionally coherent biclusters from large-scale gene expression datasets. The performance of CCS was compared with thirteen widely used biclustering algorithms. CCS consistently outperforms all the other algorithms in the comparison for obtaining true biclusters from synthetic datasets and discovering functionally enriched biclusters from the real gene expression datasets. Moreover, the CCS biclusters are equivalent to condition-dependent modules in coexpression networks. This important feature makes the CCS algorithm also useful for the study of the condition-dependent structural characteristics of the coexpression networks. The biclusters and network modules of co-regulated and co-functional genes discovered by the CCS algorithm may provide an important entry point for many other analyses such as gene set enrichment analysis, regulatory network inference and disease genes identification. The recent development of fast and accurate data acquisitions and quantifications in genomics, transcriptomics and proteomics greatly increased the data volume that needs to be processed by clustering algorithms. To enable rapid discovery of high quality biclusters, we implemented CCS algorithm using CUDA C for parallel GPGPU computing. For large datasets, the GPU implementation of CCS achieved about 20 fold speedup compared to the sequential version of CCS. We expect that the GPU-accelerated biclustering will have important applications in the time-sensitive analysis of genomic medicine data.