Gene regulatory network inference using fused LASSO on multiple data sets

Devising computational methods to accurately reconstruct gene regulatory networks given gene expression data is key to systems biology applications. Here we propose a method for reconstructing gene regulatory networks by simultaneous consideration of data sets from different perturbation experiments and corresponding controls. The method imposes three biologically meaningful constraints: (1) expression levels of each gene should be explained by the expression levels of a small number of transcription factor coding genes, (2) networks inferred from different data sets should be similar with respect to the type and number of regulatory interactions, and (3) relationships between genes which exhibit similar differential behavior over the considered perturbations should be favored. We demonstrate that these constraints can be transformed in a fused LASSO formulation for the proposed method. The comparative analysis on transcriptomics time-series data from prokaryotic species, Escherichia coli and Mycobacterium tuberculosis, as well as a eukaryotic species, mouse, demonstrated that the proposed method has the advantages of the most recent approaches for regulatory network inference, while obtaining better performance and assigning higher scores to the true regulatory links. The study indicates that the combination of sparse regression techniques with other biologically meaningful constraints is a promising framework for gene regulatory network reconstructions.

. Inferred Network from all data sets by applying proposed approach (E. coli data sets). A) Applying the proposed approach on 'heat', 'cold', 'oxidative' and 'lactose-diauxic' data sets results in inferrence of four similar networks for each data set (in concordance with the fusion term in the proposed approach). Therefore, the gene regulatory networks from all data sets can be inferred either by apply maximum or average of the obtained edge weights from each data set. B) Histogram of the differences between the regression coefficients obtained based on each of the four data sets for the same response and regressor. . The colored edges belong to the subnetwork retrieved from RegulonDB, where red edges denote activating, while blue edges indicate repressing regulatory relationships. The edges marked in green are of unspecified regulatory type. If an edge was predicted by a method but is not included in the network from RegulonDB, it is colored in black. The predicted edges for L 1 , L 1/2 , and L 0 regularization-based models, CLR, and GENIE3 are marked by 'L 1 ', 'L 1/2 ', 'L 0 ', 'R' and 'G', respectively, next to the corresponding edges. The letters are color-coded -red, blue or green fonts represent activating, repressing or unspecified relationships, respectively. The dotted edges denote the relationships predicted by the proposed approach. Illustrated are the predicted regulatory relationships and their types based on data from (A) cold, (B) heat, (C) oxidative stress, and (D) lactose-diauxic shift time-series experiments.  Figure S3. Histogram of the differences between the regression coefficients (Mus musculus data sets). The differences were calculated between the regression coefficients of the same regressor with all possible permutations (i.e., orderings) of the data sets. (A)-(D) Differences between the regression coefficients in networks inferred from all data sets, and each single data set from ES cell lines V6.5, R1, and J1, considering all possible orderings of data sets, respectively.  Table S1. ROC-based statistics for the compared methods(E. coli data set). This table includes AUROC values and the corresponding confidence intervals (CI), p-values from the comparison between proposed method and the compared method at each row, the corresponding adjusted p-values and the D-statistics which are obtained based on the four different data sets and their combination by using the compared methods. The overall AUROCscore, mean, and standard deviation of the AUROCs for each method is summarized in the last three columns, respectively. The ROC statistics are calculated using "pROC" package in R. In columns "padj" and "D", the entities highlighted in blue indicate the cases in which the proposed method significantly outperformed the other compared methods while the entities highlighted in yellow show the cases that the corresponding compared methos significantly outperformed the proposed method and the non-highlighted entities show the cases that the performance of the proposed method and the respective compared methods are comparable.

Time
47" Proposed approach per each regression/gene model (without cross validation) 3" Proposed approach per each regression/gene model (with cross validation) 9' 75"  Table S3. ROC-based statistics for the compared methods (MTB data set). The left (right) four columns contain the values for the AUROCs and AUPRs obtained from the sub-networks including the top 100 (31) highly ranked edges predicted by the compared methods. To infer the gene regulatory networks, all the compared methods applied on both: the gene expression levels (Expr. values) and the obtained log2-transformed gene expression fold changes between the control (time point 0) and the hypoxia and re-aeration time-series samples (LFC).

Data
Expr. values LFC Number of predicted true interactions in the top 100 highly ranked edges according to the gold standard network which includes 31 edges  Table S4. Inference of GRN based on the gene expression values vs. log2-transformed gene expression fold changes (MTB data set). The left part contains the values obtained from the networks inferred by applying the compared methods on the gene expression levels (Expr. values) and the values in the right part obtained from networks inferred by applying the compared methods on the log2-transformed gene expression fold changes between the control (time point 0) and the hypoxia and re-aeration time-series samples (LFC). In each part, the first 3 columns include the predicted edge weights for the interactions: Rv0081 −→ Rv3597c (Lsr2), Rv0081 −→ Rv3416 (whiB3), and Rv3133c ←→ Rv2034 (DosR). The value '--' indicates that the interaction is not predicted by the corresponding method. Fourth column indicates the number of interactions and the minimum edge rank in overlap between the gold standard network (with 31 edges) and the top 100 highly rankd edges predicted by the compared methods. The last three columns include the values for the AUROCs and AUPRs obtained from the sub-network with edges above the median ranks of the predicted interactions. Table S5. Top 100 highly ranked edges predicted by the proposed approach (MTB data set). The proposed approach has been applied to the log2-transformed gene expression fold changes between the control (time point 0) and the hypoxia and re-aeration time-series samples. The rows in bold represents predicted true interactions with respect to the gold standard network which includes 31 edges (page 180, [2]).

Rv0081
Rv2034 1  Table S6. Top 100 highly ranked edges predicted by the proposed approach (MTB data set). The proposed approach has been applied to the gene expression levels from combination of both conditions: hypoxia and re-aeration. The rows in bold represents predicted true interactions with respect to the gold standard network which includes 31 edges (page 180, [2]).

Rv0081
Rv2989 1  Table S7. ROC-based statistics for the compared methods (Mus musculus data set). The first four columns contain the values for the AUROCs and AUPRs obtained from the sub-networks including the top 500 and 248 highly ranked edges predicted by the compared methods. The fifth column includes selector values, while the sixth columns contains the number of interactions and the minimum edge rank in overlap between the gold standard network (with 248 edges) and the top 248 highly rankd edges predicted by the compared methods. The last three columns include median edge rank as well as the values of the AUROCs, and AUPRs obtained from the sub-network with edges above the median ranks of the predicted interactions.  Table S8. ROC-based statistics for the proposed approach considering all possible orderings of data sets (Mus musculus data set). To investigate the effect of the order in which the fusion penalty is formulated (Eq. 2), the networks were inferred from all possible permutation (i.e., orderings) of data sets from ES cell lines V6.5, R1, and J1. ROC-based statistics is obtained for the inferred networks. The first and the last two columns contain the values for the AUROCs and AUPRs obtained from the sub-networks including the top 500 and 248 highly ranked edges, respectively. The third column includes selector values, while the fourth columns contains the number of interactions and the minimum edge rank in overlap between the gold standard network (with 248 edges) and the top 248 highly rankd edges. The fifth, sixth and sevenths columns include median edge rank as well as the values of the AUROCs, and AUPRs obtained from the sub-network with edges above the median ranks of the predicted interactions.