Gene regulatory network inference using fused LASSO on multiple data sets

Omranian, Nooshin; Eloundou-Mbebi, Jeanne M. O.; Mueller-Roeber, Bernd; Nikoloski, Zoran

doi:10.1038/srep20533

Download PDF

Article
Open access
Published: 11 February 2016

Gene regulatory network inference using fused LASSO on multiple data sets

Nooshin Omranian^1,2,
Jeanne M. O. Eloundou-Mbebi¹,
Bernd Mueller-Roeber² &
…
Zoran Nikoloski¹

Scientific Reports volume 6, Article number: 20533 (2016) Cite this article

11k Accesses
94 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Devising computational methods to accurately reconstruct gene regulatory networks given gene expression data is key to systems biology applications. Here we propose a method for reconstructing gene regulatory networks by simultaneous consideration of data sets from different perturbation experiments and corresponding controls. The method imposes three biologically meaningful constraints: (1) expression levels of each gene should be explained by the expression levels of a small number of transcription factor coding genes, (2) networks inferred from different data sets should be similar with respect to the type and number of regulatory interactions and (3) relationships between genes which exhibit similar differential behavior over the considered perturbations should be favored. We demonstrate that these constraints can be transformed in a fused LASSO formulation for the proposed method. The comparative analysis on transcriptomics time-series data from prokaryotic species, Escherichia coli and Mycobacterium tuberculosis, as well as a eukaryotic species, mouse, demonstrated that the proposed method has the advantages of the most recent approaches for regulatory network inference, while obtaining better performance and assigning higher scores to the true regulatory links. The study indicates that the combination of sparse regression techniques with other biologically meaningful constraints is a promising framework for gene regulatory network reconstructions.

Gene regulatory network inference from sparsely sampled noisy data

Article Open access 13 July 2020

Atte Aalto, Lauri Viitasaari, … Jorge Gonçalves

Relationship between gene regulation network structure and prediction accuracy in high dimensional regression

Article Open access 01 June 2021

Yuichi Okinaga, Daisuke Kyogoku, … Kei Hirose

RENGE infers gene regulatory networks using time-series single-cell RNA-seq data with CRISPR perturbations

Article Open access 28 December 2023

Masato Ishikawa, Seiichi Sugino, … Atsushi Mochizuki

Introduction

Biological systems on different levels of organization, from organelles and single cells to tissues, organs and entire organisms, constantly sense the environment and modulate their behavior to ensure optimal performance and fitness^1,2,3. The sensing of the environment is accomplished via numerous molecular mechanisms which ultimately result in coordinate activation and suppression of, often multiple, regulatory cascades affecting different and mutually dependent cellular processes. By propagating the perceived signal, the expression levels of genes coding for transcription factors (TFs) are adequately altered, leading to changes in the levels of transcripts encoding enzymatic proteins which affect metabolism and organism’s tasks⁴. Therefore, accurate reconstruction of the complete set of regulatory interactions, forming gene regulatory networks, is one of the key tasks in systems biology⁵.

In biological systems, the gene regulatory interactions are transitory, as they depend on different factors, including: developmental, environmental, as well as internal, given by the genetic make-up of the organism⁶. High-throughput technologies for simultaneous measurement of gene expression have been used to capture the transitory behavior of thousands of genes upon internal and external perturbation in different biological systems, from bacteria and yeast to algae, plants and animals^7,8,9. The gathered gene expression levels reflect the underlying regulatory relationships, and, thus, can readily be used to reconstruct the operational regulatory networks.

With the increasing number of performed time-series experiments methods are needed to extract gene regulatory networks supported by all gathered data sets simultaneously. These experiments are over different time domains, with different sampling frequency under various conditions and conducted in different laboratories, which may affect the success of network reconstruction¹⁰. In addition, each of these experiments is usually accompanied by a corresponding reference control experiment, whose profiles are used to determine differential gene behaviors^11,12.

Reconstruction of gene regulatory networks is a classical problem in computational systems biology and various methods based on different sets of assumptions and applicable on data from particular experiments have been proposed, critically assessed and systematically reviewed^13,14,15. In general, inference of gene regulatory networks begins with application of a similarity measure of choice (reviewed in¹³) on the investigated data set, resulting in a square similarity matrix. This similarity matrix can be sparsified by retaining only the values which are statistically significant after multiple hypothesis testing.

The resulting (sparsified) matrix usually includes many indirect relationships which need to be detected and removed to increase the power of the network inference approach. Therefore, a major challenge in inferring gene regulatory networks is the identification of direct (causal) relationships between genes. The classical approach to detect indirect relationships is based on partial correlations, which imposes the control of one gene on the relationship of others. For instance, the approaches of^16,17,18 attempt to find the causal interactions with partial correlations of small order. Moreover, in Gaussian Graphical Models (GGM)^19,20,21,22 two genes are deemed to be directly related if their levels are dependent after removing the effect of all other variables²³. To this end, extensions based on regularization methods (L₁-norm) have been developed²⁴. Other approaches rely on information theoretic-based similarity measures^25,26,27; for instance, the algorithm for the reconstruction of accurate cellular networks (ARACNE)²⁵ is based on Gaussian kernel estimator to determine mutual information between the expression profiles of genes with a sparsity constraint (to filter out non-significant as well as indirect interactions); or context likelihood of relatedness (CLR) which estimates the likelihood of each mutual information score based on the local network context. In addition, approaches rooted in Bayesian Networks (BN) employ probabilistic graphical models in order to infer causal relationships between genes as random variables^28,29,30,31. Moreover, the combination of L₁-penalized likelihood approach has been applied to structure learning of causal Gaussian Bayesian networks³²; however, this approach also cannot be used to predict feedback loops and the type of regulatory interactions. The most recent methods aim at detecting indirect relationships based on (weighted) structural investigations of networks following eigenvalue decomposition and its derivatives^33,34, leading to higher inference power than the classically applied approaches. The latter approaches have been thoroughly evaluated for their dependence on parameters and usage with particular data sets, indicating that claims about their performance warrants further examination on additional test cases³⁵.

While these approaches aim at reconstructing bilateral relationships, regression-based methods extract one-to-many relationships between genes based on the corresponding transcriptomics profiles^36,37,38. A frequently applied regression-based technique in the analysis of high dimensional data involves regularization, whereby the learned models are sparse. Regularized regression models, including ridge and least absolute shrinkage and selection operator (LASSO)³⁹, have been already used to infer gene regulatory networks with promising results^40,41. In addition, LASSO has been successfully used in model learning not only in computational biology, but also signal processing, medical imaging and economics^42,43,44. GENIE3 decomposes the inference of gene regulatory networks into P (number of genes) regression problems and tree-based ensemble methods are used to predict the expression profiles of the transcription factor target genes from the expression profiles of the remaining genes²¹. Recently, Qin et al. applied L₀, L_1/2 and L₁-regularization models to infer gene regulatory networks⁴⁵. While the performances of these models were comparable to the other compared approaches, increase in accuracy was observed due to the consideration of ChIP-seq/ChIP-chip data in fixing the initial connections between TFs and targets (i.e., initial values for the regression coefficients).

Finally, modern techniques try to harness the “wisdom of crowd” concept by integrating the predictions from different computational approaches (thoroughly evaluated in the DREAM contest⁴⁶) as well as over multiple data sets in a single network termed consensus gene regulatory network^14,47,48. For instance, Hase et al. grouped the algorithms by applying Euclidean distance on the confidence scores of the links in the corresponding inferred networks⁴⁸. They showed that integration of diverse algorithms outperform the combination of the individual inference methods. Moreover, Villa-Vialaneix et al. introduced a GGM based on two penalties: the first supports the global sparsity, while the second minimizes the differences between the coefficients of networks inferred from different data sets and a consensus network⁴⁷. The consensus network was obtained in three ways: by the average of the estimated coefficients over conditions, a priori biological knowledge and pre-calculated coefficients from application of GGM on the combined data sets.

Despite these advances, the problem of accurate reconstruction of gene regulatory networks is far from fully resolved, especially since many of these methods make assumptions about the underlying dynamics of the networks leading to the particular read-outs employed in the reconstruction. Here, we propose a new approach for reconstructing gene regulatory networks, whereby we control for the sparsity while simultaneously harnessing the evidence from multiple data sources. Integrating multiple data sets from different sets of conditions is rooted in the “wisdom of crowds” concept to predict the consensus gene regulatory network. Furthermore, the novelty of the approach lies in combining the similarity in differential behavior between genes with regularized regression methods. To this end, we demonstrate that the approach can be formulated in a form of a fused LASSO model. Our comparative analysis shows that the proposed method is comparable to the most recent contending alternatives as well as classical methods when applied on time-series expression data obtained from two prokaryotic species, Escherichia coli and Mycobacterium tuberculosis, as well as one eukaryotic species, mouse (Mus musculus), but is superior with respect to the performance and assigns higher scores to the true regulatory links. The comparative study includes not only the network deconvolution algorithm and the global silencing (with specification of the used parameter values and precise steps of execution), but also the methods against which these approaches have been gauged in the recent critique³⁵—Gaussian graphical models as well as Bayesian networks, ARACNE, CLR and GENIE3, in addition to the different regularization-based models: L₀, L_1/2 and L₁ (see Methods). Furthermore, although it has been reported that integrating ChIP-seq/ChIP-chip and gene expression data sets enhances the performance of the regularization-based approaches⁴⁵, the majority of existing computational approaches still rely on gene expression data sets; the usage of data sets from one type facilitates the fair comparison of performance of different approaches. The results of the comparison indicate that the combination of sparse regression techniques with time-resolved differential gene behavior is a powerful approach which can be readily extended to include other biologically meaningful constraints in inferring gene regulatory interactions from time-series data.

Results

Formulation of the proposed method

LASSO is a linear regression that penalizes the sum of absolute value of the regression coefficients⁴⁹. It relies on combining the L₂-norm, i.e., the residual sum of squares, with the L₁-norm of the regression coefficients, which amounts to sparsity due to shrinking the coefficients towards zero. Moreover, the proposal for considering fusion in the classical LASSO formulation was intended to address problems with a meaningful order in the considered features (i.e., regressors). In fused LASSO, minimization of the L₁-norm is imposed not only on the (ordered) regression coefficients, like in LASSO, but also on the consecutive differences of regression coefficients based on the assumed order of corresponding regressors⁵⁰.

In this section we formulate the approach which extends the concept of fused LASSO by incorporating information about the similarity of differential behavior for the response and regressor genes. Here, every gene is taken as response and the genes which code for transcription factors (known as TF genes) are used as regressors. The fusion imposes the similarity constraint on the obtained networks from each data set. In this way, the approach ensures sparsity, commonly observed in gene regulatory networks, with biologically meaningful data-driven evidence for the inferred relationships. In addition, we introduce the experimental scenario and type of data on which the approach is readily applicable.

Experimental scenario and weight matrices

For consistency throughout the text, all superscripts refer to the experiments/conditions and subscripts denote genes. We will define the approach for P genes used as regressors and a single gene used as a response. The approach assumes that k gene expression data sets, denoted by Xⁱ, are gathered under k different conditions alongside corresponding reference (control) data sets X^c,i, 1 ≤ i ≤ k, all of them containing the expression levels of P genes over Nⁱ, 1 ≤ i ≤ k, time-points or perturbations (Fig. 1, step 1 and 2). Note that the data sets from the k different conditions do not have to be over the same time domain or with the same sampling frequency; the only requirement is that each data set allows to reliably obtain the corresponding covariance matrix for the P genes. In addition, the profiles Yⁱ for the single response gene over the k conditions together with the corresponding control profiles Y^c,i, 1 ≤ i ≤ k are given.

Since reference data sets X^c,i, 1 ≤ i ≤ k, are available, we first determine the differential behavior for each of the regressors as well as for the response in each condition. To this end, we rely on the gene-specific B-statistic¹¹ corresponding to the log-odds that the gene is differentially expressed at a given time point of a particular condition in comparison to the control. Let , 1 ≤ j ≤ P, 1 ≤ i ≤ k, 1 ≤ t ≤ Nⁱ be the probability that gene j at time point t under condition i is differentially expressed. Then the matrix Prⁱ of dimension gathers the probabilities for the time-dependent differential behavior of the considered genes, estimated by the corresponding B-statistics. The B-statistics was estimated for each gene and at each time point by comparing data set Xⁱ from the treatment and the respective control X^c,i, 1 ≤ i ≤ k.

The derived probabilities can be used to define weight matrices for each condition i (1 ≤ i ≤ k), denoted by , which capture the information on the similarities between the response gene and each of the regressor genes based on their differential behavior (Fig. 1, step 3), by using the following:

where and are the probabilities of differential behavior for the response gene Y and the regressor gene j (1 ≤ j ≤ P), respectively, at time point/perturbation t.

If the value of is close to zero, the j^th regressor gene has, on average, a similar differential behavior to the response over all time points/perturbations in a data set from one condition. More specifically, is close to zero if either both genes are differentially expressed or they are both not affected by the perturbation. Due to the symmetric nature of the absolute value in Eq. (1), the weight matrix Wⁱ is symmetric with zeros along the diagonal.

Regression-based model

Having k time-resolved gene expression data sets including P + 1 genes and Nⁱ, 1 ≤ i ≤ k, time points together with the corresponding control data set(s), we aim at formulating a model which captures the following three criteria: (1) the expression levels of each gene should be explained by the expression levels of a small number of transcription factor (TF) coding genes, i.e., the corresponding regression should be sparse to reduce the number of false positives and likely increase the detection of direct relationships; (2) the regulatory networks should be inferred simultaneously over the given k data sets to simultaneously explain all analyzed conditions; and (3) a direct link should be preferred for genes which exhibit similar differential behavior over the considered conditions, since a differentially behaving gene (TF) is likely to alter the behavior of a direct target. Here, for the purpose of building the regression models, we focus on using the data sets from the k conditions only, as they are potentially more informative about the responsive genes in comparison to the reference state. In addition, the control data set enter the modeling through the probabilities of differential behavior and are, thus, also employed in the reconstruction of the gene regulatory network.

The first criterion can be captured by the classical LASSO, so that the regressors with non-zero regression coefficients are considered to be involved in a direct relationship with the response gene, yielding the gene regulatory network. To address the second criterion of simultaneously inferring k networks from the k different data sets, one has to ensure that: (i) the integration and transformation of the data should be performed in such a way that the LASSO penalty is simultaneously imposed to all data sets and (ii) the k reconstructed networks should be as close as possible in terms of edges (i.e., relationships) their strength given by the coefficients and the sign of the relationship (activating or repressing).

The integration and transformation of the data sets is achieved as follows: the k transcriptomics data sets are combined into a single block matrix X of dimensions kN × kP, where , Y is a collection of the profiles for the response gene from the different conditions, resulting in a vector of dimension kN × 1. As a result, the estimated coefficients form a kP × 1-vector corresponding to the edges between the response gene Y and regressor genes in the k network. The vector of estimated regression coefficients is represented with β.

To ensure that the k reconstructed networks are as close as possible, we add the fusion LASSO term , where β′ = [β¹, β², …, β^{k − 1}]^T, β′ = [β², β³, …, β^k]^T and the order of the k data sets is arbitrarily selected. In a regression setting, this fusion term imposes the constraint that the sum of absolute differences between the estimated coefficients of the same regressor over the consecutive data sets, with the assumed arbitrary order, is minimized (Supplementary Fig. S3 and Supplementary Table S8). This idea differs from the most recent approach whereby networks reconstructed from different data sets are individually obtained and later combined in a consensus network by existing techniques^14,51.

The third criterion about the similarity of differential behavior between the response and regressors implies the inclusion of the weight matrix Wⁱ, 1 ≤ i ≤ k, so that the regressor of higher explanatory power, which are associated with non-zero regression coefficients, over the multiple data sets are less penalized in the fused LASSO regression. This can be achieved by multiplying the regression coefficients with the weight matrix W (of size kP × kP), gathering the similarity of differential behavior between each of the P regressors and the single response gene under the k conditions. As a result, the expression is included as a modified penalty in the regression.

Therefore, the final model for reconstructing the gene regulatory interactions is given by the following fusion LASSO formulation over the k given data sets (Fig. 1, step 4):

where Y is the response gene which is regulated by the regressors with non-zero regression coefficients.

Gene expression data sets

Escherichia coli

The responses of Escherichia coli to stress conditions have already been well-investigated, resulting in characterization of the general and condition-specific components that regulate transcriptional changes underlying the adjustment to changing environments^52,53. The gathered data sets provide an excellent test case to which the performance of the proposed method and the contending alternatives can be readily compared.

The time-resolved transcriptomics data sets gathered with microarray technology were obtained from⁵⁴, where changes in the expression of genes from E. coli strain MG1655 were monitored under four stress conditions, including: non-lethal temperature shifts, i.e., heat and cold treatment, oxidative stress (by adding hydrogen peroxide), lactose-diauxic shift (i.e., change of primary carbon source) relative to cultures grown under optimal conditions, referred to as control. All cultures were grown simultaneously in the same condition and different perturbations were applied at the early mid-log phase (OD 0.6). The sampling was carried out from time points 10–50 min post-perturbation (at 10 min intervals) and two control time points before each perturbation for all considered conditions (data are available at http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE20305).

The reasons for selecting and using these data sets are that the predictions can be readily validated against the gold standard benchmark for E. coli provided by the DREAM5 challenge¹⁴ as well as the experimentally verified regulatory network interactions from RegulonDB⁵³. In RegulonDB, the information about the effect of the regulators (in terms of activating and repressing interactions) on the target genes is also provided. Moreover, these data sets satisfy the requirements of the proposed approach, since the single control can be used to establish the differential gene behavior across the data sets upon application of the stresses. In addition, the data are gathered from the same laboratory following the same protocol, thus, reducing the level of noise. Finally and most importantly, we aimed at using real-world data sets to estimate the actual performance of the proposed method in a realistic setting rather than from simulated instances.

Mycobacterium tuberculosis

Mycobacterium tuberculosis (MTB) is a pathogenic bacterium whose gene regulatory network is poorly understood. However, recently Galagan et al.⁵⁵ made the first step in reconstructing the complete regulatory network of MTB based on the ChIP-seq and microarray gene expression data gathered under conditions of hypoxia and re-aeration. They performed time-resolved transcriptomics experiments in which the expression level of genes were measured at 1, 2, 3, 5 and 7 days after culturing MTB strain H37Rv in bacteriostatic hypoxia conditions. The samples were then placed back in the aerobic rolling culture and the gene expression levels were measured after 1, 2, 4, 5 and 7 days of re-aeration. The expression levels were also measured at time point 0 which is considered as the control data set for the differential analysis (data are available at http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE43466).

The reasons for selecting and using these data sets are that: (i) the predictions can be validated with respect to recent real-world data sets from non-model species, (ii) it satisfies the requirements of our approach for inferring the gene regulatory network from multiple data sets, as the gathered time series can be considered as two separate time-resolved data sets (conditions): hypoxia and re-aeration while including time point 0 as a control and (iii) the predictions can be readily compared to the small sub-network of MTB which was partially verified through experiments in⁵⁵ (Fig. 2 in⁵⁵).

Mus musculus

Mus musculus represents a model organism for understanding human biology and disease; therefore, studying its large-scale gene regulatory network is an important challenging task. Many experiments have been performed on mouse embryonic stem (ES) cells to explore the details of their pluripotency and ability to self-propagate and renew. To this end, Sene et al.⁵⁶, measured the expression level of genes from three genetically distinct mouse ES cell lines (R1, J1 and V6.5) during differentiation at 11 time points: 0 h, 6 h, 12 h, 18 h, 24 h, 36 h, 48 h, 4 d, 7 d, 9 d and 14 d. The expression levels at time point 0 are considered as the control data set for the differential analysis (the processed data has been downloaded from http://www.maayanlab.net/ESCAPE/browse.php).

This data set provides a good test case for comparing the performance of the compared approaches for gene regulatory network extraction in a higher organism. Moreover, the literature-based embryonic stem cell network is used as a global standard network to validate the predictions obtained from the compared approaches (downloaded from iScMiD (Integrated Stem Cell Molecular Interactions Database) http://amp.pharm.mssm.edu/iscmid/literature/index.htm).

Comparative analysis

For the comparative analysis we considered the most recent as well as the state-of-the-art approaches which include two criteria, sparsity of the network and removal of indirect relationships. To this end, we used the following approaches: the global silencing³³, network deconvolution³⁴, Gaussian Graphical Models (GGM)¹⁹, mutual information (ARACNE²⁵ and CLR²⁶), Bayesian Network (catnet)⁵⁷, the GENIE3 approach²¹ and different regularization-based models⁴⁵ (more details in the Methods section).

Escherichia coli

The resulting networks from these approaches were compared to the network including experimentally verified regulatory relationships from RegulonDB and the gold standards from the DREAM5 challenge. The findings were summarized in terms of the resulting true positive and false positive rates, which resulted in the corresponding ROC (receiver operating characteristic) curves illustrated in Fig. 2 (based on the used threshold value for retaining a weighted edge in the network). We used the R package minet⁵⁸ to plot the ROC curves and the statistics obtained by using this package are summarized in Table 1. As illustrated in Fig. 2, the proposed approach allowed the prediction of regulatory relationships at a higher percentage of true positives in comparison to the GENIE3, network deconvolution and the global silencing methods, while GGM, ARACNE, CLR and regularization-based models performed poorly for accurate inference of the regulatory network. Moreover, it is evident that, on the used data sets, the proposed method based on the extension of the fused LASSO performed similarly to the considered contending approaches when only the connectivity (i.e., presence of an edge/relationship) is considered. To further quantify the performance of the compared methods, we calculated the area under the ROC curves (AUROCs), the area under PR (precision-recall) curves (AUPRs) and the true positive rates (TPR) at low false positive rate (most of the compared methods showed maximum difference between TPR and FPR at FPR = 0.03), presented in Table 1. We used the R package pROC⁵⁹ to estimate the AUROC curves and their respective confidence intervals (CI) (Supplementary Table S1). The other statistics resulting from the comparison of the considered approaches are summarized in Supplementary Table S1. Altogether, our findings demonstrate that the performance of the proposed approach is statistically similar to the performance of GENIE3 and network deconvolution, while outperforming the remaining of contending methods.

Table 1 ROC-based statistics for the compared methods (E. coli data sets).

Full size table

In our approach, the network over all data sets was obtained by taking the maximum or the average of the corresponding regression coefficients (see Supplementary Fig. S1 (a) for illustration). Since the distribution of the differences between the regression coefficients obtained based on each of the four data sets for the same response and regressors are almost identical (see Supplementary Fig. S1 (b)), the two networks are not expected to differ. Indeed, the comparison between the two networks obtained by taking the average and the maximum of the regression coefficients does not show a significant difference with respect to the AUROC values (p-value = 0.6257, see Supplementary Table S1).

The performance of each method over the four data sets and their combination was summarized by the geometric mean for the five corresponding AUROCs and AUPRs, while the OverallScore is the average of the calculated AUROCscores and AUPRscores¹⁴:

The OverallScore in Eq. (3) can be used to rank the compared methods, such that a larger value corresponds to a better-performing method. As shown in Table 1, the performance of our method was superior to that of other contending methods based on the OverallScore. Moreover, the standard deviation of the AUROC curves for the proposed method is smaller in comparison to the contenders (see Supplementary Table S1). The stability of the obtained networks further supports the usage of the fusion term in the proposed approach, which was considered for the very purpose of producing similar networks across different data sets.

We next determined the smallest normalized weight at which the first true positive edge can be detected—termed selector value; selector values closer to 1 indicate that the true positive edges are likely assigned higher weights. As shown in Table 2, the proposed method performed the best for networks obtained over all data sets. Moreover, our approach is in the two best performing methods when considering individual data sets. The discrepancy can be explained by the inclusion of the fusion term which insists on context-independent regulatory relationships. Since the edge weights of the networks obtained by ARACNE with each of the four data sets were identical, the corresponding selector values are not informative and cannot be used in the comparison. Interestingly, the network deconvolution and the global silencing did not perform well with respect to the selector value, despite recent claims on synthetic data. The reasonably high selector values obtained from regularization-based models further confirm the power of regularization-based approaches at assigning higher scores to the true positive edges.

Table 2 Selector value - E. coli data sets.

Full size table

Furthermore, the accuracy of inferred gene regulatory networks depends on the ability to predict the type of the regulatory relationships—activating or repressing. Therefore, we compared the performance of the methods regarding the type of regulatory relationships predicted. To this end, we used the network from RegulonDB which includes in total 4566 relationships. Table 3 summarizes the percentages of the predicted true activating and repressing regulatory interactions for the four data sets and the considered methods. In addition, it includes the percentages of the true positive and true negative relationships predicted by different methods, irrespective of the type of regulation.

Table 3 Prediction of regulatory types - E. coli data sets.

Full size table

ARACNE and CLR as well as GENIE3 are not able to infer the type of regulatory relationships. As mentioned in Section 3.2, the original algorithms for the global silencing and the network deconvolution do not provide the sign of interactions, since they aim at ranking the regulatory interactions. As shown in Table 3, our findings indicate that the performance of the proposed method with respect to the prediction of the type of regulatory relationships is promising. In addition, GGM as well as regularization-based models resulted in the largest fraction of true negative edges as well as the smallest fraction of true positive edges. The deconvolution method, however, outperformed the rest of the approaches with respect to the fraction of true positive edges for cold and oxidative stress as well as lactose-diauxic shift (albeit with much smaller selector values, see Table 2).

In addition, to specifically inspect the efficiency of the inferred networks, we selected a subnetwork from RegulonDB including four sigma factors, RopD, RopE, RopH and RopS, included in the analyzed data sets. Sigma factors are proteins required for initiation of RNA synthesis⁶⁰ and their activities depend on the environmental conditions. While RopD is the primary sigma factor which transcribes most of the genes, the activities of the other three are environment-specific; for instance, RopE is the extreme heat stress sigma factor, RopH is the heat stress sigma factor activated upon heat exposure and RopS is the stationary phase sigma factor. Interestingly, for the four sigma factors, no edges were predicted by GGM. With respect to the six other approaches as well as catnet (applied only on genes included in the illustrated networks), by inspecting Fig. 3 (for the sake of readability, the resulting networks for CLR, GENIE3 , , and -regularization models are shown in the Supplementary Fig. S2), it becomes evident that the proposed approach predicted a considerably smaller number of false positives, due to the sparsity constraints and a comparable number of true positives, due to the regression-based formulation. In addition, as a result of the imposed constraints in the fused LASSO formulation, the networks extracted for the four different data sets based on our method are exactly the same (except for slight differences in the weights - see Supplementary Fig. S1), which is biologically expected. However, this was not the case for the networks reconstructed by ARACNE, CLR, GENIE3, catnet, the network deconvolution and the global silencing methods; in addition, the networks from GENIE3, catnet and the deconvolution method were the densest over all data sets. Altogether, regularization-based models largely underestimated the true positive edges, although they were consistent in predicting the edges for each single data set. The proposed approach performed particularly well with respect to predicting the type and not only the presence of a regulatory relationship, evident in the cases of the RpoH and RpoD sigma factors.

Finally, we obtained the run time of each inference method which is presented in Supplementary Table S2. Since the proposed method includes many independent regression models, we parallelized the approach (as described in Methods). Therefore, the run time of the proposed method is only provided for performing a single regression. Clearly, the algorithms which rely on matrix inversions in the presence of hidden parameters (e.g., network deconvolution) are faster in comparison to the proposed method which requires solving multiple regressions. While cross validation is expected to increase the computational demand, if a single value is used across all genes (i.e., models) the regulatory interaction in the vicinity of a gene can be inferred in a drastically reduced time in order of a few seconds.

Mycobacterium tuberculosis

Motivated by the predictions from applying the proposed approach on the combination of all E. coli data sets, we next investigated the comparative analysis only on the combination of both data sets from MTB (i.e., hypoxia and re-aeration). To obtain gene regulatory networks, we applied all compared methods to the pre-processed gene expression values as well as the log 2-transformed gene expression fold changes between control (time point 0) and the hypoxia and re-aeration time-series samples. The ROC analysis for the compared methods was then obtained by using the R package minet⁵⁸ for both: the top 31 (as the gold standard network includes 31 edges) and 100 highly ranked edges and the corresponding statistics are summarized in Supplementary Table S3. It is evident that the proposed approach allowed the prediction of regulatory relationships at a higher percentage of true positives in both cases of using gene expression values and log 2-transformed gene expression fold changes when comparing to the performance of GENIE3, network deconvolution and global silencing methods. Likewise with E. coli, GGM, ARACNE, CLR and regularization-based models performed poorly for accurate inference of the regulatory network; however, the performance of the majority of the compared methods was increased upon applying to the log 2-transformed gene expression fold changes. In addition, the selector values for all compared methods were the same and equal to one.

Furthermore, to specifically inspect the efficiency of the inferred networks, we investigated on three experimentally verified TF regulatory interactions that are highlighted in the Result section from Galagan et al. (page 180,⁵⁵): TF Rv0081 negatively regulates TFs Rv3597c and Rv3416 (whiB3) (Rv0081 → Rv3597c (Lsr2), Rv0081 → Rv3416 (whiB3)) and TF Rv3133c and Rv2034 (DosR) negatively regulate each other (Rv3133c ↔ Rv2034 (DosR)). All interactions and their regulatory types were successfully predicted by the proposed approach with sufficiently high ranks (Supplementary Table S4). Global silencing as well as network deconvolution methods were not successful with respect to the type of regulatory interactions and obtained rather low ranks, while GENIE3 predicted all interactions with high ranks but is not able to infer the type of regulatory relationships. Likewise, GGM, ARACNE and CLR were not successful in predicting the interactions, while regularization-based models performed very inconsistent with respect to the sign and predicting true edges.

Moreover, we counted the number of predicted true interactions in the top 100 highly ranked edges (Supplementary Tables S4, S5 and S6) considering the gold standard sub-network which includes 31 interactions (Fig. 2 in⁵⁵). It is evident that the proposed approach obtained the highest overlap with the gold standard while the minimum rank of the intersected interactions is remarkably high (above 0.6). Finally, the networks predicted by the compared approaches were filtered by the corresponding median edge ranks and the ROC analysis has been performed to the resulting sub-networks. Here, too, our proposed approach resulted in the highest AUROC and AUPR values while the edges in the filtered sub-network have considerably higher ranks (above 0.6075 and 0.379 when applied to gene expression levels and log 2-transformed gene expression fold changes, respectively) in comparison to the median ranks of the other contending methods.

Mus musculus

Finally, to verify the performance of the proposed approach in a higher organism, we performed a comparative analysis on the combination of tissue-specific time-series data sets from three genetically distinct mouse ES cell lines during differentiation. We applied all compared methods to the log 2-transformed gene expression fold changes between control (time point 0) and the rest of time-series from the corresponding ES cell lines, motivated by the resulting improvement in the performance of the inference approaches applied to the MTB data sets. The ROC analysis for the compared methods was then carried out on the top 248 (as the gold standard network includes 248 edges) and 500 highly ranked edges and the corresponding statistics are summarized in Supplementary Table S7.

It is evident that the proposed approach allowed the prediction of regulatory relationships at a higher percentage of true positives when compared to the performance of GENIE3, network deconvolution and global silencing methods. Likewise with other data sets CLR as well as regularization models performed poorly for accurate inference of the regulatory network; however, the performance of the GGM is increased in comparison to its performance on the other data sets, while ARACNE failed to predict any true positive edge. In addition, the selector values for most of the compared methods were high and close (or equal) to one.

Moreover, we counted the number of predicted true interactions in the top 248 highly ranked edges (Supplementary Tables S7 and S9) considering the gold standard sub-network (which includes 248 interactions). It is evident that the proposed approach obtained the highest overlap with the gold standard, while the minimum rank of the intersected interactions is remarkably high (above 0.5). Finally, the networks predicted by the compared approaches were filtered by the corresponding median edge ranks and the ROC analysis has been performed to the resulting sub-networks. Once again the proposed approach resulted in the highest AUROC and AUPR values, while the edges in the filtered sub-network have similar ranks as the other predicted sub-networks by the other contending methods.

Discussion

Here we proposed a novel computational method for the reconstruction of gene regulatory networks which harnesses the information available in multiple data sets (time-series or perturbations) from different conditions. The approach is based on a fused LASSO formulation which incorporates not only the constraint that the sparse networks extracted from the different data sets are supposed to be as similar as possible, but also the requirement that the genes predicted to participate in a regulatory interaction should exhibit similar differential behavior under different conditions. We applied the proposed method to the different real-world gene expression data sets from two bacteria as well as Mus musculus. The thorough comparative analysis with the most recent as well as the state-of-the-art approaches demonstrated that the proposed method (i) showed highest performance on individual data sets (E. coli data set), (ii) outperformed the other contending approaches with respect to the predicted networks from MTB as well as Mus musculus on combination of all data sets, (iii) was promising for inferring the type of regulatory interactions (E. coli data set), (iv) partially outperformed the alternatives on the combined data sets (E. coli data set), (v) was superior for selector values and finally (vi) was more stable with respect to the inferred networks from single data sets. In addition, solving the problems relying on computing inverse matrices (e.g., inverse of covariance, used in GGM and global silencing) or distance matrices (e.g., based on mutual information) can be solved in shorter time than individual regressions for each gene (as a response in the regression model), but since covariances for most of the gene expression data sets are ill-conditioned, the proposed approach opts for regularized regression. In addition, with respect to the results from MTB, we could show that the performance of the majority of the inference approaches increased by using fold changes (obtained from differential analysis) rather than gene expression values; however, this needs to be further investigated and specifically evaluated. Finally and of most interest, the proposed approach provides predictions of regulatory interactions which can be readily interpreted due to the incorporation of biologically meaningful constraints in the usage of data.

Methods

Data preprocessing

Escherichia coli

We used the preprocessed and normalized data for each data set (detailed description in⁵⁴). The final data sets contain the expression of 4400 genes at the aforementioned time points. To test our approach, we used the set of transcription factors present in RegulonDB (TF-gene and TF-TF interactions) as well as DREAM5 challenge. Based on the available microarray data sets, the profiles of altogether 210 regulators (TFs) and 1561 non-regulators were considered for the inference of the gene regulatory networks.

The model proposed in Eq. (2) was then solved for each of the 1561 genes as responses in the regression and the 210 TFs as regressors. We calculated the weight matrices Wⁱ, 1 ≤ i ≤ 4 and the transformations were applied on the four data sets. The weight matrices were calculated using the limma package⁶¹ in R⁶². The probabilities of being differentially expressed at each time point were estimated by the B-statistic of a contrast (for a time point).

Mycobacterium tuberculosis

We used the preprocessed and normalized data for each data set (available in ArrayExpress with the accession number E-GEOD-43466, http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-43466/). The final data sets used for inferring gene regulatory network include only 20 TFs present in the experimentally verified TF regulatory sub-network available in⁵⁵ (Fig. 2, page 180).

The model proposed in Eq. (2) was then solved for each of the 20 TFs considered similarly as responses and regressors in the regression. We calculated the weight matrices Wⁱ, 1 ≤ i ≤ 2 and the transformations were applied on the two data sets. The weight matrices were calculated in the same fashion as for E. coli.

Mus musculus

We used the preprocessed and normalized data for each data set (available in ESCAPE databse, http://www.maayanlab.net/ESCAPE/browse.php). We used the gene expression data obtained from MOE430A array due to measuring genes which are better characterized. The final data sets contain the expression of 8274 genes (the genes with more than one expression profiles in the array were removed) at the aforementioned time points. To test our approach, we used the set of transcription factors present in the literature-based ES network (iScMiD, http://amp.pharm.mssm.edu/iscmid/literature/index.htm). Based on the available microarray data sets, the profiles of altogether 26 regulators (TFs) and 61 genes were considered for the inference of the gene regulatory networks.

The model proposed in Eq. (2) was then solved for each of the 61 genes as responses in the regression and the 26 TFs as regressors. We calculated the weight matrices Wⁱ, 1 ≤ i ≤ 3 and the transformations were applied on the three data sets. The weight matrices were calculated in the same fashion as for E. coli.

Methods used in the comparative analysis

Escherichia coli

For each of these methods used in the comparative analysis (except catnet, due to its time complexity), we employed the data sets from each perturbation separately and all together and divided the profiles into the 210 TFs (regulators) and 1561 non-regulator genes. For the network deconvolution and the global silencing methods, we proceeded as follows: First, for the given data set, we calculated the Pearson correlation coefficients matrix ⁶³. Given g₁ regulators and g₂ non-regulators, with g = g₁ + g₂, the correlation matrix can be modified as , where O denotes the zero matrix, to include biological roles (TF and non-TF genes). We then applied the network deconvolution and global silencing methods to the modified correlation matrix S′ⁱ, 1 ≤ i ≤ 4 from the four data sets (i.e., cold, heat, oxidative stress and lactose-diauxic shift) and the combination of all, resulting in five gene regulatory networks per method. However, the global silencing depends on finding the inverse of the correlation matrix which is rank-deficient in the case p >> n, like with the data analyzed here. Since finding an inverse for a rank-deficient matrix is an ill-posed problem, we resolved it by adding a noise term which renders the matrix positive-definite. In the comparative analysis, we selected the best result from 10 runs of the procedure as a final outcome used in the comparative analysis.

We used the package GeneNet⁶⁴ to infer gene regulatory networks based on the GGM proposed in¹⁹ to also obtain four gene regulatory networks for each of the stresses and one network for the combination of data sets. Likewise, R packages minet⁵⁸ and catnet⁵⁷ were applied to reconstruct gene regulatory networks based on ARACNE²⁵ (as well as CLR²⁶) and categorical Bayesian network⁶⁵, respectively. For GENIE3, we used the corresponding online tool available in GP-DREAM^14,66 (http://dream.broadinstitute.org/gp/pages/index.jsf). Finally, for different regularization models; L₁, L_1/2 and L₀, we used the available source code from http://jjwanglab.org/LpRGNI/.

To be consistent with the obtained edge weights from each network inference method, we rescaled them to positive values between 0 and 1 (by dividing the absolute value of the edge weights by the largest absolute value of the edge weights). Then the edges were sorted in a decreasing order of their weights and the top 100,000 highly ranked edges were selected for the rest of the statistical assessments. We note that the resulting networks from global silencing and network deconvolution are rescaled by default. In addition, the actual edge weights (before rescaling) were stored for each network to estimate the percentages of predicted true positive and negative regulatory effects (i.e., activation and inhibition, respectively). The inference of regulatory type, in terms of activation or inhibition, would not be possible without capturing the edge weights before rescaling, described above.

Mycobacterium tuberculosis

The comparative analysis was performed only on the combination of all data sets resulting in one gene regulatory network per method. As the gold standard network for MTB includes only transcription factors, we applied the network deconvolution and the global silencing methods to the correlation matrix S′ⁱ, 1 ≤ i ≤ 2 from the combination of both data sets (i.e., hypoxia and re-aeration) without further modification. The rest of the analysis was performed in the same way as for E. coli, except the selection of highly ranked edges for which due to the smaller network size, we selected the top 31 (the size of gold standard network) and 100 highly ranked edges.

Mus musculus

The comparative analysis was performed only on the combination of all data sets resulting in one gene regulatory network per method. The inference of the gene regulatory networks and the comparative analysis of the compared approaches were performed in the same manner as for E. coli, except the selection of highly ranked edges for which due to the smaller network size, we selected the top 248 (the size of gold standard network) and 500 highly ranked edges.

Implementation notes for the proposed approach

To implement the modified fused LASSO approach for reconstruction of gene regulatory networks over different data sets, we defined a penalty matrix D composed of two blocks: The first block D₁ is a diagonal kP × kP matrix for the LASSO penalty which includes W in diagonal, while the second block D₂ corresponds to the fusion penalty, expressed as:

To solve the proposed extension to the fused LASSO, we used the lqa package in R⁶⁷, with remarkable computational performance due to the Cholesky decomposition⁶⁸ which requires operations, in which n is the number of observations and while p is the number of regressors. We also slightly changed the function fused.lasso in lqa package to allow inclusion of the penalty matrices defined in Eq.(4). The regression coefficients were robustly estimated by 10-fold cross validation based on the optimum values for λ₁ and λ₂ from the sets {0.05, 0.1, 0.5, 1, 1.5} and {0.1, 0.5, 1, 1.5, 2}, respectively. To further speed up the algorithm, we used mclapply function from R package parallel⁶². In addition, we used minet²⁰ and pROC⁵⁹ packages in R to draw ROC curves and estimate statistics for them.

The code snippets for all approaches used in this study is available at http://mathbiol.mpimp-golm.mpg.de/Mul-fLASSO/index.html. The complete implementation of the proposed approach for the data sets used in this study as well as the general analysis flow for applying the proposed approach on new data sets are available at https://github.com/omranian/inference-of-GRN-using-Fused-LASSO.

Additional Information

How to cite this article: Omranian, N. et al. Gene regulatory network inference using fused LASSO on multiple data sets. Sci. Rep. 6, 20533; doi: 10.1038/srep20533 (2016).

References

López-Barneo, J., Pardal, R. & Ortega-Sáenz, P. Cellular mechanism of oxygen sensing. Annu Rev Physiol 63, 259–287, doi: 10.1146/annurev.physiol.63.1.259 (2001).
Article PubMed Google Scholar
Rolland, F., Baena-Gonzalez, E. & Sheen, J. Sugar sensing and signaling in plants: conserved and novel mechanisms. Annu Rev Plant Biol 57, 675–709, doi: 10.1146/annurev.arplant.57.032905.105441 (2006).
Article CAS PubMed Google Scholar
Veal, E. A., Day, A. M. & Morgan, B. A. Hydrogen peroxide sensing and signaling. Mol Cell 26, 1–14, doi: 10.1016/j.molcel.2007.03.016 (2007).
Article CAS PubMed Google Scholar
Jacob, F. & Monod, J. Genetic regulatory mechanisms in the synthesis of proteins. J Mol Biol 3, 318–356, doi: 10.1016/S0022-2836(61)80072-7 (1961).
Article CAS PubMed Google Scholar
Karlebach, G. & Shamir, R. Modelling and analysis of gene regulatory networks. Nat Rev Mol Cell Biol 9, 770–780, doi: 10.1038/nrm2503 (2008).
Article CAS PubMed Google Scholar
Cases, I. & de Lorenzo, V. Promoters in the environment: transcriptional regulation in its natural context. Nat Rev Microbiol 3, 105–118, doi: 10.1038/nrmicro1084 (2005).
Article CAS PubMed Google Scholar
Schulze, A. & Downward, J. Navigating gene expression using microarraysa technology review. Nat cell biol 3, E190–E195, doi: 10.1038/35087138 (2001).
Article CAS PubMed Google Scholar
Blencowe, B. J., Ahmad, S. & Lee, L. J. Current-generation high-throughput sequencing: deepening insights into mammalian transcriptomes. Gene Dev 23, 1379–1386, doi: 10.1101/gad.1788009 (2009).
Article CAS PubMed Google Scholar
Rehrauer, H. et al. Agronomics1: a new resource for arabidopsis transcriptome profiling. Plant physiol 152, 487–499, doi: 10.1104/pp.109.150185 (2010).
Article CAS PubMed PubMed Central Google Scholar
Sima, C., Hua, J. & Jung, S. Inference of gene regulatory networks using time-series data: A survey. Curr Genomics 10, 416–429, doi: 10.2174/138920209789177610 (2009).
Article CAS PubMed PubMed Central Google Scholar
Smyth, G. K. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mo B 3, doi: 10.2202/1544-6115.1027 (2004).
Rapaport, F. et al. Comprehensive evaluation of differential gene expression analysis methods for rna-seq data. Genome Biol 14, R95, doi: 10.1186/gb-2013-14-9-r95 (2013).
Article CAS PubMed PubMed Central Google Scholar
Hempel, S., Koseska, A., Nikoloski, Z. & Kurths, J. Unraveling gene regulatory networks from time-resolved gene expression data - a measures comparison study. BMC Bioinformatics 12, 292, doi: 10.1186/1471-2105-12-292 (2011).
Article PubMed PubMed Central Google Scholar
Marbach, D. et al. Wisdom of crowds for robust gene network inference. Nat Methods 9, 796–804, doi: 10.1038/nmeth.2016 (2012).
Article CAS PubMed PubMed Central Google Scholar
Omony, J. Biological network inference: A review of methods and assessment of tools and techniques. Annu Res Rev Bioly 4, 577–601, doi: 10.9734/ARRB/2014/5718 (2014).
Article Google Scholar
Han, L. & Zhu, J. Using matrix of thresholding partial correlation coefficients to infer regulatory network. Biosystems 91, 158–165, doi: 10.1016/j.biosystems.2007.08.008 (2008).
Article CAS PubMed Google Scholar
Peng, J., Wang, P., Zhou, N. & Zhu, J. Partial correlation estimation by joint sparse regression models. J Am Stat Assoc 104, 735–746, doi: 10.1198/jasa.2009.0126 (2009).
Article CAS MathSciNet PubMed PubMed Central MATH Google Scholar
Yuan, Y., Li, C.-T. & Windram, O. Directed partial correlation: inferring large-scale gene regulatory network through induced topology disruptions. PLoS One 6, e16835, doi: 10.1371/journal.pone.0016835 (2011).
Article CAS ADS PubMed PubMed Central Google Scholar
Schäfer, J. & Strimmer, K. An empirical bayes approach to inferring large-scale gene association networks. Bioinformatics 21, 754–764, doi: 10.1093/bioinformatics/bti062 (2005).
Article PubMed Google Scholar
Meyer, P. E., Kontos, K., Lafitte, F. & Bontempi, G. Information-theoretic inference of large transcriptional regulatory networks. EURASIP J Bioinform Syst Biol 79879, doi: 10.1155/2007/79879 (2007).
Huynh-Thu, V. A., Irrthum, A., Wehenkel, L. & Geurts, P. Inferring regulatory networks from expression data using tree-based methods. PLoS One 5, doi: 10.1371/journal.pone.0012776 (2010).
Chun, H., Chen, M., Li, B. & Zhao, H. Joint conditional gaussian graphical models with multiple sources of genomic data. Front Genet 4, doi: 294 10.3389/fgene.2013.00294 (2013).
Lauritzen, S. L. in Graphical models 1st edn, Vol. 17, Ch. 3, 28–60 (Oxford Statistical Science Series, 1996).
Google Scholar
Krämer, N., Schäfer, J. & Boulesteix, A.-L. Regularized estimation of large-scale gene association networks using graphical gaussian models. BMC Bioinformatics 10, 384, doi: 10.1186/1471-2105-10-384(2009).
Margolin, A. A. et al. Reverse engineering cellular networks. Nat Protoc 1, 662–671, doi: 10.1038/nprot.2006.106 (2006).
Article CAS PubMed Google Scholar
Faith, J. J. et al. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS biol 5, e8, doi: 10.1371/journal.pbio.0050008 (2007).
Article CAS PubMed PubMed Central Google Scholar
Zoppoli, P., Morganella, S. & Ceccarelli, M. Timedelay-aracne: Reverse engineering of gene networks from time-course data by an information theoretic approach. Bmc Bioinformatics 11, 154, doi: 10.1186/1471-2105-11-154 (2010).
Article CAS PubMed PubMed Central Google Scholar
Aliferis, C. F., Statnikov, A., Tsamardinos, I., Mani, S. & Koutsoukos, X. D. Local causal and markov blanket induction for causal discovery and feature selection for classification part i: Algorithms and empirical evaluation. J Mach Learn Res 11, 171–234 (2010).
MathSciNet MATH Google Scholar
Statnikov, A. & Aliferis, C. F. Analysis and computational dissection of molecular signature multiplicity. PLoS comput biol 6, e1000790, doi: 10.1371/journal.pcbi.1000790 (2010).
Article CAS ADS PubMed PubMed Central Google Scholar
Toni, T. & Stumpf, M. P. Simulation-based model selection for dynamical systems in systems and population biology. Bioinformatics 26, 104–110, doi: 10.1093/bioinformatics/btp619 (2010).
Article CAS PubMed Google Scholar
Dondelinger, F., Husmeier, D. & Lèbre, S. Dynamic bayesian networks in molecular plant science: inferring gene regulatory networks from multiple gene expression time series. Euphytica 183, 361–377, doi: 10.1007/s10681-011-0538-3 (2012).
Article Google Scholar
Fu, F. & Zhou, Q. Learning sparse causal gaussian networks with experimental intervention: regularization and coordinate descent. J Am Stat Assoc 108, 288–300, doi: 10.1080/01621459.2012.754359 (2013).
Article CAS MathSciNet MATH Google Scholar
Barzel, B. & Barabási, A.-L. Network link prediction by global silencing of indirect correlations. Nat Biotechnol 31, 720–725, doi: 10.1038/nbt.2601 (2013).
Article CAS PubMed PubMed Central Google Scholar
Feizi, S., Marbach, D., Mdard, M. & Kellis, M. Network deconvolution as a general method to distinguish direct dependencies in networks. Nat Biotechnol 31, 726–733, doi: 10.1038/nbt.2635 (2013).
Article CAS PubMed PubMed Central Google Scholar
Bray, N. & Pachter, L. The network nonsense of albert-lászló barabási (2014) Date of access:10/02/2014. URL http://liorpachter.wordpress.com/2014/02/10/the-network-nonsense-of-albert-laszlo-barabasi/.
Yuan, M. & Lin, Y. Model selection and estimation in regression with grouped variables. J Roy Stat Soc B 68, 49–67, doi: 10.1111/j.1467-9868.2005.00532.x (2006).
Article MathSciNet MATH Google Scholar
Pihur, V., Datta, S. & Datta, S. Reconstruction of genetic association networks from microarray data: a partial least squares approach. Bioinformatics 24, 561–568, doi: 10.1093/bioinformatics/btm640 (2008).
Article CAS PubMed Google Scholar
Haury, A.-C., Mordelet, F., Vera-Licona, P. & Vert, J.-P. Tigress: Trustful inference of gene regulation using stability selection. BMC Syst Biol 6, 145, doi: 10.1186/1752-0509-6-145 (2012).
Article PubMed PubMed Central Google Scholar
Hastie, T. et al. The elements of statistical learning, vol. 2, doi: 10.1007/978-0-387-84858-7 (Springer, 2009).
Bonneau, R. et al. The inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo. Genome Biol 7, R36, doi: 10.1186/gb-2006-7-5-r36 (2006).
Article CAS PubMed PubMed Central Google Scholar
Cai, X., Bazerque, J. A. & Giannakis, G. B. Inference of gene regulatory networks with sparse structural equation models exploiting genetic perturbations. PLoS Comput Biol 9, e1003068, doi: 10.1371/journal.pcbi.1003068 (2013).
Article CAS ADS PubMed PubMed Central Google Scholar
Hesterberg, T., Choi, N. H., Meier, L. & Fraley, C. Least angle and l1 penalized regression: A review. Stat Surv 2, 61–93, doi: 10.1214/08-SS035 (2008).
Article MathSciNet MATH Google Scholar
Yang, A. Y., Sastry, S. S., Ganesh, A. & Ma, Y. Fast l1-minimization algorithms and an application in robust face recognition: A review. IEEE Image Proc, doi: 10.1007/s10114-013-2740-9 (2010).
Dasgupta, A., Sun, Y. V., König, I. R., Bailey-Wilson, J. E. & Malley, J. D. Brief review of regression-based and machine learning methods in genetic epidemiology: the genetic analysis workshop 17 experience. Genet Epidemiol 35, S5–11, doi: 10.1002/gepi.20642 (2011).
Article PubMed PubMed Central Google Scholar
Qin, J., Hu, Y., Xu, F., Yalamanchili, H. K. & Wang, J. Inferring gene regulatory networks by integrating chip-seq/chip and transcriptome data via lasso-type regularization methods. Methods 67, 294–303, doi: 10.1016/j.ymeth.2014.03.006 (2014).
Article CAS PubMed Google Scholar
Stolovitzky, G., Monroe, D. & Califano, A. Dialogue on reverse-engineering assessment and methods. Ann NY Acad Sci 1115, 1–22, doi: 10.1196/annals.1407.021 (2007).
Article ADS PubMed Google Scholar
Villa-Vialaneix, N., Vignes, M., Viguerie, N. & San Cristobal, M. Inferring networks from multiple samples with consensus lasso. Qual Technol Quant Manag 11, 39–60 (2014).
Article Google Scholar
Hase, T., Ghosh, S., Yamanaka, R. & Kitano, H. Harnessing diversity towards the reconstructing of large scale gene regulatory networks. PLoS Comput Biol 9, e1003361, doi: 10.1371/journal.pcbi.1003361 (2013).
Article CAS ADS PubMed PubMed Central Google Scholar
Tibshirani, R. Regression shrinkage and selection via the lasso. J Roy Stat Soc B 58, 267–288, doi: 10.2307/41262671 (1996).
Article MathSciNet MATH Google Scholar
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. & Knight, K. Sparsity and smoothness via the fused lasso. J Roy Stat Soc B 91–108, doi: 10.1111/j.1467-9868.2005.00490.x (2005).
Zhong, R., Allen, J. D., Xiao, G. & Xie, Y. Ensemble-based network aggregation improves the accuracy of gene network reconstruction. PLoS ONE 9, e106319, doi: 10.1371/journal.pone.0106319 (2014).
Article CAS ADS PubMed PubMed Central Google Scholar
Zaslaver, A. et al. A comprehensive library of fluorescent transcriptional reporters for escherichia coli. Nat Methods 3, 623–628, doi: 10.1038/nmeth895 (2006).
Article CAS PubMed Google Scholar
Gama-Castro, S. et al. Regulondb version 7.0: transcriptional regulation of Escherichia coli k-12 integrated within genetic sensory response units (gensor units). Nucleic Acids Res 39, D98–D105, doi: 10.1093/nar/gkq1110 (2011).
Article CAS PubMed Google Scholar
Jozefczuk, S. et al. Metabolomic and transcriptomic stress response of Escherichia coli. Mol Syst Biol 6, 364, doi: 10.1038/msb.2010.18 (2010).
Article PubMed PubMed Central Google Scholar
Galagan, J. E. et al. The Mycobacterium tuberculosis regulatory network and hypoxia. Nature 499, 178–183, doi: 10.1038/nature12337 (2013).
Article CAS ADS PubMed PubMed Central Google Scholar
Sene, K. H. et al. Gene function in early mouse embryonic stem cell differentiation. BMC Genomics 8, 85, doi: 10.1186/1471-2164-8-85 (2007).
Article CAS PubMed Central Google Scholar
Balov, N. & Salzman, P. catnet: categorical bayesian network inference. URL http://CRAN.R-project.org/package=catnet (2012).
Meyer, P. E., Lafitte, F. & Bontempi, G. minet: A R/bioconductor package for inferring large transcriptional networks using mutual information. BMC bioinformatics 9, 461, doi: 10.1186/1471-2105-9-461 (2008).
Article PubMed PubMed Central Google Scholar
Robin, X. et al. proc: an open-source package for r and s+ to analyze and compare roc curves. BMC Bioinformatics 12, 77, doi: 10.1186/1471-2105-12-77 (2011).
Article PubMed PubMed Central Google Scholar
Gruber, T. M. & Gross, C. A. Multiple sigma subunits and the partitioning of bacterial transcription space. Annu Rev Microbiol 57, 441–466, doi: 10.1146/annurev.micro.57.030502.090913 (2003).
Article CAS PubMed Google Scholar
Smyth, G. K. in Bioinformatics and Computational Biology Solutions Using R and Bioconductor 1st edn, Ch. 23, 397–420 (Springer New York, 2005).
R Core Team R: A Language and Environment for Statistical Computing., R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/ (2013).
Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95, 14863–14868 (1998).
Article CAS ADS PubMed PubMed Central Google Scholar
Schäfer, J., Opgen-Rhein, R. & Strimmer, K. GeneNet: Modeling and Inferring Gene Networks. URL http://CRAN.R-project.org/package=GeneNet. R package version 1.2.8. (2013).
Balov, N. A categorical network approach for discovering differentially expressed regulations in cancer. BMC Med Genomics 6, S1, doi: 10.1186/1755-8794-6-S3-S1 (2013).
Article PubMed PubMed Central Google Scholar
Reich, M. et al. Genepattern 2.0. Nat Genet 38, 500–501, doi: 10.1038/ng0506-500 (2006).
Article CAS PubMed Google Scholar
Ulbricht, J. lqa: Penalized Likelihood Inference for GLMs. URL http://CRAN.R-project.org/package=lqa (2012).
Lawson, C. L. & Hanson, R. J. Solving Least Squares Problems., Series in Automatic Computation. Prentice-Hall, Englewood Cliffs, NJ 07632, USA, doi: 10.1137/1.9781611971217 (1974).

Download references

Author information

Authors and Affiliations

Max Planck Institute for Molecular Plant Physiology, Systems Biology and Mathematical Modelling Group, Am Muehlenberg 1, Potsdam, 14476, Germany
Nooshin Omranian, Jeanne M. O. Eloundou-Mbebi & Zoran Nikoloski
Department of Molecular Biology, University of Potsdam, Karl-Liebknecht-Str. 24-25, Haus 20, Potsdam, 14476, Germany
Nooshin Omranian & Bernd Mueller-Roeber

Authors

Nooshin Omranian
View author publications
You can also search for this author in PubMed Google Scholar
Jeanne M. O. Eloundou-Mbebi
View author publications
You can also search for this author in PubMed Google Scholar
Bernd Mueller-Roeber
View author publications
You can also search for this author in PubMed Google Scholar
Zoran Nikoloski
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceived and designed the study: N.O. and Z.N. Developed the model: N.O., Z.N. and J.M.O.E.M. Implemented: N.O. N.O., J.M.O.E.M., B.M.R. and Z.O. wrote the manuscript, made comments and approved the final version submitted.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Electronic supplementary material

Supplementary Information

Rights and permissions

This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Reprints and permissions

About this article

Cite this article

Omranian, N., Eloundou-Mbebi, J., Mueller-Roeber, B. et al. Gene regulatory network inference using fused LASSO on multiple data sets. Sci Rep 6, 20533 (2016). https://doi.org/10.1038/srep20533

Download citation

Received: 19 May 2015
Accepted: 06 January 2016
Published: 11 February 2016
DOI: https://doi.org/10.1038/srep20533

This article is cited by

Integrative genetic and genomic networks identify microRNA associated with COPD and ILD
- Ana B. Pavel
- Carly Garrison
- Joshua D. Campbell
Scientific Reports (2023)
Widening the landscape of transcriptional regulation of green algal photoprotection
- Marius Arend
- Yizhong Yuan
- Dimitris Petroutsos
Nature Communications (2023)
Unsupervised relational inference using masked reconstruction
- Gerrit Großmann
- Julian Zimmerlin
- Verena Wolf
Applied Network Science (2023)
Inferring and analyzing gene regulatory networks from multi-factorial expression data: a complete and interactive suite
- Océane Cassan
- Sophie Lèbre
- Antoine Martin
BMC Genomics (2021)
Bayesian differential analysis of gene regulatory networks exploiting genetic perturbations
- Yan Li
- Dayou Liu
- Yungang Zhu
BMC Bioinformatics (2020)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Formulation of the proposed method

Experimental scenario and weight matrices

Regression-based model

Gene expression data sets

Escherichia coli

Mycobacterium tuberculosis

Mus musculus

Comparative analysis

Escherichia coli

Mycobacterium tuberculosis

Mus musculus

Discussion

Methods

Data preprocessing

Escherichia coli

Mycobacterium tuberculosis

Mus musculus

Methods used in the comparative analysis

Escherichia coli

Mycobacterium tuberculosis

Mus musculus

Implementation notes for the proposed approach

Additional Information

References

Author information

Authors and Affiliations

Contributions

Ethics declarations

Competing interests

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links