Abstract
Gene regulatory networks (GRNs) describe regulatory relationships between transcription factors (TFs) and their target genes. Computational methods to infer GRNs typically combine evidence across different conditions to infer contextagnostic networks. We develop a method, Network Reprogramming using EXpression (NetREX), that constructs a contextspecific GRN given contextspecific expression data and a contextagnostic prior network. NetREX remodels the prior network to obtain the topology that provides the best explanation for expression data. Because NetREX utilizes prior network topology, we also develop PriorBoost, a method that evaluates a prior network in terms of its consistency with the expression data. We validate NetREX and PriorBoost using the “gold standard” E. coli GRN from the DREAM5 network inference challenge and apply them to construct sexspecific Drosophila GRNs. NetREX constructed sexspecific Drosophila GRNs that, on all applied measures, outperform networks obtained from other methods indicating that NetREX is an important milestone toward building more accurate GRNs.
Introduction
Maintenance of cell typespecific states, response to stress, sexual dimorphism, and other cell functions are controlled by gene regulatory programs. In particular, gene regulatory networks (GRNs) capture the regulatory relationships between transcription factors (TFs) and their target genes. Since GRNs provide information that is essential for a global understanding of the logic of gene–gene interactions, inference of these networks is one of the key challenges in system biology. Methods to infer GRNs typically combine computational approaches and experimental data collected from different sample types, different conditions, different techniques, and different labs. Such data integration leverages dependencies that can be confidently uncovered thanks to the multitude of surveyed conditions, but leads to contextagnostic wiring diagrams^{1,2,3}. These contextagnostic networks do not accommodate regulatory program reality, which is specific to tissue types, developmental stages, sex, and other factors.
To study tissue, developmental stage, or sexspecific gene regulation, contextspecific regulatory networks are needed. Drosophila sex differentiation is an ideal test for such contextdependent models, as sexual dimorphism results in subtle differences in every germ layer and tissue^{4}. Thus, models of sexbiased expression will show many differences between the sexes, but also a core of gene regulatory relationships that should be similar between the sexes. The most readily accessible contextspecific data type is contextspecific gene expression. Therefore a spectrum of methods to construct GRNs from only gene expression data have been developed, counting on the relation between expression of TFs and expression of their target genes. In recent years, many methods that infer GRNs based on gene expression alone have been proposed. Early methods inferred regulatory relationships using mutual information between the expression levels of gene pairs^{5,6}. These approaches have been followed by more sophisticated ones that account for more complex regulatory scenarios^{7,8,9,10,11,12}. The recent DREAM5 network inference challenge^{13} evaluated over 30 expressionbased network inference methods and identified a random forestbased method, GENIE3, as the best performer. However the results of this challenge demonstrated that expression only methods are far from solving the GRN network inference problem suggesting that relying on expression only is not enough. One of the factors that led to the limited success of these methods is the complicated relationship between expression of TFs and their regulatory activity^{14}, indicating that it might be beneficial to rely on the TF regulatory activities inferred from the data rather than TF expression per se. For example, network component analysis (NCA) has been shown to be a successful approach to infer such regulatory activities^{15}. Unfortunately, NCA requires prior knowledge of the GRN in order to infer TF activities but, in our setting, the GRN is largely unknown. As a result of this difficulty, effort has been extended to integrate prior knowledge from different types of experiments, or even from different conditions, to provide additional ways to boost inference of such networks^{16,17,18,19,20,21}. For example, the Inferelator^{21} method uses a prior network in place of a true network as the input to the NCA procedure to infer TF activities, and then predicts a GRN based on relationships between the inferred TF activities and gene expression^{21}.
Here we introduce, NetREX, a method to construct GRNs by iterative reprogramming of a prior network, given a prior network and expression data. In applications to predict contextspecific GRNs, the prior network is assumed to reflect a prior information that might not be context specific, while the expression data provide the context. NetREX can be applied to any situation where a prior network is to be improved by expression data. The main idea of NetREX is to reprogram the prior network by adding and removing edges to obtain a network that provides the best explanation of the observed gene expression. Simultaneously, NetREX optimizes several other objectives to ensure that the resulting network is biologically relevant. NetREX is an approach that systematically explores the landscape of possible GRN topologies to generate contextspecific GRNs.
NetREX, and all other models that use a prior, assume that there is some similarity/overlap between the prior network and the target GRN, and thus these tools bias the optimization procedure toward networks that overlap with the prior. Therefore, in the case of significant discrepancies between the prior and the target network, the prior might be misleading rather than helpful. To address this challenge we developed PriorBoost—a computational approach to gauge the usefulness of the prior network for obtaining a good estimation of the target GRN.
We validated NetREX and PriorBoost—first on simulated data and then on the “gold standard” E. coli GRN used in the DREAM5^{13} challenge. As an additional evaluation, we compare how well the methods predicted novel regulatory edges that have been added to the E. coli RegulonDB^{22} after the DREAM5 challenge. NetREX outperforms other methods on different metrics. Additionally, PriorBoost successfully identifies priors that are likely to lead to misleading results.
We then apply NetREX and PriorBoost to construct sexspecific GRNs for adult Drosophila melanogaster using a previously constructed contextagnostic network as the prior^{2}. We supply a large expression dataset for adult female and male flies where perturbations in expression were achieved by heterozygosity for multilocus deletions^{23,24} to NetREX to generate the sexspecific GRNs. We evaluate the performance by evaluating the subnetwork centered on the sexspecific transcription factor Doublesex (DSX), which is the key gene controlling, directly or indirectly, the majority of sex differentiation in Drosophila^{25}. DSX occupancy in D. melanogaster, and the comparative genomics of DSX binding motifs in the Drosophila genus have been extensively mapped to provide a good test of connectivity predicted by NetREX. Furthermore, we illustrate that, among all competing methods, only DSX targets predicted by NetREX are enriched in genes with sexbiased expression. Finally, we demonstrate that while GRNs inferred by NetREX show differences between the sexes, their regulatory programs overlapped, consistent with the similarities between the sexes.
Results
NetREX and PriorBoost overview
The main idea of NetREX is to construct a contextspecific GRN by leveraging an existing GRN—for example a GRN constructed in a related tissue or organism, or a noisy/incomplete network for the same context. The context of interest is provided by a set of expression data. NetREX edits the prior network by removing and adding edges to obtain a network topology that provides the best explanation for the entirety of the expression data. To accomplish this, NetREX requires four components: (i) a measure of how well a network topology explains the expression data, (ii) a strategy for exploring biologically relevant network topologies, (iii) an algorithmic technique guaranteeing convergence of the network search procedure, and (iv) a method to test whether the given prior is consistent with the data and likely to provide an advance over priorfree methods. Below we provide basic intuition underlying these four components. The details of the method and its mathematical underpinning are described in Methods section.
To measure how well a given network’s topology explains the expression data, we needed to have a mathematical model linking network topology to gene expression. NetREX uses the network component analysis (NCA) model^{26} (Supplementary Figure 1), which assumes that each TF is characterized by its activity (TF activity), a variable that is not directly measured but introduced to account for unknown factors, such as protein levels, nuclear localization, and phosphorylation status. In addition, in the NCA model, each edge of the GRN has a weight representing regulatory potential (or strength) with which the TF regulates the gene. Finally, the expression of a gene is assumed to be a linear combination of the activities of TFs that regulate the gene, weighted by regulatory potentials of the regulatory edges (Supplementary Figure 1). To measure how well network’s topology explains the expression data, NetREX measures the fitness of the consistency of the given topology with the expression data using the optimal NCA model. Despite the fact that this model is relatively simple (Discussion), we verified the efficacy showing that computed explanatory power correlates with the number of “gold standard” edges in the E. coli GRN (Supplementary Figure 2), motivating our use of this metrics as a measure of the relationship of network topology to expression data.
Starting from the prior NetREX iteratively reprograms it by adding and removing edges giving preferences to topologies where coexpressed genes are coregulated and TFs with correlated activities coregulate the same genes (Fig. 1 and Methods) and penalizing the number of changes (see Methods and Supplementary Methods: The Formulation of NetREX)
Computationally, NetREX is formulated as an optimization problem with l_{0} norm involved, making the problem nonconvex and NP hard. We addressed this challenge by using a new cutting edge technique known as proximal alternative linearized maximization (PALM)^{27} as described in Supplementary Methods: Optimization Behind the NetREX Algorithm.
NetREX is a priorbased method, and therefore performance critically depends on the prior. To avoid erroneous solutions due to a poor prior, we developed PriorBoost, to evaluate the usefulness of a prior network for the task of reconstructing a GRN consistent with a given expression dataset (Methods).
Benchmarking NetREX
While benchmarking against a true network is ideal, no current GRNs are perfect. Therefore, we first tested the performance of NetREX on simulated data. Overall NetREX solution provided a consistent improvement over the initial prior and the improvement increased with less noise in the expression and/or a higher fraction of true positive edges in the prior (Supplementary Figure 3).
Next, to see how the method can handle a situation of nonrandom error in the prior network we simulated the scenario where the prior is consistent with the true network in most cases except one truly differential module of genes. NetREX performed very well even in the case when all true edges leading to the module have been removed from the prior (Supplementary Figure 4).
Complementing benchmarking the method on simulated data, we evaluated NetREX on currently the most complete GRN^{22}, the E. coli network. Following the strategy used in the DREAM5 challenge^{4}, we used the same experimentally validated highconfidence interactions from the curated dataset RegulonDB^{22} as a reasonable “gold standard” set and the same expression data that was provided to the DREAM5 competitors. We evaluated the ability of NetREX to recover this “gold standard” network as a function of the quality of the prior. As in the case of simulated data, we constructed prior networks of various quality by randomly selecting a subset of edges from the “gold standard” network as true positives and randomly adding false positive edges. We compared NetREX with Inferelator^{21}, MERLIN+P^{20}, and CoRegNet^{7}, all of which use a prior network (see parameters selection in Supplementary Note 6). In addition, we included Genie3^{11}—the best performer in the DREAM5 challenge that uses expression data only (no prior). We varied the difficulty of the network inference problem by using prior networks generated in two ways. The first set of noisy prior networks had the same number of total edges, but different percentages of true edges. The second set of noisy priors had the same number of true edges, but different numbers of total edges which are controlled by the ratio of true to false edges. We assessed the quality of the predicted networks by AUPR (the Area Under the Precision vs. Recall curve) scores. The results using AUROC (Area Under the Receiver Operator Characteristics curve) are similar and provided in (Supplementary Tables 1 and 3). Except for the case when the prior network contained only 10% of true edges (Fig. 2a–c) and no true edges (ratio of true to false edges is 0:1 in Fig. 2d–f), NetREX outperformed all other methods under most test conditions. Genie3 outperformed all other methods when the prior network contained very low percentage of true edges (Fig. 2a, b, d, e), which is consistent with the expectation that if the prior is a poor match, the algorithms not using that prior gain an advantage. Performance of MERLIN+P was overall not significantly influenced by the quality of a prior and close to the performance of GENIE3. Interestingly NetREX was the only method that provided a consistent improvement over the provided prior (curves of NetREX in Fig. 2a, d are always above the curves of the prior). When the prior contained >60% correct edges, the network constructed by Inferelator’s was actually worse than the prior network provided as the input. In this aspect, we attribute the superior performance of NetREX in part to the fact that it gives preference to the solutions that are close to the prior. We also tested the impact of sample size on method’s performance. NetREX provides improvement over the prior with as little as 10 samples and the performance continues to steeply increase with sample size and plateaus around 100 samples (Supplementary Figure 7).
Even if the E. coli GRN is currently the most complete network, it is not perfect. Therefore we performed additional validations, using new data. Specifically, we acquired 230 novel highconfidence interactions from RegulonDB^{22} (Methods) that we added to the dataset RegulonDB^{22} after the DREAM5 challenge was completed, and thus not included in the “gold standard” set. We then tested whether those novel edges were uncovered by competing algorithms (using the log(pvalue) from hypergeometric test that is used to compute the enrichment of novel edges in the set of total novel edges found by the algorithms). Again, except for the case of the lowest quality prior (CoRegNet has the best performance in predicting novel edges for the lowest quality prior), NetREX outperformed other methods (see Supplementary Tables 5 and 6).
Finally, we used E. coli network to validate our PriorBoost scoring system. Due to the dependence on the prior, NetREX, or any other model that uses a prior, could be mislead by a prior that is mostly wrong. This is observed in Fig. 2, where when the prior network had 90% false negatives (the very left points in Fig. 2a), both NetREX and Inferelator perform badly. To evaluate the prior network in the absence of “gold standard” truth, PriorBoost applies the above described theoretical model on E. coli data (given expression and priors). Figure 2c and f shows the robustness of PriorBoost scores for the perturbed prior networks (used by NetREX, Inferelator, MERLIN+P, and CoRegNet for Fig. 2a, b, d, e) with different noise levels. As demonstrated in Fig. 2c, f, PriorBoost scores correlate with the quality of the prior. In addition, a negative PriorBoost score correctly identified a situation when NetREX cannot improve over Genie3.
Reconstruction of Drosophila sexspecific GRNs
We applied NetREX and PriorBoost to construct sexspecific female and male GRNs for Drosophila. The adult female and male gene expression data were obtained from a large collection of expression profiles (99 lines of flies, with females and males profiled separately in replicates) that were perturbed by altering gene dose^{23,24}. This dataset provides a relatively large number of related samples that also have broad variability in gene expression patterns due to gene dosage alteration. Specifically, the dataset is derived from engineered chromosomal deletions each of which leads to deletion of one of the two copies of a block of genes from different regions. Because all these deletions are heterozygous (viable and fertile in this state), there are not secondary (and worse) effects due to defects in development. All the flies are morphologically wild type. As demonstrated in refs. ^{28,29} the expression changes caused by these genetic perturbations propagate and dissipate in gene network space, making this an ideal set for expressionbased network reconstruction. Specifically, while transcriptional effects are perturbed, the underlying GRN is unbroken. These significantly perturbed expression profiles explore the expression space for the whole genome, as collectively essentially all genes show differential expression in at least one deletion. In addition, our estimates suggests that this set of ~100 experiments per sex (each in two biological replicates) is a sufficiently large dataset for NetREX to perform exceptionally well (Supplementary Note 4). For the prior network, we used a previously constructed conextagnostic network^{2}. This network was constructed through integrating diverse functional genomics datasets in a supervised learning framework. Since much of the evidence used for the construction of this network was based on experiments performed on tissue culture cells, which shows significantly different expression patterns relative to sexed adult flies, it was clear that extensive rewiring would be required to constructing adult sexspecific networks. The prior networks for female and male are basically the same and correspond to the network predicted in refs. ^{30,31}. However, genes that were not expressed were removed from the prior. Since the set of nonexpressed genes in females and males is not exactly the same, this introduces a subtle difference between the two priors (Supplementary Table 7). To test the validity of using this prior for adult sexspecific networks, we first used PriorBoost to test the consistency of the prior GRN with female and male gene expression data. PriorBoost score was positive for female expression data indicating an informative prior, but was low for the male data (Fig. 3).
As an indirect way to evaluate the topology of a network, we used protein–protein interaction (PPI) scores and gene ontology (GO) scores (Methods)^{2}. Starting with the assumption that coregulated genes are more likely to belong to the same pathway, these scores measure enrichment in PPIs and consistency of GO annotations of coregulated genes. While these scores do not measure correctness of the network, they provide a coherency estimate to determine whether the network topology has expected network properties. We revised these scoring functions relative to their original definition (Methods) and show, using the E. coli network, which revised scores have improved correlation with network quality (Supplementary Figure 5). We used these scores to gauge the quality of NetREX and Genie3 networks under the same cutoffs (e.g., top 50,000, 100,000, 150,000… weighted edges). Consistent with PriorBoost scores, the networks produced by NetREX had very high scores for the female networks but relatively low scores for male networks (Fig. 3 for PPI scores and Supplementary Tables 8–10 for GO scores).
The good performance for females was gratifying, but the poor performance of the prior for males was unsurprising, as several lines of evidence indicate that the organizational principles of the regulatory program of the testis is unique^{32,33,34,35,36,37,38,39,40,41,42,43}. The Drosophila testis has a radically different gene expression machinery compared to any other tissue^{32,33,35}. There are probably several causes of this special gene expression profile. Given that little of this unique “TF free” expression program (see Supplementary Note 3 for further discussion of this issue) was represented in the prior, this was a reassuring test for PriorBoost. If the poor performance of the prior for the malespecific GRN was indeed due to the peculiar nature of testis gene expression, then removing testisbiased expression should improve the prior performance. Indeed, the PriorBoost score for the prior network of the remaining genes was positive, and thus we used this network as a prior for reconstructing a malespecific GRN without genes highly expressed in testis. The resulting network showed also a good performance, as measured by PPI scores (Fig. 3). In the remaining analysis, to avoid any bias, we did not include genes highly expressed in testis (for males) or ovary (for female). The femalespecific and malespecific GRNs constructed by NetREX are provided in Supplementary Data 1 and Supplementary Data 2.
To validate the resulting GRNs, we measured the overlap of the predicted targets of the key transcription factor for controlling the majority of sexbiased expression in flies, doublesex (DSX), with the identified targets from a combination of occupancy, binding motif, and comparative genomics^{25}. Neither DSX occupancy, nor DSX binding sites were included in the prior. The expression data resulting from direct perturbation of DSX activity was not used either. Since the prior network is based largely on embryos and tissue culture cells, not surprisingly, it contained only three of the thousands of predicted DSX targets. Therefore the performance of the method on predicting the DSX targets is particularly informative. NetREX was able to identify, with high precision, hundreds of these independently verified target genes (Fig. 4a, b). In particular, the top 100 NetREX predictions had 72 verified targets (the highest of all sets listed in Fig. 4c) as compared to MERLIN+P and Genie3 that predicted 52 and 66 verified targets in their top 100 predictions, respectively. Inferotaltor inferred only three interactions. Overall, NetREX clearly outperformed other approaches on this test.
For an additional validation, we utilized the fact that, since DSX controls sexual development, the targets of DSX are expected to be enriched in genes that are differentially expressed between females and males, even though not all DSX targets are sexspecifically expressed at any given time in development^{25}. To test for the enrichment, and to avoid any confounding due to using the same expression dataset used to generate the network models for the validation, we obtained a second dataset of sexbiased expression from GEO Series accession number GSE99574 (96 samples from GSM2647254 to GSM2647349) and used it to identify genes with sexbiased expression (details in Supplementary Note 4). When we asked what genes were predicted to be DSX targets in the predicted GRNs, we found that there were significantly more genes with sexbiased expression^{44} among those predictions in the NetREX models (hypergeometric test in Fig. 4d; gene set enrichment analysis in Fig. 4e). The other tested GRNs failed to show a significant enrichment for sexbiased gene expression among the predicted DSX targets. These data indicated that NetREX can successfully predict gene expression patterns in a novel experimental dataset.
As yet another test, we evaluated similarities between the female and male GRNs. There are 505,548 and 293,458 edges predicted by NetREX for female and male GRNs. We found that 149,462 edges are common to the female and male GRNs. Of these, 136,404 are included in the prior and 13,058 edges were newly predicted. While many differences between the GRNs exist, these networks are expected to be related, as there is also much in common between female and male adult Drosophila, and there are many genes that do not show, or show only modest, sexbiased expression. We measured the similarity of regulatory programs by comparing for each gene the agreement between TFs predicted to regulate that gene in the female and male GRNs (Fig. 5a). Thus, we separately evaluated consistency of regulatory programs on sexbiased gene expression and on nonsexbiased expression (Fig. 5b). Female and male GRNs inferred by NetREX show overall good consistency between their regulatory programs (Jaccard index above 0.2 in all sets) (Fig. 5b). This is in contrast to the other methods where the average intersection/union (Jaccard indexes) in all the tests are much smaller. Thus, NetREX shows an outstanding improvement in identifying common aspects of gene expression among the sexes. Furthermore, accounting for imperfections in the GRN network prediction, we still expect that genes that are not sex differentially expressed between male and female have higher similarity of regulatory interactions than genes that show sexspecific expression. This is indeed what we found in Fig. 5b. The average Jaccard index for nonsex differentially expressed genes are much larger than the average Jaccard index for sex differentially expressed genes.
Discussion
Gene regulation is context dependent. Gene regulatory networks depend on tissue, sex, developmental stages, and disease status among many other conditions. Ultimately, every cell type at any given time has a slightly different network than spatially or temporally neighboring cells. Clearly, universal network models will fail to capture this complexity. But, capturing this regulatory complexity is essential for elucidating the differences between regulatory networks in healthy and disease states, during development, and essentially any other biological condition. Thus, contextspecific models are fundamental for understanding global regulatory mechanisms. However, direct measurement and modeling of contextdependent GRNs is a tremendous challenge, as a human, for example, is composed of roughly 37 trillion cells^{45}. Despite advances in single cell genomics, inferring GRNs for each organism/tissue/cell/condition separately through accumulation of huge numbers of conditionspecific measurements is, and is likely to remain, impractical. We need methods that can leverage a smaller number of prior networks. For example a reference GRN for Drosophila melanogaster might provide information about GRNs for related species, and wildtype models of specific Drosophila tissues and stages might inform the changes that occur when those networks are perturbed by mutations and/or environmental conditions. In this way, a contextagnostic network provides a good first approximation prior for the wiring diagram of a contextspecific network that explains developmental progression or disease. Gene expression information is currently one of the most easily accessible contextspecific data types. Therefore, it is important to be able to utilize this data, along with the prior knowledge, for construction of contextspecific GRNs. To address this need, we introduced here a GRN inference method—NetREX. The unique property of NetREX is that starting from a prior network, it utilizes expression data to interactively remodel a new network that converges on the observed expression patterns by adding and removing edges. The fact that NetREX explores the network space around a prior network gives us a unique advantage when the target network is at least marginally similar to the prior network. The evaluation of the method on E.coli network suggested that NetREX outperforms other methods when the overlap between prior and target network is ~20%. In addition it is the only method that continues to improve over the prior network even when this network is already quite good. In addition, NetREX performs very well on novel experimental datasets both in terms of predicting independently validated interactions and in terms of network consistency. For all these reasons, NetREX is a significant milestone in development of contextdependent network models from a limited set of adaptable reference networks.
While poor network models might someday be rare, currently many prior networks will have such poor quality that rewriting is futile, or can even degrade the performance of the model. In those cases, one would want to start from a model that does not use a prior. In the last decade, a significant effort has been devoted to priorless construction of GRNs from gene expression alone. Thus, it is important to have a method to evaluate the tradeoff between priorbased and priorfree approaches. To address this challenge, we introduced PriorBoost; a method allowing a researcher to quantitatively gauge whether a given prior is helpful in the context of constructing a contextspecific network model from a given expression dataset. We demonstrated that PriorBoost was valuable for evaluating the contextagnostic networks as a possible prior for constructing both E. coli, and sexspecific Drosophila GRNs, and used this method to show that a prior for males was inappropriate when unusual testisspecific expression was used. By using PriorBoost we could eliminate ad hoc decisionmaking on the utility of prior models.
Several new methodological advancements introduced in this work contributed to the success of NetREX. These contributions include the design of the objective function which, in addition to evaluating the fit of the network, favors a network search toward biologically relevant topologies. However, since adding and removing of edges proceeds in discrete steps, the function optimized by NetREX is not continuous. Typical way of dealing with this issue is to convert the function to a continuous one (in this case by replacing l_{0} norm which is discrete by l_{1} norm which is continuous) and use standard optimization techniques on so modified problem even if it is not equivalent to the original one. An additional contribution of this work is the development of mathematical underpinnings allowing us to guarantee convergence of NetREX search by utilizing the cutting edge PALM optimization framework^{27}. The applicability of these algorithmic advances, especially convergence of calculations that include l_{0} norm, has broader applications to diverse feature selection approaches^{46,47}.
A key feature of NetREX is the ability to score the quality of network topology given expression data, in absence of the ground truth. For this purpose, NetREX utilizes the NCA model. This model is based on the assumption that gene expression can be modeled as a linear combination of activities of regulating TF, which is an oversimplification, but might approach the truth. Engineered gene expression modules in Drosophila show that TFs and enhancers act in a largely additive fashion as simple input/output devices^{48}.
While it was remarkable that, in the case of E. coli, NetREX was able to improve over a network that was that already 80% or more correct, the ultimate test for a GRN is to use it to make biological predictions. We not only used NetREX to construct the first sexspecific regulatory networks for Drosophila, but we demonstrated that NetREX outperformed networks obtained with alternative methods. For example, NetREX identified Darkener of abricot (Doa) locus as female target of DSX. The Doa locus encodes a kinase that is a positive feedback regulator of the DSX premRNA splicing event that generates femalespecific DSX TF^{49,50}. We also provide methods to avoid inappropriate application of NetREX. PriorBost allowed us to directly determine whether a prior was suitable for rewriting a contextagnostic network, as demonstrated for accommodating unusual testis gene expression regulation due to specialized basal transcriptional machinery.
Overall our results show that NetREX is a very powerful method for integrating prior knowledge and expression data for reconstructing contextspecific GRNs. While NetREX strongly relies on the initial prior, however by utilizing introduced here PriorBoost technique, it avoids using an inappropriate prior and being mislead by it.
Methods
NetREX
In contrast to most of the previous methods that rely on the predictive power of the mRNA level of the TF (which might not reflect the cellular activity of the TF^{51}), NetREX reconstructs a GRN based on unknown TF activities A. NetREX simultaneously estimates unknown TF activities A and rewires the prior network G_{0} until the structure of the rewired network S and the predicted TF activities A optimally explain the contextspecific expression data E based on the linear relationship described as \(E\left( {i,:} \right) = \mathop {\sum}\nolimits_j {S\left( {i,j} \right)} \times A\left( {j,:} \right) + \Gamma \left( {i,:} \right)\), where E(i, :) represents expression of gene i, S(i, j) represents the interaction between TF j and gene i and its regulatory potential, A(j, :) is the TF activity of TF j, and Γ(i, :) represents the noise. Therefore, NetREX is formulated as an optimization problem (1) that aims to find the optimal linear model with several additional terms controlled by λ, κ, η, ξ, μ designed to enforce important properties of the target regulatory network as described below.
where S is the adjacency matrix of network G that is the output of NetREX.║·║_{0}, ║·║_{F}, and ║·║_{∞} are l_{0}, Frobenius, and infinity norms, respectively. The ║·║_{0} norm computes the number of nonzero elements in the matrix of interest. More mathematical details about the formulation can be found in Supplementary Methods.
The term controlled by λ restricts the number of edge changes from the prior network (Supplementary Methods: The Formulation of NetREX). Larger λ indicates that only small number of edges can be added and removed controlling how far our predicted network G is from the prior network G_{0}. The term controlled by κ (the graph embedding term^{52}) encourages related genes encoded in gene–gene network G^{E} to be coregulated by the same TFs (Supplementary Methods: The Formulation of NetREX). Here G^{E} is the gene correlation network constructed based on gene expression data E and L is the corresponding Laplacian matrix. The terms controlled by parameters η and ξ, which we call the l_{0} elastic net, encourage sparsity of the final network with group effect (Supplementary Methods: The Formulation of NetREX). For the reader familiar with the elastic net model, we point out that the l_{0} elastic net is analogous to l_{1} elastic net^{53}. Notably, the graph embedding and l_{0} elastic net only encourages edges with certain property but does not remove edges. NetREX only removes edges if it finds TFs whose activities can better explain the expression of gene(s) other than the TFs in the prior network. Finally, the last term controlled by the variable μ enforces smoothness of activities in A by limiting the number of elements in A reaching the limit {−b,b}. The strategy of selecting parameters for NetREX is discussed in (Supplementary Note 6).
The optimization problem (1) with given parameters can be solved by using the Proximal Alternative Linearized Maximization (PALM) algorithm^{27} which guarantees convergence (Supplementary Methods: Optimization Behind the NetREX Algorithm). The output of the PALM algorithm, A and S, are the estimated TF activities and the predicted contextspecific GRN, respectively. We can rank the edges in S by their confidence score B that measures their impacts on the overall performance of the linear model^{18} (Supplementary Methods: Ranking Interactions and Bootstrapping).
To further improve the inference and make it more robust against overfitting and sampling errors, we use a bootstrapping strategy, where we resample the gene expression data with replacement and solve the problem (1) on the new dataset. This procedure is repeated several times, and the resulting lists of edges are combined to a final ranked list as in ref. ^{54}. For reconstruction of GRNs in a new context, where we do not have any ground truth information, different parameters are applied and the final ranking of the edges are obtained by consensus over the results under different parameters^{54} (Supplementary Methods: Model Selection of NetREX). Parameter settings of NetREX for all experiments are elaborated in (Supplementary Note 6).
Efficiency and scalability are important for utility. NetREX needs to store the expression data and the prior network, therefore, the space complexity of NetREX is O(NL+NM), where N is the number of genes, L is the number of samples, and M is the number of TFs. Based on Algorithm 1 (Supplementary Methods: Optimization Behind the NetREX Algorithm), the heaviest computation in each iteration of NetREX is to compute the partial derivatives of the objective function, indicating that the time complexity of NetREX in each iteration is O(NML). Therefore, the overall time complexity of NetREX is O(CNML), where C is the number of iterations that NetREX takes in a run. Both the space and time complexities scale linearly with respect to the number of samples L.
PriorBoost
The assessment of the prior network suitability is based on two ideas. First, the quality of any network G can be estimated by the consistency between the structure of the network and the expression data. Such consistency is validated in E. coli data (Supplementary Methods: The PriorBoost Score and Supplementary Figure 2) and can be computed by the following equation.
\(S \in G\) means that the nonzero pattern of S is conserved to the structure induced by G. Actually, equation (3) is the original formulation of NCA^{26} and q(G) is the optimal objective function value after solving NCA. Second, if a prior network is consistent with the given expression data, the network predicted by a priorbased method should be better than the network inferred by an expressionbased method. The expressionbased method we used here is Genie3, which was the winner of the DREAM4^{54} and DREAM5^{13} challenges.
Specifically, suppose we have a prior network G_{0} and expression data E. G^{*} is the network predicted using both expression E and the prior G_{0}, and \(\bar G\) is the network predicted by Genie3 using only expression E. Let \(G_c^ \ast\) and \(\bar G_c\) be networks obtained by keeping the top c edges in G^{*} and \(\bar G\) based on their edge weights, respectively. Then, the PriorBoost score for the prior network G_{0} can be estimated by
where C is a set of different cutoffs. Positive Q(G_{0}) indicates that the network predicted using E and G_{0} is more consistent with the expression data E than the network predicted by Genie3. A positive Q(G_{0}) also implies that the prior network is informative, while a negative Q(G_{0}) indicates the opposite.
Novel TFgene interactions for E. coli
In addition to the 2066 TFgene interactions used in DREAM5 challenge, we identified 230 additional interactions that were discovered after DREAM5 from RegulonDB 9.2 (version 09082016)^{22}. We utilized these 230 interactions to test the ability of each method to predict novel interactions.
The PPI score
One way to validate a GRN is to test whether physically interacting genes are preferentially coregulated. Here we introduce and validate a modification of the previously proposed score based on this idea^{2}. We consider two genes are coregulated if the Jaccard similarity coefficient between the TF set regulating the first gene and the TF set regulating the second gene is >0.5. The Jaccard similarity coefficient between two sets is the ratio of the size of the intersection of the given two sets to the size of the union of these two sets. Our measure is based on the following hypergeometric test. Suppose that there are N PPIs among M gene pairs, and there are m coregulated gene pairs in the predicted network with n having PPIs. The pvalue is the probability of selecting more than n PPIs when we choose m gene pairs at random. The PPI score is defined as −log_{10}(pvalue). We tested the PPI scores on simulated E. coli GRN with different noise levels that are controlled by the percentage of true edges and the ratio of true to false edges. We found that the PPI score defined in this way are more consistent with the quality of the network compared to the previously proposed measure^{2} (Supplementary Figure 5).
While PPI score can be very useful, it should be used with caution. In particular it should not be used to compare networks that are sparse (a network has to have a significant number of coregulated genes for the score to be meaningful) and, as any pvaluebased score, it should not be used for comparing networks of very different sizes.
Finally, note that the PPI score is independent of expression data and thus it can be used to evaluate topology of the network but not its relation to the experimental data.
The GO score
The GO score of coregulated genes was computed analogously to the PPI score^{2} with the following modification. For each coregulated gene pair, we again use the Jaccard similarity coefficient to measure the similarity between the GO annotation set corresponding to the first gene and the set corresponding to the second gene and consider the coregulated genes are functional similar if the similarity is >0.5. Instead of using all GO terms^{55}, we only considered highlevel GO terms with information content (IC) larger than two so that we can better understand the functional specificity of the coregulated gene pairs^{55,56}. The IC of a GO term g is defined as −ln(g/root), where ‘root’ is the corresponding root GO term (either F, P, or C) of g^{55,56}. We also used the hypergeometric test to get a pvalue indicating the enrichment level of the functional similar gene pairs within the coregulated gene pairs inferred by the networks. The GO score is also defined as −log_{10}(pvalue). We illustrated the effectiveness of GO scores on simulated E. coli GRNs (Supplementary Figure 5).
As in the case of PPI scores, computing GO scores might not be meaningful in some situations.
The DSX targets
The experimentally supported DSX target genes are the union of two sets. The first set of genes were obtained based on ChIPSeq gene level occupancy scores^{25}. And the second set was collected based on conserved motif scores^{25}. The experimentally supported DSX target gene set was served as the ground truth for investigating the predictive power of different methods (details are in Supplementary Note 4).
Highly expressed genes in ovary or testis
We used the quantification of tissuespecific expression from modENCODE as summarized in FlyBase^{57}. Flybase assigns genes to bins depending on their expression in a given tissue. “Bin_value” is an integer that ranges from 0 to 6, where 0 means that a gene has very low expression and 6 means it has extremely high expression. We identified all genes expressed in ovary or testis with “Bin_value” >5 and treat them as genes highly expressed in ovary or testis.
Code availability
The integrative networks, input and validation datasets, as well as the source code used for network inference and validation are provided in online supplementary information and on the companion website of the paper (https://www.ncbi.nlm.nih.gov/CBBresearch/Przytycka/index.cgi#netrex (Matlab) and https://github.com/ncbi/NetREX (Python)).
Data availability
All the data used in this study (data for E. coli, female, and male flies) are included in https://www.ncbi.nlm.nih.gov/CBBresearch/Przytycka/index.cgi#netrex. And the femalespecific and malespecific GRNs constructed by NetREX are provided in Supplementary Data 1 and Supplementary Data 2.
Change history
29 July 2019
An amendment to this paper has been published and can be accessed via a link at the top of the paper
References
Banf, M. & Rhee, S. Y. Enhancing gene regulatory network inference through data integration with markov random fields. Sci. Rep. 7, 41174 (2017).
Marbach, D. et al. Predictive regulatory models in Drosophila melanogaster by integrative inference of transcriptional networks. Genome Res. 22, 1334–1349 (2012).
Novershtern, N., Regev, A. & Friedman, N. Physical Module Networks: an integrative approach for reconstructing transcription regulation. Bioinformatics 27, i177–85 (2011).
Clough, E. & Oliver, B. Genomics of sex determination in Drosophila. Brief. Funct. Genom. 11, 387–394 (2012).
Fletcher, M. N. C. et al. Master regulators of FGFR2 signalling and breast cancer risk. Nat. Commun. 4, 2464 (2013).
Reverter, A. & Chan, E. K. F. Combining partial correlation and an information theory approach to the reversed engineering of gene coexpression networks. Bioinformatics 24, 2491–2497 (2008).
Nicolle, R., Radvanyi, F. & Elati, M. CoRegNet: reconstruction and integrated analysis of coregulatory networks. Bioinformatics 31, 3066–3068 (2015).
Haury, A.C., Mordelet, F., VeraLicona, P. & Vert, J.P. TIGRESS: trustful inference of gene regulation using stability selection. BMC Syst. Biol. 6, 145 (2012).
Faith, J. J. et al. Largescale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 5, e8 (2007).
Statnikov, A. & Aliferis, C. F. Analysis and computational dissection of molecular signature multiplicity. PLoS Comput. Biol. 6, e1000790 (2010).
HuynhThu, V. A., Irrthum, A., Wehenkel, L. & Geurts, P. Inferring regulatory networks from expression data using treebased methods. PLoS ONE 5, e12776 (2010).
Küffner, R., Petri, T., Tavakkolkhah, P., Windhager, L. & Zimmer, R. Inferring gene regulatory networks by ANOVA. Bioinformatics 28, 1376–1382 (2012).
Marbach, D. et al. Wisdom of crowds for robust gene network inference. Nat. Methods 9, 796–804 (2012).
Vogel, C. & Marcotte, E. M. Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nat. Rev. Genet. 13, 227–232 (2012).
Liao, J. C. et al. Network component analysis: reconstruction of regulatory signals in biological systems. Proc. Natl Acad. Sci. USA 100, 15522–15527 (2003).
Friedman, J., Hastie, T., Höfling, H. & Tibshirani, R. Pathwise coordinate optimization. Ann. Appl. Stat. 1, 302–332 (2007).
Mukherjee, S. & Speed, T. P. Network inference using informative priors. Proc. Natl Acad. Sci. USA 105, 14313–14318 (2008).
Greenfield, A., Hafemeister, C. & Bonneau, R. Robust datadriven incorporation of prior knowledge into the inference of dynamic regulatory networks. Bioinformatics 29, 1060–1067 (2013).
Petralia, F., Wang, P., Yang, J. & Tu, Z. Integrative random forest for gene regulatory network inference. Bioinformatics 31, i197–205 (2015).
Siahpirani, A. F. & Roy, S. A priorbased integrative framework for functional transcriptional regulatory network inference. Nucleic Acids Res. https://doi.org/10.1093/nar/gkw1160 (2016).
ArrietaOrtiz, M. L. et al. An experimentally supported model of the Bacillus subtilis global transcriptional regulatory network. Mol. Syst. Biol. 11, 839 (2015).
GamaCastro, S. et al. RegulonDB version 9.0: highlevel integration of gene regulation, coexpression, motif clustering and beyond. Nucleic Acids Res. 44, D133–D143 (2016).
Ryder, E. et al. The DrosDel deletion collection: a Drosophila genomewide chromosomal deficiency resource. Genetics 177, 615–629 (2007).
Lee, H. et al. Effects of gene dose, chromatin, and network topology on expression in Drosophila melanogaster. PLoS Genet. 12, e1006295 (2016).
Clough, E. et al. Sex and tissuespecific functions of Drosophila doublesex transcription factor target genes. Dev. Cell 31, 761–773 (2014).
Liao, J. C. et al. Network component analysis: reconstruction of regulatory signals in biological systems. Proc. Natl Acad. Sci. USA 100, 15522–15527 (2003).
Bolte, J., Sabach, S. & Teboulle, M. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146, 459–494 (2013).
MéndezCruz, C. F. et al. First steps in automatic summarization of transcription factor properties for RegulonDB: classification of sentences about structural domains and regulated processes. Database 2017, bax070 (2017).
Lee, H. et al. Effects of gene dose, chromatin, and network topology on expression in Drosophila melanogaster. PLoS Genet. 12, e1006295 (2016).
Clough, E. et al. Sex and tissuespecific functions of Drosophila doublesex transcription factor target genes. Dev. Cell. 31, 761–773 (2014).
Marbach, D. et al. Predictive regulatory models in Drosophila melanogaster by integrative inference of transcriptional networks. Genome Res. 22, 1334–1349 (2012).
Graveley, B. R. et al. The developmental transcriptome of Drosophila melanogaster. Nature 471, 473–479 (2011).
Andrews, J. et al. Gene discovery using computational and microarray analysis of transcription in the Drosophila melanogaster testis. Genome Res. 10, 2030–2043 (2000).
Parisi, M. et al. Paucity of genes on the Drosophila X chromosome showing malebiased expression. Science 299, 697–700 (2003).
Parisi, M. et al. A survey of ovary, testis, and somabiased gene expression in Drosophila melanogaster adults. Genome Biol. 5, R40 (2004).
Brown, J. B. et al. Diversity and dynamics of the Drosophila transcriptome. Nature 512, 393–399 (2014).
Lu, C. & Fuller, M. T. Recruitment of mediator complex by cell type and stagespecific factors required for tissuespecific TAF dependent gene activation in an adult stem cell lineage. PLoS Genet. 11, e1005701 (2015).
Hiller, M. et al. Testisspecific TAF homologs collaborate to control a tissuespecific transcription program. Development 131, 5297–5308 (2004).
Chen, X., Hiller, M., Sancak, Y. & Fuller, M. T. Tissuespecific TAFs counteract Polycomb to turn on terminal differentiation. Science 310, 869–872 (2005).
Santel, A., Kaufmann, J., Hyland, R. & RenkawitzPohl, R. The initiator element of the Drosophila beta2 tubulin gene core promoter contributes to gene expression in vivo but is not required for male germcell specific expression. Nucleic Acids Res. 28, 1439–1446 (2000).
Bielinska, B., Lü, J., Sturgill, D. & Oliver, B. Core promoter sequences contribute to ovoB regulation in the Drosophila melanogaster germline. Genetics 169, 161–172 (2005).
Olenkina, O. M. et al. Promoter contribution to the testisspecific expression of Stellate gene family in Drosophila melanogaster. Gene 499, 143–153 (2012).
Bai, Y., Casola, C. & Betrán, E. Quality of regulatory elements in Drosophila retrogenes. Genomics 93, 83–89 (2009).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol. 15, 550 (2014).
RozenblattRosen, O., Stubbington, M. J. T., Regev, A. & Teichmann, S. A. The Human Cell Atlas: from vision to reality. Nature 550, 451–453 (2017).
Aben, N., Vis, D. J., Michaut, M. & Wessels, L. F. A. TANDEM: a twostage approach to maximize interpretability of drug response models based on multiple molecular data types. Bioinformatics 32, i413–i420 (2016).
Das, J., Gayvert, K. M., Bunea, F., Wegkamp, M. H. & Yu, H. ENCAPP: elasticnetbased prognosis prediction and biomarker discovery for human cancers. BMC Genom. 16, 263 (2015).
Crocker, J., Ilsley, G. R. & Stern, D. L. Quantitatively predictable control of Drosophila transcriptional enhancers in vivo with engineered transcription factors. Nat. Genet. 48, 292–298 (2016).
Du, C., McGuffin, M. E., Dauwalder, B., Rabinow, L. & Mattox, W. Protein phosphorylation plays an essential role in the regulation of alternative splicing and sex determination in Drosophila. Mol. Cell 2, 741–750 (1998).
Rabinow, L. & Samson, M.L. The role of the Drosophila LAMMER protein kinase DOA in somatic sex determination. J. Genet. 89, 271–277 (2010).
Vogel, C. & Marcotte, E. M. Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nat. Rev. Genet. 13, 227–232 (2012).
Belkin, M. & Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15, 1373–1396 (2003).
Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67, 301–320 (2005).
Marbach, D. et al. Revealing strengths and weaknesses of methods for gene network inference. Proc. Natl Acad. Sci. USA 107, 6286–6291 (2010).
Gene Ontology Consortium. Gene Ontology Consortium: going forward. Nucleic Acids Res. 43, D1049–D1056 (2015).
Wang, Y. & Qian, X. Functional module identification in protein interaction networks by interaction patterns. Bioinformatics 30, 81–93 (2014).
Gramates, L. S. et al. FlyBase at 25: looking to the future. Nucleic Acids Res. 45, D663–D671 (2017).
Acknowledgements
We thank members of the Przytycka and Oliver labs for useful discussions. This research was supported by the Intramural Research Programs of the National Library of Medicine (NLM), National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), and the Korean Visiting Scientist Training Award (KVSTA, HI13C1282 to H.L.).
Author information
Authors and Affiliations
Contributions
Y.W., D.C., H.L. and J.F. conceived the research project. B.O. and T.P.M. supervised the research project. Y.W. and D.C. designed the computational formulation and algorithm. Y.W. implemented NetREX and tested it on the benchmark data. Y.W., D.C., H.L., J.F., B.O. and T.P.M. design evaluation methods to validate the predicted female and male fly GRNs. Y.W. applied the evaluation methods to analyse the results. Y.W., B.O. and T.P.M. wrote the manuscript with support from all the authors.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, Y., Cho, DY., Lee, H. et al. Reprogramming of regulatory network using expression uncovers sexspecific gene regulation in Drosophila. Nat Commun 9, 4061 (2018). https://doi.org/10.1038/s4146701806382z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4146701806382z
This article is cited by

NetREXCF integrates incomplete transcription factor data with gene expression to reconstruct gene regulatory networks
Communications Biology (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.