Abstract
The ability to accurately predict the causal relationships from transcription factors to genes would greatly enhance our understanding of transcriptional dynamics. This could lead to applications in which one or more transcription factors could be manipulated to effect a change in genes leading to the enhancement of some desired trait. Here we present a method called OutPredict that constructs a model for each gene based on time series (and other) data and that predicts gene's expression in a previously unseen subsequent time point. The model also infers causal relationships based on the most important transcription factors for each gene model, some of which have been validated from previous physical experiments. The method benefits from known network edges and steadystate data to enhance predictive accuracy. Our results across B. subtilis, Arabidopsis, E.coli, Drosophila and the DREAM4 simulated in silico dataset show improved predictive accuracy ranging from 40% to 60% over other stateoftheart methods. We find that gene expression models can benefit from the addition of steadystate data to predict expression values of time series. Finally, we validate, based on limited available data, that the influential edges we infer correspond to known relationships significantly more than expected by chance or by stateoftheart methods.
Similar content being viewed by others
Introduction
Stateoftheart methods for gene regulatory network inference^{1,2,3,4} use machine learning on genomewide sequencing data to predict the interactions between transcriptional regulators and target genes. A typical approach to gene network inference is to take the results of an assay, most often binding assays such as CHIPseq, and divide the data into training and test sets. This involves excluding some of the transcription factortarget binding observations, and using the remaining training set to infer the hidden data by some method. An issue with this approach is that it presumes that the majority of binding events are physiologically meaningful, in the sense that they influence the expression of the target gene. However, it has been shown that the physiological importance of binding can be minor^{5}.
Another frequent issue with the paradigmatic network inference approach is that the resulting networks encode linear interactions (sum of weighted effects of causal elements). This modeling strategy makes pragmatic sense in the common situation in which the number of possible interactions is much greater than the experimental data points, because linear models have fewer parameters to fit^{6}. Unfortunately, genomic interactions are decidedly nonlinear, noisy and incomplete^{7}.
For these reasons, we have approached the causality problem differently: we first attempt to build a model for each gene g that can predict the expression of that gene in leftout time points. If our model is good, then the transcription factors that most influence gene g likely constitute the causal elements for g.
The form of the model is important here. Small data sizes relative to the number of causal elements preclude the use of neural networks and, in particular, deep neural networks, which would increase the number of model's parameters. The presence of nonlinear relationships excludes linear methods. As a compromise, therefore, this work uses Random Forests (RF) because they model nonlinear synergistic interactions of features and perform well even when sample sizes are small^{8} though noise is always an issue.
The Random Forests within our new method OutPredict (OP) consist of an ensemble of regression trees tuned through extensive bootstrap sampling. We show the following: (i) The OutPredict model allows for nonlinear dependencies of target genes on causal transcription factors; (ii) OutPredict can incorporate time series, steadystate, and prior (e.g. known Transcription Factortarget interactions) information to bias the forecasts; (i) OutPredict forecasts the expression value of genes at an unseen timepoint better than stateoftheart methods, partly because of steadystate and known interaction data; and (iv) the important edges inferred from OutPredict correspond to validated edges significantly more often than other stateoftheart methods.
We compare the OutPredict method to the stateoftheart forecasting algorithms, such as Dynamic Genie3^{9}, that support forecasting and nonlinear relationships, but currently lack the ability to incorporate priors. Other timebased machine learning methods such as Inferelator^{6} and Dynamic Factor Graph^{10}, which we used in our previous studies^{11,12} are based on regularized linear regression. We also compare OutPredict with a neural netbased method built to predict gene expression time series^{13}.
Another relevant time series method from the literature is Granger causality, which has been used successfully for small numbers of genes^{14,15}. Granger causality is a vector autoregressive method that can be used to infer important transcription factors. In our case, however, we are trying to optimize predictive power using a large number of candidate transcription factors using very short time series (e.g. 6 time points). As is well known^{16}, Granger causality can give misleading results in such a setting because the time series are short, causal relationships are nonlinear, and the time series are nonstationary.
Data
Public datasets vary greatly by organism with respect to experimental design, data density, time series structure and assay technologies. To show its general applicability, we test OutPredict on five different species (Table 1): (i) a Bacillus subtilis dataset (ii) an Arabidopsis dataset in shoot tissue (iii) a Escherichia coli dataset (iv) a Drosophila time series dataset, and (v) the DREAM4 onehundred node in silico challenge. When applicable, we denote data as “gold standard“ when it is highly curated regulatory or binding data.
B. subtilis
This dataset consists of time series and steadystate data capturing the response of B. subtilis to a variety of stimuli^{17}. The gold standard network prior is a curated collection of high confidence edges from high throughput ChIPseq and transcriptomics assays on SubtiWiki^{18} (we used the parsed data set provided in^{19}).
Arabidopsis thaliana in shoots
This dataset consists of gene expression level measured from shoots over the 2hours period during which the plants are treated with nitrogen^{12}. As gold standard network data, we used experimentally validated edges from the plant cellbased TARGET assay, which was used to identify direct regulated genomewide targets of N uptake/assimilation regulators^{12}.
E. coli
This dataset includes the E. coli gene expression values, measured at multiple time points following five distinctive perturbations (i.e., cold, heat, oxidative stress, glucoselactose shift and stationary phase)^{20}. We used as gold standard ancillary data the regulatory interactions aggregated from a variety of experimental and computational methods that has been collected and described in RegulonDB^{21}. We retrieved both parsed expression dataset and gold standard data from^{9}.
Drosophila melanogaster
This dataset consists of gene expression levels covering a 24hour period; it captures the changes during which the embryogenesis of the fruitfly Drosophila occurs^{22}. As gold standard network data, we used the experimentally validated TFtarget binding interactions in the DroID database^{23}. These interactions come from a combination of ChiPchip/ChIPseq, DNAse footprinting, in vivo/vitro reporter assays and EMSA assays across various tissues from 235 publications. Huynh et al.^{9} also used this Drosophila data.
DREAM4 synthetic data
This synthetic dataset from the DREAM4 competition consists of 100 genes and 100 TFs (any gene can be a regulator)^{24}. Because this is synthetic data, the underlying causality network is known.
Methods
Time series predictions using Random Forests
OutPredict learns a function that maps expression values of all active transcription factors at time t, to the expression value of each target gene (whether a transcription factor or not) at the next time point. Thus, for each gene target, OutPredict learns a manytoone nonlinear model relating transcription factors to that target gene.
The gene function is embodied in a Random Forest, as used previously in Genie3^{25}, iRafNet^{26}, DynGenie3^{9}. When used on a single time series, the Random Forest for each gene is trained on all consecutive pairs of time points except the last time point. For example, if there are seven time points in the time series, then the Random Forest is trained based on the transitions from time point 1 to 2, 2 to 3, …, 5 to 6. Time point 7 will be predicted based on the trained function when applied to the data of time point 6. The net effect is that the testing points are not used in the training in any way because the test set includes only the last time points of each time series.
For a given time series, when multiple time series are available, OutPredict trains the Random Forest on all consecutive pairs of time points (always excluding the last time point) across all time series. Further, OutPredict treats replicates independently, viz. if there are k1 replicates for time point t1 and k2 for subsequent time point t2, then we consider k1 × k2 combinations in the course of our training. The result of the training is to construct a single function f for each target gene that applies to all time series. To test the quality of function f, we evaluate the meansquared error (MSE) on the last point of every time series on that target gene.
The Random Forest uses bootstrap aggregation, where each new tree is trained on a subsample of the training data points. The OutofBag error for a given training data point is estimated by computing the average difference between the actual value for a given training data point and the predictions based on trees that do not include the training data point in their bootstrap sample. Each tree is built on a bootstrap sample of size approximately 2/3 of the training dataset. Bootstrap sampling is done with replacement, and the remaining 1/3 of the training set is used to compute the outofbag score. Thus, the outofbag calculation is done on training data only.
All our experiments used random forest ensembles of 500 trees to avoid overfitting. Pruning did not improve the outofbag score, so the experiments used the default parameters for pruning of RandomForestRegressor in sklearn^{27}.
Incorporation of goldstandard data as priors
OutPredict uses prior data to bias the training of the Random Forest model. Specifically, each decision tree node within a tree of the Random Forest will be biased to include a transcription factor X_{1} for the model of gene g in preference to transcription factor X_{2} if the prior data indicates a relationship between X_{1} and g but none between X_{2} and g.
The gold standard for OutPredict is a matrix [Genes * TFs] containing 0 s and 1 s, which indicates whether we have prior knowledge about the interaction of a transcription factor (TF) and a gene. Hence, if the interaction between a TF and gene g is 1, then there is an inductive or repressive edge; while if it's 0, then there is no known edge.
In order to compute prior weights from the gold standard prior knowledge, we assign a value v to all interactions equal to 1 (i.e., the True Positive interactions) and 1/v to the interactions identified by 0 (the set of values tried for v is specified in Supplementary Table S2).
During the tree construction, our Weighted Random Forest, at each node d, selects r candidate features (transcription factors) X_{1}, X_{2}, …, X_{r} according to the prior weights (Fig. 1); r is the number of features sampled at each node d, which is set to the square root of the total number of transcription factors.
The r candidate transcription factors are a subset of all transcription factors and are randomly sampled at each tree node, biased based on the weights of the priors, as in iRafNet^{26}. In addition, OutPredict calculates the I(d)(variance reduction * prior weight) criterion (which is defined below in formula (3) of the Mathematical Formulation section) for all the selected subset at each node and branch on the transcription factor with highest I(d).
OutPredict incorporates steadystate(SS) data into the same Random Forest model as the time series(TS) data (an “integrated“ approach, denoted as the RF_{SS+TS} model). Further, each prior dataset can be evaluated separately depending on how helpful it is to make predictions on time series. By contrast, for example, iRafNet^{26}, combines all prior datasets and weights them equally at each tree node. An equal weighting strategy may decrease overall performance when, for example, one prior dataset is less informative or is errorrich. As an aside, iRafNet can make outofsample predictions but only on steadystate data.
Mathematical formulation
Let X be the expression values of the set of features (in our case, transcription factors), and y_{j} be a target. We seek a function such that maps X to y_{j} either in steadystate or for time series. For steadystate data, we use all experimental conditions to infer a function y_{j} = fsteady_{j}(X) where X must not include y_{j}. That is, for each gene y_{j}, we seek a function from all other genes to y_{j}. For time series, Outpredict supports two types of models:
1. TimeStep (TS) model:
2. Ordinary Differential Equation natural logarithm (ODElog) model:
where X(t_{i}) denotes the expression values of all the transcription factors at time t_{i}, y_{j}(t_{i+1}) denotes the expression of gene j at t_{i+1}, α is the degradation term. All genes are assumed to have the same α.
OutPredict integrates steadystate(SS) data with Time series(TS) data in a single Random Forest.
We have found that the ODElog model achieves a better outofbag score compared to just using the linear difference (t_{i+1} − t_{i}) in the denominator. This makes some intuitive sense because many phenomena in nature show a decay over time. Empirically, for example, the difference in expression value between 5 and 20 is more than 1/3 the difference between 5 and 60 in the Arabidopsis time series. Further, Supplementary Fig. S5 illustrates the absolute difference in gene expression decreasing over time for most of the species.
During training, one of the TimeStep or ODElog models is selected based on the outofbag score on the training data. We have found that the relative performances of the two OutPredict techniques TimeStep and ODElog are very data dependent, with TimeStep performing better than ODElog on B. subtilis and Drosophila, while the opposite is observed on Arabidopsis, E.coli and DREAM4 (Supplementary Table S1 shows the best model based on outofbag score).
In detail, during training, OutPredict determines (i) which of these two methods (ODElog or TimeStep) to use, (ii) the prior weights of the TFs, and (iii) the degradation term for the ODElog model. As far as we know, this is the first time the choice of model and degradation parameter value have been treated as trainable hyperparameters. We show in Supplementary Table S2 the set of hyperparameter values tested for the degradation term α and for the prior weights when calculating the outofbag score.
Computationally, at a given node d in a tree, OutPredict computes the product of (i) the standard Random Forest importance measure which is defined as the total reduction of the variance of y and (ii) the weight given by the priors. Here is the formula used for the reduction of variance^{8}, modified by the prior weighting:
where d is the current decision node being evaluated, S is the subset of samples that are below decision node d in the tree, S_{l} and S_{r} are the subsets of experiments on the left and right branches of decision node d, respectively; var_{y} is the variance of the target gene in a given subset, and \({S}_{num},{S}_{{l}_{num}},{S}_{{r}_{num}}\) denote the number of training samples in each subset associated with a specific target gene. Finally, \({w}_{{X}_{i},y}\) is the prior weight from a given feature X_{i} to a given target gene y, which causes features with high prior weights to be chosen with higher probability when splitting a tree node during tree construction. Because the model for each target gene is independent, OutPredict calculates the model for the target genes in parallel.
For the purpose of inferring relative influence of transcription factors on genes and constructing a network of such potential causal edges, let T be the number of trees and D_{i} be the set of nodes which branch based on transcription factor (feature) X_{i}, the overall importance score of the feature X_{i} is:
Computationally, the importance score s_{i} of X_{i} is the sum of the variance improvements I(d) over all nodes d in D_{i} divided by the number of trees T. The resulting variable importance value s_{i} is more robust than the value obtained from any single tree because of the variance reduction resulting from averaging the score over all the trees^{8}. High importance scores identify the set of the likely most influential transcription factors for each target gene.
Results
We measure the prediction performance of our algorithm using the Mean Squared Error(MSE) of the predictions of outofsample data. For each species tested, we compare the performance of the different algorithms on time series alone and on time series data with prior information.
As mentioned above, we compared our weighted Random Forest with two related works: (i) a Neural Network (NN) with a hidden layer^{13} which is an approach developed specifically for time series gene expression prediction (in the supplement). In detail, we perform hyperparameter optimization for the learning rate of the stochastic gradient descent optimizer, and the dropout rate. Thus, regularization is applied through dropout, which helps reduce overfitting. (ii) the Random Forest algorithm DynGenie3^{9}, which is an extension of Genie3^{25} that is able to handle both steadystate and time series experiments through the adaptation of the same ordinary differential equation (ODE) formulation as in the Inferelator approach^{6}. iRafNet^{26}, as noted above, does not handle time series data as the main input data.
DynGenie3 was primarily designed for Gene regulatory network inference, but the authors show the performance of DynGenie3 at predicting both time series and steadystate data in the validation sets. Therefore, we evaluate DynGenie3 for predicting leaveout time series data in order to compare it with OutPredict. As a baseline for all algorithms, we consider the penultimate value prediction of the expression of a gene at a given time point to be the same value as the expression of that gene at the immediately previous time point. To evaluate the performance of our forecasting predictions, we compare the predicted expression values to the actual expression values for each gene (Figs. 2A, 3A) and calculate the Mean Squared Error (MSE) across all genes.
Quantitative results
We show in Figs. 2B and 3B overall bar plots for a Bacillus subtilis and Arabidopsis. Similar results hold for other species (Supplementary Figs S1, S2, S3). A table showing which method and data were used for each can be found in Table 2. Our basis of comparison is Mean Squared Error, which is a measure of the error in the predictions in which smaller values indicate more accurate predictions. Given a species, the mean squared error (MSE) is calculated as follows: given the prediction and actual value for each replicate of each gene at the last time point, first compute the squared error for each replicate. Second, take the mean to get the mean squared error for that gene. Third, compute the global mean squared error as the mean of the mean squared errors of each gene. Figures 2A and 3A show qualitatively that the actual values closely track the predicted values. OutPredict outperforms DynGenie3, Neural Nets, and penultimate value predictions over all species using these datasets.
In B. subtilis (Fig. 2), OutPredict performs 30% better than Penultimate Value (P < 0.05, based on a nonparametric paired test), and 50% better than Dynamic Genie3 (P < 0.05, based on a nonparametric paired test) (Fig. 2B). As OutPredict allows the incorporation of priors into the model, such as goldstandard network data, we compared the forecasting performance of OutPredict using time series with the integration of steadystate with OutPredict on time series data with steadystate data and goldstandard regulated edges as priors (Supplementary Fig. S4). In these tests, the inclusion of validated goldstandard edges as priors improved predictions compared to excluding priors (Supplementary Fig. S4, 11% improvement, P < 0.05, nonparametric paired test).
The nonparametric paired test we use throughout this paper compares any two prediction methods M1 and M2 as follows: (i) format the data from the original experiment by a series of rows with one row for each gene containing the gene identifier, the M1 prediction for that gene, the M2 prediction, and the real value (call this series of rows Orig); (ii) calculate the figure of merit (for example, the squared error) for each gene and each method (e.g., the square of M1 prediction  real value); (iii) calculate the difference, Diff, in the average of the figure of merit (for example, the difference of the mean squared errors) of the M1 values and the M2 values; (iv) Without loss of generality, assume Diff is positive; (v) randomization test: for some large number of times N (e.g., N = 10,000), starting each time with Orig, for each gene g, swap the M1 and M2 values for gene g with probability 0.5. Now recalculate the overall difference of the figure of merit for M1 and for M2 and see if that difference is greater than Diff. If so, that run is considered an exception; (vi) The pvalue of Diff (and therefore of the change in the figure of merit) is the number of exceptions divided by N. When the pvalue is small, the observed difference is unlikely to have happened by chance.
We show in Table 2 the different models that were compared for the experimental results: each model (built with a given algorithm) is associated with a given species, a specific main input dataset and a prior dataset. Recall that, in OutPredict, the priors bias the Random Forest by adjusting the weights that determine feature inclusion.
Furthermore, we show the results using the OutPredict (OP) technique (either the Timestep or ODElog) that validation analysis found to be the best model using the outofbag score. We found that the weights/importance found in high quality prior data significantly improve predictions in B. subtilis (Fig. 2B), though less so in Arabidopsis Shoots (Fig. 3B). There is no improvement in E. coli, Drosophila or Dream4 (Supplementary Figs S1, S2, S3). The precise reasons may vary: gold standard data may contain inaccurate regulatory interactions, may be either incomplete, or may depend on specific experimental conditions. The DREAM4 dataset shows that Priors data contributes to outofsample predictions more when there are few time series than when there is abundant time series data (Supplementary Fig. S8); similarly, the outofsample predictions improvement of using time steadystate data, relative to time series data alone, decreases as the number of time series increases (Supplementary Fig. S7).
As a test of the usefulness of OutPredict's importance scores, or measures of influence, for all the TFs on every target gene, we evaluate the OPPriors model importances in Arabidopsis. The dataset consists of 162 TFs on 2173 targets, totaling 352,026 TF–target edges. To refine these timebased TF–target predictions, we retained the highestconfidence edges, specifically, the top 2% of the edges according to the score, resulting into 7042 edges. We used 1754 validated TF–target edges of 11 TFs physical experiments from^{28,29,30,31,32,33,34,35}, (the data for the 11 TFs are described in Supplementary Table S4), which is a disjoint dataset from the one used for the priors. This analysis establishes the precision (i.e., the proportion of predicted TFtarget edges that are validated) and recall (i.e., the proportion of validated TFtarget edges that are predicted) of the OutPredict top 2% edges for the validated 11 TFs. The results showed that precision and recall for the TF–target predictions in the top 2% edges were 0.246 (76/309) and 0.043 (76/1754), respectively. Both were significantly greater than the mean for 1000 random samples of 309 edges of these 11 TFs (random precision mean ≈0.161 and random recall mean ≈0.028) (Table 3). Moreover, the precision of OPPriors for the top 2% outperforms OPTSonly (precision = 0.226) and DynGenie3 (precision = 0.158). We further compared the performance of the OPPriors model importances with OPTSonly and DynGenie3, and computed the Area under PrecisionRecall (AUPR) using the 1754 validated TF–target edges of 11 TFs physical experiments in Arabidopsis. The AUPR of Outpredict with Priors (OPPriors) is 15% better than random (pvalue < 0.01, nonparametric paired test), for Outpredict without Priors (OPTSonly) AUPR is 7.5% better than random (pvalue < 0.01, nonparametric paired test), while DynGenie3 is no better than random (Fig. 4). In the supplement (Supplementary Fig. S9), we show that similar results hold for the DREAM4 synthetic dataset (where causal edges are known). This shows the promise of using prediction to infer influence and suggests that good outofsample prediction leads to good causality models.
Discussion
OutPredict is a nonlinear machine learning method based on an ensemble of regression trees for time series forecasting. It can incorporate steadystate data, temporal data and prior knowledge, as well as a variety of differential equation models for this purpose. OutPredict both predicts the future states of a given organism and gives a quantitative measure of the importance of a given transcription factor on a target gene.
There are four reasons for the relative success of OutPredict compared to other methods: (i) the use of Random Forests which provides a nonlinear model (in contrast to regression models) that requires little data (in contrast to neural net approaches), (ii) the incorporation of prior information such as gold standard network data (in contrast to DynGenie3), (iii) the adjustment of weights of predictors (in contrast to all other time series based methods), and iv) the selection during training of the optimal technique between the TimeStep and our ODElog model, which includes a degradation term that is also tuned (in contrast to all other methods).
In summary, OutPredict achieves high prediction accuracy and significantly outperforms baseline and stateoftheart methods on data sets from four different species and the in silico DREAM data as measured by mean squared error. Further, as a proof of concept, we have seen that the high importance edges correspond to individually validated regulation events much greater than by chance in both Arabidopsis and DREAM. The code is open source and is available at the site https://github.com/jacirrone/OutPredictgithub.com/jacirrone (https://doi.org/10.5281/zenodo.3611488).
Change history
19 August 2020
An amendment to this paper has been published and can be accessed via a link at the top of the paper.
References
Marbach, D. et al. Wisdom of crowds for robust gene network inference. Nature Methods (2012).
Chai, L. E. et al. A review on the computational approaches for gene regulatory network construction. Computers in Biology and Medicine 48, 55–65 (2014).
Novere, N. L. Quantitative and logic modelling of molecular and gene networks. Nature Reviews Genetetics 16, 146–158 (2015).
Delgado, F. M. & GAmezVela, F. Computational methods for gene regulatory networks reconstruction and analysis: A review. Artificial Intelligence in Medicine, Volume 95 (2019).
Gitter, A. et al. Backup in gene regulatory networks explains differences between binding and knockout results. Molecular System Biology (2009).
Greenfield, A., Hafemeister, C. & Bonneau, R. Robust datadriven incorporation of prior knowledge into the inference of dynamic regulatory networks. Bioinformatics (2013).
Slattery, M. et al. Absence of a simple code: how transcription factors read the genome. Trends in Biochemical Sciences 39(9), 381–399 (2014).
Breiman, L. Classification and regression trees. Chapman & Hall CRC (1984).
HuynhThu, V. A. & Geurts, P. Dyngenie3: dynamical genie3 for the inference of gene networks from time series expression data. Scientific Reports (2018).
Mirowski, P. & LeCun, Y. Dynamic factor graphs for time series modeling. Machine Learning and Knowledge Discovery in Databases, Pt Ii 5782, 128–43 (2009).
Brooks, M. D. et al. Network walking charts transcriptional pathways for dynamic nitrogen signaling using validated and predicted genomewide interactions. Nature Communication (2019).
Varala, K. et al. Temporal transcriptional logic of dynamic regulatory networks underlying nitrogen signaling and use in plants. Proceedings of the National Academy of Sciences(PNAS) (2018).
Smith, M. R., Clement, M., Martinez, T. & Snell, Q. Time series gene expression prediction using neural networks with hidden layers. BIOT (2010).
Christopher, P. & David, W. How to infer gene networks from expression profiles. Interface Focus (2011).
Zou, C. & Feng, J. Granger causality vs. dynamic bayesian network inference: a comparative study. BMC Bioinformatics (2009).
Maziarz, M. A review of the grangercausality fallacy. The Journal of Philosophical Economics: Reflections on Economic and Social Issues. VIII (2015).
Nicolas, P. et al. Conditiondependent transcriptome reveals highlevel regulatory architecture in bacillus subtilis. Science (2012).
Michna, R., Commichau, F., Todter, D., Zschiedrich, C. & Stulke, J. Subtiwikia database for the model organism bacillus subtilis that links pathway, interaction and expression information. Nucleic Acids Research 42, D692–D698 (2014).
ArrietaOrtiz, M. L. et al. An experimentally supported model of the bacillus subtilis global transcriptional regulatory network. Molecular System Biology (2015).
Jozefczuk, S. et al. Metabolomic and transcriptomic stress response of escherichia coli. Molecular System Biology (2010).
Salgado, H. et al. Regulondb v8.0: omics data sets, evolutionary conservation, regulatory phrases, crossvalidated gold standards and more. Nucleic Acids Research 41, D203–D213 (2013).
Hooper, S. D. et al. Identification of tightly regulated groups of genes during drosophila melanogaster embryogenesis. Molecular System Biology (2007).
Murali, T. et al. Droid 2011: a comprehensive, integrated resource for protein, transcription factor, rna and gene interactions for drosophila. Nucleic Acids Research (2011).
Greenfield, A., Madar, A., Ostrer, H. & Bonneau, R. Dream4: Combining genetic and dynamic information to identify biological networks and dynamical models). Edited by Mark Isalan. PLoS ONE 5 (10). Public Library of Science (PLoS): e13397 (2010).
HuynhThu, V. A., Irrthum, A., Wehenkel, L. & Geurts, P. Inferring regulatory networks from expression data using treebased methods. Edited by Mark Isalan. PLoS ONE 5 (9). Public Library of Science (PLoS): e12776 (2010).
Petralia, F., Wang, P., Yang, J., & Tu, Z. Integrative random forest for gene regulatory network inference). Bioinformatics 31 (12). Oxford University Press (OUP) (2015).
Pedregosa, F. et al. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011).
Rubin, G., Tohge, T., Matsuda, F., Saito, K. & Scheible, W.R. Members of the lbd family of transcription factors repress anthocyanin synthesis and affect additional nitrogen responses in arabidopsis. Plant Cell (2009).
Bastakis, E., Hedtke, B., Klermund, C., Grimm, B. & Schwechheimer, C. Llmdomain bgata transcription factors play multifaceted roles in controlling greening in arabidopsis. Plant Cell (2018).
Behringer, C., Bastakis, E., Ranftl, Q., Mayer, K. & Schwechheimer, C. Functional diversification within the family of bgata transcription factors through the leucineleucinemethionine domain. Plant Physiology (2014).
Luo, X. et al. Integration of lightandbrassinosteroid signaling pathways by a gata transcription factor in arabidopsis. Developmental Cell (2010).
Fan, M. et al. The bhlh transcription factor hbi1 mediates the tradeoff between growth and pathogenassociated molecular patterntriggered immunity in arabidopsis. Plant Cell (2014).
Marchive, C. et al. Nuclear retention of the transcription factor nlp7 orchestrates the early response to nitrate in plants. Nature Communications (2013).
Gregis, V. et al. Identification of pathways directly regulated by short vegetative phase during vegetative and reproductive development in arabidopsis. Genome Biology (2013).
Bustos, R. et al. A central regulatory system largely controls transcriptional activation and repression responses to phosphate starvation in arabidopsis. Plos Genetics (2010).
Acknowledgements
The authors gratefully acknowledge funding from the following sources: NIH NIGMS Grant GM032877 to G.M.C. and D.E.S., NSFPGRP IOS1339362 to G.M.C. and D.E.S., an NIH NIGMS Fellowship 1F32GM116347 to M.D.B., and a Plant Genomics Grant from the Zegar Family Foundation (A160051).
Author information
Authors and Affiliations
Contributions
J.C., M.D.B., R.B., G.M.C., and D.E.S. designed research, conceived the experiments and reviewed the manuscript. J.C. and M.D.B. analyzed the data. J.C. contributed new analytical tools and performed the experiments.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Cirrone, J., Brooks, M.D., Bonneau, R. et al. OutPredict: multiple datasets can improve prediction of expression and inference of causality. Sci Rep 10, 6804 (2020). https://doi.org/10.1038/s41598020633473
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598020633473
This article is cited by

Rewiring gene circuitry for plant improvement
Nature Genetics (2024)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.