Abstract
The gene regulatory network (GRN) of a cell executes genetic programs in response to environmental and internal cues. Two distinct classes of methods are used to infer regulatory interactions from gene expression: those that only use observed changes in gene expression, and those that use both the observed changes and the perturbation design, i.e. the targets used to cause the changes in gene expression. Considering that the GRN by definition converts input cues to changes in gene expression, it may be conjectured that the latter methods would yield more accurate inferences but this has not previously been investigated. To address this question, we evaluated a number of popular GRN inference methods that either use the perturbation design or not. For the evaluation we used targeted perturbation knockdown gene expression datasets with varying noise levels generated by two different packages, GeneNetWeaver and GeneSpider. The accuracy was evaluated on each dataset using a variety of measures. The results show that on all datasets, methods using the perturbation design matrix consistently and significantly outperform methods not using it. This was also found to be the case on a smaller experimental dataset from E. coli. Targeted gene perturbations combined with inference methods that use the perturbation design are indispensable for accurate GRN inference.
Similar content being viewed by others
Introduction
Accurate identification of gene interactions that regulate biochemical mechanisms in a living organism can help identify physiological and pathological mechanisms and enable researchers to e.g. understand the cause of genetic diseases. Prediction of these gene regulatory interactions can be performed from gene expression data via gene regulatory network inference methods which differ among each other in terms of their mathematical models. The accuracy of one inference method may exhibit fluctuations based on the properties of the dataset, e.g. noise levels^{1,2}. Several benchmark studies have been published from the Dialogue on Reverse Engineering Assessment and Methods (DREAM) network inference challenges^{3,4,5}, where different sources of networks and data were used in each challenge, and performance comparisons were made for different methods. These benchmarks, especially the fifth round of DREAM^{5} provided a broad selection of GRN inference methods, however, they did not assess the performances of these methods at different data properties such as noise levels. A benchmark of ten GRN inference methods by Bellot et al.^{6} includes an analysis of the impact of noise, but did not consider link direction, and all measured accuracies are very low, which reduces its usefulness. Another benchmark by Pirgazi et al.^{7} provided acceptable accuracy levels but here the noise levels were not varied enough to make noiserelated conclusions. Other smaller benchmarks are also found in the publications of new methods^{8,9,10} but there both the selection of benchmarked methods and data properties are very limited. When combined together, all these benchmarks include a large amount of inference methods, yet fail to provide the community with clear guidance of the strengths and weaknesses of surveyed methods on data with different properties, which is useful for identifying the most suitable method for a particular dataset.
In addition to the mentioned shortcomings of the current benchmarks, one key aspect that has not previously been examined is the importance of knowing and using the experimental perturbation design, which only some GRN inference methods are capable of. To investigate this, we benchmarked several GRN inference methods based on diverse mathematical models, divided into two categories based on whether they use knowledge of the perturbation design (Pbased methods, where P refers to the perturbation design matrix) or not (non Pbased methods). Perturbation in GRN inference, and gene expression studies in general, can take many forms such as overexpression using plasmids in yeast^{11} or knockdown experiments using RNAi^{12}. Regardless of how the perturbation is performed, methods that use the knowledge of the perturbation design can use this information in different ways, either as part of the system model, as prior information, or to filter data, in order to build a GRN. Pbased methods, by mapping the perturbations to measured gene expression, can identify the causality behind the gene regulation^{13,14,15,16}, a crucial aspect in GRN inference when the ultimate goal is to identify genetic mechanisms and propose possible therapies. In contrast, most methods that do not utilize the perturbation design are limited to finding associations between genes. We applied methods that either do or do not use the perturbation design to in silico datasets generated using GeneNetWeaver^{17} and GeneSPIDER^{2} with varying noise levels. Inferred GRNs were compared to their gold standards, and their accuracy was evaluated in terms of several metrics (Fig. 1).
The results show that Pbased methods are significantly more accurate than non Pbased ones, and that only Pbased methods were able to reach near perfect inference accuracy.
Results
We applied five Pbased and five non Pbased GRN inference methods to 100 and 250gene synthetic data with three levels of Gaussian noise: high, medium, and low (Eq. (1) in “Methods”). The high noise level corresponds roughly to the noise level of biological datasets, the medium level can be achieved following a successful preprocessing approach^{18}, and at the low noise level the minimum signal is equivalent to the noise, meaning that it is relatively easy to reconstruct the underlying system. For each noise level, we measured the inference accuracy across all error levels as the area under the precisionrecall (AUPR) curve. To support the validity of the drawn conclusions we also calculated the area under the receiver operating characteristic (AUROC) curve, F1score, and Matthew’s correlation coefficient (MCC) for the 100gene datasets.
Utilizing the perturbation design leads to more accurate GRN inference
A general noiserelated trend was observed in GRN inference accuracy, that AUPR levels increase relative to decreased noise especially from ‘high’ to ‘medium’ noise levels (Fig. 2; Suppl. Table S2; Suppl. Fig. S12). The increase in AUPR levels was larger for the GeneSPIDER datasets than for GeneNetWeaver, and was significant in both cases (p < 0.05). The transition from noise level ‘medium’ to ‘low’ was still significant for the Pbased methods, but not for the non Pbased methods (p > 0.05). At all noise levels a significant difference in accuracy is observed between the Pbased and non Pbased methods, where the former outperformed the latter without exception (Suppl. Table S1).
At the high noise level, Zscore^{19} was the most accurate of all methods, followed by other Pbased methods, both on data generated by GeneNetWeaver and GeneSPIDER. All non Pbased methods performed poorly at this noise level, and there was no clear winner among them. The increase in accuracy when going from ‘high’ to ‘medium’ noise is more noticeable for the GeneSPIDER data than for the GeneNetWeaver data. A smaller increase in AUPR levels was observed when the noise level decreased from ‘medium’ to ‘low’, but this is still statistically significant for the Pbased methods, and resulted in some of the Pbased methods achieving the perfect level of AUPR on the GeneSPIDER data. For ‘medium’ and ‘low’ noise levels, GENIE3^{8} was the top performer among the non Pbased methods, closely followed by BC3NET^{20}, yet they were with no exception outperformed by the least accurate of the Pbased methods. PLSNET^{10} and CLR^{21} were the least accurate of all methods across all datasets.
We also calculated the maximum F1scores (Suppl. Fig. S5) and MCC levels (Suppl. Fig. S6) on these 100gene datasets to support the validity of our hypothesis. The same trend as for AUPR was observed with both these alternate measures, with Pbased methods always outperforming the non Pbased ones.
Correct knowledge of the perturbation design is crucial for accurate GRN inference
To further investigate the effect of the information stored in the design matrix on inference accuracy of the Pbased methods, we randomly displaced every perturbation in the perturbation design matrix, and applied the Pbased methods to these 100gene datasets where the connection between gene expression and its perturbation design is broken. The results showed that, regardless of any decrease in noise levels, the performance from the incorrect perturbation design remained around the random line in terms of AUPR (Fig. 3, Suppl. Table S1). This situation occurs as the Pbased methods are built on the assumption that the input P matrix represents the actual perturbations, and they can reach almost perfect accuracy thanks to utilizing the correct perturbation design.
Stratification of Pbased and non Pbased methods
We have shown examples of how Pbased methods outperform non Pbased methods under changing noise levels, and that the correct perturbation design is crucial for the Pbased methods to be able to perform accurately. To provide an overview of the benchmark results, we plotted the AUPR scores versus the AUROC scores for the 100gene datasets at all noise levels (Fig. 4). This highlights the separation between Pbased and non Pbased methods, and shows that only Pbased methods, when provided with the correct knowledge of the perturbation design, can achieve near perfect levels of GRN inference accuracy in terms of both AUPR and AUROC, whereas the accuracy of non Pbased methods remains limited to a level in terms of AUPR (< 0.6) even at low noise levels (Fig. 4).
Similarity between methods
In order to investigate the similarity between the methods in terms of predicted interactions, we used the benchmarked results on the 100gene synthetic data and measured the Jaccard index of the predicted edges in the maximum F1score GRNs for each method pair, averaged for all datasets with the same properties (Fig. 5). The results show that the GRNs from Pbased and non Pbased methods cluster within the categories, where the Pbased methods have an average Jaccard index at all noise levels of 0.75 for GeneNetWeaver data and 0.72 for GeneSPIDER data, while the non Pbased methods are less similar at on average 0.33 for GeneNetWeaver data and 0.30 for GeneSPIDER data. The Jaccard indices between categories were statistically significantly different at all noise levels (Suppl. Table S3). The similarity between Pbased and non Pbased methods is very low at high noise (0.09 and 0.02 on average for the GeneNetWeaver and GeneSPIDER data, respectively) but increases for the lower noise levels.
We also calculated the fraction of true interactions for each method pair, i.e. the portion of the overlap between two methods found in the true GRN (Suppl. Fig. S9). One minus this fraction would accordingly give the agreement on false positives. The overall trend followed the trend in method prediction overlap (Fig. 5), where the fraction of true edges is higher for the overlap between the Pbased methods than between the non Pbased. The fraction of true edges remained considerably lower for the 100gene GeneNetWeaver datasets across all noise levels than for the 100gene GeneSPIDER datasets. For the GeneNetWeaver datasets at high noise level, the highest true fraction was observed for the overlap between the non Pbased methods, but this was not the case for the other noise levels. For the GeneSPIDER datasets, the true fraction was always considerably higher for the overlap between the Pbased methods, and showed an increasing trend with decreasing noise.
Speed benchmark
Among all benchmarked methods, Zscore and CLR were the fastest in CPU time, but Zscore was faster in real time, followed by BC3NET in both CPU and real time. TIGRESS^{21} was by far the slowest, followed by PLSNET and GENIE3, in CPU time. In real time, however, GENIE3 was the slowest because TIGRESS and PLSNET use MATLAB’s parallelization (Fig. 6, Suppl. Table S5).
The effect of selfloops on GRN inference accuracy
A fundamental difference that we observed between the benchmarked Pbased and non Pbased methods is that the former infers selfloops while the latter does not. Selfloops represent the rate of gene transcript degradation, which is an important parameter for the system’s stability, making them an essential part of the true network. To make a fair assessment of both method categories, we included the selfloops in the true networks for the Pbased methods, and removed them from the true network for the non Pbased methods. As GeneNetWeaver networks do not contain all selfloops, the missing ones were added for Pbased methods. This procedure prevents Pbased methods to suffer from false positives, and non Pbased methods to sacrifice accuracy due to false negatives. Since the main assessment of GRN inference accuracy is made mainly in terms of AUPR, both errors have an equal effect on the inference accuracy. In order to investigate the specific effect of selfloops on GRN accuracy, we removed them both from the true network and the inferred GRNs of the 100gene datasets for all methods, including the Pbased ones, and observed that even though selfloops are responsible for a large part of the accuracy under high noise levels, they don’t have any effect on accuracy for medium and low noise levels (Suppl. Fig. S10). One exception is the Zscore method whose accuracy is not strongly affected by the treatment of selfloops at any noise level.
Benchmarking on biological data from DREAM5
The fifth round of DREAM^{5} includes a GRN inference challenge based on perturbationinduced expression data for E. coli, and RegulonDB as the gold standard GRN. We identified a subset of this dataset (network3)^{5} with knowntarget knockout or overexpression perturbations, i.e. with a known P matrix, and performed GRN inference by all methods that were run on the synthetic data above. Note that most methods that participated in DREAM5 performed very poorly in this challenge as no method scored an AUPR above 0.15. However, on the knowntarget subset, the Pbased methods were able to achieve considerably higher AUPR levels, ranging between 0.30 and 0.38 (Suppl. Fig. S11). This was not the case for the non Pbased methods, which only reached AUPR levels between 0.01 and 0.03, which is even lower than in the DREAM5 challenge. For instance, GENIE3 achieved an AUPR of ~ 0.10 on the full challenge dataset^{5} but on the selected subset only 0.03. Taken together, this agrees with the previous results suggesting that knowing and using the P matrix can augment GRN inference accuracy considerably.
Discussion
Previous benchmarking of GRN inference methods have provided a broad perspective of the accuracy that is achievable with different mathematical approaches on different datasets. However, a very important aspect has so far been missed, namely the type of information that the methods utilize. This study for the first time assesses the importance of using the perturbation design for accurate GRN inference. All conclusions in this study are based on the presented results which were obtained by using data that inherently represent the targeted perturbation steadystate condition (knockdown), and inference methods developed for this type of data.
As expected, a lower noise level in the data generally led to higher accuracy. For all noise levels, there was however a clear separation between Pbased and non Pbased methods in that the former consistently outperformed the latter. Furthermore, Pbased methods were able to achieve almost perfect accuracy, whereas the accuracy of the non Pbased methods remained below an AUPR of 0.6. Pbased methods have previously been shown to continuously improve with decreasing noise and achieve perfect accuracy under good data conditions^{1,2}, supporting the present results. This suggests that, if the informativeness of real data can be improved by reducing the noise level of the system from high to at least medium via for instance preprocessing approaches^{18,22}, close to perfect accuracy can be achieved by Pbased methods that use the knowledge of the perturbation design. As a result, accurate and reliable prediction of gene regulatory interactions could be performed to identify novel regulatory mechanisms and treatment targets, which would not be possible with the non Pbased methods whose accuracy is considerably lower due to not utilizing the essential knowledge of the perturbation design.
We have tested the effect of the correct perturbation information on the performances of the Pbased methods by misplacing every perturbation in the design matrix and breaking the connection between the gene expression and its design, which resulted in the same accuracy as random predictions. This shows that the Pbased methods are very powerful with the correct and complete design information, but also potentially vulnerable to errors in the design matrix, which could have either experimental or data processing causes. The value of using the design matrix was previously shown in the DREAM5 network inference challenge^{5}, where inference methods using it were better able to predict the targets of transcription factors. Despite utilizing the perturbation design, these methods were not able to achieve high accuracy levels on the DREAM5 data because there only transcription factors were perturbed, which is a small fraction of all genes. This is different from the data in this benchmark where a majority of the genes in the system are perturbed, which makes highly accurate GRN reconstruction possible unless the noise level is too high. The effect of an imperfect P matrix has previously been explored^{22}, where it was shown that the connection between the intended perturbation design and measured gene expression may be broken due to high noise levels or offtarget effects of perturbations, leading to an incorrect mapping of the P matrix to gene expression and lower accuracy. Pbased methods using a P matrix that was inferred from the measured gene expression were shown to perform better than when the intended P was used, providing additional support for the importance of utilizing the correct P that suits the data.
A drawback of Pbased methods is that they are not possible to apply to data without a targeted perturbation design matrix, which limits their application. Non Pbased methods, on the contrary, are possible to apply to both targeted and untargeted perturbation data, which increases their generality but comes with a great sacrifice of accuracy. This situation can simply be seen as a tradeoff between generality and higher accuracy, where both method categories sacrifice one or another.
Given the low GRN accuracies of inference methods for data from both GeneNetWeaver and GeneSPIDER, especially at high noise levels, it is clear that only a portion of the true regulatory interactions was captured by each method. The Jaccard index of the overlap between the interactions inferred by method pairs shows that the agreement between the non Pbased methods is considerably lower compared to the Pbased methods. Combining these two results, one possible reason for such discrepancy could be that the differences in the mathematical backgrounds of the non Pbased methods may capture different patterns in the data, resulting in low overlap in the top inferred interactions. In contrast, Pbased methods not only share a larger overlap but also the fraction of true interactions is higher compared to the non Pbased methods, at least for GeneSPIDER data. This is most likely due to the fact that the penalized regression methods such as LASSO, ElasticNet and Ridge regression are all related to least squares and therefore are more likely to capture the same patterns in the data. It is noteworthy that the predictions of Pbased methods can be combined due to the large fraction of true interactions in their overlaps, and in the absence of a true GRN, the intersection of these methods is likely to provide the best GRN that can be obtained.
Autoregulation is of crucial importance for a system’s stability. Therefore, all synthetic true networks we generated contain selfloops to effectuate autoregulation, and the data generated from these networks are partly based on selfloops. The two categories of inference methods however follow different approaches regarding the selfloops as the Pbased methods tend to always infer these, while the non Pbased methods do not. Not including the selfloops in the true GRN would result in false positives for the Pbased methods, and including them would result in false negatives for the non Pbased methods. Our solution to this situation is to treat both categories in a fair way by keeping all selfloops in the true GRN for the Pbased methods, and removing them from the true GRN when comparing to the inferred GRNs of the non Pbased methods. Since the accuracy assessment was mainly done by AUPR, this selfloop treatment has the same effect on both situations as false positives and false negatives have equal roles in AUPR. We also explored an alternative approach to remove the selfloops from both the true GRN and the inferred GRNs of the Pbased methods. This had little effect at the low and medium noise levels, but at high noise most of the accuracy was lost compared to the standard benchmarking method, for all methods except Zscore that only lost about half the AUPR.
Without considering selfloops, Zscore stands out as much more accurate than all other methods at high noise levels. Also when considering selfloops, the most reliable method for high noise was Zscore. Previous studies have shown similar results where the Zscore approach is among the top performing methods despite its simplicity^{17,19,23,24}. Although it is the winner at the most difficult noise level, Zscore is outperformed by the other Pbased methods at medium and low noise on the GeneSPIDER data. It can thus be either the best or the worst of the Pbased methods depending on the conditions.
We observed a strange situation with the direction of interactions inferred by Genie3 in the benchmark, where the reverse edge direction was much more accurate than the original direction (Suppl. Note 1). This was however not the case for the results from the DREAM4 in silico multifactorial network inference challenge. Our analysis suggests that the reverse direction generally gets the highest weight between regulators, which results in a much higher accuracy when the true GRN has a large fraction of regulators, as in this benchmark (Suppl. Fig. S14). According to the Genie3 authors, it yields the wrong link direction when applied to single gene knockdown data, which is used here (P. Geurts, personal communication). Note that even when reversing the direction of the Genie3 GRNs to optimize its performance, it could not compete with the worst performing Pbased method in any dataset generated for this benchmark. As the effect of the link direction depends on the data, we encourage other researchers to investigate if they can verify our findings with their own data.
A large portion of GRN inference methods infer directed interactions where an edge is drawn from a regulator gene to its target(s). Some of these methods also assign a sign to these interactions to indicate activation or inhibition. Some methods, however, only infer undirected interactions, where the source and the target genes are unknown, and the networks are symmetric. The accuracy calculation in this study was performed by considering the direction of the interactions but not their sign. The direction of an interaction is of crucial importance in a GRN, but two of the benchmarked methods in this study, CLR and BC3NET, infer the interactions without direction, i.e. all inferred links are in both directions. While this will degrade the performance of undirected methods in a directed benchmark, it is not obvious how to avoid it. In a real situation one might derive a direction from regulators to targets, but providing this extra information in a benchmark would give these methods an unfair advantage. Another aspect of the present benchmarking method is that, ignoring the sign of the interactions could potentially result in a GRN to be rated perfectly accurate although all the links have the wrong sign. However, given that the sign of the interaction is not inferred by any of the non Pbased methods, we did not include the sign of the predicted links in the benchmark, even though this gives such methods an advantage.
In this study we performed a broad benchmark with a novel idea of comparing the mathematical model category of the GRN inference methods, and showed that the knowledge of the perturbation design is essential for accurate GRN inference, and that the methods utilizing this information are significantly more accurate and reliable than the ones not utilizing it. Given that one of the biggest aims of GRN inference is to reveal unknown mechanisms that may become helpful in better understanding and treating genetic diseases, this study which demonstrates the positive effect of using the perturbation design on GRN accuracy may lead to a significant change in the field both in terms of biological data generated and methods developed. Therefore, based on our presented results, we strongly recommend experimentalists to perform targeted perturbation experiments, and computational systems biologists to use and develop perturbationbased methods for more accurate and reliable GRN inference, especially when the ultimate goal is to infer novel regulatory interactions as treatment targets where false predictions would lead to wasted efforts.
Materials and methods
Networks and datasets
We have generated synthetic networks and datasets via GeneNetWeaver and GeneSPIDER for benchmarking the gene regulatory network inference methods. We also extracted a knowntarget perturbation subset from the Escherichia coli dataset (network 3) of the DREAM5^{5} network inference challenge.
In silico true network generation via GeneNetWeaver
Five subnetworks of 100genes were extracted from the complete E. coli network. All genes were requested to be regulators but GeneNetWeaver does not assign the exact requested number, resulting in a varying number of regulators per subnetwork. The vertices were drawn randomly with the “greedy” edge selection. The true sparsity of the 5 GRNs, without selfloops, ranges between 1.48 and 1.95 links per gene.
In silico true network generation via GeneSPIDER
Five synthetic networks were generated in scalefree topology with directed and signed edges. Each gene is allowed to be a regulator, and on average three links per node were assigned. The true sparsity of the 5 GRNs, without selfloops, ranges between 2.22 and 2.38 links per gene.
In silico perturbation design
To be able to observe the regulatory effect of a gene on one or more others, it is important to introduce alterations to the system. These alterations are called “perturbations”, which can be applied to all genes in the system or target specific genes, one at a time. Some GRN inference methods can infer GRNs from measurements of both types of perturbations (called non Pbased in this study), while some methods require known targeted perturbations (Pbased). To investigate the importance of the knowledge of targeted perturbations in GRN inference, we generated single targetperturbation matrices with three replicates per perturbation experiment, to be later used as the input cue to the true regulatory system when generating perturbationbased data from it. The perturbation information is stored in a binary NbyM matrix, where N refers to genes and M experiments, assigning − 1 (for knockdown) to all perturbations and 0 to all other cells. This matrix is throughout the paper referred to as the P matrix.
In silico noisefree data generation via GeneNetWeaver
For each of the five subnetworks, a noisefree gene expression dataset of steadystate knockdown perturbations was generated from ordinary differential equations. No normalization was performed, and noisefree fold changes were calculated by the log_{2} ratio between the gene expression and its wild type value. The fold change matrix was transposed and replicated three times, to simulate a perturbation experiment with three replicates. The resulting noisefree fold change gene expression matrix is therefore in size 100 × 300 (genes × experiments).
In silico noisefree data generation via GeneSPIDER
For each of the five synthetic networks a noisefree fold change gene expression dataset with three replicates was generated. Unlike GeneNetWeaver, GeneSPIDER directly generates fold changes instead of generating gene expression and wild type separately. GeneSPIDER also inherently allows for replicates, therefore no manual replication was necessary. The resulting noisefree fold change gene expression matrix is in size 100 × 300 (genes × experiments).
Noise generation
Noise was generated in the same way for the data from both generation tools to allow for a fair comparison based on the signaltonoise ratio (SNR). Given a target SNR, we used Eq. (1) to calculate the required variance (λ). Then we generated a random noise matrix of the same data size with the desired SNR using the derived variance.
In Eq. (1), svd(X) is a set of values from the singular value decomposition of the noisefree fold change gene expression matrix X, 1 − α is the confidence level, 0.99, N is the number of genes, and M is the number of experiments. λ refers to the variance. Following this approach, we generated three different noise matrices of ‘high’, ‘medium’, and ‘low’ noise levels from SNR levels of 0.01, 0.1, and 1, respectively. The generated noise matrices were added to their corresponding noisefree gene expression matrix to have the noisy data to perform GRN inference.
DREAM5 E. coli subset
A knowntarget perturbation subset was extracted from the DREAM5 E. coli (network3) challenge dataset^{5} using the mapping file called “chip features” where the “DeletedGenes” and “OverexpressedGenes” columns refer to knockout and overexpression experiments, respectively. The entries with the same experiment number without any perturbation information were considered the control experiments, and their average was used to calculate the fold change gene expression. Knockdown experiments were discarded since their targets were not specified. This subset is in size 41 × 193 (genes × experiments) while the full dataset is 4297 × 805.
Benchmarked GRN inference methods
We investigated several methods from different mathematical backgrounds, i.e., regression, mutual information, random forests and Bayesian, and gathered the state of the art methods from each mathematical background. The mutual information based methods CLR and BC3NET were selected because they performed well in the DREAM network inference challenges. Note that in the DREAM competition, the Pbased methods used here were not used with a P matrix as input. Pbased methods such as LASSO and related regressionbased methods were chosen because they are well known and commonly used in the field. To complement these we chose the Zscore approach for Pbased observed effect GRN inference.
Least squares
The least square regression provides the optimal fit between the dependent variable and the independent variables by minimizing the sum of squared residuals. Assuming at steady state YA + P = 0 ^{1}, the GRN A is estimated as − P × Y^{†}, where Y is the observed expression matrix, P the perturbation design matrix, and ^{†} denotes the Moore–Penrose inverse. P contains a − 1 in the experiment/gene cell if it is a knockdown perturbation, and + 1 for overexpression (not used here). Noise terms are modeled implicitly here.
ElasticNet^{25,26,27}
ElasticNet is a regression model that combines LASSO’s L1 regularization parameter with the L2 penalty from the Ridge regression to overcome LASSO’s limitations especially when the data is illconditioned. We used Matlab’s Glmnet implementation with α = 0.7, which corresponds to ElasticNet (0 < α < 1).
LASSO^{26,27}
Least absolute shrinkage and selection operator (LASSO) is a regressionbased variable selection and regularization method, which utilizes the perturbation design and is used here for accurate prediction of gene regulatory interactions. LASSO uses the L1 regularization parameter. We used Matlab’s Glmnet implementation with α = 1, which corresponds to LASSO.
Ridge regression^{25,26,27}
Ridge regression is a regression method that uses the L2 penalty to estimate the regression coefficients of highly correlated explanatory variables. We used Matlab’s Glmnet implementation with α = 0.
Zscore^{19}
Zscore corresponds to the distance between an observed gene expression and the mean of the gene sample that it is compared to, divided by the standard deviation of the same sample. In this study, we implemented a Zscorebased approach that utilizes the perturbation design matrix. Each Zscore value is considered as a weight between the gene and the intended target of the perturbation.
GENIE3^{8}
GENIE3 uses random forestslike tree ensemble methods to build weighted and directed unsigned interactions of a gene against the others. It uses gene expression profiles without requiring knowledge of the perturbation design. The weights in the output GRN correspond to the strength of the regulation from the regulator gene to its target. GENIE3 outperformed its competitors in DREAM4 in silico network inference challenge, however it is computationally expensive due to its treebased algorithm. We used GENIE3 in Matlab with its default parameters, which are random forests with 1000 trees at each step. We tested using the output in two ways for the default parameters when all genes are assumed as transcriptional regulators: either assuming that regulators are in rows or in columns. As the latter gave much higher accuracy, this is how we used the output (Suppl. Fig. S14).
PLSNET^{10}
PLSNET uses the partial least squares approach to construct weighted, directed but unsigned gene regulatory networks. It does not utilize the perturbation design. We used PLSNET in Matlab with its default parameters, which are ‘nfac’ = 5, ‘K’ = 20, ‘T’ = 1000.
CLR^{21}
Context likelihood of relatedness (CLR) uses mutual information in its background, and outputs weighted but undirected and unsigned gene regulatory networks. We used the settings ‘method’ = ‘rayleigh’, ‘n’ = 10, ‘k’ = 3.
TIGRESS^{9}
TIGRESS is a regressionbased gene regulatory network inference method that outputs directed but unsigned networks where the weights correspond to the strength of the regulation. We used default settings except 'R' = 1000 as the default of 10,000 was too slow.
BC3NET^{20}
BC3NET is a Bayesian bootstrapped mutual informationbased gene regulatory network inference method that outputs unsigned and undirected but weighted networks, where the edge weights denote the ensemble consensus rate in terms of the corresponding mutual information. We used its R implementation and default parameters (100 bootstraps).
Accuracy calculation and metrics
Accuracy of GRN inference is evaluated mainly in terms of the area under the precisionrecall (AUPR) curve, but also area under the receiveroperatingcharacteristic (AUROC) curve, Matthew’s correlation coefficient (MCC), and the F1score. Even though the true GRNs are signed, and some of the benchmarked methods also infer signed interactions, the sign of the interaction was not used in the accuracy calculation, but the direction was.
In Eqs. (2)–(5), TP, FP, TN, and FN refer to the total number of true positives, false positives, true negatives, and false negatives, respectively. In Eq. (2), TPR denotes the true positive rate, or recall. In Eq. (3), FPR denotes the false positive rate. In Eq. (5), MCC refers to Matthew’s correlation coefficient.
Area under the precisionrecall (AUPR) curve
On a coordinate plane, the true positive rates (recall) (Eq. 2) are placed on the horizontal axis (xaxis) and the precision values (Eq. 4) are placed on the vertical axis (yaxis) for different sparsity levels that form a curve from the top of the yaxis where the precision equals to 1 and recall to 0 to the right of the xaxis where the recall equals to 1 and precision to 0. The area trapped under this curve is called the AUPR, and its value is between 0 and 1 where the former refers to a random performance whereas the latter denotes perfection.
Area under the receiveroperating characteristic (AUROC) curve
On a coordinate plane, the false positive rates (Eq. 3) are placed on the horizontal axis (xaxis) and the true positive rates (Eq. 2) are placed on the vertical axis (yaxis) for different sparsity levels that form a curve from the top right corner of the system where both FPR and TPR equal to 1, to the bottom left corner of the system where both values equal 0. The area trapped under this curve is called the AUROC, and its value is between 0 and 1 where the former refers to a fully misclassified system whereas the latter denotes perfection. A random performance is at AUROC 0.5.
Matthew’s correlation coefficient (MCC)
MCC (Eq. 5) is an accuracy measurement that takes all predictions (true positives, false positives, true negatives, and false negatives) into account to calculate correlation coefficients between the true and predicted values. MCC is commonly used in the field, and a trusted quantity known to not be affected by class imbalance. It ranges between − 1 and 1, where the former refers to a complete misclassification and the latter denotes a perfect classification. A random prediction has an MCC of 0.
F1score
The F1score (Eq. 6) is an accuracy measure based on precision and recall. It is preferred over MCC in some situations where there is an uncertainty regarding the true negatives, i.e. whether they are actual negatives or yet unknown. It ranges between 0 and 1, where the former occurs when either the precision or recall is 0, and the latter implies perfect prediction.
Sparsity selection approach
We applied two different approaches, one for the penaltybased methods (LASSO, ElasticNet, and Ridge regression), and one for all the others. The penaltybased methods infer an independent GRN for each value in the input penalty vector, meaning each inferred GRN should be treated as its own. For these methods, we inferred 30 GRNs whose sparsity ranges from full to empty using logspace(− 6, 0, 30). The accuracy in terms of AUPR is calculated across these 30 GRN accuracy points. For the other methods, since they either output a fully connected GRN of weighted interactions, or they produce a single network with an optimized sparsity, i.e. BC3NET, we used every unique value in the GRN as a cutoff and reduced the GRN for each of these unique values. For 100 gene networks, this means 100 K data points were considered in the GRN accuracy calculation (for the ones which output a fully connected GRN). Each GRN of a different sparsity is contained in the initial output of the method unlike the penaltybased methods.
Speed benchmark
The benchmarked methods were run on a computer with 16 Intel Xeon E5620 2.40 GHz CPUs and 70 GB of RAM.
Data availability
The datasets supporting the conclusions of this article, and the code used to generate the results presented in this article are available in https://bitbucket.org/sonnhammergrni/benchmark/.
References
Tjärnberg, A., Nordling, T. E. M., Studham, M., Nelander, S. & Sonnhammer, E. L. L. Avoiding pitfalls in L1regularised inference of gene networks. Mol. Biosyst. 11, 287–296 (2015).
Tjärnberg, A., Morgan, D. C., Studham, M., Nordling, T. E. M. & Sonnhammer, E. L. L. GeneSPIDER—Gene regulatory network inference benchmarking with controlled network and data properties. Mol. Biosyst. 13, 1304–1312 (2017).
Madar, A., Greenfield, A., VandenEijnden, E. & Bonneau, R. DREAM3: Network inference using dynamic context likelihood of relatedness and the inferelator. PLoS ONE 5, e9803 (2010).
Greenfield, A., Madar, A., Ostrer, H. & Bonneau, R. DREAM4: Combining genetic and dynamic information to identify biological networks and dynamical models. PLoS ONE 5, e13397 (2010).
Marbach, D. et al. Wisdom of crowds for robust gene network inference. Nat. Methods 9, 796–804 (2012).
Bellot, P., Olsen, C., Salembier, P., OliverasVergés, A. & Meyer, P. E. NetBenchmark: A bioconductor package for reproducible benchmarks of gene regulatory network inference. BMC Bioinform. 16, 312 (2015).
Pirgazi, J., Olyaee, M. H. & Khanteymoori, A. KFGRNI: A robust method to inference gene regulatory network from timecourse gene data based on ensemble Kalman filter. J. Bioinform. Comput. Biol. 19, 2150002 (2021).
HuynhThu, V. A., Irrthum, A., Wehenkel, L. & Geurts, P. Inferring regulatory networks from expression data using treebased methods. PLoS One 5, (2010).
Haury, A.C., Mordelet, F., VeraLicona, P. & Vert, J.P. TIGRESS: Trustful Inference of gene REgulation using stability selection. BMC Syst. Biol. 6, 145 (2012).
Guo, S., Jiang, Q., Chen, L. & Guo, D. Gene regulatory network inference using PLSbased methods. BMC Bioinform. 17, 545 (2016).
Prelich, G. Gene overexpression: Uses, mechanisms, and interpretation. Genetics 190, 841–854 (2012).
Han, H. RNA interference to knock down gene expression. Methods Mol. Biol. 1706, 293–302 (2018).
Pearl, J. An Introduction to Causal Inference (CreateSpace, 2015).
Meinshausen, N. et al. Methods for causal inference from gene perturbation experiments and validation. Proc. Natl. Acad. Sci. USA. 113, 7361–7368 (2016).
UdDean, S. M. M. & Gunawan, R. Ensemble inference and inferability of gene regulatory networks. PLoS ONE 9, e103812 (2014).
UdDean, S. M. M. & Gunawan, R. Optimal design of gene knockout experiments for gene regulatory network inference. Bioinformatics 32, 875–883 (2016).
Schaffter, T., Marbach, D. & Floreano, D. GeneNetWeaver: In silico benchmark generation and performance profiling of network inference methods. Bioinformatics 27, 2263–2270 (2011).
Seçilmiş, D. et al. Uncovering cancer gene regulation by accurate regulatory network inference from uninformative data. NPJ Syst. Biol. Appl. 6, 37 (2020).
Prill, R. J. et al. Towards a rigorous assessment of systems biology models: The DREAM3 challenges. PLoS ONE 5, e9202 (2010).
de Matos Simoes, R. & EmmertStreib, F. Bagging statistical network inference from largescale gene expression data. PLoS ONE 7, e33624 (2012).
Faith, J. J. et al. Largescale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 5, e8 (2007).
Seçilmiş, D., Hillerton, T., Nelander, S. & Sonnhammer, E. L. L. Inferring the experimental design for accurate gene regulatory network inference. Bioinformatics https://doi.org/10.1093/bioinformatics/btab367 (2021).
Aalto, A., Viitasaari, L., Ilmonen, P., Mombaerts, L. & Gonçalves, J. Gene regulatory network inference from sparsely sampled noisy data. Nat. Commun. 11, 3493 (2020).
Maetschke, S. R., Madhamshettiwar, P. B., Davis, M. J. & Ragan, M. A. Supervised, semisupervised and unsupervised inference of gene regulatory networks. Brief. Bioinform. 15, 195–211 (2014).
Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Series B (Stat. Methodol.). 67, 301–320 (2005).
Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B Stat. Methodol. 58, 267–288 (1996).
Acknowledgements
This work was supported by the Swedish Foundation for Strategic Research, and by the Ministry of Science and Technology in Taiwan Grants: MOST 1082634F006009 and MOST 1092224E006003.
Funding
Open access funding provided by Stockholm University.
Author information
Authors and Affiliations
Contributions
D.S. implemented the benchmarking pipeline, generated the GeneNetWeaver data and Gaussian noise matrices for both tools, tested and ran the non Pbased methods and implemented the Zscore method, performed the analyses, made the figures and wrote the manuscript; T.H. generated the GeneSPIDER data, implemented the wrapper for running ridge regression, and wrote the scripts for accuracy calculation; A.T. implemented the wrappers for running LASSO and ElasticNet; S.N. provided support and comments on the study; E.L.L.S. and T.E.M.N. conceived and supervised the study and revised the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Seçilmiş, D., Hillerton, T., Tjärnberg, A. et al. Knowledge of the perturbation design is essential for accurate gene regulatory network inference. Sci Rep 12, 16531 (2022). https://doi.org/10.1038/s4159802219005x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4159802219005x
This article is cited by

RENGE infers gene regulatory networks using timeseries singlecell RNAseq data with CRISPR perturbations
Communications Biology (2023)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.