Introduction

Gene regulatory networks (GRNs) are models that describe how transcription factors (TFs) orchestrate the expression of other genes1,2,3. In GRNs the nodes are TFs and genes, and the edges represent regulatory interactions. TFs bind to the promoters of genes to activate or silence the production of mRNA4. By doing so, TFs help control gene expression and thus modulate or enable cellular processes5. Most existing approaches of studying gene regulation focus on the effect of a single TF on the rest of the genes. However, TFs can regulate the expression of other TFs and thus a perturbation of a single TF can propagate information throughout the whole system. The potential for a single TF to control a system depends on certain network features: a TF that regulates many genes will obviously have a large influence, but the contrary is not necessarily true. Even if a TF regulates only a small number of genes directly, it can still play a major role in the system if these genes in turn regulate many others. This is the concept of closeness centrality in graph theory. Therefore, understanding a system’s GRN architecture can reveal the TFs that are important for a specific phenotype and thus can explain molecular pathogenesis and reveal potential drug targets.

Several methods have been developed to predict GRNs using gene expression data2,3,6,7,8,9. These methods formulate the problem of inferring a GRN as an unsupervised classification problem. Typically, a GRN is formulated as a weighted graph, trained on unlabeled data, whose edge values represent the predicted probability that each TF regulates other genes (including other TFs).

Several groups have used different algorithms based on different mathematical formulations to infer GRNs from gene expression data. These include Bayesian networks, correlation metrics, mutual information methods and parametric and non-parametric regression. A seminal paper published in 2012 showed that correlation, mutual information and Bayesian networks tended to perform far worse than methods based on regression3. For this reason, in this work we focused only on regression based GRN inference methods.

Regression based GRN inference methods build a mathematical model of the expression levels of a target gene given the expression levels of different TFs. The central assumption in these methods is that if the expression level of a TF predicts the expression level of a target gene it is likely regulating it. The model goodness of fit is used to estimate the probability that the TFs in a model regulate the target gene. That is, if the expression of a TF poorly predicts the expression of a gene, then it is not likely to regulate it and vice versa. Some of the popular regression based GRN inference algorithms are summarized in Fig. 1.

Figure 1
figure 1

Regression-based Gene Regulatory Network Inference Algorithms. Schematic describing how different regression based GRN inference methods work. These methods are based on different machine learning algorithms. We show six methods based on different machine learning algorithms that differ on the model shape and the way models for different TFs are combined. They were selected because they represent major types of machine learning methods used for GRN reconstruction and because of their high performance in the DREAM 4 and DREAM 5 challenges. PLSNET uses partial least squares and fits a linear model between TFs and targets. TIGRESS uses a linear model with different lasso parameters They both rely on the assumption that the expression of a gene is proportional to the expression levels of the TFs that regulate it. GRNBoost2 and ENNET use boosting to learn different tree models between TFs and targets. GENIE3 also uses tree models, however they iteratively resample different subsets of observations and potential TFs per target and create an ensemble of tree models. Unlike linear models, tree models do not rely on any assumption between the relationship of a TF and a target, however they are not continuous models.

Parametric GRN inference algorithms use a mathematical formulation that describes gene regulation using a predefined expected form. The most popular algorithms belonging to this class are linear regression based. In other words, they assume that the expression of a target gene is a linear combination of the expression levels of the TFs that regulate it8,9,10,11. PLSNET and TIGRESS both use a linear regression approach8,9. PLSNET uses a form of partial least squares known as SIMPLS8, while TIGRESS uses the least angle regression algorithm (LARS)9. LARS is an effective approach for obtaining different solutions for the linear regression coefficients with different absolute norms12. This algorithm takes an iterative approach: at each step, the residuals between the response variable and the linear regression model are compared with the explanatory variables, and because the models are trained iteratively on the residuals it can be described as a boosting algorithm12. TIGRESS couples LARS with stability selection and uses the frequency with which each TF is chosen in iterations of LARS as a proxy for the predictive probability that a TF regulates a gene9. The effectiveness of linear model-based approaches relies on how well a target gene’s expression levels can be approximated using linear functions.

For this reason, nonparametric regression methods have been proposed. In particular, regression trees have been very successful for GRN inference in different datasets2,6,7. Regression trees separate the explanatory variables into regions and give the response variable a value depending on what region of the explanatory variables each observation resides in13,14. The GENIE3 method uses the regression version of the random forest algorithm2,13—this algorithm iteratively samples different observations and explanatory variables and builds regression trees with a fixed depth. The different trees’ predictions are then combined to form the final prediction13. GENIE3 runs the random forest algorithm once per target gene. At each step, the variable importance of each TF is calculated as the proportion of explained variance of the target gene. The GENIE3 algorithm won two important GRN inference challenges, namely the DREAM 4 multifactorial sub-challenge and the DREAM 5 challenge2. The datasets for these challenges are now widely used benchmarks to test the performance of different GRN inference algorithms. Besides the random forest approach, regression tree models can also be trained using a stochastic gradient boosting approach. In the context of GRN inference, ENNET and GRNBoost2 are based on regression tree stochastic gradient boosting6,7. Stochastic gradient boosting is a form of a gradient based function optimization applied on model space14. Very generally speaking, tree boosting is similar to LARS except that it uses tree models instead of linear models12.

Here we present KBoost, a method that uses kernel PCA (Principal Component Analysis) regression and gradient boosting to reconstruct gene regulatory networks. Kernel PCA regression (KPCR) is a nonparametric regression technique that does not require the data to follow a strict form15,16. This is accomplished by transforming the explanatory variables, in our case the TF expression levels, using a kernel function. The kernel function yields a symmetric matrix that corresponds to the dot product of nonlinear features that are a function of the explanatory variables. For many kernel functions these implicit nonlinear features correspond to polynomials of the explanatory variables17. In fact, the radial basis kernel function (RBF) is the dot product of an infinite number of polynomials without having to calculate them explicitly18. Using polynomials for regression is an attractive approach, as according to the Weierstrass approximation theorem any continuous real function can be uniformly approximated by polynomials. Furthermore, kernel PCA lets us obtain the principal components of this large number of polynomials by using the kernel matrix with much smaller dimensions. Thus, with KPCR we can use the first principal components of a large number of implicit features to approximate any continuous function—in our case the relationship between a TF and a target gene.

For simplicity, instead of calculating KPCs of the expression levels of multiple TFs, we build an ensemble of models of KPCs. This is done through gradient boosting, which iteratively combines predictions from different models, in our case KPCRs from different TFs7,14. For each gene, we start by selecting a TF whose KPCs predict its expression levels the best according to the posterior. Then, we search among every TF’s KPCs to see which can predict the residuals, that is the difference with the original prediction and the actual gene expression values. At the end of each iteration, we update the predictions, by adding the model fit on the residuals. This greedy model search that allows us to drastically reduce the running time. The posterior probabilities of the models queried are used to estimate the GRN. Because of this Bayesian formulation, we can include information from previous experimental data and increase the accuracy of our predictions11,19. We tested KBoost against other algorithms on several benchmark datasets. The results show that KBoost performs better than other algorithms even in the absence of prior information.

Methods

KBoost overview

The aim of KBoost is to provide a fast and scalable algorithm for the accurate inference of GRNs (Fig. 2). It takes mRNA expression (microarray or RNA-seq) and, optionally, previously found TF-target interactions (e.g. ChIPseq) as input. KBoost uses KPCR and boosting coupled with Bayesian model averaging (BMA) to estimate the probabilities of genes regulating each other, and thereby reconstructs GRNs10,11,20. We use a greedy approach to sample the model space. That is, for every gene we build a mathematical model that predicts its expression using the kernel PCA of the expression levels of a likely subset of TFs. We then compare different KPCR TF-gene models and estimate the probability that a TF regulates a gene using BMA. The model fit is used as a proxy of the probability that a set of TFs regulate a specific gene. If there is information from other sources on likely TF targets, it can be combined with the model fit in the form of a prior. The output of KBoost is then an estimate of the probability that each TF regulates each gene.

Figure 2
figure 2

Overview of KBoost. KBoost uses KPCR in a boosting framework to infer GRNs from gene expression data. KBoost uses a list of predefined TFs and builds KPCR models to predict the expression of other genes (including TFs). These models are combined using gradient boosting. The residuals are used to estimate the probability that a TF regulates a gene.

Modelling gene expression

We formulate the data as \(n\) independent observations \({\varvec{x}}={\left[{x}_{1}, \dots , {x}_{n}\right]}^{T}\), each with real-valued measurements of G genes: \({x}_{i}=({x}_{i}^{\left(1\right)}, \dots , {x}_{i}^{\left(G\right)})\). We assume that each gene’s \(j\) expression, \({x}_{i}^{(j)}\), is a linear combination of non-linear functions of the expression of a subset of the gene’s TFs, noted as \({A}_{j}\), plus some random noise \({\varepsilon }_{i}^{(j)}\), which comes from a normal distribution with zero mean and variance \({\sigma }_{j}^{2}\):

$${x}_{i}^{(j)}=\sum_{p \in { A}_{j}}f({x}_{i}^{(p)})+ {\varepsilon }_{i}^{(j)}.$$
(1)

Gene regulatory network inference as unsupervised classification

The gene regulatory network, \({\varvec{C}}\), would then be defined as a directed graph with G nodes, where each node is a gene. We write \({\varvec{C}}\) as a \(G\times P\) matrix, where \(P\) is the number of genes in \(G\) that are TFs. The entry \({C}_{j,p}\) gets the value 1 if TF \(p\) regulates gene \(j\) and 0 otherwise. As many other algorithms, we formulate the problem of estimating \({\varvec{C}}\) from \({\varvec{x}}\) as an unsupervised classification problem, where the fit of different subsets of TFs in Eq. (1) is used to estimate the posterior probability that any TF regulates any gene; in other words, how well different combinations of TFs, \({A}_{j}\), fit the data \({x}_{i}^{(j)}\). We can then consider all the combinations of TF subsets \({A}_{j}=({A}_{j}^{\left(1\right)},\dots , {A}_{j}^{\left({2}^{P}\right)}\), where \({A}_{j}^{\left(d\right)}\) contains the integer indexes of the genes that are TFs and are part of subset \(d\). The predictive probability of an element of \({C}_{j,p}\) to be 1:

$$P\left({C}_{j,p}=1|x\right)=\frac{{\sum }_{d:p \in {A}_{j}^{(d)} }P\left({A}_{j}^{\left(d\right)}\right)\pi \left({A}_{j}^{(d)}\right)}{{\sum }_{t=1}^{{2}^{K}}P\left({A}_{j}^{\left(t\right)}\right)\pi \left({A}_{j}^{(t)}\right)}.$$
(2)

Here, \(P\left({A}_{j}^{\left(d\right)}\right)\) is the marginal likelihood of the model in Eq. (1) given the subset \({A}_{j}^{\left(d\right)}\), and \(\pi \left({A}_{j}^{(d)}\right)\) is the prior that the TFs in this subset \({A}_{j}^{(d)}\) regulate gene \(j\). We formulate the prior probability of a subset of TFs \({A}_{j}^{(d)}\) to regulate a gene, \(j\), as coming from a binomial distribution:

$$\pi \left({A}_{j}^{(d)}\right)={\prod }_{p=1}^{P}{w}_{j,p}^{{1}_{k} \in { {A}^{(d)}}_{j}}{\left(1-{w}_{j,p}\right)}^{{1}_{k} \notin { {A}^{(d)}}_{j}}.$$
(3)

Here, the parameters of the prior are in a matrix \({\varvec{W}}=[{w}_{1}, \dots , {{w}_{G}]}^{T}\) with the same dimensions as the gene regulatory network \({\varvec{C}}\). For simplicity the marginal likelihood, \(P\left({A}_{j}^{\left(d\right)}\right)\), is obtained by assuming \(\sum_{p \in { A}_{j}}f({x}_{i}^{(k)})\) to be fixed. This assumption facilitates the inference process and reduces the computation time required in KBoost. We give the variance of the noise Jeffreys’ prior, \(\left({\sigma }_{j}^{2}\right)\propto \frac{1}{{\sigma }_{j}^{2}}\). This yields the following expression for the marginal likelihood:

$$P\left({A}_{j}^{\left(d\right)}\right) \propto {\left({\sum }_{i=1}^{n}{\left({x}_{i}^{\left(j\right)}-\sum_{p \in { A}_{j}}f\left({x}_{i}^{\left(k\right)}\right)\right)}^{2} \right)}^{(-\frac{n-1}{2})}.$$
(4)

Greedy model selection and boosting

So far, we have omitted the details of two crucial tasks, first how we calculate \(\sum_{p \in { A}_{j}}f\left({x}_{i}^{\left(k\right)}\right)\), and second how we go through the different subsets \({\mathrm{A}}_{\mathrm{j}}^{\left(\mathrm{d}\right)}\). For the second task, we chose a greedy approach where we iteratively select the TF subsets with the highest posteriors \(\mathrm{P}\left({\mathrm{A}}_{\mathrm{j}}^{\left(\mathrm{d}\right)}\right)\) and expand them for each gene. This choice is motivated by two goals: (1) to reduce the computational burden of calculating \(\mathrm{P}\left({\mathrm{A}}_{\mathrm{j}}^{\left(\mathrm{d}\right)}\right)\) for all possible combinations of TFs, and (2) to reduce the number of false positives. Genes that are regulated by the same set of TFs will have a high correlation between their expression levels. We assume that for each gene \(j\), there are several \({\mathrm{A}}_{\mathrm{j}}^{\left(\mathrm{d}\right)}\) which yield a relatively high \(\mathrm{P}\left({\mathrm{A}}_{\mathrm{j}}^{\left(\mathrm{d}\right)}\right)\) due to coregulation and thus we perform a greedy search keeping only the TFs that yield the highest \(\mathrm{P}\left({\mathrm{A}}_{\mathrm{j}}^{\left(\mathrm{d}\right)}\right)\) at each boosting iteration. This is similar to the LARS procedure, except that in Eq. (2) we include all explored models, like in the Occam Razor version of BMA12,21. Therefore, Eq. (5) uses only the subsets \(\widehat{{A}_{j}}\) explored by our greedy search, i.e.:

$${\widehat{C}}_{j},p= P\left({C}_{j,p}=1|{\varvec{x}}\right)\cong \frac{{\sum }_{d:p \in \widehat{{A}_{j}^{(d)}} }P\left({A}_{j}^{\left(d\right)}\right)\pi \left({A}_{j}^{(d)}\right)}{\sum_{\widehat{{A}_{j}^{(q)}}}P\left({A}_{j}^{\left(q\right)}\right)\pi \left({A}_{j}^{(q)}\right)}.$$
(5)

Kernel principal components

Regarding the task of calculating \(\sum_{p \in { A}_{j}}f\left({x}_{i}^{\left(k\right)}\right)\), we chose to use KPCR because it has three important advantages in a boosting framework. First, KPCR is a kernel-based regression method which can approximate any potential shape that the relationship between a TF a target gene might have. We use the RBF kernel function which is the exponential of minus the squared Euclidean distance between observations divided by a width hyperparameter. KPCR is a technique that allows to compute the principal components in the infinite dimensional feature space directly from the kernel matrix. A more detailed description is presented in the appendix. Second, unlike other kernel regression methods, which would need an \(n\times n\) kernel matrix per TF, in KPCR we only need a few principal components per TF. This means a substantial reduction in memory usage while trying out different subsets \({A}_{j}^{\left(d\right)}\). Finally, the boosting framework allows us to effectively combine principal components from kernels of different TFs. The formulation is shown below.

For each TF \(p\), we first calculate the RBF kernel with width parameter \(\gamma \):

$${K}^{(p)}\in {\mathbb{R}}^{n\times n}, {K}_{i,u}^{(p)}=exp \left(-\frac{{\left({x}_{i}^{\left(p\right)}-{x}_{u }^{\left(p\right)}\right)}^{2}}{\gamma } \right)$$
(6)

Then, we normalize the covariance matrix of the infinite dimensional features implicitly:

$$\overline{{K }^{(p)}}={K}^{(p)}-{1}_{nn}{K}^{(p)}-{K}^{(p)}{1}_{nn}+{1}_{nn}{K}^{(p)}{1}_{nn}$$
(7)

here, \({1}_{nn}\) is an \(n\times n\) matrix with all the values equal to \(\frac{1}{n}\). Next, we perform an eigendecomposition of \(\overline{K }\):

$$\overline{{K }^{(p)}}{\alpha }^{(p)}=\overline{{K }^{(p)}} {\lambda }^{(p)},$$
(8)

where \(\alpha \) are the eigenvectors and \(\lambda \) is a diagonal matrix with the eigenvalues in descending order The first \(z\) KPCs are then obtained by

$$g\left({x}_{i}^{\left(p\right)}\right)= \overline{{K }_{i}^{\left(p\right)}}{\alpha }_{1:z}^{\left(p\right)}.$$
(9)

In KBoost we select KPCs by using a threshold on their corresponding eigenvalues. The KPCs are then normalized so that they have unit norm. After obtaining \(g\left({x}_{i}^{\left(p\right)}\right)\), the matrices \(\overline{{K }^{(p)}}\) and \({K}^{(p)}\) can be overwritten and thus this procedure allows us to reduce the amount of memory required. Note that the columns of \(g\left({x}_{i}^{\left(p\right)}\right)\) are orthogonal to each other. This means that for a model with one TF, \({A}_{j}^{(d)}=\{p\}\), we can fit the model in Eq. (1) by:

$$ \begin{aligned} x_{i}^{\left( j \right)} & = \mathop \sum \limits_{{p \in A_{j} }} f\left( {x_{i}^{\left( p \right)} } \right) + \varepsilon_{i}^{\left( j \right)} = \widehat{{\beta_{j} }}^{\left( p \right)T} g\left( {x_{i}^{\left( p \right)} } \right) + \varepsilon_{i}^{\left( j \right)} , \\ & \quad \widehat{{\beta_{j} }}^{\left( p \right)} = \nu \lambda^{*\left( p \right)} g\left( {x^{\left( p \right)} } \right)^{T} \varepsilon^{\left( j \right)} . \\ \end{aligned} $$
(10)

here, \({{\widehat{\beta }}_{j}}^{\left(p\right)}\) are the coefficients of the KPC regression multiplied by a constant \(0<\nu \le 1\), and \({{\lambda }^{*}}^{\left(p\right)}\) is a diagonal matrix whose entries are the reciprocals of the eigenvalues that correspond to the KPCs. The constant \(\nu \) is known as a shrinkage parameter. We propose using \(\nu \) to counteract the effect that the number of observations \(n\) has on the marginal likelihood. Since the marginal likelihood changes exponentially to the power of \((n-1)/2\), the resulting GRNs would change dramatically depending on \(n\). Namely, with a large \(n\), most TFs posterior probability would be close to zero while a few would be close to one. This can lead to an overestimation of the confidence we have on the predicted GRN. By shrinking the estimates of \({{\widehat{\beta }}_{j}}^{\left(p\right)}\), we introduce a heuristic measure that will make the resulting sum of squares less different between TF subsets, \({A}_{j}^{(d)}\), and thus reduce the effect of \(n\).

Boosting

The last equation (Eq. 10) shows how we fit the model in Eq. (1) in the case that \({A}_{j}^{(d)}\) contains 1 TF. For simplicity we then follow a greedy boosting approach that is very similar to the least angle regression method. In KBoost, we first fit every single TF subset \({A}_{j}^{(d)}\) (i.e. only one TF in each model) and calculate their unnormalized posteriors. Then, we select the TF with the highest unnormalized posterior, \({p}^{*}\), and calculate its residuals, \({\varepsilon }^{(j)}\). Next, we form new subsets by iteratively adding a TF, and fit the TFs to the residuals:

$$ \begin{aligned} & \varepsilon_{i}^{\left( j \right)} = x_{i}^{\left( j \right)} - \widehat{{\beta_{j} }}^{{\left( {p^{*} } \right)}} g\left( {x_{i}^{{\left( {p^{*} } \right)}} } \right), \\ & \hat{\beta }_{j}^{\left( p \right)} = \nu \lambda^{*\left( p \right)} g\left( {x^{\left( p \right)} } \right)^{T} \varepsilon^{\left( j \right)} , \\ & A_{j}^{\left( d \right)} = \left\{ {p^{*} ,p} \right\}, \\ & \mathop \sum \limits_{{p \in A_{j} }} f\left( {x_{i}^{\left( p \right)} } \right) = \widehat{{\beta_{j} }}^{\left( p \right)T} g\left( {x_{i}^{\left( p \right)} } \right) + \widehat{{\beta_{j} }}^{{\left( {p^{*} } \right)T}} g\left( {x_{i}^{{\left( {p^{*} } \right)}} } \right). \\ \end{aligned} $$
(11)

Similarly, to the previous step, we calculate their unnormalized posterior and select the TF with the highest posterior to expand next (Fig. 3). When we use a \(\nu \) under 1, there is a possibility that the same TF will be chosen more than once. This might reflect a TF having a very strong relationship with a particular target gene. At the end we estimate \(P\left({C}_{j,p}=1|x\right)\) by performing BMA on all explored models. In KBoost, the number of iterations represents also the maximum number of TFs per gene to be considered. The resulting network is tuned using the heuristic proposed by Slawek and Arodz7, in which we multiply each column \(p\) of \(P\left({C}_{j,p}=1|x\right)\) by the sample variance of the whole column. This has the effect of preserving the sparse outdegree distribution found in most biological networks. KBoost is available as an R package at: https://github.com/Luisiglm/KBoost and as a Bioconductor software package.

Figure 3
figure 3

Gene Expression Modelling with KPCR and Boosting. The procedure starts by selecting the TF whose KPCR model has the highest posterior for each gene. Then we combine different KPCR models by boosting. At each boosting iteration we fit a new KPCR model to each gene expression’s residuals instead of the actual values. We then select the TF whose KPCR model has the highest performance in each case and update the KPCR model ensemble.

Performance assessment

We compared KBoost against five other GRN reconstruction methods that were chosen because of their high performance in previous benchmarking exercises, such as the DREAM 4 and DREAM 5 challenges. Specifically, the algorithms included are GENIE, PLSNET, ENNET, TIGRESS, and GRNBoost22,6,7,8. All algorithms were run using their default parameters, including KBoost (\(if \, n>10: v= \frac{10}{n}, else \, v=0.5,\), \(\gamma =60\) and 3 iterations). We used three different datasets to assess the comparative performance of our algorithm: IRMA22, the in silico networks from the DREAM 4 multifactorial sub-challenge2, and the DREAM 5 dataset, which contains three sets of gene expression data—one from a simulated network and two from real-life networks3. Before the analysis we standardized each gene’s expression to a sample variance of one and a sample mean of zero. For this comparison, we did not use any prior information on the relationships between TFs and potential targets. Therefore, we gave every TF a probability of 0.5 of regulating any gene, so that every possible combination of TFs had the same prior probability.

figure a

Benchmarking consisted of comparing a predicted GRN with the real GRN using common performance metrics such as the area under the precision recall curve (AUPR) and the area under the receiver operator curve (AUROC). Both methods work by changing a threshold of probability from close to 0 to 1. Regulatory interactions with a probability higher than the threshold are kept while those below the threshold are discarded. The inferred regulatory interactions are compared with the real GRN. To obtain AUPR we calculate the precision (number of correctly predicted interactions divided by the total number of predicted interactions) and the recall (number of correctly predicted interactions divided by the total number of actual interactions) at each threshold. Then, both precision and recall are plotted and the area under the curve is calculated. Similarly, for AUROC, at each threshold we calculate the recall and the false positive rate (number of incorrectly predicted interactions divided by the actual number of non-interactions). The recall is plotted against the false positive rate and the area under the curve is calculated. The AUROC value ranges from 0 to 1; 0.5 represents the number that would be achieved by random guessing and 1 is the perfect score23. Besides using the AUROC and AUPR, we also use the specific scores developed for the DREAM 4 and DREAM 5 challenges. The authors of the DREAM 4 and DREAM 5 challenge devised a score to benchmark different methods on their datasets. They calculated these scores using the mean of the log10 of the empirical P-value to guess a network randomly of the same accuracy as the one inferred by each specific method3. For consistency, we used the DREAM 4 and DREAM 5 evaluation scripts to compare the different algorithms.

Results

Performance assessment of KBoost

IRMA dataset

The IRMA dataset is a small experimental synthetic network in yeast, which is composed of five genes that regulate each other and are responsive to galactose and glucose22. To implement IRMA in vivo, yeast cells were transduced with expression vectors that contained five single TF promoters linked to TF encoding genes. The five genes used were ASH1, GAL80, SWI5, GAL4 and CBF1. The network can be activated by the addition of galactose and silenced by glucose. Thus, this dataset contains two small steady state sub-datasets: IRMA switch on, and IRMA switch off, respectively. The expression of the five genes was measured by q-PCR. Both IRMA switch on and IRMA switch off contain 24 observations22. We used the IRMA switch off dataset to select the default values of the parameters for KBoost, and these were kept fixed for the rest of the benchmark performance assessments (see “Supplementary data S1”). The results show that KBoost performs similarly well in both datasets (Table 1). Furthermore, it had the highest performance in both metrics for the IRMA switch off dataset.

Table 1 IRMA dataset benchmark.

DREAM 4 multifactorial challenge dataset

The DREAM 4 dataset contains several different network inference challenges. We focused only on the multifactorial sub-challenge as it more closely resembles gene expression data from patients2. This challenge consists of five simulated steady-state expression profiles of 100 genes in 100 samples that were obtained by perturbing all genes simultaneously. The expression levels in the dataset correspond to steady state levels after perturbations with added normal and log-normal noise to resemble a real-life microarray experiment. The results show that KBoost, using the default parameters, generally achieved the highest performance, compared to state-of-the-art algorithms (Table 2). GENIE3, the original sub-challenge winner, had obtained a score of 37.5, whereas KBoost scores 55.93. This score is also clearly higher than that achieved by the other algorithms, which were developed after GENIE3. The results show that for several networks KBoost achieved the highest AUROC and AUPR.

Table 2 DREAM 4 dataset benchmark.

DREAM 5 dataset

The DREAM 5 dataset contains three sets of gene expression data, one from a simulated network (Net 1) and two from real-life networks. The three gene expression sets contain data simulated or obtained from different kinds of experiments such as drug perturbations, genetic perturbations (gene deletions or overexpression), time series sets and steady-state sets. The first network is a simulated network with 195 TFs and 1643 genes. The second network is from Escherichia coli with 334 TFs and 4511 genes (Net 3), and the third network is from Saccharomyces cerevisiae and has 333 TFs and 5950 genes (Net 4)3. The results show that KBoost compares favorably to most algorithms and has a similar overall performance as ENNET, a tree gradient boosting algorithm (Table 3). The results also show that most algorithms did not perform well on the experimental networks, Net 2 and Net 3.

Table 3 DREAM 5 dataset benchmark.

Running time

The boosting framework of KBoost allows us to dramatically reduce the number of computational operations. That is because KBoost adds only one TF at a time, and the maximum likelihood solution to the regression of KPCs of only one TF does not require an inversion, since KPCs coming from one kernel are orthogonal. The resulting worst case computational complexity is governed by the eigendecomposition of the KPCA \(O({p\cdot (n}^{3}))\) \(n\) the number of observations and \(p\) the number of TFs. This might seem extreme but typically GRN inference datasets have at most a few hundred of observations.

We ran all algorithms on the same desktop with an Intel® Core™ i7-8700 CPU with 370 GHz, 6 cores and 12 logical cores, and 32 GB of RAM. The results show that KBoost is much faster for all datasets compared to all the other algorithms. The summary of all the benchmark analysis is shown in Table 4. KBoost has the best performance for all datasets except for the IRMA switch on dataset, where it performed second best. However, KBoost dramatically shortens running times, particularly for the DREAM 5 dataset.

Table 4 Results summary of running time and performance.

Effect of an informative prior network

For several organisms, many years of research have linked certain TFs with specific genes. This information can be included into KBoost in the form of the prior \({\varvec{W}}\). In the previous sections, we used a \({\varvec{W}}\) with 0.5 for every of its element, giving every TF-gene relationship the same prior probability. In this section, we explore what could happen in a real-life setting, if the user had prior information from previous research studies linking certain TFs to certain genes. We used the gold standards of the DREAM 4 multifactorial dataset as prior networks and added some noise by randomly modifying different fractions of the edges. We gave the edges present in the prior networks a weight of 0.6 and the edges that were absent a weight of 0.4. The results show that even with a prior network with relatively high levels of noise (up to 50% of the edges were added or deleted from the original network) the average AUPR and AUROC are higher than when a noninformative prior was used (Fig. 3). All the posterior networks had a higher AUPR than the corresponding prior networks, while for the AUROC only the prior with 5% of the edges modified had a higher AUROC in the prior than the posterior network. In general, this shows the potential to increase the accuracy and precision of the predictions by including prior information (Fig. 4).

Figure 4
figure 4

Effect of an Informative Prior into KBoost on the DREAM 4 Multifactorial Dataset The red line represents the average of the prior networks, the green line the average results with a non-informative prior and the black line the average of the posterior networks with an informative prior. The results highlight the advantage of having prior information even if it is limited to a few TF—target gene pairs. It also shows that when the prior information is not accurate it can have a detrimental effect on performance.

Application to the METABRIC dataset

To examine the usefulness of KBoost to biological research we applied KBoost to the METABRIC dataset of breast cancer downloaded from cBioportal24. This dataset contains data for 1904 patients, of which 1898 have been stratified into subtypes. Breast cancer was one of the first cancer types where gene expression profiles were used for a molecular stratification of cancer subtypes, which correlate with disease phenotypes, prognosis and treatment response25,26. Studies have identified 4 major breast cancer subtypes (luminal A, luminal B, HER2-enriched and basal), a claudin-low subtype and a normal-like group. The claudin-low is sometimes described as a particularly aggressive form of the basal subtype subtype27,28. We wanted to test whether the subtypes have characteristic GRNs that can be used to distinguish them. For this, we reconstructed a network for each of the six subtypes using KBoost with a prior based on the GRN obtained from ChIP-Seq experiments on several TFs29. We gave the edges seen in the prior network a prior probability of 0.6, while for TFs in the prior network but not on our system, we gave 0.4 probability to the edges that were not observed in the ChIP-Seq. The dataset contains microarray experiments with expression levels for 24,368 genes. To identify TFs we overlapped the list of 24,368 genes in METABRIC with a previously published list of human TFs30. KBoost was ran on a desktop with an Intel® Core™ i7-8700 CPU with 370 GHz, 6 cores and 12 logical cores, and 32 GB of RAM. The resulting network was filtered by keeping edges with a posterior probability higher than 0.2. This threshold was found by using the F1 metric on the IRMA Off dataset. The whole dataset of 24,368 genes by 1345 TFs and 1898 patients was run in 104 min. This is less than what most algorithms took for the DREAM 5 challenge which is significantly smaller.

The results show that KBoost reconstructed GRNs that distinguished the five main subtypes, Normal-like, Luminal A, Luminal B, HER2 enriched, and Basal/Triple negative breast cancer (TNBC) (Fig. 5 and Table 5). KBoost. We analyzed the GRNs using the closeness centrality measure, which is an indication of the number of genes a TF regulates directly or indirectly.

Figure 5
figure 5

Closeness Centralities of Breast Cancer Subtypes. The peaks correspond to the closeness centrality of different TFs among the subtypes. The results show that breast cancer subtypes have differences in their underlying GRNs.

Table 5 METABRIC summary.

Discussion

Inferring GRNs from gene expression data is an important task in bioinformatics. KBoost represents a novel algorithm that uses KPCR, boosting and BMA coupled with greedy model selection. Both boosting and BMA have been previously used in this setting. ENNET and GRNBoost2 use regression tree boosting, while TIGRESS uses a least angle regression which can be considered a form of boosting. On the other hand, BMA has mainly been used with linear regression. KPCR has advantages over using regression trees and linear regression. Both KPCR and regression trees do not need the relationship between a TF and its target gene to be linear, unlike linear regression. However, regression trees produce discontinuous functions which can be added together after several iterations to approximate a continuous function by using random forests or boosting. In contrast, KPCR directly yields continuous functions.

We show that KBoost has a very competitive performance using three different datasets. KBoost had the highest score on the multifactorial subset of DREAM 4 and the IRMA switch off dataset and tied for first place with ENNET in DREAM 5. Furthermore, though we did not use any prior information for the benchmarking, we show that in the presence of prior information the performance of KBoost can increase dramatically. This is particularly relevant as there are several datasets with predicted or observed TF binding sites, ChIP-seq networks and other forms of prior information available. In terms of running time, the advantage of using KBoost is especially clear, as it was the fastest among all algorithms across all datasets. In addition, we were able to run KBoost on the METABRIC dataset containing over 24,000 genes with over 1300 TFs and over 1800 patients in less than two hours on a standard desktop computer.

We compared the GRNs from different breast cancer subtypes using the closeness centrality. Closeness centrality is a measure of a node’s importance in a network and is calculated using the number of other nodes to which it is directly and indirectly connected. Different subtypes showed differences in TF centrality. The patterns can easily be distinguished by the naked eye and correlate with increasing aggressiveness. According to the current transcriptome based classification24,28 the following phenotypes can be discerned: normal or normal-like is the most benign phenotype that may simply reflect normal breast tissue due to a low tumor cell content in the samples; luminal A and luminal B, which are relatively benign; HER2 enriched, which is more aggressive; and basal which has the worst prognosis. In the claudin-low subgroup, the gene FOXM1 had the highest closeness centrality. FOXM1 regulates the cell cycle and is required for the proliferation of normal breast epithelia but also promotes the proliferation of malignant cells and is often overexpressed in breast cancer31. The Luminal A and B GRNs were similar. In both GRNs, the TF with highest closeness centrality was REPIN1/AP4. This gene is a zinc finger protein that has been previously associated with enhanced proliferation, migration, and cisplatin resistance in breast cancer cells32. Interestingly, REPIN1/AP4 also was reported to reduce breast cancer cell proliferation33, which may be related to the slow growing phenotype of luminal A/B breast cancers. The gene AKAP8L had the highest closeness centrality in HER2 enriched breast cancer. Previous studies have found AKAP8L to be associated with metastasis suppression in Luminal A, B and HER2 enriched breast cancer34. For the basal subtype, the gene TBX21 had the highest closeness centrality. This gene is often overexpressed in breast cancer indicating poor prognosis35. Finally, we found the gene TSC22D1 (TGFβ stimulated clone-22) to be important for the normal-like subtype GRN. TSC22D1 seems to have a dual role. It was identified as tumor suppressor in several cancer types including breast cancer36. However, it also was described that its expression correlates with tamoxifen resistance and breast cancer recurrence37. TSC22D1 functions in the TGFβ pathway, which can suppress tumorigenesis in early stages but promote it in later stages38. Together these networks highlight the gene expression differences between the breast cancer subtypes that might contribute to their observed differences in outcome. As this molecular classification of breast cancer subtypes is based on gene expression data25,26, it is not surprising that the subtypes differ in their GRNs. However, the identification of the TFs responsible is useful. It suggests pathogenetic mechanisms, and the dimensionality reduction associated with classifying molecular phenotypes based on GRN reconstruction facilitates the analysis and identification of potentially actionable targets.