## Abstract

Reconstructing gene regulatory networks is crucial to understand biological processes and holds potential for developing personalized treatment. Yet, it is still an open problem as state-of-the-art algorithms are often not able to process large amounts of data within reasonable time. Furthermore, many of the existing methods predict numerous false positives and have limited capabilities to integrate other sources of information, such as previously known interactions. Here we introduce KBoost, an algorithm that uses kernel PCA regression, boosting and Bayesian model averaging for fast and accurate reconstruction of gene regulatory networks. We have benchmarked KBoost against other high performing algorithms using three different datasets. The results show that our method compares favorably to other methods across datasets. We have also applied KBoost to a large cohort of close to 2000 breast cancer patients and 24,000 genes in less than 2 h on standard hardware. Our results show that molecularly defined breast cancer subtypes also feature differences in their GRNs. An implementation of KBoost in the form of an R package is available at: https://github.com/Luisiglm/KBoost and as a Bioconductor software package.

### Similar content being viewed by others

## Introduction

Gene regulatory networks (GRNs) are models that describe how transcription factors (TFs) orchestrate the expression of other genes^{1,2,3}. In GRNs the nodes are TFs and genes, and the edges represent regulatory interactions. TFs bind to the promoters of genes to activate or silence the production of mRNA^{4}. By doing so, TFs help control gene expression and thus modulate or enable cellular processes^{5}. Most existing approaches of studying gene regulation focus on the effect of a single TF on the rest of the genes. However, TFs can regulate the expression of other TFs and thus a perturbation of a single TF can propagate information throughout the whole system. The potential for a single TF to control a system depends on certain network features: a TF that regulates many genes will obviously have a large influence, but the contrary is not necessarily true. Even if a TF regulates only a small number of genes directly, it can still play a major role in the system if these genes in turn regulate many others. This is the concept of closeness centrality in graph theory. Therefore, understanding a system’s GRN architecture can reveal the TFs that are important for a specific phenotype and thus can explain molecular pathogenesis and reveal potential drug targets.

Several methods have been developed to predict GRNs using gene expression data^{2,3,6,7,8,9}. These methods formulate the problem of inferring a GRN as an unsupervised classification problem. Typically, a GRN is formulated as a weighted graph, trained on unlabeled data, whose edge values represent the predicted probability that each TF regulates other genes (including other TFs).

Several groups have used different algorithms based on different mathematical formulations to infer GRNs from gene expression data. These include Bayesian networks, correlation metrics, mutual information methods and parametric and non-parametric regression. A seminal paper published in 2012 showed that correlation, mutual information and Bayesian networks tended to perform far worse than methods based on regression^{3}. For this reason, in this work we focused only on regression based GRN inference methods.

Regression based GRN inference methods build a mathematical model of the expression levels of a target gene given the expression levels of different TFs. The central assumption in these methods is that if the expression level of a TF predicts the expression level of a target gene it is likely regulating it. The model goodness of fit is used to estimate the probability that the TFs in a model regulate the target gene. That is, if the expression of a TF poorly predicts the expression of a gene, then it is not likely to regulate it and vice versa. Some of the popular regression based GRN inference algorithms are summarized in Fig. 1.

Parametric GRN inference algorithms use a mathematical formulation that describes gene regulation using a predefined expected form. The most popular algorithms belonging to this class are linear regression based. In other words, they assume that the expression of a target gene is a linear combination of the expression levels of the TFs that regulate it^{8,9,10,11}. PLSNET and TIGRESS both use a linear regression approach^{8,9}. PLSNET uses a form of partial least squares known as SIMPLS^{8}, while TIGRESS uses the least angle regression algorithm (LARS)^{9}. LARS is an effective approach for obtaining different solutions for the linear regression coefficients with different absolute norms^{12}. This algorithm takes an iterative approach: at each step, the residuals between the response variable and the linear regression model are compared with the explanatory variables, and because the models are trained iteratively on the residuals it can be described as a boosting algorithm^{12}. TIGRESS couples LARS with stability selection and uses the frequency with which each TF is chosen in iterations of LARS as a proxy for the predictive probability that a TF regulates a gene^{9}. The effectiveness of linear model-based approaches relies on how well a target gene’s expression levels can be approximated using linear functions.

For this reason, nonparametric regression methods have been proposed. In particular, regression trees have been very successful for GRN inference in different datasets^{2,6,7}. Regression trees separate the explanatory variables into regions and give the response variable a value depending on what region of the explanatory variables each observation resides in^{13,14}. The GENIE3 method uses the regression version of the random forest algorithm^{2,13}—this algorithm iteratively samples different observations and explanatory variables and builds regression trees with a fixed depth. The different trees’ predictions are then combined to form the final prediction^{13}. GENIE3 runs the random forest algorithm once per target gene. At each step, the variable importance of each TF is calculated as the proportion of explained variance of the target gene. The GENIE3 algorithm won two important GRN inference challenges, namely the DREAM 4 multifactorial sub-challenge and the DREAM 5 challenge^{2}. The datasets for these challenges are now widely used benchmarks to test the performance of different GRN inference algorithms. Besides the random forest approach, regression tree models can also be trained using a stochastic gradient boosting approach. In the context of GRN inference, ENNET and GRNBoost2 are based on regression tree stochastic gradient boosting^{6,7}. Stochastic gradient boosting is a form of a gradient based function optimization applied on model space^{14}. Very generally speaking, tree boosting is similar to LARS except that it uses tree models instead of linear models^{12}.

Here we present KBoost, a method that uses kernel PCA (Principal Component Analysis) regression and gradient boosting to reconstruct gene regulatory networks. Kernel PCA regression (KPCR) is a nonparametric regression technique that does not require the data to follow a strict form^{15,16}. This is accomplished by transforming the explanatory variables, in our case the TF expression levels, using a kernel function. The kernel function yields a symmetric matrix that corresponds to the dot product of nonlinear features that are a function of the explanatory variables. For many kernel functions these implicit nonlinear features correspond to polynomials of the explanatory variables^{17}. In fact, the radial basis kernel function (RBF) is the dot product of an infinite number of polynomials without having to calculate them explicitly^{18}. Using polynomials for regression is an attractive approach, as according to the Weierstrass approximation theorem any continuous real function can be uniformly approximated by polynomials. Furthermore, kernel PCA lets us obtain the principal components of this large number of polynomials by using the kernel matrix with much smaller dimensions. Thus, with KPCR we can use the first principal components of a large number of implicit features to approximate any continuous function—in our case the relationship between a TF and a target gene.

For simplicity, instead of calculating KPCs of the expression levels of multiple TFs, we build an ensemble of models of KPCs. This is done through gradient boosting, which iteratively combines predictions from different models, in our case KPCRs from different TFs^{7,14}. For each gene, we start by selecting a TF whose KPCs predict its expression levels the best according to the posterior. Then, we search among every TF’s KPCs to see which can predict the residuals, that is the difference with the original prediction and the actual gene expression values. At the end of each iteration, we update the predictions, by adding the model fit on the residuals. This greedy model search that allows us to drastically reduce the running time. The posterior probabilities of the models queried are used to estimate the GRN. Because of this Bayesian formulation, we can include information from previous experimental data and increase the accuracy of our predictions^{11,19}. We tested KBoost against other algorithms on several benchmark datasets. The results show that KBoost performs better than other algorithms even in the absence of prior information.

## Methods

### KBoost overview

The aim of KBoost is to provide a fast and scalable algorithm for the accurate inference of GRNs (Fig. 2). It takes mRNA expression (microarray or RNA-seq) and, optionally, previously found TF-target interactions (e.g. ChIPseq) as input. KBoost uses KPCR and boosting coupled with Bayesian model averaging (BMA) to estimate the probabilities of genes regulating each other, and thereby reconstructs GRNs^{10,11,20}. We use a greedy approach to sample the model space. That is, for every gene we build a mathematical model that predicts its expression using the kernel PCA of the expression levels of a likely subset of TFs. We then compare different KPCR TF-gene models and estimate the probability that a TF regulates a gene using BMA. The model fit is used as a proxy of the probability that a set of TFs regulate a specific gene. If there is information from other sources on likely TF targets, it can be combined with the model fit in the form of a prior. The output of KBoost is then an estimate of the probability that each TF regulates each gene.

#### Modelling gene expression

We formulate the data as \(n\) independent observations \({\varvec{x}}={\left[{x}_{1}, \dots , {x}_{n}\right]}^{T}\), each with real-valued measurements of G genes: \({x}_{i}=({x}_{i}^{\left(1\right)}, \dots , {x}_{i}^{\left(G\right)})\). We assume that each gene’s \(j\) expression, \({x}_{i}^{(j)}\), is a linear combination of non-linear functions of the expression of a subset of the gene’s TFs, noted as \({A}_{j}\), plus some random noise \({\varepsilon }_{i}^{(j)}\), which comes from a normal distribution with zero mean and variance \({\sigma }_{j}^{2}\):

#### Gene regulatory network inference as unsupervised classification

The gene regulatory network, \({\varvec{C}}\), would then be defined as a directed graph with *G* nodes, where each node is a gene. We write \({\varvec{C}}\) as a \(G\times P\) matrix, where \(P\) is the number of genes in \(G\) that are TFs. The entry \({C}_{j,p}\) gets the value 1 if TF \(p\) regulates gene \(j\) and 0 otherwise. As many other algorithms, we formulate the problem of estimating \({\varvec{C}}\) from \({\varvec{x}}\) as an unsupervised classification problem, where the fit of different subsets of TFs in Eq. (1) is used to estimate the posterior probability that any TF regulates any gene; in other words, how well different combinations of TFs, \({A}_{j}\), fit the data \({x}_{i}^{(j)}\). We can then consider all the combinations of TF subsets \({A}_{j}=({A}_{j}^{\left(1\right)},\dots , {A}_{j}^{\left({2}^{P}\right)}\), where \({A}_{j}^{\left(d\right)}\) contains the integer indexes of the genes that are TFs and are part of subset \(d\). The predictive probability of an element of \({C}_{j,p}\) to be 1:

Here, \(P\left({A}_{j}^{\left(d\right)}\right)\) is the marginal likelihood of the model in Eq. (1) given the subset \({A}_{j}^{\left(d\right)}\), and \(\pi \left({A}_{j}^{(d)}\right)\) is the prior that the TFs in this subset \({A}_{j}^{(d)}\) regulate gene \(j\). We formulate the prior probability of a subset of TFs \({A}_{j}^{(d)}\) to regulate a gene, \(j\), as coming from a binomial distribution:

Here, the parameters of the prior are in a matrix \({\varvec{W}}=[{w}_{1}, \dots , {{w}_{G}]}^{T}\) with the same dimensions as the gene regulatory network \({\varvec{C}}\). For simplicity the marginal likelihood, \(P\left({A}_{j}^{\left(d\right)}\right)\), is obtained by assuming \(\sum_{p \in { A}_{j}}f({x}_{i}^{(k)})\) to be fixed. This assumption facilitates the inference process and reduces the computation time required in KBoost. We give the variance of the noise Jeffreys’ prior, \(\left({\sigma }_{j}^{2}\right)\propto \frac{1}{{\sigma }_{j}^{2}}\). This yields the following expression for the marginal likelihood:

#### Greedy model selection and boosting

So far, we have omitted the details of two crucial tasks, first how we calculate \(\sum_{p \in { A}_{j}}f\left({x}_{i}^{\left(k\right)}\right)\), and second how we go through the different subsets \({\mathrm{A}}_{\mathrm{j}}^{\left(\mathrm{d}\right)}\). For the second task, we chose a greedy approach where we iteratively select the TF subsets with the highest posteriors \(\mathrm{P}\left({\mathrm{A}}_{\mathrm{j}}^{\left(\mathrm{d}\right)}\right)\) and expand them for each gene. This choice is motivated by two goals: (1) to reduce the computational burden of calculating \(\mathrm{P}\left({\mathrm{A}}_{\mathrm{j}}^{\left(\mathrm{d}\right)}\right)\) for all possible combinations of TFs, and (2) to reduce the number of false positives. Genes that are regulated by the same set of TFs will have a high correlation between their expression levels. We assume that for each gene \(j\), there are several \({\mathrm{A}}_{\mathrm{j}}^{\left(\mathrm{d}\right)}\) which yield a relatively high \(\mathrm{P}\left({\mathrm{A}}_{\mathrm{j}}^{\left(\mathrm{d}\right)}\right)\) due to coregulation and thus we perform a greedy search keeping only the TFs that yield the highest \(\mathrm{P}\left({\mathrm{A}}_{\mathrm{j}}^{\left(\mathrm{d}\right)}\right)\) at each boosting iteration. This is similar to the LARS procedure, except that in Eq. (2) we include all explored models, like in the Occam Razor version of BMA^{12,21}. Therefore, Eq. (5) uses only the subsets \(\widehat{{A}_{j}}\) explored by our greedy search, i.e.:

#### Kernel principal components

Regarding the task of calculating \(\sum_{p \in { A}_{j}}f\left({x}_{i}^{\left(k\right)}\right)\), we chose to use KPCR because it has three important advantages in a boosting framework. First, KPCR is a kernel-based regression method which can approximate any potential shape that the relationship between a TF a target gene might have. We use the RBF kernel function which is the exponential of minus the squared Euclidean distance between observations divided by a width hyperparameter. KPCR is a technique that allows to compute the principal components in the infinite dimensional feature space directly from the kernel matrix. A more detailed description is presented in the appendix. Second, unlike other kernel regression methods, which would need an \(n\times n\) kernel matrix per TF, in KPCR we only need a few principal components per TF. This means a substantial reduction in memory usage while trying out different subsets \({A}_{j}^{\left(d\right)}\). Finally, the boosting framework allows us to effectively combine principal components from kernels of different TFs. The formulation is shown below.

For each TF \(p\), we first calculate the RBF kernel with width parameter \(\gamma \):

Then, we normalize the covariance matrix of the infinite dimensional features implicitly:

here, \({1}_{nn}\) is an \(n\times n\) matrix with all the values equal to \(\frac{1}{n}\). Next, we perform an eigendecomposition of \(\overline{K }\):

where \(\alpha \) are the eigenvectors and \(\lambda \) is a diagonal matrix with the eigenvalues in descending order The first \(z\) KPCs are then obtained by

In KBoost we select KPCs by using a threshold on their corresponding eigenvalues. The KPCs are then normalized so that they have unit norm. After obtaining \(g\left({x}_{i}^{\left(p\right)}\right)\), the matrices \(\overline{{K }^{(p)}}\) and \({K}^{(p)}\) can be overwritten and thus this procedure allows us to reduce the amount of memory required. Note that the columns of \(g\left({x}_{i}^{\left(p\right)}\right)\) are orthogonal to each other. This means that for a model with one TF, \({A}_{j}^{(d)}=\{p\}\), we can fit the model in Eq. (1) by:

here, \({{\widehat{\beta }}_{j}}^{\left(p\right)}\) are the coefficients of the KPC regression multiplied by a constant \(0<\nu \le 1\), and \({{\lambda }^{*}}^{\left(p\right)}\) is a diagonal matrix whose entries are the reciprocals of the eigenvalues that correspond to the KPCs. The constant \(\nu \) is known as a shrinkage parameter. We propose using \(\nu \) to counteract the effect that the number of observations \(n\) has on the marginal likelihood. Since the marginal likelihood changes exponentially to the power of \((n-1)/2\), the resulting GRNs would change dramatically depending on \(n\). Namely, with a large \(n\), most TFs posterior probability would be close to zero while a few would be close to one. This can lead to an overestimation of the confidence we have on the predicted GRN. By shrinking the estimates of \({{\widehat{\beta }}_{j}}^{\left(p\right)}\), we introduce a heuristic measure that will make the resulting sum of squares less different between TF subsets, \({A}_{j}^{(d)}\), and thus reduce the effect of \(n\).

#### Boosting

The last equation (Eq. 10) shows how we fit the model in Eq. (1) in the case that \({A}_{j}^{(d)}\) contains 1 TF. For simplicity we then follow a greedy boosting approach that is very similar to the least angle regression method. In KBoost, we first fit every single TF subset \({A}_{j}^{(d)}\) (i.e. only one TF in each model) and calculate their unnormalized posteriors. Then, we select the TF with the highest unnormalized posterior, \({p}^{*}\), and calculate its residuals, \({\varepsilon }^{(j)}\). Next, we form new subsets by iteratively adding a TF, and fit the TFs to the residuals:

Similarly, to the previous step, we calculate their unnormalized posterior and select the TF with the highest posterior to expand next (Fig. 3). When we use a \(\nu \) under 1, there is a possibility that the same TF will be chosen more than once. This might reflect a TF having a very strong relationship with a particular target gene. At the end we estimate \(P\left({C}_{j,p}=1|x\right)\) by performing BMA on all explored models. In KBoost, the number of iterations represents also the maximum number of TFs per gene to be considered. The resulting network is tuned using the heuristic proposed by Slawek and Arodz^{7}, in which we multiply each column \(p\) of \(P\left({C}_{j,p}=1|x\right)\) by the sample variance of the whole column. This has the effect of preserving the sparse outdegree distribution found in most biological networks. KBoost is available as an R package at: https://github.com/Luisiglm/KBoost and as a Bioconductor software package.

### Performance assessment

We compared KBoost against five other GRN reconstruction methods that were chosen because of their high performance in previous benchmarking exercises, such as the DREAM 4 and DREAM 5 challenges. Specifically, the algorithms included are GENIE, PLSNET, ENNET, TIGRESS, and GRNBoost2^{2,6,7,8}. All algorithms were run using their default parameters, including KBoost (\(if \, n>10: v= \frac{10}{n}, else \, v=0.5,\), \(\gamma =60\) and 3 iterations). We used three different datasets to assess the comparative performance of our algorithm: IRMA^{22}, the in silico networks from the DREAM 4 multifactorial sub-challenge^{2}, and the DREAM 5 dataset, which contains three sets of gene expression data—one from a simulated network and two from real-life networks^{3}. Before the analysis we standardized each gene’s expression to a sample variance of one and a sample mean of zero. For this comparison, we did not use any prior information on the relationships between TFs and potential targets. Therefore, we gave every TF a probability of 0.5 of regulating any gene, so that every possible combination of TFs had the same prior probability.

Benchmarking consisted of comparing a predicted GRN with the real GRN using common performance metrics such as the area under the precision recall curve (AUPR) and the area under the receiver operator curve (AUROC). Both methods work by changing a threshold of probability from close to 0 to 1. Regulatory interactions with a probability higher than the threshold are kept while those below the threshold are discarded. The inferred regulatory interactions are compared with the real GRN. To obtain AUPR we calculate the precision (number of correctly predicted interactions divided by the total number of predicted interactions) and the recall (number of correctly predicted interactions divided by the total number of actual interactions) at each threshold. Then, both precision and recall are plotted and the area under the curve is calculated. Similarly, for AUROC, at each threshold we calculate the recall and the false positive rate (number of incorrectly predicted interactions divided by the actual number of non-interactions). The recall is plotted against the false positive rate and the area under the curve is calculated. The AUROC value ranges from 0 to 1; 0.5 represents the number that would be achieved by random guessing and 1 is the perfect score^{23}. Besides using the AUROC and AUPR, we also use the specific scores developed for the DREAM 4 and DREAM 5 challenges. The authors of the DREAM 4 and DREAM 5 challenge devised a score to benchmark different methods on their datasets. They calculated these scores using the mean of the log10 of the empirical P-value to guess a network randomly of the same accuracy as the one inferred by each specific method^{3}. For consistency, we used the DREAM 4 and DREAM 5 evaluation scripts to compare the different algorithms.

## Results

### Performance assessment of KBoost

#### IRMA dataset

The IRMA dataset is a small experimental synthetic network in yeast, which is composed of five genes that regulate each other and are responsive to galactose and glucose^{22}. To implement IRMA in vivo, yeast cells were transduced with expression vectors that contained five single TF promoters linked to TF encoding genes. The five genes used were ASH1, GAL80, SWI5, GAL4 and CBF1. The network can be activated by the addition of galactose and silenced by glucose. Thus, this dataset contains two small steady state sub-datasets: IRMA switch on, and IRMA switch off, respectively. The expression of the five genes was measured by q-PCR. Both IRMA switch on and IRMA switch off contain 24 observations^{22}. We used the IRMA switch off dataset to select the default values of the parameters for KBoost, and these were kept fixed for the rest of the benchmark performance assessments (see “Supplementary data S1”). The results show that KBoost performs similarly well in both datasets (Table 1). Furthermore, it had the highest performance in both metrics for the IRMA switch off dataset.

#### DREAM 4 multifactorial challenge dataset

The DREAM 4 dataset contains several different network inference challenges. We focused only on the multifactorial sub-challenge as it more closely resembles gene expression data from patients^{2}. This challenge consists of five simulated steady-state expression profiles of 100 genes in 100 samples that were obtained by perturbing all genes simultaneously. The expression levels in the dataset correspond to steady state levels after perturbations with added normal and log-normal noise to resemble a real-life microarray experiment. The results show that KBoost, using the default parameters, generally achieved the highest performance, compared to state-of-the-art algorithms (Table 2). GENIE3, the original sub-challenge winner, had obtained a score of 37.5, whereas KBoost scores 55.93. This score is also clearly higher than that achieved by the other algorithms, which were developed after GENIE3. The results show that for several networks KBoost achieved the highest AUROC and AUPR.

#### DREAM 5 dataset

The DREAM 5 dataset contains three sets of gene expression data, one from a simulated network (Net 1) and two from real-life networks. The three gene expression sets contain data simulated or obtained from different kinds of experiments such as drug perturbations, genetic perturbations (gene deletions or overexpression), time series sets and steady-state sets. The first network is a simulated network with 195 TFs and 1643 genes. The second network is from *Escherichia coli* with 334 TFs and 4511 genes (Net 3), and the third network is from *Saccharomyces cerevisiae* and has 333 TFs and 5950 genes (Net 4)^{3}. The results show that KBoost compares favorably to most algorithms and has a similar overall performance as ENNET, a tree gradient boosting algorithm (Table 3). The results also show that most algorithms did not perform well on the experimental networks, Net 2 and Net 3.

#### Running time

The boosting framework of KBoost allows us to dramatically reduce the number of computational operations. That is because KBoost adds only one TF at a time, and the maximum likelihood solution to the regression of KPCs of only one TF does not require an inversion, since KPCs coming from one kernel are orthogonal. The resulting worst case computational complexity is governed by the eigendecomposition of the KPCA \(O({p\cdot (n}^{3}))\) \(n\) the number of observations and \(p\) the number of TFs. This might seem extreme but typically GRN inference datasets have at most a few hundred of observations.

We ran all algorithms on the same desktop with an Intel^{®} Core™ i7-8700 CPU with 370 GHz, 6 cores and 12 logical cores, and 32 GB of RAM. The results show that KBoost is much faster for all datasets compared to all the other algorithms. The summary of all the benchmark analysis is shown in Table 4. KBoost has the best performance for all datasets except for the IRMA switch on dataset, where it performed second best. However, KBoost dramatically shortens running times, particularly for the DREAM 5 dataset.

#### Effect of an informative prior network

For several organisms, many years of research have linked certain TFs with specific genes. This information can be included into KBoost in the form of the prior \({\varvec{W}}\). In the previous sections, we used a \({\varvec{W}}\) with 0.5 for every of its element, giving every TF-gene relationship the same prior probability. In this section, we explore what could happen in a real-life setting, if the user had prior information from previous research studies linking certain TFs to certain genes. We used the gold standards of the DREAM 4 multifactorial dataset as prior networks and added some noise by randomly modifying different fractions of the edges. We gave the edges present in the prior networks a weight of 0.6 and the edges that were absent a weight of 0.4. The results show that even with a prior network with relatively high levels of noise (up to 50% of the edges were added or deleted from the original network) the average AUPR and AUROC are higher than when a noninformative prior was used (Fig. 3). All the posterior networks had a higher AUPR than the corresponding prior networks, while for the AUROC only the prior with 5% of the edges modified had a higher AUROC in the prior than the posterior network. In general, this shows the potential to increase the accuracy and precision of the predictions by including prior information (Fig. 4).

#### Application to the METABRIC dataset

To examine the usefulness of KBoost to biological research we applied KBoost to the METABRIC dataset of breast cancer downloaded from cBioportal^{24}. This dataset contains data for 1904 patients, of which 1898 have been stratified into subtypes. Breast cancer was one of the first cancer types where gene expression profiles were used for a molecular stratification of cancer subtypes, which correlate with disease phenotypes, prognosis and treatment response^{25,26}. Studies have identified 4 major breast cancer subtypes (luminal A, luminal B, HER2-enriched and basal), a claudin-low subtype and a normal-like group. The claudin-low is sometimes described as a particularly aggressive form of the basal subtype subtype^{27,28}. We wanted to test whether the subtypes have characteristic GRNs that can be used to distinguish them. For this, we reconstructed a network for each of the six subtypes using KBoost with a prior based on the GRN obtained from ChIP-Seq experiments on several TFs^{29}. We gave the edges seen in the prior network a prior probability of 0.6, while for TFs in the prior network but not on our system, we gave 0.4 probability to the edges that were not observed in the ChIP-Seq. The dataset contains microarray experiments with expression levels for 24,368 genes. To identify TFs we overlapped the list of 24,368 genes in METABRIC with a previously published list of human TFs^{30}. KBoost was ran on a desktop with an Intel^{®} Core™ i7-8700 CPU with 370 GHz, 6 cores and 12 logical cores, and 32 GB of RAM. The resulting network was filtered by keeping edges with a posterior probability higher than 0.2. This threshold was found by using the F1 metric on the IRMA Off dataset. The whole dataset of 24,368 genes by 1345 TFs and 1898 patients was run in 104 min. This is less than what most algorithms took for the DREAM 5 challenge which is significantly smaller.

The results show that KBoost reconstructed GRNs that distinguished the five main subtypes, Normal-like, Luminal A, Luminal B, HER2 enriched, and Basal/Triple negative breast cancer (TNBC) (Fig. 5 and Table 5). KBoost. We analyzed the GRNs using the closeness centrality measure, which is an indication of the number of genes a TF regulates directly or indirectly.

## Discussion

Inferring GRNs from gene expression data is an important task in bioinformatics. KBoost represents a novel algorithm that uses KPCR, boosting and BMA coupled with greedy model selection. Both boosting and BMA have been previously used in this setting. ENNET and GRNBoost2 use regression tree boosting, while TIGRESS uses a least angle regression which can be considered a form of boosting. On the other hand, BMA has mainly been used with linear regression. KPCR has advantages over using regression trees and linear regression. Both KPCR and regression trees do not need the relationship between a TF and its target gene to be linear, unlike linear regression. However, regression trees produce discontinuous functions which can be added together after several iterations to approximate a continuous function by using random forests or boosting. In contrast, KPCR directly yields continuous functions.

We show that KBoost has a very competitive performance using three different datasets. KBoost had the highest score on the multifactorial subset of DREAM 4 and the IRMA switch off dataset and tied for first place with ENNET in DREAM 5. Furthermore, though we did not use any prior information for the benchmarking, we show that in the presence of prior information the performance of KBoost can increase dramatically. This is particularly relevant as there are several datasets with predicted or observed TF binding sites, ChIP-seq networks and other forms of prior information available. In terms of running time, the advantage of using KBoost is especially clear, as it was the fastest among all algorithms across all datasets. In addition, we were able to run KBoost on the METABRIC dataset containing over 24,000 genes with over 1300 TFs and over 1800 patients in less than two hours on a standard desktop computer.

We compared the GRNs from different breast cancer subtypes using the closeness centrality. Closeness centrality is a measure of a node’s importance in a network and is calculated using the number of other nodes to which it is directly and indirectly connected. Different subtypes showed differences in TF centrality. The patterns can easily be distinguished by the naked eye and correlate with increasing aggressiveness. According to the current transcriptome based classification^{24,28} the following phenotypes can be discerned: normal or normal-like is the most benign phenotype that may simply reflect normal breast tissue due to a low tumor cell content in the samples; luminal A and luminal B, which are relatively benign; HER2 enriched, which is more aggressive; and basal which has the worst prognosis. In the claudin-low subgroup, the gene FOXM1 had the highest closeness centrality. FOXM1 regulates the cell cycle and is required for the proliferation of normal breast epithelia but also promotes the proliferation of malignant cells and is often overexpressed in breast cancer^{31}. The Luminal A and B GRNs were similar. In both GRNs, the TF with highest closeness centrality was REPIN1/AP4. This gene is a zinc finger protein that has been previously associated with enhanced proliferation, migration, and cisplatin resistance in breast cancer cells^{32}. Interestingly, REPIN1/AP4 also was reported to reduce breast cancer cell proliferation^{33}, which may be related to the slow growing phenotype of luminal A/B breast cancers. The gene AKAP8L had the highest closeness centrality in HER2 enriched breast cancer. Previous studies have found AKAP8L to be associated with metastasis suppression in Luminal A, B and HER2 enriched breast cancer^{34}. For the basal subtype, the gene TBX21 had the highest closeness centrality. This gene is often overexpressed in breast cancer indicating poor prognosis^{35}. Finally, we found the gene TSC22D1 (TGFβ stimulated clone-22) to be important for the normal-like subtype GRN. TSC22D1 seems to have a dual role. It was identified as tumor suppressor in several cancer types including breast cancer^{36}. However, it also was described that its expression correlates with tamoxifen resistance and breast cancer recurrence^{37}. TSC22D1 functions in the TGFβ pathway, which can suppress tumorigenesis in early stages but promote it in later stages^{38}. Together these networks highlight the gene expression differences between the breast cancer subtypes that might contribute to their observed differences in outcome. As this molecular classification of breast cancer subtypes is based on gene expression data^{25,26}, it is not surprising that the subtypes differ in their GRNs. However, the identification of the TFs responsible is useful. It suggests pathogenetic mechanisms, and the dimensionality reduction associated with classifying molecular phenotypes based on GRN reconstruction facilitates the analysis and identification of potentially actionable targets.

## References

Alon, U.

*An Introduction to Systems Biology: Design Principles of Biological Circuits*(Chapman & Hall/CRC, 2007).Huynh-Thu, V. A., Irrthum, A., Wehenkel, L. & Geurts, P. Inferring regulatory networks from expression data using tree-based methods.

*PLoS ONE.*https://doi.org/10.1371/journal.pone.0012776 (2010).Marbach, D.

*et al.*Wisdom of Crowds for Robust Gene Network Inference.*Nat. Methods***9**, 796. https://doi.org/10.1038/nmeth.2016 (2012).Whitmarsh, A. J. & Davis, R. J. Regulation of transcription factor function by phosphorylation.

*Cell. Mol. Life Sci.***57**, 1172–1183. https://doi.org/10.1007/pl00000757 (2000).Fischer, M., Grossmann, P., Padi, M. & DeCaprio, J. A. Integration of TP53, DREAM, MMB-FOXM1 and RB-E2F target gene analyses identifies cell cycle gene regulatory networks.

*Nucleic Acids Res.***44**, 6070–6086. https://doi.org/10.1093/nar/gkw523 (2016).Moerman, T.

*et al.*GRNBoost2 and Arboreto: Efficient and scalable inference of gene regulatory networks.*Bioinformatics***35**, 2159–2161. https://doi.org/10.1093/bioinformatics/bty916 (2019).Slawek, J. & Arodz, T. ENNET: Inferring large gene regulatory networks from expression data using gradient boosting.

*BMC Syst. Biol.*https://doi.org/10.1186/1752-0509-7-106 (2013).Guo, S., Jiang, Q. S., Chen, L. F. & Guo, D. H. Gene regulatory network inference using PLS-based methods.

*BMC Bioinform.*https://doi.org/10.1186/s12859-016-1398-6 (2016).Haury, A. C., Mordelet, F., Vera-Licona, P. & Vert, J. P. TIGRESS: Trustful Inference of Gene REgulation using Stability Selection.

*BMC Syst. Biol.*https://doi.org/10.1186/1752-0509-6-145 (2012).Iglesias-Martinez, L. F., Kolch, W. & Santra, T. BGRMI: A method for inferring gene regulatory networks from time-course gene expression data and its application in breast cancer research.

*Sci. Rep.*https://doi.org/10.1038/srep37140 (2016).Young, W. C., Raftery, A. E. & Yeung, K. Y. Fast Bayesian inference for gene regulatory networks using ScanBMA.

*BMC Syst. Biol.*https://doi.org/10.1186/1752-0509-8-47 (2014).Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. Least angle regression.

*Ann. Stat.***32**, 407–451 (2004).Breiman, L. Random forests.

*Mach. Learn.***45**, 5–32. https://doi.org/10.1023/a:1010933404324 (2001).Friedman, J. H. Greedy function approximation: A gradient boosting machine.

*Ann. Stat.***29**, 1189–1232. https://doi.org/10.1214/aos/1013203451 (2001).Scholkopf, B., Smola, A. & Muller, K. R. Nonlinear component analysis as a kernel eigenvalue problem.

*Neural Comput.***10**, 1299–1319. https://doi.org/10.1162/089976698300017467 (1998).Rosipal, R., Girolami, M., Trejo, L. J. & Cichocki, A. Kernel PCA for feature extraction and de-noising in nonlinear regression.

*Neural Comput. Appl.***10**, 231–243. https://doi.org/10.1007/s521-001-8051-z (2001).Scholkopf, B. & Smola, A. J. A short introduction to learning with kernels.

*Adv. Lect. Mach. Learn.***2600**, 41–64 (2002).Sch©œlkopf, B. & Smola, A. J.

*Learning With Kernels: Support Vector Machines, Regularization, Optimization, and Beyond*(MIT Press, 2002).Madigan, D. & Raftery, A. E. Model selection and accounting for model uncertainty in graphical models using Occams window.

*J. Am. Stat. Assoc.***89**, 1535–1546. https://doi.org/10.2307/2291017 (1994).Raftery, A. E., Madigan, D. & Hoeting, J. A. Bayesian model averaging for linear regression models.

*J. Am. Stat. Assoc.***92**, 179–191. https://doi.org/10.2307/2291462 (1997).Hoeting, J. A., Madigan, D., Raftery, A. E. & Volinsky, C. T. Bayesian model averaging: A tutorial.

*Stat. Sci.***14**, 382–401 (1999).Cantone, I.

*et al.*A yeast synthetic network for in vivo assessment of reverse-engineering and modeling approaches.*Cell***137**, 172–181. https://doi.org/10.1016/j.cell.2009.01.055 (2009).Saito, T. & Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.

*PLoS ONE*https://doi.org/10.1371/journal.pone.0118432 (2015).Curtis, C.

*et al.*The genomic and transcriptomic architecture of 2000 breast tumours reveals novel subgroups.*Nature***486**, 346–352. https://doi.org/10.1038/nature10983 (2012).Prat, A. & Perou, C. M. Deconstructing the molecular portraits of breast cancer.

*Mol. Oncol.***5**, 5–23. https://doi.org/10.1016/j.molonc.2010.11.003 (2011).Perou, C. M.

*et al.*Molecular portraits of human breast tumours.*Nature***406**, 747–752. https://doi.org/10.1038/35021093 (2000).Herschkowitz, J. I.

*et al.*Identification of conserved gene expression features between murine mammary carcinoma models and human breast tumors.*Genome Biol.***8**, R76. https://doi.org/10.1186/gb-2007-8-5-r76 (2007).Fougner, C., Bergholtz, H., Norum, J. H. & Sørlie, T. Re-definition of claudin-low as a breast cancer phenotype.

*Nat. Commun.***11**, 1787. https://doi.org/10.1038/s41467-020-15574-5 (2020).Gerstein, M. B.

*et al.*Architecture of the human regulatory network derived from ENCODE data.*Nature***489**, 91–100. https://doi.org/10.1038/nature11245 (2012).Lambert, S. A.

*et al.*The human transcription factors.*Cell***172**, 650–665. https://doi.org/10.1016/j.cell.2018.01.029 (2018).Saba, R., Alsayed, A., Zacny, J. P. & Dudek, A. Z. The role of forkhead box protein M1 in breast cancer progression and resistance to therapy.

*Int. J. Breast Cancer***2016**, 9768183. https://doi.org/10.1155/2016/9768183 (2016).Wang, L., Meng, Y., Xu, J. J. & Zhang, Q. Y. The transcription factor AP4 promotes oncogenic phenotypes and cisplatin resistance by regulating LAPTM4B expression.

*Mol. Cancer Res. MCR***16**, 857–868. https://doi.org/10.1158/1541-7786.Mcr-17-0519 (2018).Amin, S., Kumar, A., Nilchi, L., Wright, K. & Kozlowski, M. Breast cancer cells proliferation is regulated by tyrosine phosphatase SHP1 through c-jun N-terminal kinase and cooperative induction of RFX-1 and AP-4 transcription factors.

*Mol. Cancer Res. MCR***9**, 1112–1125. https://doi.org/10.1158/1541-7786.Mcr-11-0097 (2011).Hu, X.

*et al.*The RNA-binding protein AKAP8 suppresses tumor metastasis by antagonizing EMT-associated alternative splicing.*Nat. Commun.***11**, 486–486. https://doi.org/10.1038/s41467-020-14304-1 (2020).Yu, H.

*et al.*T-box transcription factor 21 expression in breast cancer and its relationship with prognosis.*Int. J. Clin. Exp. Pathol.***7**, 6906–6913 (2014).Kester, H. A., van der Leede, B. M., van der Saag, P. T. & van der Burg, B. Novel progesterone target genes identified by an improved differential display technique suggest that progestin-induced growth inhibition of breast cancer cells coincides with enhancement of differentiation.

*J. Biol. Chem.***272**, 16637–16643. https://doi.org/10.1074/jbc.272.26.16637 (1997).Meijer, D.

*et al.*TSC22D1 and PSAP predict clinical outcome of tamoxifen treatment in patients with recurrent breast cancer.*Breast Cancer Res. Treat.***113**, 253–260. https://doi.org/10.1007/s10549-008-9934-3 (2009).Syed, V. TGF-β signaling in cancer.

*J. Cell. Biochem.***117**, 1279–1287. https://doi.org/10.1002/jcb.25496 (2016).

## Acknowledgements

We thank Dr. Colm Ryan for helpful discussions in the preparation of this manuscript.

## Funding

This work has been supported by the Irish Cancer Society CCRC BREAST-PREDICT Grant CCRC13GAL, and by Science Foundation Ireland under Grants 14/IA/2395 and 18/SPP/3522. B.D.K. was funded through an Irish Research Council 2017/2018 Laureate Award awarded to Dr. Colm J. Ryan.

## Author information

### Authors and Affiliations

### Contributions

L.F.I.M. and W.K. designed the study. L.F.I.M. wrote the R package for KBoost and B.D.K. reviewed the code and wrote the vignette. L.F.I.M. wrote the main manuscript text and Figs. 1, 3 and 4. B.D.K. prepared Fig. 1. All authors reviewed the manuscript.

### Corresponding author

## Ethics declarations

### Competing interests

The authors declare no competing interests.

## Additional information

### Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Supplementary Information

## Rights and permissions

**Open Access** This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

## About this article

### Cite this article

Iglesias-Martinez, L.F., De Kegel, B. & Kolch, W. KBoost: a new method to infer gene regulatory networks from gene expression data.
*Sci Rep* **11**, 15461 (2021). https://doi.org/10.1038/s41598-021-94919-6

Received:

Accepted:

Published:

DOI: https://doi.org/10.1038/s41598-021-94919-6

## Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.