KBoost: a new method to infer gene regulatory networks from gene expression data

Reconstructing gene regulatory networks is crucial to understand biological processes and holds potential for developing personalized treatment. Yet, it is still an open problem as state-of-the-art algorithms are often not able to process large amounts of data within reasonable time. Furthermore, many of the existing methods predict numerous false positives and have limited capabilities to integrate other sources of information, such as previously known interactions. Here we introduce KBoost, an algorithm that uses kernel PCA regression, boosting and Bayesian model averaging for fast and accurate reconstruction of gene regulatory networks. We have benchmarked KBoost against other high performing algorithms using three different datasets. The results show that our method compares favorably to other methods across datasets. We have also applied KBoost to a large cohort of close to 2000 breast cancer patients and 24,000 genes in less than 2 h on standard hardware. Our results show that molecularly defined breast cancer subtypes also feature differences in their GRNs. An implementation of KBoost in the form of an R package is available at: https://github.com/Luisiglm/KBoost and as a Bioconductor software package.


Hyperparameter Selection for KBoost
Kboost has the hyperparameters , and the number of iterations for KBoost. The parameter is the shrinkage parameter for gradient boosting. It reduces the contributions of each iteration and is related to the learning rate in gradient descent. It is restricted to values between 0 and 1. The parameter is the width parameter of the RBF kernel, it needs to be larger than 0. In general terms, the larger the smoother the resulting regression function will be. Finally, the number of iterations controls the maximal number of TFs put together in a regression model. ChIP-Seq studies have suggested that genes have an average of 3 TFs regulating them (Gerstein, et al., 2012).
The shrinkage parameter, , in our case however it has another effect. From equation 4 and 10 the marginal likelihood is as follows: As mentioned in the methods section A j (d) are the subet of TFs on gene , the number of observations and ∑ ( is the sum of squared errors. Hypothetically speaking, if we leave the sum of squared errors divided by n fixed for two arbitrary subsets A j (d) and A j (d+1) , where the sum of squred errors is larger in subset + 1 than , the ratio P(A j (d) )/( P(A j (d+1) ) + ( ( ) )) becomes closer to 1 as increases while P(A j (d+1) )/( P(A j (d+1) ) + ( ( ) )) becomes closer to zero. This is easier to see in log form: Here, Q is a constant, let and +1 be the sum of squared errors divided by n for A j (d) and A j (d+1) respectively. The ratio increases: This means that for two models with the same fit, depending on n we can observe very different Bayesian model averages. If n is sufficiently large and is lower than +1 , then P(A j (d+1) ) ≪ ( ( ) ) yielding the result P(A j (d+1) ) + ( ( ) ) ≅ ( ( ) ). The Bayesian model averages under this conditions would result in 1 for ( ) and 0 for A j (d+1) . In our case, we are using BMA as an estimate to the probability that a TF regualtes a gene. In conditions as those described above using a shrinkage parameter, we could overestimate the confidence of our predictions. As a heuristic, we propose reducing ν in the boosting algorithm to balance this. If ν decreases as a funnction of n, then the difference log( +1 ) − log( ) would be less dramatic because the contribution of the TF models are reduced. This can be seen in supplementary figure 1. Furthermore, since we use the same shrinkage parameter for all models, the model ranking remains identical when using 1 boosting iteration. Figure S1. Effect of the Shrinkage Parameter on the Posterior. We ran KBoost on the DREAM4 multifactorial network 1, using of 1 and 0.1 with 1 iteration and a parameter of 60. The results highlight the effect of the shrinkage parameter at reducing the disparity between models which is a product of a large .
To see how different parameters affected the performance of KBoost, we ran KBoost on the IRMA off dataset with different combinations of values for each parameter and used the AUPR and AUROC metric to asses its performance. We fixed the number of iterations to 3, as they represent the maximum number of TF to be considered to regulate a gene together. For the other parameter we used values ranging from 0.0001 to 1 and values from 1 to 100. We focused on finding ranges of values in which both metrics were consistently high rather than a single value that had the maximum as it could be an artifact specific to this dataset. The results are shown in supplementary figure 2. The performance seemed to plateau for large widths > 50 given shrinkage parameters under 0.5.
Interestingly, the behavior is similar in the five networks of the DREAM 4 dataset ( Figure S3). While the shrinkage parameter, , has an effect on the sparsity of the posterior, and the number of iterations corresponds to the maximum number of TFs per, the width parameter, , can be associated with overfitting. An RBF kernel regresion model with a low width will be able to fit every point of the data exactly, this increases the chance of fitting spurius patterns. We investigated the effect of in the network 1 of the DREAM 4 multifactorial challenge dataset. We performed a three fold cross validation with a grid of iteration from 1, 3, 5, 10 and 20, shrinakges from 0.001, 01, 0.3 0.5 and 1 and widths, , taking the values of 0.1, 10 , 20 , 60 and 100 . The results indicate in all cases that ≥ 10 had a lower sum of squared errors in the validations set ( Figure S4).

Figure S2. Hyperparameter Selection for KBoost
The results show the effect of the width and shrinkage parameters on the performance on the IRMA Off dataset. The performance seems to plataeu for large widths for any value of . We fixed the number of iterations at 3, assuming a maximum of 3 TFs per gene.