Robust phenotype prediction from gene expression data using differential shrinkage of co-regulated genes

Discovery of robust diagnostic or prognostic biomarkers is a key to optimizing therapeutic benefit for select patient cohorts - an idea commonly referred to as precision medicine. Most discovery studies to derive such markers from high-dimensional transcriptomics datasets are weakly powered with sample sizes in the tens of patients. Therefore, highly regularized statistical approaches are essential to making generalizable predictions. At the same time, prior knowledge-driven approaches have been successfully applied to the manual interpretation of high-dimensional transcriptomics datasets. In this work, we assess the impact of combining two orthogonal approaches for the discovery of biomarker signatures, namely (1) well-known lasso-based regression approaches and its more recent derivative, the group lasso, and (2) the discovery of significant upstream regulators in literature-derived biological networks. Our method integrates both approaches in a weighted group-lasso model and differentially weights gene sets based on inferred active regulatory mechanism. Using nested cross-validation as well as independent clinical datasets, we demonstrate that our approach leads to increased accuracy and generalizable results. We implement our approach in a computationally efficient, user-friendly R package called creNET. The package can be downloaded at https://github.com/kouroshz/creNethttps://github.com/kouroshz/creNet and is accompanied by a parsed version of the STRING DB data base.


Methods
The starting point of our method is a set of per-patient baseline gene expression measurements as well as a gene regulatory network. Our method, creNET, consists of two integral parts, (1) the discovery of significant upstream regulators and accompanying weights from literature-derived biological networks, (2) the efficient incorporation of this information into a group lasso procedure.
Upstream regulator detection. We recently introduced a new 'active gene regulator' detection method that is suitable for use with public protein-protein, protein-gene interaction networks with a mix of signed and unsigned causal edges. In this context, let G = (V, E) be a given gene regulatory networks, where V is the set of vertices, typically consisting of Transcription Factors, Proteins, miRNAs, Compounds, etc., and E is the set of edges. The existence of an edge between a source node v and a target node g (v → g) implies that v regulates g. For the purpose of this paper we assume that the target nodes g are genes. If the edge between v and g is signed, then the direction of the regulation is known with + indicating 'v up-regulates g' , and − indicating 'v down-regulates g' . The network groups the gene into classes (or gene sets) according to the connections of 'upstream regulators' to downstream genes. Figure 1 shows a schematic plot of gene sets defined by a gene regulatory network.
The method outputs likely upstream regulators based on differentially expressed genes between two conditions. It quantifies the statistical significance of each potential regulator with respect to a null model of random assignment of differential expression status to genes in the network. If edges in the network are signed, the direction of regulation will influence the test statistic and, thus, give more credence to upstream regulators that respect the regulation pattern of their downstream genes. For more details, we refer to the original publication 16 . The significance estimates for each potential regulator will serve as weights for the corresponding gene set in our creNET procedure.
There are many academic 25,26 and commercial (Ingenuity: www.ingenuity.com and Selventa: www.selventa. com) sources of biological networks. In our current implementation, we utilize a commercially curated collection of interactions that consists of approximately 87,140 nodes and 671,371 edges, where each edge is supported by an experiment reported in the literature. We refer to this network as Causal Reasoning Engine Knowledge Base or creKB for short. However, our method also works with other types of network. In particular, the released package SCientifiC RepoRts | (2018) 8:1237 | DOI:10.1038/s41598-018-19635-0 comes with version 10 of the STRING DB data base 25 and we provide equivalent figures for all results based on the STRING DB in the Additional File 1. We refer to this network as publicKB.
Shrinkage of co-regulated gene sets. In order to integrate information on regulatory interactions and co-regulated gene sets into the classification In order to integrate information on regulatory interactions and co-regulated gene sets into the classification task, we utilize group-wise regularization of covariates as follows. Given a set of training data y p is the vector of normalized gene expression values for the i-th patient and y i ∈ {−1, 1} is a binary response, we construct an n × p design matrix X and a binary response vector y = (y 1 , …, y n ) T . The classification problem can be quite generally stated as an optimization problem of the form: where Φ is a loss function that links the covariates to the response via the linear function β T x i , g is a penalty term and λ is a tuning parameter 11 . Here  ( , ) ∈ is the vector of unknown coefficients. Several choices of loss functions such as linear regression and Support Vector Machines (SVMs) can be formulated as above. For our purpose, we utilize logistic regression: For ease of presentation, we do not consider the intercept term β 0 separately; addition of this term to the loss function can be trivially achieved by adding an extra column of 1's at the beginning of the design matrix X.
The group lasso penalty introduced in 7 provides a natural way for simultaneous shrinkage of co-regulated gene sets. Assume that the gene regulatory network categorizes the genes into K (possibly overlapping) groups Fig. 1). Moreover, let β ( )  be the vector obtained from β by keeping all components of β that correspond to the group  g ( ) identical and setting all other components to zero. Note that . The group-lasso norm is then defined by ∈ This penalty is in fact the  1 norm of the  2 norm of the group coefficient vectors (β (1) , …β (K) ). This quantity is non-differentiable at 0 ( ) β =  and hence acts as a regular  1 penalty, but sets the coefficient of the entire gene group to 0. If some of the groups overlap, the quantity (3) is still a norm whose unit ball has singularities when some of the  0 ( ) β = . In this case, it is straightforward to show that g c 0 where   0 ⊂ denotes the set of groups such that β (g) = 0 and the exponent, c, denotes the set complement. An important implication of the above equation is that in presence of overlapping groups, the group-lasso penalty will result is sparse solutions where the active (i.e., non-zero) groups may include genes with zero coefficients. For instance if groups  g ( ) and g (k) have covariates x 1 , …x m in common, and if g ( )  is set to zero, the coefficients of these same covariates in g (k) will also be set to zero. This is not a desirable effect. In particular, consider a situation where the covariates x 1 , …x m weakly correlate with the phenotype and no other covariate in group  g ( ) is significant, so the entire group is set to zero. However, same covariate, when considered with the remaining covariates in group g (k) may make the entire group significant and hence should remain in the model. As most regulatory biological networks will have overlapping groups, we are interested in a penalty such that the support of β is a union of groups, i.e., the opposite of the above equation. In 10 , the authors introduced an overlapping group-penalty defined by The regulators R 1 , R 2 and R 3 group the downstream genes into gene sets g (1) = {g 1 , g 2 , g 3 , g 4 }, g (2) = {g 1 , g 2 , g 3 , g 4 , g 5 , g 6 , g 7 , g 8 } and g (3) = {g 5 , g 6 , g 7 , g 8 }. Note that the gene sets may overlap.
This penalty has the desirable property that if a group is active (non-zero), all of the down-stream genes will have a non-zero coefficient, even if the genes are shared by inactive (i.e., zero) groups, i.e., where   1 ⊂ denotes the set of groups such that v (g) ≠ 0. In order to achieve within-group sparsity, an 1  penalty of covariates can be added resulting in a penalty of the form in the objective function (1), where α ∈ [0,1] is a tradeoff parameter between group-wise and single gene penalties.
The calculation of the overlapping group norm (5) based on its direct definition amounts to a nonsmooth convex optimization problem under linear constraints, which is in general highly nontrivial. To facilitate computations (at the expense of artificially increasing the size of the data), a simple duplication device can be used to replace the overlapping group norm with an equivalent non-overlapping group-lasso norm via 10 . More precisely, ( ) be the total number of genes, taking their multiplicity into consideration. For each group of co-regulated genes g (j) , let X (j) be the n × #(g (j) ) slice of the gene expression matrix X that contains the genes regulated by the j-th regulator.
The classification can now be performed on the duplicated gene expression matrix ∼ X with regression coefficients β ∈   M and non-overlapping group norm β . A formal derivation of the equivalence between overlapping group penalty and duplicated non-overlapping group-penalty is presented in the Additional File 1.

Differential shrinkage of gene sets based on the significance of upstream regulators. The penalty
In the original introduction of the group-lasso norm and its subsequent developments, the weights w j are generally set to #(g (j) ) 1/2 , i.e., the square root of the number of genes in the j-th regulation group 7 . While this choice seem reasonable, it may have unintended consequences. Indeed these weights may cause larger groups to be set to zero and only let smaller groups survive, regardless of the biological importance of the regulator under consideration. For instance, an important regulator in Kidney Rejection (one of the diseases that we study in this paper) is IFNG. This regulator corresponds to a very large gene set that is heavily penalized under the square-root penalty and is not retained in the fitted model.
Intuitively, gene sets should be penalized based on the relevance of the set to the phenotype. The inferred significance of upstream regulators of the gene sets provides a natural way to achieve this. Below, we present four sets of weights, designed to systematically assess the impact of differential shrinkage on model performance and robustness.
1. unweighted: No weights are applied to groups. This weighting is equivalent to unweighted SGL 7,9 . 2. sqrt-weighted: Square root of the number of genes in the groups. This weighting is equivalent to weighted SGL 7,9 . 3. regulator-weighted: Inferred p-values (of active upstream regulators of the gene group). 4. regulator-sqrt-weighted: Inferred p-values times the square root of the group size.
creNET: implementation. For the fitting part of the algorithm, we modified the source C++ implementation of the fitting process from the R package SGL 9 . As part of the modification, we added the inclusion of arbitrary weights into the fitting process. We then developed an independent R package called creNET that includes several auxiliary and utility functions. These include functions for network and data processing, data normalization, calculation of an optimal path of λ values 9 , data filtering and weight generation, goodness-of-fit measures such as AUC, nested cross-validation, independent training and testing, and plotting utilities. Moreover, we provide a parsed version of the STRING DB database as well as scripts for running various tests. The package is available to download from github at https://github.com/kouroshz/creNet. See Additional File 1 for additional algorithmic details.
Evaluation procedure. We compared the proposed methods to well-established benchmark methods using artificial as well as empirical clinical data sets. Our numerical experiments were designed to test various aspects of our model including (1) the advantage of group-based vs. single gene shrinkage, (2) the importance of gene sets defined by upstream regulatory interactions, (3) the impact of weighted shrinkage based on the inferred activity of upstream regulators, (4) the generalizability of trained predictive models to new datasets and (5) the biological relevance of the significant predictors to the phenotype. We used simulation and randomization studies to assess items 1 and 2, while 3 and 4 were tested on real clinical trial datasets by cross-validation as well as independent training and testing. An advantage of our method is that in addition to the class labels, it outputs gene sets and upstream regulators with significant discriminatory power. We studied the biological relevance of these regulators to the phenotype in each of our datasets to address item 5. Table 1 gives an overview of the benchmark methods. Ridge (#1) and lasso (#2) logistic regressions are popular approaches for phenotypic prediction 27 . To assess the impact of feature selection by the upstream regulator detection method, we also defined the Filtered Gene lasso (#3). In this method, we simply restrict the input transcriptomics data to genes that are regulated by active upstream regulators. Finally, we consider the four group lasso-based approaches defined in the previous section. All parameters for detection of upstream regulators are the same across methods. For a fair comparison between methods, input train and test gene expression data are filtered to include only genes that are present in the network of gene regulatory interactions. The lasso and ridge regression models were trained using the R glmnet package 27 .
Nested cross-validation vs. independent data sets. In all methods under study, internal parameters such as the lasso regularization parameter λ in (1) or the tradeoff parameter α in (7) must be fit. In addition, we perform a feature selection based on upstream regulator detection prior to classification. The literature abounds with articles in which this situation is inappropriately handled by simple cross-validation. For instance, one could detect transcripts regulated by active regulators on the complete dataset and subsequently employ a cross-validation procedure for the regularized classifier only. This will lead to a strong overestimation of expected classification performance 28 .
The appropriate approach is nested cross-validation which wraps feature selection, parameter fitting and classification into nested loops such that no information on the complete dataset is ever used for fitting on the current training slice. In Additional File 1, we outline the employed procedure in more detail. As performance metrics we will focus on balanced accuracy (i.e. the average of sensitivity and specificity) which is an appropriate measure when positive and negative class sizes are not equal. Even appropriate nested cross-validation can, in principle, not account for shifts in the underlying population when the trained classifier is applied in a production setting. Therefore, we chose biological datasets for which we could also evaluate our approaches in two completely independent scenarios.
Generation of simulated data. If the difference between responders and non-responders in a given dataset is governed through previously described upstream regulatory mechanisms 29 , our proposed method should perform well. In contrast, on a random network our approach should perform no better than purely statistical regularization approaches such as lasso. To test this hypothesis, we simulated a total of 100 datasets with 1,2,…, 10 randomly selected regulators in the network regarded as 'active' . Each dataset contains 50 responders and 50 non-responders for training and the same number of samples for testing. The simulation procedure uses a previously described Bayesian network model 15 to obtain the set of differentially expressed genes given active regulators. Next, we used a negative binomial model to generate gene expression values based on their differential expression status 30 . For unregulated genes, we used a negative binomial model with mean μ 0 and dispersion r = μ 0 /3. For up-regulated genes, the mean was set to μ 1 = cμ 0 for responders and μ 2 = (1/c)μ 0 for non-responders. Here c denotes the fold change. The mean values were switched in the case of down regulated genes. In our simulations we used μ 0 = 3 and fold change c = (3) 1/2 . The simulated values were then normalized using the R package voom. Figure 2 shows a schematic plot of the simulation process.
Processing of clinical data. Clinical trial datasets from (a) acute kidney rejection 31,32 (GSE50058 and GSE21374) and (b) response to infliximab therapy in ulcerative colitis (UC) 33 (GSE12251 and GSE14580) were obtained and RMA normalized and differentially expressed genes (DEGs) were identified. All analyses were performed using standard pipelines from the R Bioconductor package 34 . A p-value of 0.05 and a fold change of 1.5 were used as cutoff values in differential gene expression analysis. The DEGs were then used as input to the QuaternaryProd package 16 to calculate the statistical significance of the upstream regulators. A cutoff of 0.01 was used for the p-values to identify significant groups.

Results
In a series of numerical experiments using artificial and real clinical trial datasets, we tested the ability of our model to predict phenotypic response and to identify relevant associated biomarkers. In the following sections we discuss the results of these experiments. Simulations. Figure 3 shows results of our simulations across the range of simulation parameters. If the network captures the underlying regulatory structure correctly, the proposed creNET method performs well.
Assessing model performance. Figure 4 shows an overview of performance in terms of balanced accuracy split by cross-validation and independent test set runs. All runs are based on the creKB. First, we focus on balanced accuracy under nested cross-validation, i.e. the expected performance of the trained classifier in a new class-balanced dataset from the same underlying population. Red diamonds in Fig. 4 show that performance of tested method is broadly consistent across the board -with our proposed methods (#6 and #7) showing superior performance in the UC2 and Kidney1 datasets.
As outlined before, we are particularly interested in the robustness of trained classifier when used in independent studies with clinically highly similar phenotypes. Blue circles in Fig. 4 depict balanced accuracy in the corresponding independent datasets. Arrows give a visual indication of the magnitude of the change between cross-validation and independent test set results. In virtually all cases, classifier performance deteriorates when   Figure 3. The bar plots show the achieved balanced accuracy in simulated test data for method #6 CREweighted group-lasso using the original network and the permuted network in contrast with #2 lasso-based logistic regression as well as #4 Unweighted group-lasso. Across all 10 runs, our method is able to exploit the network-structure to improve performance with respect to lasso-based regression. For a random network, performance deteriorates to the level of lasso-based regression.
applying the fitted models to the independent datasets with slightly different underlying characteristics. This is especially pronounced for the kidney datasets in which the non-group-lasso methods (#1, #2, and #3) either deteriorate to random predictions (Kidney2 data) or to weak predictivity (Kidney1 data). Notably, our proposed methods show robust performance in all cases.
Active regulators. In addition to its robust performance, creNET facilitates the biological interpretation of the predictive signals by providing 'regulators' that can be easily related to the biology that likely drives the response. To further assess the reproducibility and consistency of the selected regulators, we performed a bootstrap analysis as follows. For each dataset, a bootstrap sample was selected and the CRE weighted GL (#6) was trained using cross validation and the regulators selected by the model were recorded. This process was repeated a 100 times and the frequency of every selected predictors was recorded. Figure 5 shows the frequencies of selected regulators in each dataset. Any regulator picked at least once in any dataset is displayed providing an overview of the heterogeneity within and between datasets. For example, in the case of infliximab response in UC, the predominantly selected 'regulator' is Lipopolysaccharide;LPS (and its synonym lipopolysaccharide), which is a key activator of the toll-like receptor 4 (TLR4). TLR4 signaling in intestinal epithelial cells in turn is a key component of homeostasis and activation of the TLR4 pathway and has been associated with both Ulcerative colitis and Crohn's disease 35,36 . The link between infliximab sensitivity and TLR4 is further supported by findings that infliximab induced blockade of TNF-alpha can lead to strong suppression of TLR4 expression 37 . While other components of inflammatory signals may contribute to the overall differential gene expression, creNET directly establishes the hypothesis that differences in the state of TLR4 signaling (via sensitivity to LPS) at baseline may be a key distinguishing factor in the response to infliximab. In the case of acute kidney rejection, the regulators selected by the algorithm appear at the first glance to be 'asymmetrical' , i.e. creNET consistently selects Interferon gamma (IFNG) associated transcripts in the GSE21374 cohort, while IFNG is not selected as much in the GSE500058 cohort and combinations of CFTR, Serpinb9b, Vancomycin and HNF4A are selected more frequently instead. However, acute kidney rejection can be mechanistically separated into two distinct concepts: T-cell mediated rejection (TCMR) and antibody mediated rejection (ABMR) 38,39 . IFNG is well documented as a cytokine involved in innate immune response and genes induced by IFNG are shared between TCMR and ABMR 39 . It is therefore intuitive that IFNG regulated transcripts are suitable in predicting rejection. Additionally, a third concept should also be taken into consideration, namely that of acute kidney injury, which appears to be more dominantly represented in the GSE50058 cohort. CFTR, and HFN4A can be interpreted as indicators of kidney function and Vancomycin is well established as a driver for kidney injury. Serpinb9b, the last relevant regulator identified in GSE50058 is again a marker of cytotoxic T-cell activity 40 and would therefore be consistent with T-cell mediated rejection.

Conclusions
In this paper we presented a way to integrate biological regulatory networks into a penalized regression and classification framework. The method is specifically designed for phenotype prediction using omics-scale data. We demonstrated that the new penalization and the integration of the relevant biology of gene regulation can lead to more robust predictions that generalize well to independent datasets (Fig. 4). An integral part of the method is the identification of important gene sets defined by the regulators in the network and differential shrinkage of the groups according to their biological relevance (Fig. 5).
Our results show that even when proper care is taken in training and assessing classifiers on high dimensional transcriptomics datasets, models that don't incorporate prior biological knowledge may still result in poor generalization to independent test data with slightly shifted characteristics. By integrating prior knowledge into the classification framework the model appears to select predictors that are more biologically relevant across a wider, more heterogeneous population.  Figure 4. Overview of model performance in nested cross-validation (red diamonds) and independent test sets (blue squares). Each row depicts performance in one of the methods outlined in Table 1. The x-axis corresponds to the average achieved balanced accuracy in multiple runs using the best λ value in each run. The horizontal error bars correspond to one standard deviation. Arrows indicate the magnitude of the change between crossvalidation and independent tests. Note that in almost all cases performance deteriorates in the independent setting and that prior-knowledge based methods are better able to retain high predictive performance. We provide a user-friendly R-package creNET for download from github at https://github.com/kouroshz. The package is accompanied with a parsed version of the STRING DB, but any other regulatory network can be used. Moreover, we provide command-line scripts to facilitate batch runs. Note that performance of the model depends on the assumption that the network captures the underlying biology of the disease. Figures and results have been reproduced using the current freely available version of the STRING DB network 25 and can be found in the Additional File 1. With the public network we still see equivalent or better performance when compared to the lasso approach in most scenarios, albeit not as strongly as with our proprietary knowledge base. Especially the generalization performance to independent datasets is appreciably weaker using the public network. We regard these results as a call to improve publicly available networks to provide even better substrate for prior-knowledge driven prediction approaches.
As future work we plan the extension of our method to non-linear classification. The described approach should generalize to SVM-based prediction approaches and we hope to further push the boundary for robust classification of phenotypes based on omics data. To summarize, we provide a method that encodes prior biological knowledge of regulatory interactions into the classification process for prediction of clinical phenotypes using baseline or early stage gene expression measurements. This integration results in optimal model sparsity. Our method outputs inferred active upstream regulators, which increases the interpretability of the results. Figure 5. The heatmap shows heterogeneity in picked regulators within and between datasets based on 100 bootstrap runs. Any regulator that was picked at least once in any dataset is displayed. Numbers indicate the number of times a specific regulator was chosen in the 100 bootstrap runs conducted for each dataset.