Introduction

Immune checkpoint inhibitors are established treatments for locally advanced or metastatic urothelial cancer (aUC). Although mechanisms underlying anticancer immunity and immune checkpoint inhibition have been studied extensively, prospective use of biomarkers to identify patients who are most likely to obtain long-term durable benefits from these agents remains unrealized, in part due to variability in assay platform and interpretation across studies. A recent trial (JAVELIN Bladder 100, NCT02603432) showed that the addition of avelumab as first-line maintenance therapy to best supportive care (BSC) significantly prolonged overall survival (OS) compared with BSC alone and established avelumab as a new first-line standard-of-care treatment for aUC1. The isolation of avelumab through randomization as the only active treatment covariate in the maintenance setting for aUC provides a unique opportunity to investigate biomarkers that are associated with survival benefit2. In particular, the study provides an opportunity to evaluate various methodologies with respect to selecting biomarkers among various candidates, estimating the effects of biomarkers, and combining multiple biomarkers into accurate models.

In published literature, various models have been proposed to address these challenges. For low-dimensional data, the Cox proportional hazards model is the most popular method to study associations of biomarkers with time-to-event endpoints3. However, in the context of high-dimensional data (number of biomarkers > number of observations) or in the presence of severe collinearity in the data, the proportional hazards model may not be suitable. Various penalized regression methods have been proposed to overcome these hurdles, among which ridge, lasso (least absolute shrinkage and selection operator), and elastic net are most popular4. As an alternative to penalized regression methods, Bayesian methods have also been proposed for variable selection. The main advantage of the Bayesian framework for variable selection is that it allows the incorporation of any prior information regarding the data into the model in addition to transparent quantification of uncertainty. Work by Park and Casella5 and Li and Lin6 has shown that the frequentist approaches mentioned previously can be outperformed by Bayesian variable selection methods. Various choices of prior distributions have been proposed in the literature for variable selection; however, one type of priors, “spike and slab,” has gained widespread attention due to its intuitive nature and ease of implementation. In the context of survival analysis, Tang et al.7 introduced a double-exponential (DE) spike-and-slab prior distribution that was successfully utilized to analyze genes associated with breast cancer in a Dutch dataset. Subsequently, the authors extended their work and proposed group spike-and-slab lasso Cox (gsslasso Cox) to conduct variable selection by incorporating group structures into the model. Tree-based methods have also been proposed as a flexible alternative to the Cox proportional hazards model for modeling survival time and variable selection. In particular, the random survival forest (RSF) has been developed to identify significant covariates and their interactions8. Its main advantage over other methods is that it can model complex nonlinear and high-dimensional survival data without strong assumptions regarding the data-generating process.

The main objective of this study was to examine biomarkers associated with survival benefit in the JAVELIN Bladder 100 aUC population using popular variable selection approaches for high-dimensional data and to evaluate variable selection methods using both simulation and the existing biological understanding of aUC biomarkers. In Section “Methods”, we describe various methods that were studied as part of the project and a simulation study conducted to assess the performance of various methods. We also introduce a modified thresholding rule based on Bayesian information criterion (BIC) to select variables based on the posterior estimates of the parameters. In Section “Results”, we present the results obtained by implementing variable selection methods on simulated data and the JAVELIN Bladder 100 dataset. Concluding remarks are provided in Section “Discussion”.

Methods

Cox proportional hazards model

In survival analysis, the dataset is usually in the form \(({T}_{i}, {\delta }_{i}, {x}_{i})\), where \({T}_{i}\) is the observed time (either failure time or censored time), \({\delta }_{i} \in \{\mathrm{0,1}\}\) is the censoring indicator for an event \({\delta }_{i}=1\) in the case of a failure or death or \({\delta }_{i}=0\) if the observation is censored, and \({x}_{i}\) denotes a p dimensional vector of the observed covariates of the ith individual. The Cox proportional hazards model3 is the most popular method for studying the relationship between observed survival response and explanatory variables. It assumes that \(\lambda \left(t|x\right),\) the hazard at time t given the vector of explanatory variables \(X\), takes the form:

$$\lambda \left(t|X\right)={\lambda }_{0}(t)\mathrm{exp}({\beta }^{T}X)$$

where \({\lambda }_{0}(t)\) is the baseline hazard function, and \(X\) and β are the vectors of explanatory variables and coefficients, respectively. Here, \({\beta }^{T}x\) is the linear predictor and is also called the risk score. We can estimate the parameter β in the model without specification of \({\lambda }_{0}(t)\) by maximizing the partial log-likelihood:

$$pl\left(\beta \right)=\sum_{i=1}^{n}{\delta }_{i}log\left(\frac{\mathrm{exp}({\beta }^{T}{X}_{i})}{{\sum }_{{i}{\prime}\in R({t}_{i})}\mathrm{exp}({\beta }^{T}{X}_{{i}{\prime}})}\right)$$

Here, \(R({t}_{i})\) denotes the risk set at time \({t}_{i}\), which contains all subjects who are at risk of an event.

For low-dimensional data, the Cox proportional hazards model can help to understand the relationship between covariates and observed survival response. However, for high-dimensional data, this model fails to be identifiable, or in presence of severe collinearity, regression coefficient estimates \(\widehat{\beta }\) fail to converge. Several methods have been proposed to handle such cases, including penalized regression models.

Penalized regression methods

Lasso

Lasso is a regularization method that has L1 Norm as its penalty9. Here, coefficients are estimated by minimizing the penalized negative log partial-likelihood:

$$Q\left(\beta \right)=-pl\left(\beta \right)+\lambda \sum_{j=1}^{p}|{\beta }_{j}|$$

where λ is the regularization or shrinkage parameter. The estimation of parameters β depends on the value of λ: a larger value of λ implies a higher number of non-zero regression coefficient estimates. We used a 10-fold cross-validation procedure with grid search to find an ‘optimal’ value of λ. This procedure was implemented using the glmnet package in R.

Elastic net

Zou and Hastie10 showed that if a group contains variables with very high pairwise correlations, the lasso tends to randomly select only one variable from the group. To address this issue and other limitations of lasso, they proposed elastic net—a penalized regression method where the penalty is a convex combination of the L1 Norm and the L2 Norm. The presence of an additional L2 Norm term in the penalty makes it possible to promote a grouping effect, thereby removing the limitation of the number of selected variables. Here, coefficients are estimated by minimizing the penalized negative log partial-likelihood:

$$Q\left(\beta \right)= -pl\left(\beta \right)+\lambda (\alpha \sum_{j=1}^{p}\left|{\beta }_{j}\right|+(1-\alpha )\sum_{j=1}^{p}{\left|{\beta }_{j}\right|}^{2})$$

The elastic net penalty is controlled by mixing parameter α to bridge the gap between the lasso regression (α = 1) and ridge regression (α = 0). The parameters (λ, α) can be estimated using cross-validation with grid search. Because the elastic net has two tuning parameters, we cross-validated on a two-dimensional surface. We first selected a value of α from a grid of values, then, for each α, we selected a value of λ using 10-fold cross-validation.

Adaptive lasso

The adaptive lasso introduces a variable-specific weight \({w}_{j}\) into the lasso penalty11. The main objective is to penalize larger coefficients less than smaller coefficients to reduce the bias of penalized coefficient estimates found using lasso. The penalized negative log-likelihood is given by:

$$Q\left(\beta \right)=-pl\left(\beta \right)+\lambda \sum_{j=1}^{p}{w}_{j}\left|{\beta }_{j}\right|$$

The variable-specific weights \({w}_{j}\) are of the form \(\frac{1}{\left|{\beta }_{j}\right|}\), with \(\widetilde{{\beta }_{j}}, j=\mathrm{1,2},\dots p,\) being the solutions of an initial estimation. We considered \(\widetilde{{\beta }_{j}}\) to be the ridge estimates found in our data. The optimal value of the parameter λ can be found by performing cross-validation as per lasso.

Random survival forest

RSF is a nonparametric method that has been proposed for modeling survival data8. It combines the ideas of bootstrap aggregation and random selection of variables. In our work, survival trees were built according to the parameters recommended by the authors8 in the case of high-dimensional data, using the randomForestSRC package in R.

We also performed variable selection using the minimal depth method proposed by Ishwaran et al.12. This is a simple and robust method for selecting variables from high-dimensional survival data. Minimal depth evaluates the predictiveness of a variable by its depth in relation to the root node of a tree. The idea can be understood more precisely by defining a maximal subtree for a variable v, such that it is the largest subtree whose root node is split using v and no other parent node of the subtree is split using v. The shortest distance from the root of the tree to the nearest maximal subtree of v is the minimal depth of v.

Bayesian variable selection methods

Stochastic search variable selection

Stochastic search variable selection (SSVS)13 was proposed for variable selection in the context of linear regression. In SSVS, the coefficients βj are assumed to follow a suitable Gaussian mixture prior, which induces a positive prior probability on the hypothesis H0: βk = 0. Such prior distributions, which are a mixture of two continuous distributions and imply high probability close to zero, are referred as “spike-and-slab” priors. The mathematical formulation of the SSVS prior setup is the following:

$${\beta }_{j}|{\gamma }_{j}=\sim \left(1-{\gamma }_{j}\right)N\left(0,{\tau }_{j}^{2}\right)+{\gamma }_{j}N(0,{c}_{j}^{2}{\tau }_{j}^{2})$$
$${\gamma }_{j}|{p}_{j}\sim Bernoulli({p}_{j})$$
$${p}_{j}\sim Uniform(\mathrm{0,1})$$

Here, \({\gamma }_{j}\in [\mathrm{0,1}]\) acts as a latent variable that facilitates the analysis of performing variable selection. The parameter \({p}_{j}\) can be thought of as the prior probability that \({\beta }_{j}\) is non-zero or that \({X}_{j}\) should be included in the model. The parameters \({\tau }_{j}\) and \({c}_{j}\) can be thought of as tuning parameters that are data dependent. \({\tau }_{j}\) is set to be small so that if \({\gamma }_{j}=0\), \({\beta }_{j}\) can be estimated by 0, while \({c}_{j}\) is set to be large to ensure that a non-zero estimate of \({\beta }_{j}\) can be included in the final model.

We fixed \({c}_{j}\) to be a relatively large value, 1⁄τj, and identified optimal values of \({\tau }_{j}\) using 10-fold cross-validation with grid search.

Spike-and-slab lasso Cox

Spike-and-slab lasso (sslasso)14 was proposed to integrate two popular methods—lasso and Bayesian spike-and-slab models—into one unifying framework. This method has also been extended for the Cox proportional hazards model to perform variable selection in survival analysis. The sslasso Cox model7 was developed by extending the DE prior into the spike-and-slab model because lasso can be expressed as a hierarchical model with DE prior on the coefficients. The mathematical formulation of the prior setup is the following:

$${\beta }_{j}|{\gamma }_{j}\sim \left(1-{\gamma }_{j}\right)DE\left(0,{s}_{0}\right)+{\gamma }_{j}DE(0,{s}_{1})$$
$${\gamma }_{j}|{p}_{j}\sim Bernoulli({p}_{j})$$
$${p}_{j}\sim Uniform(\mathrm{0,1})$$

where the preset scale value s0 is chosen to be small to induce strong shrinkage on estimation whereas s1 is chosen to be large to induce weak shrinkage on estimation. The R package BhGLM has been developed to implement the sslasso Cox prior formulation. To find optimal parameters, Tang et al.7 suggested to set the slab scale s1 to be a relatively large value (e.g., 1) and use cross-validation to find an optimal value of s0.

Group spike-and-slab lasso Cox

Group structure can also be incorporated into the sslasso model by assigning a group-specific Bernoulli distribution for the indicator variables15. Suppose there are K groups with \({m}_{k}\) variables each in the group. For a coefficient, \({\beta }_{{k}_{j}}\) in a group k, where k = 1, 2….K and \(j=1, 2, \dots , {m}_{k}\), the mathematical formulation is given by:

$${\beta }_{{k}_{j}}|{\gamma }_{{k}_{j}}\sim \left(1-{\gamma }_{{k}_{j}}\right)DE\left(0,{s}_{0}\right)+{\gamma }_{{k}_{j}}DE(0,{s}_{1})$$
$${\gamma }_{{k}_{j}}|{p}_{k}\sim Bernoulli({p}_{k})$$
$${p}_{k}\sim beta(a,b)$$

If group k includes important predictors, the parameter \({p}_{k}\) will be estimated to be relatively large, implying that other predictors in the group are likely to be important. For the probability parameters, a beta prior is adopted, which yields the uniform hyperprior \({p}_{k}\sim U(\mathrm{0,1})\), if \(a=b=1\).

For all methods, decision rules to determine hyperparameters are described in the Appendix.

Simulation study

During exploratory data analysis, it was observed that the data were characterized by extreme collinearity and were sparse in nature. To assess the ability of various methods to detect the true variables in the presence of these issues, a simulation study was conducted. The simulated data (n = 450 and p = 200, where n is the number of observations, and p is the number of variables) was created by varying the number of true variables in the model, their effect size (relative risk reduction of a one-unit increase in the value of variable), and the type of correlation structure in explanatory variables. The survival time was generated from exponential distribution.

Simulation settings

Following the simulation study conducted by Tibshirani9 and after understanding the structure of our data, we randomly generated blocks of correlated variables from the standard normal distribution with an autoregressive correlation structure, i.e., with homogeneous unit variances and with correlation (\(\rho \)) declining exponentially within blocks: \({\sigma }_{ij}^{2}={\rho }^{|i-j|}\). We considered \(\rho =0.9\) and block size = 50 in the simulation study. The number of true biomarkers in the model (q) was 5 or 10. To generate survival time, we assumed that median survival time was 4 years and considered two values of β in the simulations: \({\beta }_{LOW}\) and \({\beta }_{HIGH}\), where:

  • \({\beta }_{LOW}\): coefficients of true biomarkers between − 0.4 and − 0.1, and 0 otherwise.

  • \({\beta }_{HIGH}\): coefficients of true biomarkers between − 1 and − 0.5, and 0 otherwise.

The censoring time was generated from the exponential distribution where the parameter c was chosen to keep the censoring rate near 50%. In total, 4 designs with an autoregressive correlation structure were created as part of the simulation study. Information about simulated datasets is summarized in Table 1.

Table 1 Simulated data.

Measures for evaluation of results

The data were generated randomly under each design 100 times. The variable selection capability of different models was judged by computing three operating characteristics based on the parameter estimates: true positive rate (TPR or sensitivity), true negative rate (TNR or specificity), and false positive rate (FPR). The formulas for these operating measures are as follows:

$$TPR=\frac{Number \, of \, true \, variables \, correctly \, entered \, into \, the \, model}{Total \, number \, of \, variables}$$
$$TNR=\frac{Number \, of \, irrelevant \, variables \, correctly \, excluded \, from \, the \, model}{Total \, number \, of \, variables}$$
$$FPR=\frac{Number \, of \, irrelevant \, variables \, mistakenly \, entered \, into \, the \, model}{Total \, number \, of \, variables}$$

For penalized methods, variable selection was performed using non-zero coefficient estimates. In RSF, variables were selected using variable importance through minimal depth procedure. However, for Bayesian methods, variables were selected using three alternative rules because posterior estimates are non-zero:

  • Confidence interval (CI) rule: variables whose \(\left(1-\alpha \right)\%\) CI for coefficient estimates \([\widehat{\beta }-{Z}_{\alpha /2}*SE\left(\widehat{\beta }\right), \widehat{\beta }+{Z}_{\alpha /2}*SE\left(\widehat{\beta }\right) ]\) does not contain 0 are selected by the model. Here, \({Z}_{\alpha /2}\) is the critical value when the right-tailed area under a standard normal distribution is given by α/2, i.e.,

    $$P\left(Z>{Z}_{\alpha }\right)=\alpha $$

    where \(Z\sim N(\mathrm{0,1})\). We considered \(\alpha =0.25, 0.1,\;or\; 0.05\) for calculating the CIs.

  • BIC thresholding rule: Lee, Chakraborty, and Sun 16 proposed the BIC thresholding rule for variable selection. Here, the absolute posterior estimates of \({\beta }_{j}\) are initially arranged in descending order, and BIC values are computed in a stepwise manner by sequentially adding important covariates. The formula for BIC with \(j\) largest \({\beta }_{j}\) is written as:

    $${BIC}_{j}= -2\left(l\left({\widehat{\beta }}_{\left(1:j\right)}\right)-l\left(0\right)\right)+jlog(n)$$

    where n is the number of observations, \(l\left({\widehat{\beta }}_{\left(1:j\right)}\right)\) denotes the maximized log-likelihood under a model that includes the variables corresponding to the first \(j\) largest absolute posterior estimates \({\widehat{\beta }}_{j}\) s given by \({\widehat{\beta }}_{\left(1:j\right)}\). \(l\left(0\right)\) denotes the log-likelihood under the null model. \({BIC}_{j}\) s are computed by sequentially adding important variables and the best model is chosen where its minimum occurs. To shortlist variables before computation of BIC, we considered the top 50 variables with the largest non-zero absolute posterior estimates (j = 50).

  • Modified BIC thresholding rule: We also considered a modified version of the BIC thresholding rule above. We proceeded as follows:

    1. 1.

      Select the top 50 variables with the largest non-zero absolute posterior estimates.

    2. 2.

      Consider the variable with the highest coefficient and include it in the model.

    3. 3.

      Include a variable in the model from the list of remaining variables for which BIC is minimum.

    4. 4.

      Continue adding variables and computing BIC as in Step 3 until there are no remaining variables to be considered.

    5. 5.

      Consider the model with minimum BIC value as the final model.

Our modified approach is likely to be less conservative in selecting variables than the original BIC approach. However, the computation time increases due to additional comparisons.

Ethics approval and consent to participate

The trial was conducted in accordance with the ethics principles of the Declaration of Helsinki and with the Good Clinical Practice guidelines defined by the International Council for Harmonization. All the patients provided written informed consent. The experimental protocol and amendments were approved by Pfizer.

Results

Simulated data

Results were obtained for all 100 replications of the four designs that were considered in the study, and average measures of TPR, TNR, and FPR are shown in Tables 2, 3, 4 and 5 for different methods. In our simulation study, we considered penalized, RSF, and Bayesian spike-and-slab models for comparison. Optimal parameters were found using the cross-validation procedure mentioned in the methods section for each model. Table 2 shows that elastic net had the highest TPR (sensitivity) among penalized regression methods for different designs, followed by lasso, adaptive lasso, and RSF. However, RSF, followed by lasso and elastic net, had high FPRs and consequently low specificity compared with adaptive lasso. With RSF, a huge number of variables were chosen, resulting in an extremely high FPR and poor performance across different designs. However, adaptive lasso tried to achieve a balance between TPRs and FPRs and had a significantly lower FPR than RSF and the other two penalized regression methods. The decrease in the value of FPR in adaptive lasso came at the cost of the number of true discoveries and results in a relatively low sensitivity compared with the other models.

Table 2 Simulation results for penalized regression methods and RSF.
Table 3 Simulation results for Bayesian methods with BIC rule for variable selection.
Table 4 Simulation results for gsslasso Cox prior with CI rule for variable selection.
Table 5 Simulation results for SSVS prior with CI rule for variable selection.

The results for Bayesian methods are summarized in Tables 3, 4, and 5. In Table 3, we compare a modified BIC approach for selection with the original BIC approach for both Bayesian models. Our modified approach performed better than the original approach for both Bayesian models. Among the two Bayesian models, gsslasso performed the best and showed higher sensitivity and specificity than the SSVS model.

In the case of the CI rule for variable selection, we see from Table 4 that the 90% CI rule performed better than the other choices for gsslasso Cox prior owing to its higher sensitivity. However, for the SSVS model, as shown in Table 5, the 95% CI rule was the most appropriate choice owing to the high values of FPRs in the other two choices for CI. Overall, however, SSVS performed less well than other Bayesian methods.

From our simulation study, the adaptive lasso and gsslasso Cox were concluded to be the most appropriate models for variable selection from the two classes of models. Both had moderate sensitivity and high specificity. Additionally, the gsslasso Cox model had the advantage of incorporating the group structure into the model. Because keeping the false discovery rate low is highly important in a clinical setting, both methods were concluded to be more appropriate for variable selection in the presence of high collinearity than their counterparts.

Real data

We assessed the performance of the various methods discussed in Section “Results” on using data from a recent phase 3 trial: JAVELIN Bladder 100 (NCT02603432). The data contained information from 688 patients for 189 variables. Among the 189 variables, two were related to OS outcome (OS_EVENT and OS), one was treatment arm (TRT01P1, “avelumab + BSC” is the treatment arm and “BSC” is the control arm), four were baseline patient characteristics (SEX, AGE, STRATI11 [best response to first-line chemotherapy], and STRATI21 [metastatic disease site at first-line chemotherapy]), and the remaining 182 were biological features (biomarkers) of interest.

There were 344 observations in the treatment group and 344 observations in the control group (Table 6). Of the total of 688 observed times, 320 were uncensored observations (failure times) and 368 (53.4%) were censored observations. However, among the 688 observations, only 429 observations had complete data for all variables (37.6% missing observations). Eliminating the missing observations, only 222 observations remained in the treatment group and 207 observations remained in the control group. Of these 429 observed times, 198 were uncensored observations and 231 were censored observations (53.8%). For our analysis, we excluded all missing observations from the data and analyzed only complete data.

Table 6 Data description.

We found that the data (excluding variables related to OS outcome) originally consisted of 5 binary variables and 182 numerical variables. Three new binary variables were created and included in the analysis. Post feature engineering, the data had 8 binary variables and 178 numerical variables. The data were also found to have a high amount of sparsity in some features, with 12 of 186 numeric variables (excluding OS_EVENT and OS from the total 188 variables) having at least 60% “zeroes” in their observations. Although, some of the sparse features had a low number of unique values, they were still considered as numerical variables in our analysis.

In exploratory data analysis, we found that the data had severe multicollinearity. Additionally, the variables had a natural group structure such that 5 groups of variables had varying lengths and some other variables were not part of any group (mentioned in Supplement S2). The variables in one group were found to have “inner correlation” (correlation between themselves) and “outer correlation” (correlation with variables outside their own group).

In summary, the main characteristics of our data are:

  1. 1.

    Severe collinearity among the explanatory variables,

  2. 2.

    High-dimensional data with not enough sample size,

  3. 3.

    Sparsity in the data (8.06%), and

  4. 4.

    High percentage of censored observations (53.8%).

On comparing full data with treatment-only data, we found that the collinearity was more severe for treatment-only data. The percentage of censored observations also increased for treatment-only data (59.4% from 53.8%). However, the sparsity in the data remained similar (7.82% from 8.06%).

We focused both on the full data, which contained information on patients assigned to treatment and control observations, in addition to the data subset containing only the observations of patients who were assigned to treatment. Optimal parameters for all the methods except RSF were found using the cross-validation procedures.

For RSF, we considered the value of parameters recommended by Ishwaran et al.8. We observed that RSF gave very poor prediction results and had a high prediction error rate for both full and treatment-only data (46.96% and 46.48%, respectively). Tuning parameters did not reduce the prediction error. We further performed variable selection by shortlisting the top 15 variables with highest variable importance. The results are presented in two tables in the supplement (Tables S1 & S2) that summarize the variables selected by different methods. For Bayesian models, we used only the 90% or 95% CI rule and our modified BIC approach for variable selection after observing their performance in the simulation study.

Full data

Table 7 reports the results for the analysis of full data, showing only those variables selected by at least two of the methods considered in our assessment; the full list of selected variables is reported in Table S1. None of the variables were selected by all the models. However, all the penalized models, gsslasso Cox, and RSF selected the treatment variable and thus validated its relevance. Very few variables were selected by both penalized and Bayesian methods. Except for the treatment variable, only “cytopro.effector_memory_CD8.positive_alpha.beta_T_cell,” “IC_PD_L1_Status1,”, and “LM22.Mast_cells_activated” were selected by both classes of methods. All the penalized models selected the variables “cytopro.neutrophil,” “cytopro.effector_memory_RA_-CD8.positive_alpha.beta_T_cell_.TEMRA,” and “STRATI21,” whereas these were not identified as being relevant by Bayesian methods. In contrast, Bayesian methods selected “LM22.T_cells_CD8” and “LM22.NK_- cells_resting,” whereas these were not identified as being relevant by penalized models. These different results highlight a stark contrast between the two classes of methods in selecting variables from the data, making the relevance of these variables unclear. However, the significantly lower false discovery rate of Bayesian methods compared with penalized models favors the Bayesian methods. Furthermore, gsslasso Cox performed better than sslasso Cox due to incorporation of the group structure in the model.

Table 7 Results (full data).

Treatment-only data

Table 8 reports the results for the analysis of treatment-only data, showing only those variables selected by at least two methods considered in our assessment; the full list of selected variables is reported in Table S2. Results were similar to those observed in the analysis of full data. As observed for the treatment variable in the analysis of full data, “STRATI21” was selected by all of the penalized models and by gsslasso Cox, but not by RSF. All of the penalized models selected the variables “Number_high_affinity_FCGR_alleles” and “cytopro.CD8.positive_alpha.beta_-T_cell,” whereas these were not identified as being relevant by Bayesian methods. In contrast, Bayesian methods selected “HALLMARK_IL2_STAT5_SIGNALING” and “LM22.Mast_cells_activated,” whereas these were not identified as being relevant by penalized models. “TUMOR_CELL_STAINING_analytePDL1” and “cytopro.effector_memory_CD8.positive_alpha.beta_T_cell” were selected by both the penalized and Bayesian models.

Table 8 Results (treatment-only data).

Discussion

We assessed various methods of variable selection in survival analysis. Our objectives were to find important variables in a recent clinical trial dataset using different methods and to find a suitable method for variable selection. To understand the properties of different methods, we performed a simulation study that reflected key issues present within our data. In particular, we assessed the performance of penalized regression models, RSF, and Bayesian spike-and-slab models in the presence of groups of highly correlated variables with low signals within the data. In the simulation study, we found that RSF selected the highest number of variables, followed by penalized regression methods and Bayesian methods, and had higher sensitivity. However, because of its less restrictive nature in selecting variables, RSF also had a higher FPR than Bayesian methods. Elastic net had the highest sensitivity and the lowest specificity among the classic penalized regression models in all the scenarios. Adaptive lasso aimed to achieve a balance between TPRs and FPRs, and had a significantly lower FPR than lasso and elastic net for a marginal decrease in sensitivity. Among Bayesian methods, the gsslasso Cox model had the best overall performance in all scenarios owing to its extremely high specificity and moderate sensitivity. We also considered various rules for variable selection because posterior estimates of coefficients are not exactly zero, unlike in penalized regression models. In the case of the gsslasso Cox model, we found that the 90% CI rule and our modified BIC approach provided similarly good results. However, for the SSVS model, the CI rule performed poorly in terms of specificity for different designs and was inferior to the BIC approach. Following the simulation study, we applied all the methods studied to clinical trial data but observed poor performance. We assessed the penalized regression models and Bayesian spike-and-slab models in addition to RSF. Results for penalized regression models and Bayesian spike-and-slab models were slightly inconsistent with few variables identified by both classes of models in both full and treatment-only data. The difference in the results made the relevance of the selected variables unclear. However, variables that were selected by both classes of models can be considered potentially significant. Using full data, “TRT01P1,” “IC_PD_L1_Status1,” and “LM22.Mast_cells_activated” were selected by both model classes, and for treatment-only data, “STRATI21” and “TUMOR_CELL_STAINING_analytePDL1” were selected by both model classes. Because Bayesian spike-and-slab models had very low false discovery rates, the probability that common variables were false discoveries is low.

RSF was also applied to the same datasets, but its performance was poor owing to high prediction error. Tuning parameters did not improve this model, but we still performed variable selection using minimal depth procedure. Despite its poor prediction performance, some variables selected using RSF were the same as those selected by other methods.

There remains much room for further research. We only considered main effects in our regression models. A future direction of work may focus on incorporating interaction effects in the regression model in the presence of a high amount of noise in the data.

In conclusion, we assessed various well-known variable selection methods and identified potentially significant variables or biomarkers. However, these methods may not be well suited to analyzing this type of dataset because of the presence of extreme collinearity and low signal. The sslasso model can overcome collinearity owing to its ability to incorporate group structure in the model; however, it does not perform well if low signals are present.