Detecting survival-associated biomarkers from heterogeneous populations

Detection of prognostic factors associated with patients’ survival outcome helps gain insights into a disease and guide treatment decisions. The rapid advancement of high-throughput technologies has yielded plentiful genomic biomarkers as candidate prognostic factors, but most are of limited use in clinical application. As the price of the technology drops over time, many genomic studies are conducted to explore a common scientific question in different cohorts to identify more reproducible and credible biomarkers. However, new challenges arise from heterogeneity in study populations and designs when jointly analyzing the multiple studies. For example, patients from different cohorts show different demographic characteristics and risk profiles. Existing high-dimensional variable selection methods for survival analysis, however, are restricted to single study analysis. We propose a novel Cox model based two-stage variable selection method called “Cox-TOTEM” to detect survival-associated biomarkers common in multiple genomic studies. Simulations showed our method greatly improved the sensitivity of variable selection as compared to the separate applications of existing methods to each study, especially when the signals are weak or when the studies are heterogeneous. An application of our method to TCGA transcriptomic data identified essential survival associated genes related to the common disease mechanism of five Pan-Gynecologic cancers.


ADMM algorithm for solving the optimization problem at the regularization stage
The optimization problem (7) at the regularization stage of the Cox-TOTEM algorithm is solved by the alternating direction method of multipliers (ADMM). For notational simplicity, we suppress the subscriptM [1] in β (k) M [1] and use d in place of d 1 . We also write X (k) to denote the selected features in the screening stage instead of X (k) M [1] . Let k (β (k) ) be the partial log-likelihood function for the kth study of size n k given by where β (k) = (β (k) 1 , . . . , β (k) d ) T is a vector of regression coefficients in the Cox model of the kth study. Let β j = (β (1) j , . . . , β (K) j ) T ∈ R K be a vector consisting of the jth elements of β (k) 's. The optimization problem (7) can be written as the following form: Let y ∈ R d×K , z ∈ R d×K , and u ∈ R d×K . We write a row and column vector of these matrices by e.g. y j· = (y j,1 , y j,2 , . . . , y j,K ) T and y ·,k = (y 1,k , y 2,k , . . . , y d,k ) T . Then the equivalent optimization problem is

Now, the ADMM consists of the iterations
at the (m + 1)th update. Here x + = max{x, 0}. The update on y ·k can be carried out by the Newton-Raphson algorithm. A stopping criterion is for a precsribed > 0.
2 Sensitivity analysis results for the choice of α 1 and α 2 3 Cross-validation scheme to select λ To select the optimal tuning parameter λ in the group lasso problem, we propose a multi-study cross validation procedure. The proposed methodology is characterized by (1) k-fold cross validation within studies, and (2) prediction of survival as a performance measure. One way to perform cross validation in the analysis of multiple studies is to leave one study out as in Zhu et al. (2017). The idea behind this is that all studies are more or less comparable to each other. Our proposed group lasso, however, aims at identifying potentially different sets of predictors in each study while borrowing strength from others. Because multiple data sets exhibit selection bias and heterogeneity in patient characteristics, keeping data from all studies in cross validation serves well for the objective of the proposed group lasso. The choice of performance measure in cross validation is closely related to the objective of statistical analysis. The existing literature van Houwelingen et al. (2006); Simon et al. (2011); Dai and Breheny (2019) utilized the variants of the partial likelihood to measure the goodness of fit through the Kullback-Leibler divergence. In this paper, we adopt the prediction of individual survival at a fixed time point. This intuitive measure is easy to interpret and simple to compute. The estimated survival probability for the ith subject in the kth study at time t in the lth cross validation iŝ CV l is the group lasso estimator andΛ (k) (t) is the corresponding Breslow estimator of the cumulative hazard function. We predict a subject survived ifŜ (k) i,CV l (t) > 0.5 and failure otherwise. With this performance measure, we perform multi-study cross validation as follows: 1. For each study, divide data into L pieces.
2. Leave one piece out from each study at the same time to create the training data, and then apply the group lasso algorithm.
3. Predict survival at a prescribed time t on uncensored subjects in the testing data and compare prediction with the real outcomes.
4. Repeat 2 and 3 for all L pieces.
5. Select λ that achieves most successful average prediction (i.e. maximize the prediction accuracy).