## Background

The multikinase inhibitor, sorafenib, was the first agent shown to offer survival benefit to patients with advanced hepatocellular carcinoma (aHCC) in a prospective placebo-controlled clinical trial (The SHARP trial).1 Similar results were subsequently reported from the analogous Asia-Pacific study.2 Sorafenib remains3 the current standard of care for patients with aHCC in most countries offering a survival advantage of about three months compared to placebo. Although there are several potential new treatments for aHCC,4 sorafenib remains a first line treatment.5

Median overall survival (OS) for patients undergoing sorafenib treatment is about 10 months.1,6,7 However, at the individual patient level, there is a wide variation ranging from 3 months to more than 2 years.8,9As part of a personalised approach to sorafenib treatment, information on a patient’s baseline clinical features can potentially be utilised to make individual survival predictions. Previous studies looked at utility of current HCC staging systems in survival prediction but, none of them were found to be optimal.10 Even within the Barcelona Clinic Liver Cancer staging system, where sorafenib is the recommended treatment (BCLC C), better survival varied considerably according to baseline clinical features.9

Subsequent studies have examined prognostic factors affecting OS among sorafenib patients as well as other variables predictive of sorafenib OS benefit. Factors found to influence prognosis included neutrophil-to-lymphocyte ratio (NLR), alpha-fetoprotein (AFP), tumour size/stage, extra-hepatic spread (EHD), Child-Pugh score, aspartate aminotransferase (AST), compensated cirrhosis, ascites, macroscopic vascular invasion (MVI), performance status (PS), albumin and bilirubin levels.9,11,12,13,14,15,16 Variables that predict greater sorafenib benefit compared to placebo included lower neutrophil-to-lymphocyte ratio, absence of extra-hepatic spread (EHS) and being hepatitis C positive (HCV).8,16,17,18,19,20

In view of the aforementioned wide survival variation, we develop here for the first time a statistical model that predicts individual patient survival, using baseline clinical and laboratory features including those of prognostic and/or predictive significance.

## Methods

### Sorafenib clinical trials

We had access to the sorafenib control arms of two multinational phase III randomised clinical trials (RCT) that compared sorafenib with each of brivanib6 and sunitinib.7 The brivanib trial sorafenib arm consists of 588 patients with aHCC accrued between 2009 and 2011. The sunitinib trial sorafenib arm comprised 542 patients with aHCC, recruited between July 2008 and May 2010. Inclusion criteria (see Supplementary table 1) in both trials were very similar. All data were obtained with permission from Bristol-Myers Squibb and Pfizer respectively.

### Variables

Baseline variables available for analysis that were common to both trials are shown in Table 1. All variables were measured at baseline before the start of the treatment. Those used for model building were: age, sex, race, Eastern Cooperative Oncology Group (ECOG) score, creatinine, bilirubin, AST, AFP, albumin, international normalised ratio (INR), aetiology, tumour type (solitary/multiple), tumour size, presence of extra-hepatic spread and vascular invasion. Aetiology was categorised as HCV-related, hepatitis B virus (HBV)-related and “other”. The “other” group of aetiology also included patients with no known risk factors. Extra-hepatic spread included patients with lymph node involvement. Tumour number was presented as a binary variable (“solitary” or “multiple”) rather than discrete because there was more missing data in the latter (6.1% versus 11.2%). This discrepancy in missingness is due to some patients being marked as having multiple tumours rather than the actual number being given. Child-Pugh grade was excluded from the model building and replaced instead by its derivation components albumin, bilirubin and INR. The amount of missingness for each variable is summarised in Supplementary table 2. Patients with missing data in any of the listed variables above were excluded from the modelling analysis.

### Statistical methods

Analysis was carried out using Stata/SE 14.1 (StataCorp, Texas, USA). Continuous variables were reported as mean (with standard deviation [SD]) or median (with interquartile range [IQR]), the latter for variables with highly skewed distributions. Categorical variables were presented as counts and percentages. Continuous variables which exhibited extreme skewness were log transformed. Overall survival (OS) was measured from date of randomisation until date of death (any cause). Patients who were still alive were censored at their date of last follow-up. Overall survival curves for each dataset were plotted using the Kaplan–Meier (KM) method and median survival with 95% confidence interval (95% CI) were reported. Survival distributions were compared using hazard ratio (HR) and corresponding p values.

Univariable and multivariable analyses were undertaken using a flexible parametric survival model (see below). A multivariable model that predicts survival based on the baseline clinical features of patients was built using the sorafenib arm of the brivanib trial (training set). The sorafenib arm from the sunitinib trial was used as a validation set. The model was validated and tested according to the methodology suggested by Royston and Altman.21

#### Flexible parametric survival model

Flexible parametric survival regression was proposed by Royston and Parmar22 in 2002. It is an extension of the Weibull model, and models the log baseline cumulative hazard using restricted cubic splines. The mathematical basis of the model and its advantages over the traditional Cox regression is described in the supplementary appendix titled “The flexible parametric survival model” and Supplementary figure 1.

#### Model building and validation

Univariable analysis was undertaken to examine the prognostic influence of each individual variable. The hazard ratio, 95% CI and p values were reported. A multivariable model was then built using a stepwise backward selection of variables significant at the 5% level. Any strong interactions between the variables in the model were also examined. The hazard ratio, 95% CI and p values for the multivariable model were reported.

A time-dependent (TD) effect for each variable in the final model was sequentially added and tested using the likelihood ratio (LR) test to inspect for any proportion hazards violation. The optimal degrees of freedom (d.f.) or knots for the restricted cubic spline function was chosen by testing and comparing different d.f. using the LR test. The functional forms of the continuous variables were examined by plotting a smoothed curve through Martingale residuals estimates, with zero gradient signalling an appropriate form.

The linear predictor of the final model was then derived using its coefficients. In order to generate four risk categories, previously suggested cut-offs21 at the 15th, 50th and 85th centiles were applied to the linear predictor of the training set. Subsequent model predictions were grouped according to this classification. Individual-level survival prediction was undertaken by calculating the survival function at time t (i.e. probability of a patient surviving past time t). The method for deriving the survival function formula is described in the supplementary appendix.

The derived model was validated on the sorafenib arm of the second RCT (sunitinib trial). KM survival curves according to the risk categories were plotted and visually inspected for both the training and validation sets. Median OS and HR were also calculated for each risk category.

#### Model calibration

Model predictions according to the risk categories were visually inspected by overlaying the observed KM and predicted mean survival curves into one graph and examining how closely they agree. In addition to this, the corresponding observed versus predicted median survival as well as observed versus predicted percentage survival at 12 months were derived and reported. This was carried out for both training and validation sets.

#### Model discrimination

Model discriminative performance was measured using Harrell’s c-index23 and Royston-Sauerbrei’s R2D.24 Harrell’s c-index measures the proportion of patient pairs where the survival predictions and observed outcomes are in agreement with respect to rank. R2D assesses the level of explained variation on the log relative hazard scale. Model parameters derived from the training set were first applied to the validation set before calculation of Harrell’s c-index and R2D. A higher value of Harrell’s c-index and R2D is indicative of better model discrimination.

#### Missing data

In order to investigate the nature of missingness and its effect on the final model parameters, multiple imputation of missing data using chained equations25,26,27 was undertaken and coefficients and p values between the complete case final model and the one with the imputed data were compared for any divergence.

## Results

Both sorafenib arms had similar baseline features (Table 1) with the exception of the presence of Child-Pugh B patients (n = 46) in the brivanib trial patients (none in the sunitinib trial). Comparing the KM plots between the two datasets showed that there was no evidence of a statistical difference in survival (HR = 1.11, 95% CI: 0.97, 1.28, p = 0.128) (Supplementary figure 2).

### Univariable and multivariable analyses

Supplementary table 3 reports the hazard ratios, 95% CI and p values of the univariable flexible parametric survival models. They show that age, ECOG, aetiology, tumour size, extra-hepatic spread, vascular invasion, log(bilirubin), log(AST), log(AFP), albumin and INR were statistically significant prognostic factors.

Table 2 shows the variables that were selected for the final multivariable model, along with the hazard ratios, 95% CI and p values. The variables in the final model were vascular invasion, age, ECOG, log(AFP), albumin, log(creatinine), log(AST), extra-hepatic spread and aetiology, along with an interaction between age and vascular invasion.

Partitioning the linear predictor using the prescribed cut-offs produced four distinct risk categories (ranked 1 to 4) in both the training and validation sets (Fig. 1a, b). The observed group-level median OS was comparable in both the training and validation sets, ranging approximately from 4 months in risk category 4 to 30 months in risk group 1.

The curves in Fig. 1 were generated as follows. First the linear predictor was derived:

$$\begin{array}{l}{\mathrm{Linear}}\,{\mathrm{predictor,}}\,{\mathrm{\eta }} = \left( {{\mathrm{0}}{\mathrm{.327}}\,{\times}\,{\mathrm{vascular}}\,{\mathrm{invasion}} \ast } \right) + \\ \left( { - {\mathrm{0}}{\mathrm{.0231}}\, \times \,\left( {{\mathrm{Age}} - {\mathrm{60}}} \right)} \right) + \\ \left( {{\mathrm{0}}{\mathrm{.0303}}\,{\mathrm{ \times }}\,\left[ {\left( {{\mathrm{Age}} - {\mathrm{60}}} \right){\times}\,{\mathrm{vascular}}\,{\mathrm{invasion}} \ast } \right]} \right){\mathrm{ + }}\\ \left( {{\mathrm{0}}{\mathrm{.455}}\,{\times}\,{\mathrm{ECOG}} \ast \!\ast } \right) + \\ \left( {{\mathrm{0}}{\mathrm{.0831}}\,{\times}\,{\mathrm{ln}}\left( {{\mathrm{AFP}}} \right)} \right) + \\ \left( {{\mathrm{ - 0}}{\mathrm{.0553}}\,{\times}\,{\mathrm{albumin}}} \right) + \\ \left( {{\mathrm{0}}{\mathrm{.709}}\,{\mathrm{x}}\,{\mathrm{ln}}\left( {{\mathrm{creatinine}}} \right)} \right) + \\ \left( {{\mathrm{0}}{\mathrm{.349}}\,{\times}\,{\mathrm{ln}}\left( {{\mathrm{AST}}} \right)} \right) + \\ \left( {{\mathrm{0}}{\mathrm{.298}}\,{\times}\,{\mathrm{extra}} {\hbox{-}} {\mathrm{hepatic}}\,{\mathrm{spread}} \ast } \right) + \\ \left( {\it{0.526}} \, \times \, {{HBV}} \ast \right) + \\ \left( {{\it{0}}{\it{.507}}\,{\times} \,{{other}}\,{{aetiology}}\,{{if}}\,{{not}}\,{{HCV}}/{{HBV}} \ast } \right)\end{array}$$

Where

$$\begin{array}{l}\ast {\mathrm{0}} = {\mathrm{no}}\,{\mathrm{and}}\,{\mathrm{1}} = {\mathrm{yes,}}\, \\ \ast \!\ast {\mathrm{0}} = {\mathrm{ECOG}}\ 0 \ {\mathrm{and}}\,{\mathrm{1}} = {\mathrm{ECOG}} \, 1\end{array}$$

HCV is the reference group

The following cutoffs were then applied to the linear predictor, η, to generate the four risk categories: ≤2.898 (risk category 1), >2.898 to ≤3.666 (risk category 2), >3.666 to ≤4.559 (risk category 3) and >4.559 (risk category 4).

To calculate the survival function for an individual patient at time t, the following three steps were undertaken. The derivation of these equations is explained in more detail in the supplementary.

(1) The log cumulative baseline hazard (spline function) at time t was derived as follows:

$${{s}}\left( {{\mathrm{log}}\,{{t}}} \right) = - {\mathrm{8}}{\mathrm{.167}} + {\mathrm{2}}{\mathrm{.465}}\,{\mathrm{log}}\,{{t}} + {\mathrm{0}}{\mathrm{.113}}\,{{z}}_{\mathrm{2}}$$

where $${{{z}}_{\mathrm{2}} = \left( {{\mathrm{log}}\,{{t}} - {\mathrm{1}}{\mathrm{.833}}} \right)_ + ^3 - {\mathrm{0}}{\mathrm{.361}}\,\left( {{\mathrm{log}}\,{{t}} + {\mathrm{1}}{\mathrm{.017}}} \right)_ + ^3 - {\mathrm{0}}{\mathrm{.639}}\,\left( {{\mathrm{log}}\,{{t}} - {\mathrm{3}}{\mathrm{.443}}} \right)_ + ^3}$$ The “+” notation denotes (x)+  = max(0, x)

(2) Baseline survival function, at time t, was expressed as:

$${{S}}_{\mathrm{0}}\left( {{t}} \right) = {\mathrm{exp}}\left( { - {\mathrm{exp}}\left( {{{s}}\left( {{\mathrm{log}}\,{{t}}} \right)} \right)} \right)$$

(3) Survival function, S(t), at time t for an individual subject can then be calculated by:

$${{S}}\left( {{t}} \right) = {{S}}_{\mathrm{0}}\left( {\mathrm{t}} \right)^{{\mathrm{exp(\eta )}}}$$

where η is the linear predictor.

The values for S0(t) at time points 3, 6, 12, 18 and 24 months were 0.997, 0.991, 0.977, 0.965 and 0.955 respectively. For other time points, S0(t) can be calculated by following Steps 1 and 2.

For example, a patient with the following baseline features: No vascular invasion, 71 years of age, ECOG 1, AFP 850.3 ng/ml, albumin 46 g/l, creatinine 35.36 μmol/l, AST 29 U/l, no EHD and “other” aetiology (non-HBV, non-HCV) will have a survival function of 97%, 90%, 77%, 67% and 59% at times 3, 6, 12, 18 and 24 months, respectively.

An online calculator to generate the survival predictions at a group and individual level is available at: https://jscalc.io/calc/oGSDLHDsDg9g2XBF

### Model calibration

Observed KM and model-predicted survival curves according to the risk categories were closely matched in both the training and validation sets (Fig. 2a, b). This was also reflected by the similarities between the observed and predicted median OS as well as observed and predicted percentage survival at 12 months (Table 3) in both datasets. There was some discrepancy, however, between the observed and predicted median OS in the lowest risk category of the training set although the percentage survival at 12 months was almost identical (78 vs. 76%). Table 3 shows the median OS, percentage survival at 12 months, hazard ratio and p value according to each risk category.

### Model discrimination

There was a slight fall in the Harrell’s c-index (from 0.72 to 0.70) and R2D (from 0.27 to 0.18) in the validation set compared to the training set (Table 3); signalling a small deterioration in predictive power of the model. However, both figures were indicative of good prediction.

### All patients combined

Since both the training and validation sets showed similar survival both overall and within each risk category, they were merged and KM survival curves (Fig. 3a) and calibration plots (Fig. 3b) were generated.

Observed percentage survival at 12 months for patients within risk categories 1 to 4 were 76%, 55%, 28% and 6% respectively, with corresponding HR (in comparison to risk category 1) being 1.93 (95% CI: 1.46, 2.54; p < 0.0001), 3.68 (95% CI: 2.81, 4.81; p < 0.0001) and 8.16 (95% CI: 6.03, 11.04; p < 0.0001). Model-predicted and observed survival were very similar, with the exception of some discrepancy in median OS in the lowest risk category, as mirrored in the individual training and validation set results. Table 3 shows the median OS, hazard ratio and p value when the model is applied to all the patients combined.

Finally, comparing the parameters between the complete case model and the one using imputed data showed very similar coefficients and p values (Supplementary table 4), therefore indicating that the final model was not greatly affected by missing data.

## Discussion

Although the SHARP trial and the Asia-Pacific study1,2 clearly demonstrated a significant improvement in survival over placebo in patients with advanced HCC, the absolute improvement in median survival was less than 3 months. Furthermore, as we show here, there is heterogeneity around the median survival figures. In this study, where patient inclusion criteria were analogous to that of the SHARP trial, survival ranged from less than one month to more than two years. Others have found similar heterogeneity.8,9 At present the recommended indication for sorafenib treatment is patients with well-preserved liver function and being unsuitable for loco-regional therapies,5 and this does not take into account likely survival. These figures are put into further context by a recent review of sorafenib amongst over 1000 Medicare beneficiaries28 noting that ‘survival is exceptionally short…..and downsides of sorafenib use - high drug-related symptom burden and high drug cost - must be considered in light of this minimal benefit’. To overcome this issue, in this paper, we have developed a statistical model [henceforth known as PROSASH (PRediction Of Survival in Advanced Sorafenib-treated HCC] that allows personalised survival predictions with a view to aiding patient counselling and trial design.

Using data collected for regulatory purposes with similar criteria to that used for the SHARP trial and Asia-Pacific study,1,2 we show that it is possible to predict survival on the basis of clinical features available at the time of diagnosis. Within the entire cohort of patients, we identified four risk categories (Fig. 1) whose median overall survival ranged from 4 months in the highest risk category to over 20 months in the lowest risk category (Table 3). The corresponding percentage survival at 12 months for risk categories 1 to 4 were approximately 8%, 28%, 55%, and 76% respectively (Table 3). This stratification in addition to the individual patient predictions, permits ‘personalisation’ of sorafenib therapy and may be useful in clinical trials to optimally stratify patients where sorafenib is the appropriate control arm. Thus, it would be possible to ensure that patients in a randomised phase III trial would have equivalent and matching prognostic features. Furthermore, the median survival figures in the higher risk groups are, in fact, worse than in the control (placebo) group of the SHARP study and this may lead clinicians to consider if the toxicity consequent on sorafenib therapy is worth any small potential survival benefit.

The model also offers insight into some of the factors that influence survival (Table 2). Notably, using HCV as the reference aetiology, the prognosis is clearly much worse in the HBV and ‘other’ groups. This is consistent with recent findings both from a retrospective review of the SHARP trial and Asia-Pacific study1,2,16 and meta-analysis studies,8,20 which identify HCV positivity as a key predictive factor for benefit after sorafenib. Our model thus contains the major factors that have been found to be either predictive of sorafenib benefit such as extra-hepatic spread, or prognostic, such as vascular invasion and AFP in the combined analysis of the SHARP and Asia-Pacific trials16 apart from NLR, which was not a recorded in our datasets. Although such data permits optimisation of the patient groups in whom sorafenib is administered, molecular markers related to the mechanism of action of targeted agents remain an important and, as yet unfulfilled, goal.29

The inclusion of aetiological factors in our model is clearly justified on the basis of the previously mentioned evidence that HCV positivity is a predictor of survival benefit compared to placebo. However, the inclusion of aetiological factors other than HBV and HCV, which are relatively ‘objective’ is problematical. Thus, the lifetime consumption of alcohol is very difficult to record accurately and when both alcohol and a type of viral hepatitis are recorded, attribution to one specific aetiology becomes highly speculative. The diagnosis of NAFLD is equally difficult in the setting of HCC since evidence of NAFLD has often disappeared by the time the patient develops cirrhosis and HCC.30,31 In the light of these observations we believe that categorisation aetiology as HCV, HBV or ‘other’ is the fairest option.

Another limitation of our study is that we did not have sufficient data to take into account pre- and post- study treatments. Since these are not predicated in the trial protocol, there is a wide variation in the treatment options and their duration, which makes it difficult to model statistically.

Since the guidelines for sorafenib treatment are based on clinical trial data it seems reasonable to build our prognostic model on similar datasets. Nonetheless, it will be important to validate the performance of the model in larger datasets and in patients treated in the routine clinical practice setting. Although the predicted survival in our model was very close to that observed at each risk category (Fig. 2), there was some discrepancy between the observed and predicted median survival in the low risk category of the training set towards the end of the follow-up period (after about 15 months). Possible explanations for this discrepancy may be due to the small number of patients within that group surviving beyond 20 months such that (a) there was not enough information for the model to correctly extrapolate survival for and (b) such patients may have different features (compared to others) and factors that affect their survival may not be accounted for by the model. The observed and predicted percentage survival at 12 months in this category was, however, very close (78 vs 76%).

We believe that the statistical approach adopted here could be used to generate a ‘virtual control group’ for phase II, single arm, trials of new agents. Thus, we can generate survival curves that predict the outcome of patients in such trials had they received sorafenib although quantitative estimation of differences between the trial arms (sorafenib predicted vs actual new agent) remains methodologically challenging. Such an approach might be a useful preliminary screen for new agents when a go /no go decision concerning progression from phase II to phase III has to be made.