Main

Randomised phase III trials are usually based on several hundred or thousand subjects. New interventions are often found to be ineffective, or the observed effect is lower than expected and clinically unimportant, despite preliminary evidence that it could be beneficial. If an intervention is ineffective, it is worth considering whether the trial could have been stopped earlier after examining interim data, thus avoiding the recruitment of additional subjects and giving them an ineffective therapy, particularly if there are side effects. Also, the trial treatment could be stopped among those who are already taking it. Stopping for futility has other potential advantages, including savings in staff and financial resources.

Stopping trials early for futility has been discussed as far back as the 1980s (Halperin et al, 1982; Lan et al, 1982; Lan and Wittes, 1988), and work in this area is ongoing (Whitehead and Matsushita, 2003; Pocock, 2006; Lachin, 2009). There appears to be an increasing number of trials that incorporate futility, either in the protocol or the Independent Data Monitoring Committee (IDMC) requests such analyses during the trial. The Food & Drug Administration, for example, gives some guidance on this (FDA, 2006).

Futility methods involve using earlier results from patients recruited up to a specific point, to make assumptions about future data, so there will be limitations to this. For example with time-to-event outcomes, the assumption of proportional hazards could be violated during the trial. The two main methods to assess futility are group sequential methods and conditional power (CP) (Whitehead and Matsushita, 2003; Snappin et al, 2006; Hughes et al, 2009). There are various approaches to futility analysis based on CP (Halperin et al 1982, Lan et al 1982; Lan and DeMets, 1983; Lan and Wittes, 1988). Other approaches include a Bayesian method to estimate an ‘average’ CP, called predictive power (Spiegelhalter et al, 1986; DeMets, 2006) and the use of a phase II surrogate end point in a phase III trial (Herson et al, 2011).

The Cancer Research UK and UCL Cancer Trials Centre has conducted clinical trials for many years, of which several have not shown a worthwhile benefit. We retrospectively examined futility analyses in these trials.

Materials and Methods

Ten randomised phase III trials of superiority were included, in which the target sample size was reached in all (none had stopped early); Table 1. These trials were all of those on which the authors had worked that showed either no effect (n=5), a moderate effect (n=4) and one additional study was chosen with a large benefit, for comparison. We aimed to see whether examining futility would stop the five ‘negative’ trials early (and if so, what the savings could be), but not any of the others. In UKHAN, no effect was shown in patients with prior surgery, whereas those without surgery did benefit, so we regarded them as two separate trials (UKHAN_1 and UKHAN_2, respectively). In ZIPP, results for two endpoints were used to show how different results could arise.

Table 1 Summary characteristics and final results of the 10 trials examined

The CP is the chance of getting a statistically significant result at the end of the trial given the data so far. At each analysis, the distribution of future data is assumed to be consistent with the target hazard ratio (HR) (Snapinn et al, 2006). The CP calculation incorporates the observed and target number of events, and the observed and target HR; see Appendix 1 for the statistical methods, which are described in full elsewhere (Lan and Wittes, 1988; Proschan et al, 2006). For time-to-event outcomes, CP is usually based on the expected total number of events. However, one should be cautious in choosing the denominator of the information fraction and it is recommended not to dramatically change it (Proschan et al, 2006). Royston et al (2003) describe a stopping rule approach based on the expected total number of events in the control group, for multiple experimental arms when each is compared with the control. Generally, the CP should get closer to 100% over time as the observed HR approaches or exceeds the planned HR (i.e., when there is a real treatment effect), and it gets closer to 0% for studies of ineffective treatments. The CP should be low to provide sufficient supporting evidence to stop early, though there is no standard threshold in practice (Snapinn et al, 2006). Here, we suggest CP15%. Stata v10 (College Station, TX, USA) was used to calculate CP (Appendix 2). We also used the method in which CP is based on the observed treatment effect, rather than the target effect, at the interim analysis (Snappin et al, 2006). Although this method can be used with the other one by the Data Monitoring Committee to examine the interim results using various assumptions, it is not often used in practice. For time-to-event outcomes, caution should be used when interpreting CP if the proportional hazards assumption is violated. This is not so much a concern for calculating the CP, but rather a limitation of the statistic to measure treatment benefit in the presence of non-proportional hazards.

Three interim analyses were specified, after 25, 50 and 75% of events had occurred, or patients had been recruited. Many researchers trigger the interim analyses on events, but using patients recruited is also used. Outcomes were: (i) CP, (ii) the number of patients left to recruit the target sample size and (iii) cost savings if a trial were stopped early.

If analyses are triggered on a specified percentage of recruited patients, we allowed some follow-up so that events could occur in the last patients accrued: 3 months for advanced disease (lung and biliary tract cancer), and 6 months for the others. Further patients would be recruited during this time, but with minimal contribution to the analysis. In addition to this, and also for interim analyses triggered on a specified percentage of events, we allowed two extra months, during which the IDMC would meet, discuss the results, and then make decisions with the trial investigators. Both allowances are expected in practice.

To provide some estimate of uncertainty when interpreting a single observed CP from a trial, we also simulated 1000 bootstrap samples for each trial when 50% patients or 50% events had occurred. Sampling was with replacement. For each trial, patients were randomly selected from the trial, such that they could contribute none or at least once to each of the 1000 bootstrap samples. The patients were sorted by the date of randomisation and the date of events where they occurred. Bootstrapping was conducted based on the order in which patients were entered into the trial, therefore replicating the interim analyses scenario as they would have occurred prospectively. Each simulation was stratified by treatment arm so that the number of patients in each arm was the same as that observed. For each of the 1000 bootstrap samples we calculated the HR and corresponding CP, in order to assess the proportion of samples that would indicate stopping the trial early (i.e., where CP 15%). Cost savings were examined in the five trials with no overall benefit. The same unit costs were specified for all studies for comparability, without considering inflation and increased expenses over time. The costs were applied to the number of months left to complete the target recruitment at each interim analysis. Investigational drugs were always provided free of charge by the manufacturer or health service provider, as were the costs associated with extra follow-up clinic visits and assessments. Because only the direct costs of conducting the trial were considered, any estimates of savings are conservative.

Results

The 10 trials are summarised in Table 1, of which 6 were relatively large (>500 patients). The observed HRs at the end of the study in the 5 trials with no overall treatment benefit were either just below or above 1.0, though one (TOPICAL) showed a clear benefit among patients who had first cycle erlotinib rash. Among the four trials with moderate effects, the HRs were no lower than 0.78. The proportional hazards assumption was met in all trials.

Interim analyses triggered after a specified percentage of events are observed

None of the five trials with no overall benefit would be stopped early after observing 25% of events (Table 2). After 50% of events had occurred, Study 12 and UKHAN_1 could have been stopped (low CP of 2% and 3% respectively), by which point the percentage of patients left to complete accrual would be 12% (n=83) and 14% (n=36), respectively. The proportion of samples in which the CP was expected to be 15% was 83.6% (Study 12) and 78.2% (UKHAN_1) based on the bootstrap estimates of CP. Therefore, a decision to stop these trials using CP 15%, suggests 16 and 22% probability of the trials continuing (Table 3); that is, the converse percentages. Study 14 also had low CP (15%), but recruitment would already have finished. After observing 75% of events, these same three trials had very low CP, but all had finished recruitment. Study 8 also had low CP, but with only 17% of patients (n=54) left to recruit. The TOPICAL trial would not have been stopped at any point, though at 75% of events the CP (17%) was close to our specified cutoff.

Table 2 Interim analyses based on a fixed percentage of target events (assumes future data is consistent with the target HR)
Table 3 CP based upon 50% patients recruited or 50% events observed, assuming that future data follows the planned HR distribution, and subsequent 1000 bootstrap replicates

Among trials with a moderate effect, only ACT I had low CP, after 50 and 75% of events had occurred (CP=4% and 2%). However, the HR estimates at these times (1.16 and 0.95) are noticeably different from the final estimate of 0.86, so the interim results on overall survival (OS) would be misleading and inconsistent with other trial end points (i.e., local failure, for which a clear benefit was shown), had the study been stopped early. None of the other three trials had low enough CP to be terminated early.

As expected, the trial with the large treatment effect (ABC02) would not be stopped early for futility at any point.

Interim analyses triggered after a specified percentage of patients are observed

Table 4 shows the results at each of the three specified time points. None of the five trials with no benefit had sufficiently low CP (15%) after either 25 or 50% of patients were recruited. For example, even after half the patients were randomised in Study 12 (26% (156 out of 609) of the target number of events observed), the CP was still 55% and HR=1.07 (final HR=1.09). However, four trials could have been stopped early after 75% of patients had been recruited, where the CP was 0.2%, 3%, 10% and 8%, in Study 8, Study 12, TOPICAL and UKHAN_1, respectively. At this point, there remains 22, 9, 10 and 17% of patients to be recruited to complete the original target for these trials.

Table 4 Interim analyses based on a fixed percentage of recruited patients (assumes future data is consistent with the target HR)

Among trials with a modest treatment benefit, there are two instances when recruitment could have terminated early: UKHAN_2 (CP=11%; 50% of patients), and ACT I (CP=7%; 75% of patients). Stopping these two trials early would be particularly concerning because the interim data for OS would not indicate any benefit at all (HRs 1.12 and 1.29 for ACT I and UKHAN_2, respectively) – very different from the final estimates (0.86 and 0.81). After 13 years followup of ACT I HR=0.86, 95% CI 0.70–1.04 (Northover et al, 2010) and there was a clear benefit on event-free survival for UKHAN (HR=0.72, P=0.004; Tobias et al, 2010). For ACT I, at 50% events, about 22% of bootstrapped CPs were >15% (Table 3), which shows some uncertainty in any decision to stop early.

Futility assuming future data would be consistent with the observed HR

Table A1 shows the CP when the futility analyses assume that data from future patients follow the same distribution as that observed so far (rather than the original target HR). All of the other results (observed HR, number of patients recruited, and number of patients and events left to accrue) are the same as in Tables 2 and 4.

After 50% recruitment, four trials could be stopped (Study 12, Study 14, Study 8 and UKHAN), but not TOPICAL. All five trials had low CP after 75% recruitment. In TOPICAL, the CP is expected to decrease (given that there was no overall effect), but it was low at first (6%), then higher (26%) and then low again (0.8%). This is because the HR at 50% of patients (0.88) happened by chance to be closer to the target (HR=0.75). However, all four trials with a moderate benefit could have been stopped early.

Time and cost savings for the trials that showed no evidence of an overall benefit

When interim analyses are based on percentage of events, the number of months left is, as expected, lower than when based on percentages of patients recruited, but there could still be cost savings (Table 5). For example, Study 12 could be terminated early after observing 50% of events (CP=2%), but there are only 4 more months to complete recruitment and the savings associated with early stopping is £44 000. Overall, after seeing 50% of events, three trials could be stopped early, avoiding 4–24 more months of accrual and saving £44 000–231 000 in two of these (Study 12 and UKHAN_1); in Study 14 no savings are made because recruitment had already finished.

Table 5 Potential savings (time and costs) associated with the interim analyses shown in Tables 2 and 4 for the five trials in which there was no overall treatment effect

With the futility analysis at 75% of events, only one trial with low CP is still recruiting (Study 8), but the savings would be 15 fewer months of recruitment and £144 000 lower costs.

Table 5 also shows the estimated time and cost savings when the analyses are based on recruited patients. The trials could only be stopped after 75% of patients had been recruited, with 4–28 fewer accrual months and £44 000–270 000 lower costs. For example, in Study 12 only 66 more patients are needed to reach the target sample size, which actually took only 4 months. Had this study been stopped early, the savings would be £44 000. However, the number of months left to complete accrual was 19 for Study 8, 6 for TOPICAL and 28 for UKHAN_1. Even after recruiting 75% of patients there could be significant cost savings by stopping early: £183 000, £58 000 and £270 000, respectively. The observed monthly accrual rates are an important factor when considering whether to stop early or not, which was high in Study 12.

Discussion

To the best of our knowledge, this is the first application of futility analysis to several real phase III oncology trials. Early stopping of those with an ineffective intervention has obvious appeal – primarily not exposing further patients to it, when there is no benefit but there could be side effects. However, we show that the decision to stop recruitment early is not straightforward (unless based on safety concerns and there is clearly more harm in one group than the other). There are trials with no overall benefit that might not be stopped early, but worse still there are studies with modest effects that could. Similar conclusions have been found elsewhere (Barthel et al, 2009). Conducting clinical trials is expensive and takes several years, so a secondary consideration is the potential significant savings in accrual time and financial costs, which could be of interest to funding organisations, but should be outweighed by the ethical issues. All of these considerations should be balanced against maximising the sample size to get a more reliable estimate of the treatment effect; examination of secondary end points (DeMets, 2006) and important, pre-specified subgroup analyses; and not missing an intervention with a moderate benefit, which is still clinically worthwhile.

Occasionally, by the time there is sufficient evidence for futility, recruitment is not far from the target, so it is sometimes best to continue to the end, because the savings in time and costs are minimal (e.g., Study 12); but only if there is no unacceptable harm to patients. A further consideration is whether patients are still on treatment. A trial in which all have finished the trial treatments, but subjects are in follow-up, could still continue if there are no concerns over the schedule of clinic assessments. Continuing follow-up in a trial that has been stopped early has the advantages of minimising bias and obtaining more data on adverse events.

The worse situation is for trials where there appears to be no benefit at an interim analysis, but they do in fact have a moderate effect. It would be unsatisfactory to stop such trials early because of insufficient patients or events. We give examples (ACT I and UKHAN_2) where interim HRs are close to or exceed 1.0, with low CP, but the final HR indicated a clinically important effect.

The results and conclusions of three of the trials with no overall effect provided useful information after reaching the target sample size, especially when examining important subgroup analyses. Study 8, whose results were unexpectedly inconsistent with a preceding Canadian trial (despite having the same protocol), led to a systematic review showing that early radiotherapy only improved survival if patients completed chemotherapy (Spiro et al, 2006). A post-hoc subgroup analysis in Study 14 (Lee et al, 2009b) indicated that patients with squamous histology who had at least stable disease by chemotherapy cycle 3 had an OS HR of 0.71, and this has led to a randomised phase II trial using another antiangiogenic agent in these particular patients. In a prespecified subgroup analyses in TOPICAL (Lee et al, 2010), OS and PFS were significantly improved only among those who developed first-cycle erlotinib rash, but the reliability of these results would have been less clear if based on fewer patients and events. Continuing to the planned end in order to have reliable subgroup analyses has sometimes been used as justification for not conducting futility analyses, especially if there is unlikely to be an overall effect. However, there must be clear justification for these subgroup analyses, acknowledging the problems with data dredging. Also, if there is a positive treatment effect in one subgroup, when no overall effect is found, there may be a negative effect in another subgroup.

Our analysis has several key strengths. First, it is based only on trials that reached the original target sample size. Second, we use real clinical trial data, not just statistical simulations. Third, we took a practical approach to the interim analyses by allowing time for follow-up and for the IDMC to meet and make decisions with the trial team. Fourth, the trials had a range of effect sizes and sample sizes. Fifth, we undertook bootstrap simulation to provide estimates of measuring uncertainty for any decision to stop early, in order to support the analyses based on a single CP estimate from each trial. We are not aware of any previously published report that has examined the application of futility with all these considerations in mind.

Stopping a trial early is a crucial decision to be made between the IDMC and trial team. The evidence should be robust and based on several pieces of information, not just one statistic, be it the CP or otherwise. On the basis of our findings, a list of considerations for stopping for futility is shown in Box 1, so that only truly ‘negative’ trials are likely to be stopped early. It is worthwhile having two successive interim analyses to see if the data are consistent, hence strengthening the justification to terminate. Herson et al (2011) suggest that stopping trials early might miss late treatment effects and so futility methods should be used with caution. Freidlin et al (2010) comment on the need to strike a balance between aggressive and conservative stopping rules, suggesting a repeated monitoring approach. Overly aggressive stopping rules in the second half of a study may result in trials with moderate effects being stopped early. For example, in ACT I (after 50% events) the HR=1.16 and CP=4%, but the bootstrapping analysis indicates that there is still 22% chance of reaching the target HR. Conversely, conservative stopping rules may allow trials to continue past the point of when sufficient evidence to stop early has been attained.

Assumptions about the distribution of future data and timing of the interim looks are important. The CP method we used is based on the target HR (Snapinn et al, 2006). There is another method in which CP is estimated using the observed HR as the new target. The problem with this is that the observed HR is likely to be unreliable early on in the trial. However, CP based on the target effect size is relatively insensitive to the early results of a trial. Deciding whether to trigger the interim analysis on proportion of patients recruited or events observed is also important. The observed effect size early on in a trial may fluctuate too much and so be unreliable, especially if there is treatment imbalance (Herson et al, 2011), and regardless of the method or assumptions used. Many researchers use percentage of events to trigger the interim analysis, a reasonable approach given that the statistical analyses are often influenced most by the number of events, and hence might be more reliable than percentage of patients. In the set of trials we examined, futility analyses triggered on events (after 50 or 75%) could stop four out of the five trials with no overall benefit, and only one trial with a moderate effect. Whereas analyses triggered on patients could also stop four out of five studies with no benefit, but 2 trials with a moderate effect. An important consideration is that analyses triggered on events are more likely to be based on longer follow-up, so the potential savings are generally less than analyses triggered on number of patients (Table 5).

Further research using modelling and simulations could examine an appropriate frequency of interim analyses, specifying situations when futility may or may not be appropriate, and which method(s) are appropriate, including whether to trigger the early looks on percentage of events or patients observed. Terminology from medical screening could be useful: detection rate (DR – the proportion of truly negative trials that are stopped early) and false-positive rate (FPR – the proportion of trials with modest treatment effects that are stopped early). A good method will have high DR and low FPR, and these parameters could be examined in relation to trial size, the timing of interim analyses, and different statistical methods. Other authors have discussed futility in relation to falsely stopping studies (Hughes et al, 2009). Methods examining two or more end points could also be developed.

In summary, careful application of futility methods can lead to ineffective treatments not being given to future trial patients, and this could also lead to shorter trial duration and reduced financial costs. However, there are situations when the end of the trial is not far off, so the research team may as well complete it. A major concern is that there are studies with modest treatment effects that could be inappropriately stopped early, and a clinically important effect missed. Therefore, unless there is very clear and sufficient evidence for futility, it is often best to continue to the planned end.