Main

Phase II clinical trials are common in medical research, particularly in oncology. They are often based on a relatively small or moderate number of patients (typically 40–70), and allow a preliminary assessment of a new intervention before embarking on a larger and expensive randomised controlled trial (i.e., phase III). Many new drugs are not investigated beyond phase II, because of evidence that they are ineffective. The Fleming single-stage procedure (Fleming, 1982 (for the situation where K=1); Machin et al, 1997; A’Hern, 2001) has been a widely used approach in early-phase drug development. It involves having a single treatment group, and all patients are given the test intervention (often called a single-arm, single-stage design). The observed data are considered in relation to historical data expected to be associated with the control/standard treatment in order to design a subsequent phase III study.

Multistage designs involve conducting one or more interim analyses to decide whether all patients planned for a trial should be recruited. A decision to proceed to phase III is determined by using patient data from all stages of accrual. One important advantage of a multistage design relates to fewer patients on ineffective experimental treatments, with the opportunity to stop a trial earlier for futility (Schlesselman et al, 2006).

Recently, randomised controlled phase II trials are becoming more popular, particularly for common cancers, in which patients are randomly allocated to receive either the test intervention or the control (e.g., the standard treatment or placebo), or they are randomised to several experimental treatments. An important advantage of this approach is that the control group data are collected prospectively, that is, at the same time as those given the new intervention, and this usually yields more reliable data from which to design a subsequent phase III trial. A review and discussion of phase II designs is given in Rubinstein et al (2005), Ratain and Sargent (2009), Daniel et al (2009) and Sargent and Taylor (2009).

The basic idea of a phase II design is that a new therapy is worth considering further if it demonstrates a level of treatment response, P1 (e.g., tumour response or lower disease progression), which is greater than the response rate for the current or standard treatment, P0. Values of P1 and P0 are estimates of π1 and π0, respectively, the true probability of response, and used for sample-size calculations, along with desired statistical power and level of statistical significance.

The sample-size method in a single-stage Fleming design uses a normal approximation to the binomial distribution, and this helped facilitate the calculations for multiple stage testing. However, the sample sizes using this approximation result in differences in sample sizes compared with exact methods, particularly for relatively small studies. This is discussed in A’Hern (2001), who also provided sample-size tables using exact methods, which are larger than those obtained using the Fleming (1982) design. The difference is noticeable for studies of say <50 patients. The Fleming design can also produce anomalous results in that the confidence interval for the observed proportion could include P0, even though the P-value is <0.05 (A’Hern, 2001). It is therefore better to use exact tests. Although the idea of using alpha (α) levels greater than threshold values, such as 5 or 10%, was mentioned briefly, the impact on sample size was not discussed (A’Hern, 2001). A’Hern (2001) provides sample sizes based on approximate α levels and power.

Sample-size software (Hintze, 2001; Machin et al, 2009) and tables (A’Hern, 2001; Machin et al, 2009) are available for the Fleming design and A’Hern exact sample sizes. However, they are based on conventional significance levels, such as 5 or 10%, and power, such as 80 or 90%. This means that the sample sizes produced are based on ensuring an α 5% and power 80%; but researchers might think that α=5% and power=80%. In some situations, smaller sample sizes for α=5.1% and power=79.9% might be possible and therefore ignored. It can be difficult to choose from a wider set of possible sample sizes if software or tables only offer a solution where α is 5% and power is 80%. Even with some specialist software, such as PASS (Hintze, 2001), it would require inputting a large number of non-standard α and power values either one at a time or in some other way. If we accept an α level that is ‘around’ 10% and power ‘around’ 80%, then more than one possible sample size can arise (shown later). The consequence of this is that software programs and the tables presented by A’Hern (2001) give sample sizes based on approximate α levels and power.

This paper considers the implications of such approximations in clinical trials in practice and presents sample size for exact α levels based on the exact test. We can then use this approach to examine several sample sizes for the same treatment effect and choose one that is the smallest. This would be especially useful for studies of novel agents (where little is known about the treatment) or for rare disorders, where it is appropriate to minimise the sample size.

Materials and methods

With a single-stage design, the standard response rate is assumed to be P0 (under the null hypothesis); and the new therapy is considered worthy of further research if we can reject the null hypothesis in favour of the alternative hypothesis, where the response rate is P1. For example, a current therapy may be associated with a 50% tumour response rate for a given cancer, and a new agent or intervention is considered potentially useful if it can increase the rate to at least 65%. The consequent decision rule provides the sample size (n) and minimum number of responders (i.e., r) that are required to warrant further investigation of the new therapy, such that statistical significance is achieved. If the number of responders is <r, then this number is the maximum number of responders for which statistical significance is not achieved.

The above can be formally stated as:

H0 : P0 50% vs H1 : P1 65%

The term Bin(n, P1, r) states that responses are from a Binomial distribution with parameters P1, the probability of a response, r, the number of responders to the new treatment and n, the sample size.

The approach to computing (1) and (2) is shown below in (3) and (4) respectively, as described by Chow et al (2003).

From Equation 3, we generate the observed significance level (α) and this value is compared against the pre-specified significance level (i.e., 5 or 10%). We also require values from Equation 4 to be 80%. Sample sizes are then chosen when the observed α levels are <5% (or 10%) and when the power is 80%.

Results

Table 1 uses a range of differences between P0 and P1 to show how sample size can vary if we accept a significance level or power that are not exactly equal to conventionally accepted levels. Interest is in whether there is a value of α that is not much larger than the usual specified level of 5 or 10%, or a value of power which is not much <80%, but where there is a reduction in sample size. Table 1 shows the exact α and power values compared with the tables from A’Hern (2001). In Table 1, we show only the first five solutions ordered by sample size where available, for sample sizes >20, for absolute differences between P0 and P1 ranging 10–30%. Smaller differences (i.e., <10%) are not considered clinically important in most trials. The value of α has been increased to a limit of 8% where the planned (target) α was 5% (target +3%). For a target α of 10%, the limit is 13%. The sample sizes are presented with power always 77%. The exact α and power values are computed only when the sample size requirement is >20, because sample sizes <20 pose less of a problem in recruitment terms. A SAS macro is also provided (Appendix I), which can be used to derive sample sizes, power and α values for all possibilities of P0 and P1, but other software such as R can also be used. The SAS program requires the user to input ranges of P0 and P1 and also uses cut-offs of 0.08 for α (i.e., target α +3%) and 0.77 for power, which we consider as constituting a ‘small trade-off’. We illustrate the use of Table 1 using some examples. A Corresponding R function is also available in the Clinfun package written by Seshan (2012), see Appendix II.

Table 1 Sample sizes based on exact binomial test

Example 1: single-arm phase II study

Aogi et al (2011) report a single-arm trial designed to detect a small difference of 10% in Japanese breast cancer patients. We use the same parameters in the context of a single-arm trial with P0=10% and P1=20%, α=5% (one sided) and power=80% to demonstrate the impact of trading-off type I and II error rates. There is no exact solution for this design. Fleming approximation using the formulae as presented in Machin et al (2009) gives the solution as sample size n=69 and number of response r=12, which (based on exact method) is actually coming from exact α=4% and exact power=75.04%. The A’Hern method in the sample-size software (Machin et al, 2009) gives the solution to the same problem as 13 out of 78, which actually come from α=4.53% and power=80.81% (Table 1, first row entries in bold).

Further examination of Table 1 shows alternative sample sizes obtained by relaxing α and power (savings in sample sizes are shown by comparing the bold figures in the first row with the sample sizes immediately below), and also for differences >10%. By accepting α=5.67% and power=77.7%, both of which are reasonably close to the specified levels of 5 and 80%, and the solution is 11 out of 65, which is smaller than Fleming (12 out of 69) but the power is also higher. Compared with A’Hern, the sample size 78, there is a saving of 13. This means that the chance of declaring the new agent as being beneficial when in reality it is not has only increased by 1.14 (from 4.53% to 5.67%) percentage points, and power has decreased by 3.10 percentage points. On the other hand, the increase of α is only 0.67% when compared with the conventional 5%. This could be considered worthwhile in relation to the potential saving in financial costs and accrual time, as well as exposing fewer patients to a novel agent that may have serious side effects.

Example 2: randomised controlled phase II study (1 : 1 allocation)

We take P0=30% and P1=40%; and specify α=10% and power=80%. Again, there is no exact solution for this when α=10% and power=80%. Sample-size software and tables from A’Hern (2001) give the solution as n=107 and r=39, which has an exact α=9.00% and power=80.13%. However, by accepting α=10.43% and power=77.33% (Table 1), both of which are reasonably close to the specified levels of 10 and 80%, the sample size could be 90 patients in one treatment group (instead of 107). Because there are an equal number of patients in the other treatment group (which could be another new treatment or a control group), there would be a total saving of 34 patients. The increase in α has therefore been 1.43 percentage points (and only 0.43 percentage points from the usual α of 10%); and power has decreased by only 2.8 percentage points. If the same trial had a 2 : 1 allocation instead of a 1 : 1 (in favour of the new intervention group), the saving would be 25 patients (17 in the experimental plus roughly half the number in the control). It should be noted that although this approach to randomised studies was common, a more efficient approach is to have a design that involves a direct comparison of the two treatment groups (Rubinstein et al, 2005; Jung, 2008).

A practical example

We extend the trade-off approach to an example of a phase II randomised trial in lung cancer patients using a two-stage design, which was stopped for lack of efficacy after stage I. A Simon’s two-stage minimax design with P0=50% and P1=65%, α=10% and power=90% using the software by Machin et al (2009) gives the required sample size at stage 1 of 20 out of 40, and total sample size was 42 out of 72 in each intervention arm. Therefore, total sample size was 144 due to randomisation. Our exact calculation reveals that the actual α was 9.7% and power was 90.4% in the sample-size calculation. The trial had major recruitment problems and it could have been designed with α=10.4% and power=88.8% to give a stage 1 sample size of 13 out of 27 and stage II sample size of 38 out of 65, saving 26 patients at stage I and 14 patients in total. It is worth noting that both the original design (n=144) and the alternative design (n=130) have probabilities of early termination under the null hypothesis of <50%.

Discussion

There are an increasing number of early phase II trials being conducted, given the availability of many new therapies, which are used on their own or in combination with the standard treatments. Furthermore, there is an emerging preference for randomised controlled phase II studies, which increases the total trial size (Lee and Feng, 2005; Cannistra, 2009). Phase II trials need to be conducted as quickly as possible with the minimum of resources, in order to reject apparently ineffective interventions early on in drug development and move on to other treatments, or to further investigate those that look promising. Traditionally, phase II trials are designed on the basis of the active (new) treatment arm only, in that the sample size is based on the expected treatment effect in that arm. If there is a control arm, the number of patients may be taken to be the same as or half of that in the active arm, depending on 1 : 1 or 2 : 1 allocation, respectively.

The financial costs of conducting a clinical trial have increased, particularly in light of the current regulations and governance, so that it can take many months (>6) to set up a study (Hackshaw et al, 2008). Having a small study, where acceptable, can therefore have clear benefits in terms of shorter trial duration, which is associated with lower costs. Another benefit is that fewer patients are exposed to a novel agent that has serious side effects but is eventually shown to be ineffective. Minimising sample size is particularly important for rare disorders where recruiting even 10–15 patients could take several months.

When designing studies, most researchers use established values for α of 5 or 10% and power of 80% (occasionally 90%). In our paper, we show that by allowing slightly higher α and lower power for these exact tests, there could be a material reduction in sample size, particularly for studies with say <50 patients. We believe that such an approach is useful for two reasons. First, phase II trials are usually only meant to provide preliminary evidence of efficacy, therefore relaxing the design parameters should not be of great concern. Second, the conventional values of α=5% and power 80% were somewhat arbitrary when originally stipulated; they were not selected on the basis of scientific principles (α=5% was judged sufficiently low and power=80% as sufficiently high). However, these values were primarily meant for large confirmatory studies, but researchers and reviewers involved in grant applications have not often relaxed them for exploratory studies, such as phase II trials. Recently, it has become more common to have values of α of 10% in cancer trials (Rubinstein et al, 2005). Therefore, accepting α of 7% instead of 5%, or power of 77% instead of 80%, could be considered a worthwhile trade-off for having a smaller study, particularly when the largest savings are made with randomised controlled phase II trials.

Our approach to sizing studies is not just limited to single-stage designs, but can also be extended to two-stage (Simon, 1989) and other n-stage designs where exact methods are used. In some two-stage design, trade-offs in the expected sample size are considered for smaller overall sample sizes (Jung et al, 2001). By compromising α and power in addition to the expected sample size, it is possible that savings in sample size are even greater. However, additional complexities such as the probability of early termination might also be important when considering any trade-off.

The implications of trading-off type I and II errors is that the risk of a false-positive or -negative may be slightly above or below the conventional 5% and 80%, respectively. The specific type of trade-off is likely to be based on feasibility and may vary from trial to trial. However, in phase II trials, which are often about finding preliminary evidence of effect, a trade-off in either direction may be possible. It is important to point out that such a trade-off does not influence the size of the treatment benefit.

A limitation of our suggested approach is that the final result ideally needs to be considered in relation to the α level used in the sample-size calculation, which is not a round number such as 5% or 10%. However, even when sample sizes come from A’Hern (2001) or software, the interpretation of the primary result is based on α=5%, even though the actual value might be 4.5%. Moreover, reported P-values such as ‘0.052’ or ‘0.057’ (in the context of phase III trials) are not readily dismissed for lack of effect (Hackshaw and Kirkwood, 2011), and therefore powering a trial with non-standard α and power may also be considered a reasonable approach for phase II study designs. Nevertheless, the decision on whether or not to investigate a new treatment further should not be based on a single numerical cut-off for α, but perhaps on consideration of several pieces of information, including other clinically important efficacy end points, safety and accrual rates. It is often the case that a smaller treatment effect is observed, and precision would be lost by having a study that is too small, making it difficult to determine whether to investigate the new therapy further or not. We therefore do not recommend that sample sizes be reduced to <20 patients per treatment group.

In conclusion, it is worthwhile examining a fuller range of sample sizes when using exact methods for single-stage phase II trials, so that the smallest acceptable sample size could be chosen after allowing a slightly higher α level (error rate) than the conventional 5 or 10%, and lower power than the nominal 80%. This can lead to benefits such as shorter study duration and lower financial costs, which are key considerations when investigating treatments for uncommon disorders or new agents in proof-of-concept studies, and this could make a project proposal being considered for funding more attractive when peer-reviewed. When the decision rule is based on the experimental arm alone, but the study is a randomised parallel group design, the differences in sample size between the approaches described here and those presented by A’Hern can be up to 25% lower after allowing for small trade-offs in α and power.