Statistical strategies to improve the efficiency of molecular studies of colorectal cancer prognosis

The evaluation of tumour molecular markers may be beneficial in prognosis and predictive in therapy. We develop a stopping rule approach to assist in the efficient utilisation of resources and samples involved in such evaluations. This approach has application in determining whether a specific molecular marker has sufficient variability to yield meaningful results after the evaluation of molecular markers in the first n patients in a study of sample size N (n⩽N). We evaluated colorectal tumours for mutations (microsatellite instability, K-ras, B-raf, PI3 kinase, and TGFβR-II) by PCR and protein markers (Bcl2, cyclin D1, E-cadherin, hMLH1, ki67, MDM2, and P53) by immunohistochemistry. Using this method, we identified and abandoned potentially uninformative molecular markers in favour of more promising candidates. This approach conserves tissue resources, time, and money, and may be applicable to other studies.

Research on the molecular biology of colorectal cancer has increased our expectation that a better understanding of molecular changes in colorectal tumours may improve our knowledge of aetiology and treatment. Recently, investigators have recognised that molecular characteristics of colorectal cancers are associated with prognosis and therapeutic response. Studies suggest that some of the major genetic players in colorectal neoplasia, such as p53 mutations, are associated with poorer prognosis (Hardingham et al, 1998). Other studies report correlations between K-ras mutations, tumour stage, and survival (Andreyev et al, 1998;Samowitz et al, 2000). In a population-based study of 607 colorectal cancer patients, Gryfe et al (2000) observed that highfrequency microsatellite instability (MSI) conferred significant survival advantage independent of other prognostic factors including tumour stage.
Molecular studies in colorectal cancer may help us better understand how genetic alterations could alter prognosis or impact response to cytotoxic agents. However, there are limitations in the analysis of molecular markers in studies of colorectal cancer prognosis. Oftentimes, studies have a limited amount of tissue samples or have samples from a small number of subjects. Furthermore, variation in the expression of markers in tumour samples might be too small to detect differences in prognosis, thus limiting the utility of some markers. Therefore, there is a need to devise strategies to utilise resources efficiently in studies of molecular markers of prognosis.
In the conduct of a population-based study to determine prognostic and predictive molecular factors for colorectal cancer, we used data from more than 100 patients to develop a strategy to determine whether specific molecular markers possess sufficient variability to yield meaningful results in a study of sample size 1000. Using this method, molecular markers that were unlikely to be informative were abandoned in an early stage of the study in favour of mutations or protein markers showing more promise. This method allowed us to conserve time and resources, and may be applicable to other molecular studies.

MATERIALS AND METHODS
We are conducting a population-based study of colorectal cancer in 33 county areas of North Carolina. This study, Cancer Care Outcomes and Surveillance (CanCORS), is a multicentre population-based study, funded by the National Cancer Institute, to evaluate patient, physician. and treatment factors that influence colorectal cancer outcomes. As part of the CanCORS study at the University of North Carolina, we collected tumour tissue on consenting subjects, and constructed tissue microarrays (Kononen et al, 1998) to be used for immunohistochemistry and mutational analysis as part of the UNC GI Specialized Programme in Research Excellence (SPORE) grant. We enrolled 1000 patients (N ¼ 1000) into the study, and the study was approved by the Institutional review board (IRB) of the UNC School of Medicine. From more than 100 patients, we evaluated genetic mutations in p53 (Angelopoulou and Diamandis, 1998;Curtin et al, 2004), K-ras, B-raf, TGFbR-II, MSI (Boland et al, 1998), and examined protein expression of MDM2, BCl-2, cyclin D1, Ki67, P53, hMLH1, and E-cadherin by immunohistochemistry using commercial antibodies.

Binary mutation marker data
In our study of N ¼ 1000 patients, our objective was to develop a stopping rule that might be applied after the first n patients were evaluated (npN) to improve efficiency and lower cost. A binary mutation marker variable takes a value of 0 or 1 to represent the absence or presence of a mutation, respectively. To assess the effectiveness of any marker, one typically employs a regression model to correlate the marker variable with the outcome. A problem arises in the early stages of a study when time-toevent outcomes are not yet available because of short follow-up, hindering the evaluation of marker effectiveness in terms of survival. However, one can still make some informative decisions on marker effectiveness by evaluating marker variability. If among the first n (npN) patients most have either mutations or non-mutations, it suggests that the marker has little variability and likely little impact on prognosis.
To evaluate marker effectiveness through marker variability without survival outcome data, we find it appropriate to use the power and sample size relation. Let a denote significance level and Z 1Àa the (1Àa) Â 100% percentile from the standard normal distribution. Assuming the Cox proportional hazards model, Schoenfeld (1983) derived a sample size and power relation for two sample comparisons (eg, mutations vs non-mutations) in which the proportion of the mutation group p satisfies: where D is the hazard ratio between two samples, D is the total deaths among N patients (which can be also written as D ¼ N*d where d is the overall death rate), and 1Àb is power. This formula shows the relationship between hazard ratio (or effect size) D, variability p(1Àp), and statistical power given all the other parameters being fixed. Clearly, to detect a specific effect D between mutations and non-mutations, the power can be too low when there is little variability in a marker. This suggests that we can compute a lower and upper bound of the mutation rate from (1), so that there is sufficient power (X80%) to detect a specific effect, D, if the mutation rate falls between the bounds. When marker data from n patients (npN) are available, but survival data are not, we can construct a 95% confidence interval for mutation rate and compare it with the bounds. If the 95% confidence interval falls completely below the lower bound (or completely above the upper bound), it suggests that the marker might have too little variability to be effective in predicting survival, even if marker data were collected from all N patients. In such circumstances, investigators can make informative decisions regarding whether they want to continue data collection on a marker that is unlikely effective, or direct resources to other markers showing more promise.
To decide how big n should be, we provide the following formula: where p can be taken as 0.5 and L is a prespecified precision defined as the width of a 95% confidence interval. By using a small L, we can expect to have an accurate estimate for mutation rate based on only n (npN) data points. It is important to note that, in addition to variability, effect size plays an important role in (1). When calculating the lower and upper bound (or simply the variance bound) at 80% power, we have to introduce a value for effect size. Unfortunately, the true effect size of a marker is unknown and cannot be estimated in the absence of survival data. Under such circumstances, supplying a value lower than the true effect size results in a higher variance bound, making it easier to reject a marker; supplying a value larger than the true effect size would only make it harder to reject a marker. We recommend supplying an upper bound for effect size to minimise the chance of throwing away important markers that may have very low variability, but huge effects on survival.

Continuous protein marker data
The protein markers under investigation were assessed by immunohistochemistry. The scoring system was based upon the proportion of cells that were stained and the intensity of staining (Hoos and Cordon-Cardo, 2001). The final score took continuous values between 0 and 5. Similar to the mutation data, our goal was to develop a stopping rule for protein marker data, which might be where D is the hazard ratio associated with one unit of increase in marker values. Similar to the binary marker case, a lower bound for marker variance can be computed by solving (3) for s 2 , such that there is at least 80% power to detect a specific survival effect D, given the overall death rate d (D ¼ N*d). Unlike in the binary marker case, there is no upper bound for marker variance in the continuous case. Again, effect size plays an important role in (3), in addition to variability. We do not want to underestimate the true effect size of a marker when calculating the variance lower bound. On the other hand, overestimating the true effect size would only make the method conservative. If only continuous markers were evaluated in a study, one could use the following formula to compute the required sample size n to satisfy a certain precision L: 2L ¼ ðn À 1Þs 2 =w 2 nÀ1; 0:025 À ðn À 1Þs 2 =w 2 where s 2 is an estimate of s 2 , according to a pilot study. However, when both binary and continuous markers are evaluated in a study, there is no need to compute n twice. In that case, one can compute n based on formula (2) because of its simplicity.

Binary mutation marker data
We need at most n ¼ 97 ( ¼ 1.96 2 Â 0.25/0.1 2 ) patients to satisfy a 0.1 precision in evaluating a binary mutation marker (in formula (2), let p ¼ 0.5, as it gives the highest possible value for the right hand side). Table 1 displays the lower and upper bound of mutation rate, denoted as pL and pU, for a range of overall death rates and effect sizes where power is fixed at 80%, N ¼ 1000 and a ¼ 0.05. The bounds add up to 1 for each combination of overall death rate d and effect size D because of the symmetry in the left side of formula (1). When the overall death rate is lower than 20% and the effect size is also low, there are no solutions for pL and pU, because the power is insufficient (p80%) regardless of the mutation rate. Figure 1 displays pL and pU, when power is fixed at 80%, a ¼ 0.05, d ¼ 0.6, and N ¼ 1000. If the 95% confidence interval for mutation rate of a genetic marker falls completely in the grey area, it suggests little variability and effectiveness in the marker. For our study, we evaluated mutation markers, such as PI3 kinase, K-ras, B-raf, TGFbR-II, and MSI (Table 2). At the time of the development of this method, we had collected data for more than 97 patients.  Figure 1 Lower and upper bounds calculated at 80% power, a ¼ 0.05, overall death rate d ¼ 0.6, and total sample size N ¼ 1000 for comparison with mutation rate estimated from n (npN) patients. The shaded area represents the rejection region. A 95% confidence interval of mutation rate from n patients falling completely within this region suggests that the marker has little variability and likely insufficient power to predict survival, even if all data from N patients were collected.   overall death rate among the N ¼ 1000 registered patients, and a hazard ratio of no more than 1.5 (i.e., 1.5 was an upper bound for the effect size between the mutation and non-mutation groups). As shown in Table 1, the lower and upper bound of mutation rate are 0.067 and 0.933, respectively (for a 0.6 overall death rate and 1.5 hazard ratio). Among the markers, only TGFbR-II had a 95% confidence interval of 0.019 and 0.057, falling completely below the 0.067 lower bound. This reveals that less than 6% of the population had TGFbR-II mutations, a range unlikely to have sufficient power (480%) to predict prognosis, even if we gathered TGFbR-II mutation data from all N ¼ 1000 patients. Thus, we decided to stop further genetic analysis on TGFbR-II, and focus attention on the other markers. Table 3 presents the minimum variance required to detect a specific hazard ratio for a range of overall death rate values when power is fixed at 80%, N ¼ 1000, and a ¼ 0.05. In our study, protein markers were measured for Bcl2, cyclin D1, E-cadherin, hMLH1, Ki67, MDM2, and P53. We computed a 95% confidence interval for the variance of each marker (Table 4). Again, the overall death rate was expected to be 0.6, and the true effect size was not more than 1.5 for one unit of increase in the protein marker values. According to Table 3, the minimum variance required for each marker is 0.063. As the lower confidence limits of all the markers were larger than 0.063, none of the markers met the stopping criterion at this early stage of the study. Figure 2 displays the rejection region of variance when power is fixed at 80%, a ¼ 0.05, d ¼ 0.6, and N ¼ 1000, which is the case for our study.

DISCUSSION
The prospect that we might use the molecular characteristics of tumours to determine patient prognosis and predict response to chemotherapy is compelling. Studies to date have shown promising results, and there is every expectation that continued research will further improve our prognostic and predictive abilities.
Although it is tempting to perform molecular analyses on an entire study sample, depending on the size of the study sample and the variability in a marker, the analysis might not be informative.
In this study, we have illustrated one potential approach to evaluate marker effectiveness in the early stage of a study when survival data are not available, and the number of markers under consideration was limited. We recommend supplying an upper bound for the true effect size when calculating the marker variance bounds. In doing so, we minimise the chance of throwing away important markers that may have very low variability but huge effects on survival. The method is conservative; in that we do not abandon markers early unless markers show extremely low variability. However, if any markers are identified ineffective, the savings in money, time, and resources may be significant. Institutional review boards and funding agencies generally demand power calculations (Friedman et al, 1999) as a requisite for study approval. The stakes are lower for molecular studies upon existing samples, but the ethical impetus remains to make efficient use of resources and precious, often irreplaceable, patient samples. Our approach helps identify uninformative markers in the early stage of a large molecular study to conserve time and resources. To fully assess this approach, future researchers should consider evaluating the real gain and loss of applying this approach on a large and completed study.  Figure 2 Minimum variance calculated at 80% power, a ¼ 0.05, overall death rate d ¼ 0.6, and total sample size N ¼ 1000 for comparison with marker variance estimated from n (npN) patients. The shaded area represents the rejection region. A 95% confidence interval of marker variance from n patients falling completely within this region suggests that the marker has little variability and likely insufficient power to predict survival, even if all data from N patients were collected.