A prognostic model for ovarian cancer

About 6000 women in the United Kingdom develop ovarian cancer each year and about two-thirds of the women will die from the disease. Establishing the prognosis of a woman with ovarian cancer is an important part of her evaluation and treatment. Prognostic models and indices in ovarian cancer should be developed using large databases and, ideally, with complete information on both prognostic indicators and long-term outcome. We developed a prognostic model using Cox regression and multiple imputation from 1189 primary cases of epithelial ovarian cancer (with median follow-up of 4.6 years). We found that the significant (P≤ 0.05) prognostic factors for overall survival were age at diagnosis, FIGO stage, grade of tumour, histology (mixed mesodermal, clear cell and endometrioid versus serous papillary), the presence or absence of ascites, albumin, alkaline phosphatase, performance status on the ZUBROD-ECOG-WHO scale, and debulking of the tumour. This model is consistent with other models in the ovarian cancer literature; it has better predictive ability and, after simplification and validation, could be used in clinical practice. http://www.bjcancer.com © 2001 Cancer Research Campaignhttp://www.bjcancer.com


Statistical methods
Univariate analyses on continuous prognostic factors were performed using Cox regression and categorical prognostic factors were analysed using the Kaplan-Meier and the log-rank methods (Hosmer and Lemeshow, 1999). The linearity (or more complex forms) of the effect of continuous prognostic factors was assessed using fractional polynomials (Royston and Altman, 1994).
Because about half of the patients had at least one prognostic factor missing, multiple imputation (Rubin, 1987) was applied to account for missing prognostic factor data when considering all variables at once. Our approach, assuming that these data are missing at random (MAR) (Rubin, 1987), is similar to that described by Van Buuren et al (1999). This involved creating 10 complete datasets by replacing missing values with simulated values from a Gibbs sampling procedure (Gelfand and Smith, 1990). The analysis presented is a pooled summary of the results from the 10 datasets.
The fundamental method of multivariate analysis was Cox regression. The models were formulated by systematically removing predictors that were not significant (P > 0.05) starting from a (full) model containing all the prognostic factors (including factors that were non-significant in the univariate analysis). The proportional hazard assumption for each predictor was tested using an approximate score statistic of linear correlation between the rank order of failure times in the sample and Schoenfeld partial residuals (Grambsch and Therneau, 1994).
We evaluated the predictive performance of models by considering measures of discrimination and calibration. Discrimination refers to the ability to distinguish between high-risk and low-risk patients, and was quantified using the c-index and Nagelkerke R 2 (R N 2 ) (Harrell, 1999). The c-index -a generalisation of the area under the Receiver Operating Characteristic (ROC) curve -is a probability of concordance between predicted and observed survival, with c = 0.5 for random predictions and c = 1 for a perfectly discriminating model. Similarly, a R N 2 = 0 indicates no predictive ability and R N 2 = 1 indicates perfect predictions. Calibration (or reliability) refers to whether the predicted probabilities agree with observed probabilities. Calibration was quantified using an estimate of slope shrinkage (Harrell, 1999), based on 200 bootstrap samples, and a value of 1 indicates perfect calibration.
All statistical analyses were carried out with S-Plus 2000 (Release 3) using the Hmisc, Design and MICE software libraries. The MICE library is available from www.multipleimputation.com.

Patient characteristics
Patients were aged between 15 and 90 years at the time of diagnosis (median 61 years), presented at initial diagnosis with predominantly FIGO stages III and IV (64.6%), and predominantly with a serous papillary (51.8%) histology. The potential prognostic factors are summarised in Table 1. Performance status levels 3 and 4 were combined because of small numbers. CA125 and alkaline phosphatase laboratory results were skewed, and a summary of both transformed using natural logs is also presented. The median times between diagnosis and the CA125, alkaline phosphatase and albumin were 27 days (range 0-86 days), 30 days (range 0-88 days) and 30 days (range 0-88 days), respectively.
The treatment data are summarised in Table 2. 1140 (95.9%) patients underwent surgery, and of those, 860 (75.4%) had chemotherapy. Of the 893 (75.1%) patients who received chemotherapy as part of first-line treatment, 656 (73.5%) patients were treated with single-agent platinum regimens. The median time between diagnosis and first course of chemotherapy was 32 days (range 1-115 days). 1080 (94.7%) patients were diagnosed at time of surgery. Of those 49 (4.1%) patients who did not have surgery, 42 were FIGO stage III or IV, and 7 were missing a FIGO stage. Of those 296 (24.9%) patients who did not have chemotherapy, 96 (32.4%) were FIGO stage III or IV, 159 (53.7%) were FIGO stage I, and 10 were missing a FIGO stage.

Long-term outcome
Follow-up information was available on all 1189 patients. 842 (70.8%) patients had died at the time of censoring the data. Median follow-up in the 347 (29.2%) patients who were living at the point of censoring was 1665 days (range 29-5852 days). 5year survival in the cohort was 29.6% (95% Cl: 26.8-32.5%) in keeping with international mortality rates.

Univariate analysis
Univariate analyses suggested that age, FIGO stage, the presence or absence of ascites, performance status, histology, debulking, albumin, grade, (log) CA125 and alkaline phosphatase were significant (P ≤ 0.05) prognostic factors for overall survival. Investigations of linearity of albumin, age, alkaline phosphatase and CA125 in the univariate models suggested that alkaline phosphatase and CA125 required a natural log transformation. Missing data were not included in these analyses.

Missing data
The frequency of missing values (labelled as 'unknown') for each prognostic factor is presented in Table 1. Overall, there were 2045 (17.2%) missing values distributed in 831 (69.9%) patients. 236 (19.8%) patients had 4 or more missing values, and only 4 (0.4%) patients with 7 or more missing prognostic factors. 1739 (85.0%) of the missing cells resulted from missing information on albumin, alkaline phosphatase, CA125 and performance status. The number of patients contributing to a complete case analysis using all the prognostic factors in Table 1 would be 358 (245 deaths), and contributing to the final model (Table 3) would be 450 (323 deaths). The missing data methodology produced values for the missing data that were consistent with the non-missing data, compensated for the uncertainty of producing that data, and ultimately has allowed us to perform an analysis with greater power on all 1189 patients. Table 3 shows the results of the Cox regression analysis. There was significantly greater risk of mortality associated with older age, higher FIGO stage, poorer differentiated grade, the presence of ascites, worse performance status, greater residual disease, a lower albumin level and greater (log) alkaline phosphatase score. We combined FIGO stages III and IV because their (log) hazard ratios were identical to one decimal place (P = 0.66). Histology comparisons -which used the serous group as a reference -were statistically non-significant, except comparisons of the 'endometrioid', 'mixed mesodermal', and 'clear cell' groups with the 'serous' group (P ≤ 0.05). There were also differences between the 'mixed mesodermal' and 'mucinous' and 'endometrioid' groups (P < 0.05). Using a likelihood-ratio test, an overall histology effect was significant (P < 0.0001), and therefore all histological comparisons were included in the final model. The only variable omitted was log CA125 which had a hazard ratio of 1.022 (95% Cl: 0.951-1.097) and was not statistically significant (P = 0.54). There were no significant overall interaction effects between FIGO stage and log CA125 (P = 0.32), FIGO stage and grade (P = 0.74), and grade and histology (P = 0.06). Although, there was no evidence of an overall FIGO stage and histology interaction (P = 0.62), there was weak evidence (P = 0.03) that those patients with a mucinous histology and advanced disease have more aggressive disease and worse prognosis than those with a serous histology (HR: 2.050; 95% CI: 1.073-3.916).

Multivariate analysis
A slope-shrinkage (= 0.948) close to 1 indicates there is little need for recalibration and little evidence of over-fitting.
Reasonably large values for the c-index (= 0.786) and Nagelkerke R 2 (= 0.511) indicate that the set of prognostic factors is explaining the variation in outcome reasonably well, and this implies good prediction for individual patients. Figure 1 illustrates good discrimination ability for 4 risk groups constructed by partitioning using the quartiles of the patients' predicted risks based on the model. The survival curves and 95% confidence intervals for the risk groups are well separated. The c-index and Nagelkerke R 2 in those with early disease (FIGO stages I/II) was 0.715 and 0.250 respectively, whilst in advanced cases (FIGO stages III/IV), they were 0.727 and 0.371, respectively. This demonstrates that the model does discriminate well within groups with early or advanced disease.

Application of the model
Our final prognostic model may be used to calculate expected survival probabilities at various times for different patients. Figure 2 is a nomogram (Lubsen et al, 1978) which enables a clinician to calculate a median survival time or 2-and 5-year survival probabilities for patients. For each level of the prognostic factors there is a number of points allocated, and the total number of points from all prognostic factors can be converted into a median survival time or predicted survival probabilities at 2 and 5 years. For example, consider a 60-year-old woman (~36 points) with FIGO stage III (~39 points), a grade II tumour (= 13 points), a serous papillary histology (~8 points), no ascites present (= 0 points), a performance status of level 1 (~7 points), residual disease less than 2 centimetres (= 0 points), an albumin level of 30 (~14 points), and a log alkaline phosphatase level of log 90 (= 4.50) (~30 points). This patient has a total score of approximately 147, which translates into survival probabilities at 2 and 5 years respectively of approximately 0.7 and 0.25. The median survival time is approximately 2.5 years.

DISCUSSION
Despite the plethora of published prognostic models in cancer, very few are used routinely in clinical practice (Wyatt and Altman, 1995). One problem associated with the acceptance of these models relates to their construction, which predominantly relies on small data sets. Sometimes limited analysable data are a result of low prevalence of disease and/or missing data. When performing a retrospective evaluation of large clinical databases, missing data can be problem. It is usual to exclude from an analysis those individuals on whom data are missing, and this practice leads to a reduction in the statistical power of an analysis and also leads to invalid results if the excluded group is a selective sub-sample of the entire dataset with respect to prognosis. We applied a missing data methodology (multiple imputation) and Cox regression to construct a prognostic model for overall survival in 1189 ovarian cancer patients. This model included age, FIGO stage, ascites, performance status (2, 3 and 4 versus 0), grade, histology (particularly 'mixed mesodermal' versus other types), albumin, (log) alkaline phosphatase and debulking as (significant) prognostic factors (P ≤ 0.05). Measures of predictive performance indicated that this model is well calibrated (i.e. has good ability to produce unbiased estimates of outcome), and has good discrimination abilities (i.e. has the ability to provide reasonably accurate predictions for individual patients). We also produced a nomogram for our model, which enables calculation of median survival times and expected survival probabilities for individual patients at 2 and 5 years.
Could the model be used in clinical practice? To help answer this question we have addressed 3 issues. First, we have reviewed previously published models -specifically the composition of models, sample size of studies used to construct the models, and ultimately their predictive ability. Second, we consider the generalisability and transportability of the model to other populations of ovarian cancer patients. Third, we comment on the practicability of applying the current model in a clinical setting.

Existing models
We found many reports of multivariate models, but there are problems in synthesising this information into a meaningful overview. These problems include: small sample size of studies, limited details regarding which factors were considered in the analysis, inconsistencies in methodology between studies evaluating the same factor (e.g. categorisation of continuous factors or handling missing data), and non-uniform patient populations both between and within studies. The latter can be minimised when the data are from a prospective trial. Those analyses that are retrospective are often flawed, first, by considering response to treatment as a potential prognostic factor and, second, by not accounting for the use of different chemotherapy regimens.
We have restricted our review of prognostic models to those published since 1985 and with more than 200 patients in the study. This can be justified by potential problems with changes in treatment practice and small sample sizes. We have focused on variables that were available for analysis and their statistical significance in the model. There was no consideration of a variable's form or the magnitude of its hazard ratio(s). Table 4 is a summary of our results. The above limitations ignored, the most consistent 'statistically significant' prognostic factors are stage and post-operative residual disease. Histology and grade were incorporated into most assessments, but were less often significant. Some studies included either early or advanced ovarian cancer, while others included both groups. Although treatment practices may differ between early or advanced stage groups, it  seems sensible to use all FIGO stages together in a dataset (if possible) because, given a larger magnitude of patients, multivariate analyses can detect heterogeneity in prognostic factor effects between the various stages by fitting interactionterms.
In those studies (Swenerton et al, 1985;Hogberg et al, 1993;Hartmann et al, 1994;Kehoe et al, 1994;Kosary, 1994;Meden et al, 1995;Marx et al, 1997;Brun et al, 2000) that considered a complete (stage) case mix, grade and FIGO stage were consistently significant, with residual disease, histology, performance status, ascites, and albumin being significant in some, but not all models. The largest study (Kosary, 1994) used National Cancer Institute's Surveillance, Epidemiology, and End Results (SEER) data on 21 240 patients diagnosed between 1973 and 1987. Although, there was substantial power, data on residual disease and performance status were not presented. Among other large studies, Kehoe et al (1994) presented a study of 1184, but by removing patients with any missing data -an approach that could bias their results -the multivariate model was based on just 451 (38%) patients. Swenerton et al (1985) developed a model based on 556 patients with all FIGO stages, but their patients were treated with pre-platinum-based regimens. Brun et al (2000) included a chemotherapy factor in their multivariate model as a way of adjusting for those patients with different chemotherapy regimens, and comparing those who received and did not receive chemotherapy. Although there was a statistically significant difference between those who received and did not receive chemotherapy, there was homogeneity between the effects of different chemotherapy regimens. In our multivariate model, the addition of a chemotherapy factor (applied versus not applied) was statistically significant (P = 0.02). This is not unexpected as treatment decisions are based on baseline prognostic factors. For example, women with early stage disease are less likely to receive chemotherapy. This is confirmed in our cohort with the proportion of those treated with chemotherapy in FIGO stages I, II and III/IV being 43%, 74% and 88% respectively. We also considered the effects of different chemotherapy regimens in the modelling framework above, and found there was no difference in overall survival between single platinum and combination regimens (P = 0.92). Although these are not randomised comparisons and interpretation is problematic, this result may be due to the high proportion (83%) of those chemotherapy patients receiving any form of platinum-based therapy, and the effect of less follow-up in the combination group (median 1.8 years). Fewer than 5% of chemotherapy patients received new first-line regimens, such as taxanes, and in the future our model may have to account for the potential survival improvement from these and other therapeutic advances. Given that 65% of our cohort were FIGO stage III or IV, it seems reasonable to compare our model with those from studies of advanced ovarian cancer. Our dataset (n = 768) is one of the largest studies when considering these cases alone. A larger study, Marsoni et al, (1990), combined data from 4 clinical trials on FIGO stage III and IV patients. This study of 914 patients produced a model including age, FIGO stage, histology and residual disease. However, it was found that performance status could replace age and residual disease in those 721 patients not missing performance status. Of those other studies with more than 200 advanced cases (Marsoni et al, 1990;Alberts et al, 1993;Bruzzone et al 1995;Peters-Engl et al, 1999), the maximum size was 512 cases, and stage and residual disease were the core factors, with grade, histology, performance status, ascites, and albumin being in some, but not all models. We have not found a statistically significant difference between FIGO stages III and IV.
Once again, there are studies with multivariate analyses performed using substantially fewer patients than those initially recruited to the study. Some studies (Van Houwelingen et al, 1989;Lund and Williamson, 1991;Kosary, 1994;Nagele et al, 1995;Peters-Engl et al, 1999;Brun et al, 2000) have dealt with missing data for each factor by including it as a separate level or category. This practice can lead to unrealistic hazard ratios and underestimation of standard errors (Greenland and Finkle 1995), but may be better than excluding whole cases.
CA125 was not significant in our model, and it is now widely accepted that baseline measurements do not predict long-term survival of patients, especially in patients with advanced ovarian cancer (Peters-Engl et al, 1999). There are some prognostic factors from Table 4 that are not present in our model. These include race, type of surgeon, and new molecular markers. Decreased survival has occasionally been observed in non-white women, and it is unknown whether this is due to racial differences in stage distribution, histology or grade or if there are possible questions concerning treatment differences, access to care or other socioeconomic factors. Race has been an important prognostic factor in   (Junor et al., 1999). This is not a concern in our dataset as more than 90% of the Edinburgh patients have been operated on by a gynaecologist. In the past few years there have been several reports about the prognostic impact of molecular markers, including oncogene products (her-2/neu, p21), tumour supressor gene products (p53, p16, pRB) and measures of drug sensitivity (Pgp, LRP, MRP, GST, BAX). There is difficulty in distilling these data into a clear conclusion on which of these measures are important prognostic factors, since there are few reports of substantial size. Two studies presented include her-2/neu, but we do not have sufficient data to test this factor. Although it is foreseeable that computer chip technology may produce new (molecular and genetic) prognostic factors and the prognostic models for the future are likely to be completely different, currently there is not enough data or follow-up to rigorously test them. The best way to do this is by large prospective studies, randomised if possible (Simon and Altman, 1994). None of the studies in Table 4 formally assessed predictive ability using the calibration and discrimination measures we applied. Factors that affect predictive ability include: sample size, number of events, quality of the study design, quality of the data, and efficient model construction. Due to our sophisticated management of missing values, the model compares favourably to other studies with respect to sample size, number of deaths and power. Development of prognostic models is an exploratory analysis. By searching among many variables, there is a risk of including some variables that are not truly prognostic (Simon and Altman, 1994). During the construction of the model, we minimised false positive errors by assessing a limited number of predetermined variables. We also assessed assumptions associated with the multivariate Cox model, including the linearity of continuous variables and proportionality, and these were satisfied.
Some studies (Swenerton et al, 1985;Van Houwelinger et al, 1989;Lund et al, 1990;Marsoni et al, 1990;Lund and Williamson et al, 1991;Hogberg et al, 1993;Kappen and Neijt, 1993;Warwick et al, 1995) assessed discrimination by creating risk groups (via indices) based on the same methodology we employed. Log-rank methodology was applied to assess differences in these groups. As the construction of the risk groups is arbitrary, it is too difficult to compare them using measures of separation.

Transportability to other populations
The Edinburgh population is predominantly Caucasian and, given the discussion of race above, our model may not be suitable outside this type of population. To assess the transportability of our model requires validation in at least one independent population (Altman and Royston, 2000). Table 4 shows that the variables in our model are broadly consistent with most of the other published models. In future work we plan to externally validate our model and a possible index. To our knowledge, 2 groups (Lund et al, 1990;Carey et al, 1993) have attempted to externally validate a model or index. Lund et al (1990) used Danish trial data to validate a statistical model from Dutch trial patients containing FIGO stage, grade, performance status, ascites, and residual disease. They found that performance status and residual disease were the only factors in common, and suggested that these factors would be a starting point for a basic index. Carey et al (1993) presented a larger validation of models from Dembo et al (1982), but their outcome of interest was ovarian cancer mortality. Overall survival is a more reliable outcome.

Practicability of applying a model in a clinic
We have shown how a nomogram may be used to assess a woman's prognosis, and could potentially be used as the basis for making treatment decisions. However, any predictions would need to be presented with a measure of uncertainty, perhaps in the form of confidence limits. Our model, if and when validated, could benefit from being simplified and converted to a web-based medium for clinical utility.
In summary, we have constructed a prognostic model that is based on one of the largest analysable datasets in ovarian cancer. Our model is broadly consistent with other models in the literature, but it needs to be validated in an external dataset. In coming years the identification of prognostic molecular factors may change such models considerably. The production of valid models in the future will depend on the availability of large databases where both standard factors and molecular markers can be assessed. This could be done in the context of large (randomised) multicentre trials where treatment, follow-up and endpoints are predefined and measured according to identical criteria in all patients and, ideally, multiple laboratory measures are incorporated. We believe our model is one of the best available until such evidence is accumulated.