Completeness of case ascertainment and survival time error in English cancer registries: impact on 1-year survival estimates

Background: It has been suggested that cancer registries in England are too dependent on processing of information from death certificates, and consequently that cancer survival statistics reported for England are systematically biased and too low. Methods: We have linked routine cancer registration records for colorectal, lung, and breast cancer patients with information from the Hospital Episode Statistics (HES) database for the period 2001–2007. Based on record linkage with the HES database, records missing in the cancer register were identified, and dates of diagnosis were revised. The effects of those revisions on the estimated survival time and proportion of patients surviving for 1 year or more were studied. Cases that were absent in the cancer register and present in the HES data with a relevant diagnosis code and a relevant surgery code were used to estimate (a) the completeness of the cancer register. Differences in survival times calculated from the two data sources were used to estimate (b) the possible extent of error in the recorded survival time in the cancer register. Finally, we combined (a) and (b) to estimate (c) the resulting differences in 1-year cumulative survival estimates. Results: Completeness of case ascertainment in English cancer registries is high, around 98–99%. Using HES data added 1.9%, 0.4% and 2.0% to the number of colorectal, lung, and breast cancer registrations, respectively. Around 5–6% of rapidly fatal cancer registrations had survival time extended by more than a month, and almost 3% of rapidly fatal breast cancer records were extended by more than a year. The resulting impact on estimates of 1-year survival was small, amounting to 1.0, 0.8, and 0.4 percentage points for colorectal, lung, and breast cancer, respectively. Interpretation: English cancer registration data cannot be dismissed as unfit for the purpose of cancer survival analysis. However, investigators should retain a critical attitude to data quality and sources of error in international cancer survival studies.

It has been suggested that cancer registries in England are too dependent on the processing of information from death certificates (Bullard et al, 2000;Robinson et al, 2007;Beral and Peto, 2010;Møller et al, 2010). This could have several adverse effects: (a) case ascertainment would be incomplete, particularly for non-fatal cases; (b) survival time would be too short, because hospital activity related to recurrent disease or end-of-life care would sometimes be recorded as the first known event and hence provide the date of diagnosis. A further consequence of these errors would be that (c) reported cancer survival statistics for England would be estimated with a systematic bias and be too low.
Cancer registries in England have recently linked routine cancer registration records with information from the Hospital Episode Statistics (HES) database (http://www.hesonline.nhs.uk). The HES database contains details of all in-patient and day-case admissions to National Health Service (NHS) hospitals in England.
The linked data set provides a new opportunity to evaluate the magnitude of errors (a), (b), and (c), defined above. Based on record linkage with the HES database, records missing in the cancer register were identified, and dates of diagnosis were revised.
The effects of those revisions on the estimated survival time and proportion of patients surviving for 1 year or more were studied. Cases that were absent in the cancer register and present in the HES data with a relevant diagnosis code and a relevant surgery code were used to estimate (a) the completeness of the cancer register. Differences in survival times calculated from the two data sources were used to estimate (b) the possible extent of error in the recorded survival time in the cancer register. Finally, we combined (a) and (b) to estimate (c) the resulting differences in 1-year cumulative survival estimates.
The analyses presented here were executed at the Thames Cancer Registry on behalf of cancer registries in England, in order to address the comments in a recent editorial by Beral and Peto (2010). The principal analysis and the form of reporting of findings were specified before we had any knowledge of the results of the investigation.

MATERIALS AND METHODS
We identified all cancer records in the HES-only side of the linked data set that in 2001 -2007 had activity relating to colorectal cancer (ICD10 C18 -C21), lung cancer (C33 -C34), or breast cancer (C50), and that had a surgery code indicating a relevant, non-diagnostic surgical procedure. By HES-only cases, we are referring to the cases identified by the record linkage, that are not present in the cancer register but present in the HES data with a relevant diagnosis code and a relevant surgery code. These cases would not be included in routine cancer survival analysis, and they represent good-prognosis cases that may have been missed in the primary case ascertainment in the cancer registries, and not subsequently identified through routine record linkage with death certificates.
The three cancer diagnosis groups were selected to represent the spectrum of fatality among different, common types of cancer. The HES-only records were considered as an indication of the possible magnitude of under-ascertainment of non-fatal cancer cases in the cancer registries.
We considered only surgically treated cases because the combination of diagnosis code and resection code in the HES record gives a high degree of certainty that the record represents a true record of cancer. The HES-only records without an indication of cancer treatment would not be considered as sufficient evidence to create a cancer registration, and would need to be verified against other clinical records. The large majority of such HES-only records are known to relate to cases where cancer might have been suspected but it was not subsequently confirmed (Brewster et al, 1997).
To give a measure of incompleteness, we compared the number of HES-only records (with surgical treatment) against the number of regular cancer registration records in the linked data set, and  (2001 -2007), and cancer registry. We plotted the incompleteness measure for each cancer registry for each year.
To evaluate the possible magnitude of survival time error, we identified all cases in the cancer registry records with a recorded survival time of o1 year. These rapidly fatal cases were considered as the ones most likely to be influenced by a systematic survival time error. Within the three groups of cancers, we searched the HES records for evidence of an earlier cancer diagnosis for these persons (with or without a record of surgery), and used the first matching HES record with a relevant diagnosis code. We computed the difference in survival time (days) using the two alternative dates of diagnosis. We described the distributions of the survival time difference, stratified by type of cancer, and cancer registry.
Finally, we evaluated the likely magnitude of the influence of incompleteness and survival time error on a commonly used outcome measure: the 1-year survival proportion. We computed 3.0%   Table 1 shows the estimated incompleteness of case ascertainment. The HES-only cases added 1.9% to the number of colorectal cancer registrations, 0.4% to lung cancer, and 2.0% to breast cancer. These effects were similar in males and females, slightly higher in the younger age groups, and declined over the period 2001 -2007. There was some variation between cancer registries, with the highest incompleteness in the Thames Cancer Registry (4.1% in colorectal cancer, 0.5% in lung cancer, and 4.3% in breast cancer) and lowest in the Trent and the South West registries. There was a general decrease in incompleteness over time in most cancer registries (Figure 1). For breast cancer, 6.2% of cases had a survival time difference of more than 1 month and 2.7% differed by 41 year. There was variation in the distributions between the cancer registries, which persisted for 41 year ( Figure 2C). The proportion of cases with survival difference of 41 year ranged from 0.8% in Northern and Yorkshire to 4.4% in Trent. Table 3 shows the three alternative analyses of 1-year survival. The 1-year survival estimates increased when the HES-only cases and their respective survival times were considered in the analysis.

RESULTS
The changes were small, amounting to 0.5, 0.2, and 0.2 percentage points for colorectal, lung, and breast cancer, respectively. With the further use of the HES-derived survival times for the cancer registration records, the 1-year survival times increased further but the changes remained small: 1.0, 0.8, and 0.4 percentage points, respectively.

DISCUSSION
The main findings from this analysis are that completeness of case ascertainment in English cancer registries is high, possibly as much as 98 -99%, when evaluated against independently recorded hospital episodes which included relevant cancer diagnosis and surgery codes. The analysis found evidence of the hypothesised survival time error. Around 5 -6% of rapidly fatal (1 year) cancer registrations had the survival time extended by  There are important limitations of this analysis, and we do not propose that it gives a full and accurate estimation of completeness and survival time errors. The analysis uses a new source of data and a pre-specified analysis plan to indicate the possible magnitude and impact of the errors and biases proposed by Beral and Peto (2010) and previously investigated and discussed by ourselves (Bullard et al, 2000;Robinson et al, 2007;Møller et al, 2009Møller et al, , 2010. We decided a priori to consider only resected cases from HES as potentially missed cases in cancer registration, these would be representatives of the non-fatal cancers that the registration process might have missed (Bullard et al, 2000). The most likely reason for any absence of surgical treatment of hospitalised cancer patients would be that they were too ill to be considered eligible for surgery. The exclusion of such patients as potential cases is not likely to result in artificial underestimation of survival, but rather the contrary.
A small proportion of cancer patients may have had their diagnosis and surgery services provided in private hospitals, particularly patients residing in the London area. Such patients are not all recorded in the cancer registries and treatment services provided on a private basis are not recorded in HES.
The analysis suggests a slightly lower completeness in the youngest age groups, that is, colorectal and lung cancer patients below 50 years and breast cancer patients below 35 years (Table 1). This observation is based on small numbers. The good prognosis of young patients may be a contributing factor to this.
The principal limitation of the study lies in the completeness of the record linkage with the HES data and the accuracy of the information therein. Unique person identifiers (NHS numbers) have come to be almost universally used in NHS hospitals only in the last few years, and this puts a limit on the period covered in the linked data set. The year-on-year improvement in the availability of NHS numbers and in the completeness of the record linkage is the most likely reason for the slightly lower estimated completeness in 2001 and 2002 for colorectal cancer and breast cancer (Table 1).
Even in the most recent period, the linkage algorithm used NHS number, sex, date of birth, postcode, and date of death, and it is known to be imperfect. Some of the apparent HES-only cases will in fact have a corresponding record in the cancer registry, and there may be duplication whereby more than one of the HES-only cases relates to a single person.
Additionally, there are known errors in the routine HES data (as in any administrative data set) and some cases will have been missed because they did not include the specific cancer diagnosis or a relevant surgery code. We are not able to determine the direction and magnitude of errors created by these imperfections, but it seems unlikely that our analysis is severely flawed or biased. We will continue to explore means of quality assurance and improvement of the cancer registry records. The new linked data set will gradually improve through quality assurance processes related to the continuous use of the data and its annual updating.
Taken at face value, the 1 -4% incompleteness in the Thames Cancer Registry is about as we would expect from previous analyses (Bullard et al, 2000;Robinson et al, 2007) and a recent update thereof (unpublished data, available on request). It is reassuring that most registries seem to have even higher completeness than Thames. The analysis of survival time differences between HES and cancer registries serves as a sensitivity analysis of survival estimates derived from English cancer registry data, but it should not be inferred that the earlier diagnosis date from HES is the correct one, particularly when the difference is small. The date of diagnosis concept in cancer registration does not always take the date of first hospital activity or first clinical diagnosis. In many cases, the date (often later) of the definitive histopathological diagnosis will prevail, in accordance with the international definition of date of diagnosis. The observed distribution of survival time differences in the North West cancer registry could be due to a more rigorous application of this rule, and does not necessarily point to a particular problem in the processing of death certificate information. The differences we have found between cancer registries will be explored by the registries and used in their continued quality assurance and improvement of the service.
In conclusion, we confirm the hypothesis (Beral and Peto, 2010) -and our own expectations (Bullard et al, 2000;Robinson et al, 2007;Møller et al, 2010) -that incompleteness of case ascertainment and survival time error are real phenomena which bias cancer survival estimates in the direction of too low estimates. The error is very small compared with the observed differences between North West European countries  and between socioeconomic groups in England (Møller et al, personal communication). Although the British situation, with immediate availability and processing of information from death certificates, entails a risk of dependence on this source of information, this is more desirable than the situation in several other European countries where death information can only be processed with technical difficulty and delay, or where it is considered as sensitive and not available for cancer registration . The estimates of completeness in cancer registries in England are generally consistent with estimates from other national cancer registries that process information from death certificates in the primary case ascertainment, for example, Finland (Robinson et al, 2007) and Norway (Larsen et al, 2009).
In the mid-1990s, the Thames Cancer Registry had 15 -20% registrations based entirely on death certificates, and the data would not at present be considered as suitable for cancer survival analysis. This death certificate-only proportion has been  Table 1 on completeness of case ascertainment; B: as reported in Table 1 on completeness of case ascertainment; C: these cases died within 1 year according to the cancer registry data; D: these HES-only cases died within 1 year according to the HES data; E: with date of diagnosis revised, some registry cases now survive longer than 1 year. (1)  gradually reduced to 1.6% in 2008. English cancer registration data can no longer be simply dismissed as unfit-for-purpose. It is worth noting that the errors we have discussed are not specific to the British situation but will exist in the same form or in similar forms in other countries as well. The best strategy is to be careful in the selection of comparison countries and to retain a critical (and self-critical) attitude to the international cancer survival and cancer mortality comparisons we perform Møller et al, 2010;Morris et al, 2011).