Cancer staging at diagnosis data comparisons in South Australia

Cancer stage at diagnosis is an important gap for Australian population based cancer registries. The study aims to understand the quality and completeness of three different collections of cancer staging data. The South Australian Cancer Registry data collection for breast and colorectal cancer (CRC) cases diagnosed in 2011, was linked to Registry Derived Stage (RDS) data, pathology plus hospital metastasis codes (pathology stage), and the South Australian Clinical Cancer Registry Stage (SACCR stage). The agreement between staging systems was examined using kappa statistics. Kaplan-Meier curves and Cox regression were used to examine the difference in survival by staging methods. Among 2,530 breast and CRC cases 98.8% were stageable (n = 2,500) according to histology. Among stageable cases, 84.6% had RDS, 51.2% had pathology stage and 29.5% had SACCR stage. The kappa statistic for RDS and pathology stage was 0.930 for breast cancer and 0.973 for CRC, and 0.574 for RDS and SACCR stage for breast cancer and 0.632 for CRC. The agreement between pathology stage and SACCR stage was 0.430 for breast cancer and 0.528 for CRC. The distribution of stage was similar across staging methods, although more stage four cancers by pathology stage, and survival patterns were similar but not the same. The agreement was high between different staging systems. Pathology stage had a higher than expected stage 4 proportion. This study highlights an opportunity to collect stage information in a cost-effective manner, while collecting data that usefully represent stage at diagnosis across the population, for population based epidemiological analyses.

Cancer, as a collective, is now the leading cause of burden of disease in Australia, accounting for 19% of the total burden 1 . While rates of cancer diagnosis are declining, the number of diagnoses continues to rise and so there is an ongoing need to better understand the population dynamics of cancer over time for health systems. Cancer registries in Australia have been collecting data on the incidence and mortality of invasive cancer since as early as 1972, providing highly valuable data to inform health service planning, prevention policy, the evaluation of interventions and for research purposes 2 . Despite the value presented in population based cancer registries 3,4 , there is a clear need to expand them to include stage at diagnosis 5 . From an epidemiological perspective, such information allows for a more complete analysis of trends, and in particular allows for a more thorough understanding of potentially causal factors for underlying shifts in incidence and mortality over time. This information is invaluable to population health, including cancer screening, health care services and for comparisons with other jurisdictions 6 .
There are however challenges in collecting valid, reliable and complete data on stage at diagnosis at the population level 7,8 . Information in Australia is notified under legislation to population based cancer registries from a variety of sources, but these notifications generally do not include detailed clinical information on which to make an assessment of stage at diagnosis. Further, collecting stage at diagnosis as an additional data item in the setting of increasing cancer notifications represents a challenge in resource constrained environments. As such, it is important to evaluate the quality and completeness of stage data collected at the population level, in addition to considering the resource implications of the different staging collection systems. This evaluation will be necessary to understand the strengths and limitations of the staging data and determine how the data should be used in an ongoing way.
South Australia was a participant in the 'National Collection of Registry Derived Stage for 2011' project. This project, led by Cancer Australia, examined the feasibility of Australian population based cancer registries meeting epidemiological demands for comparability and completeness in collecting disease stage at diagnosis for all eligible breast, lung, colorectal, melanoma and prostate cancers diagnosed in 2011 across Australia 9,10 . These data present an opportunity to evaluate the completeness and quality of the systematically collected population based staging data against other state based cancer staging systems in South Australia to better understand the value of the data collected from population based staging systems going forward.

Methods
Data. There are four main data sources for this study: the South Australian Cancer Registry (SACR), Registry Derived Stage (RDS), pathology stage and the South Australian Clinical Cancer Registry (SACCR).
The SACR has been in operation since 1977, collecting incidence and mortality data on all invasive cancers 11 . The notification sources, under legislation, are pathology, hospital admissions, deaths from the state based Births, Deaths and Marriages registry, in addition to radiotherapy notifications.
The data collected under the 'National Collection of Registry Derived Stage for 2011' project, hereafter referred to as RDS at diagnosis, used an agreed set of coding rules for staging at the population level, endorsed by the Australasian Association of Cancer Registries. The data were collected for all cases from the top five high incidence cancers (prostate, breast, lung, bowel, melanoma) in the SACR diagnosed in 2011. The sources of data used for this project were copies of all relevant pathology reports, summary stage data from radiotherapy notifications and summary information on metastatic disease supplied by hospital health information systems. RDS was assigned by coding staff experienced in staging data according to the AJCC 7 rules. The AJCC staging system is a classification system developed by the American Joint Committee of Cancer for describing the extent of disease progression in cancer patients 12 . Within the RDS system, a case was coded as "unknown stage" when the staging elements that are necessary for determining stage were incomplete. RDS was not derived for cases where the basis of diagnosis was solely clinical or the case was described as a death certificate only, as there was no pathology in these instances. Ten percent of all RDS cases were double-coded entirely by a second person to ensure reliability across coding staff. RDS was defined as the best estimate of summary TNM stage at the time of diagnosis from available data sources for use in population-based cancer registries 9 .
Data were extracted from the SACR for all eligible cases diagnosed in 2011 and pathology reports were collated for each case. Summary data from the SACR, including stage at radiotherapy were collated into a purpose built database where RDS was entered by coding staff. Coding rules for each of the primary sites being staged were supplied to the coding staff along with training and ongoing support in the use of the staging system. At the completion of the coding exercise, data were extracted into a file for analysis.
In April 2018, the cancer stage at diagnosis project was complete, and the project found that staging completeness and comparability across Australia was high for each of the top five incidence cancers 9 . The current study used the data only for South Australian breast and colorectal cancer (CRC) due to the availability of clinical cancer registry staged data as a comparison, and further that both breast and CRC are screened for in the national population based screening program.
The second source of staging data was the TNM (AJCC7) information as was reported on the pathology report, where available. Given pathology alone may miss metastatic disease, this information was combined with a notification in the hospital admissions data of a metastatic cancer code. Hospital admission data containing a cancer code in the ICD-10 range of C78-C79 inclusive, within four months of the first diagnosis date of the cancer were flagged as metastatic disease. Where there was a metastasis notified as above, the summary stage was recorded as a stage 4. This data source has been referred to as pathology stage.
The third source of staging data was the SACCR 13 . The SACCR is authorised under the Health Care Act 2008 14 , and covers a range of cancer types relating to admissions to one of four public hospitals in metropolitan Adelaide. The SACCR is supported by a central coordinating unit, and uses the full medical record as the source of information about the cancer case. The SACCR uses the AJCC 7 staging system for both breast and colorectal cancer. For each of these cancers, the SACCR has registered around one third of the total South Australian cases in 2011. This staging source is referred to as SACCR stage.
Data linkage. The RDS data and pathology stage were already linked to the SACR. The SACCR data were matched to the SACR using a common unique identification number. For cases that were not matched, patient's name, date of birth, sex, and cancer were used in a probabilistic linkage. After completing data linkage, data were de-identified and provided to a separate team for data analysis.

Statistical method.
A descriptive analysis was undertaken, comparing the demographics and staging information and completeness across the three staging systems (RDS, SACCR stage and pathology stage). The degree of agreement was evaluated using the Cohen's kappa score. This analysis only used the matched data set for each of breast and colorectal cancer. Kappa agreement classification is as following: poor (<0.20), fair (0.20-0.40), moderate (0.40-0.60), good (0.60-0.80), and very good (0.80-1.00). Using all the available data (both matched and non-matched data), Kaplan-Meier survival estimates were used to examine whether there was any inconsistency for five year survival by the different staging systems. The censor date was the date of death or 31 December 2016, whichever came first. The cause-specific death information was taken from the SACR dataset given it was complete for all cases. The Cox proportional hazard model was used for examining cancer cause-specific mortality across the three staging methods. All analyses were performed using Stata SE 15.0 (Stata Corp, Texas).
All methods were performed in accordance with the relevant guidelines and regulations of Scientific Reports. The study is a retrospective data linkage project using existing data to evaluate registry data quality, and there was no any patient contact for the study, therefore there was no patient consent process. The Human Research Ethics Committee, South Department for Health and Ageing approved the project in December 2017. The ethics commit waived the need for informed consent for this study as part of the study approval.

Results
completeness of staging data for sources. There were 2,530 breast and colorectal cancer cases in the SACR 2011 data collection. Based on histology, 98.8% were stageable (n = 2,500). Among cases stageable on the basis of histology, 84.5% (n = 2,115) had RDS, 51.2% (n = 1,279) had pathology staging values, and 29.5% (n = 737) had SACCR stage values. Breast cancer had staging information on 88.1% of cases with RDS, and colorectal had stage information for 81.3%. The proportions were similar for breast and colorectal cancer for pathology stage (51.5% vs 50.8%) and SACCR stage (29.6% vs 27.8%. The details of the demographics across the three data sources are presented in Table 1. The proportion of unknown or missing RDS was 11.9% for breast cancer (n = 143) and 18.5% for CRC (n = 242). The proportion of unknown or missing SACCR stage was 7.1% for breast cancer (n = 26) and 3.2% for CRC (n = 12). Among available cases with pathology stage values, there was no unknown or missing value for either breast cancer or CRC.
Agreement across three staging methods. There were no significant differences in the age and sex distribution for breast or colorectal cancer across three staging methods (all p > 0.05). However, the SACCR staging data had a higher proportion of people in the lowest quintile of SEIFA (27% most disadvantaged) than the RDS data and the pathological stage data (17% and 15.7% respectively).
There were 619 breast and 652 CRC cases for which there was a matched cancer case with the RDS and pathology stage data. The agreement between RDS and pathology stage was very good, with a kappa statistic of 0.930 for breast cancer and 0.973 for CRC ( Table 2). There were 331 breast cancer and 317 CRC cases for which there was a matched cancer case with the RDS and SACCR stage data. The agreement between RDS and SACCR stage was good for breast cancer (kappa = 0.574) and CRC (kappa = 0.632). There were 162 breast cancer and 205 CRC cases for which there was a matched cancer case with the pathology stage and SACCR data. The agreement between pathology stage and SACCR stage was moderate for breast cancer (kappa = 0.430) and good for CRC (kappa = 0.528).
Staging distribution comparison across staging data sources. When examining all available data, that is both data that were matched across the different staging systems and non-matched data, pathology stage has a significantly higher proportion of stage 4 cases when compared to RDS (mean difference in proportion = 5.6%, 95%CI = 2.0-9.2%, p = 0.002) and SACCR stage (mean difference in proportion = 10.4%, 95%CI = 6.4-14.4%, p < 0.001) for breast cancer (Fig. 1).
Cancer cause-specific survival comparison across staging data sources. Using all available data, the distribution of survival by stage was similar by three staging methods for breast cancer and colorectal cancer (Fig. 2). The one year survival for stage 4 breast and colorectal cancer cases were similar across three staging methods, 70-87% for breast cancer and 64-67% for CRC. However, a more distinguished worse survival in stage 4 cases was observed using SACCR stage in both breast cancer and CRC. The five year survival for stage 4 breast cancer was 33.0% when using SACCR stage, but was 66.9% by RDS and 71.4% by pathology stage. The five year survival for stage 4 CRC was only 16.5% using SACCR stage, but was 29.6% by RDS and 29.4% by pathology stage.
The Cox proportional hazard models for breast cancer and CRC by all three staging methods found similar results across the three staging methods for both types of cancer (Table 3). For breast cancer, stage 4 cases had a significantly greater risk of dying from breast cancer than stage 1 or 2 cases by each of the three staging methods. For CRC, stage 3 or 4 cases had significantly greater risk of dying from CRC than stage 1 or 2 cases.

Discussion
This study found that staging completeness overall was very high for both breast and colorectal incidence cancers at diagnosis in the RDS data, and lowest for the SACCR stage data as SACCR only collects public hospital data. The different staging systems had similar age and sex profiles, but the SACCR included more people from the most disadvantaged groups. There was a good to very good agreement for both breast and colorectal cancer for each of the data sources when considering only the matched data, and while they each had a similar profile of stage at diagnosis across the population, the pathology stage had proportionately higher stage 4 cases than the other two. This difference can be explained by the pathology stage data in its entirety provides a different stage profile than does the pathology data that is matched to the other data sources. The different staging systems were equally able to discriminate survival by stage, although the actual survival estimates were different from each other. It is likely, however, that the differences found between the staging systems still present a reasonable estimate of stage at diagnosis at the population level (for epidemiological analyses), and in particular, that the two population based systems (pathology stage and RDS) are a reasonable estimate of the true distribution of stage at diagnosis at the population level.
The completeness of staging data for the South Australian RDS was 99.6% for breast cancer and 98.1% for CRC, which compared well with the Australian national average of 94% for female breast cancer and 88% for CRC 10 . The percentage of stage 'unknown' was 11.9% for breast cancer and 18.5% for CRC which is slightly higher than was reported in Canada for data from 2010, with 10% of unknown stage cases 15 .
The cancer staging distributions by RDS for breast cancer and CRC are similar to Australian national figures, which is that the majority of cancers were staged as an 'early stage' 10  www.nature.com/scientificreports www.nature.com/scientificreports/ 2011 collection has higher percentage of stage IV breast cancer (12.1% & 17.7% respectively) than the SEER's. The outlier in the staging data was the higher proportion of stage 4 in the colorectal pathology data, despite the degree of agreement between this source and the other sources being high.
The difference between the high degree of agreement on matched data but not on stage at diagnosis profile across the population most likely reflects selection bias in those cases that had a pathology stage and those that had the other staging systems. Pathology stage is a relatively small sub-set of the total number of stageable cases,   and so likely represents a degree of selection bias, biasing the proportion of stage 4 cases upward. The SACCR data are only collected from some public hospitals, and so would not be representative of a whole of population stage at diagnosis. Given the RDS system had the highest level of completeness, and estimates a similar stage at diagnosis profile as in other jurisdictions, it should be considered the most valid of the three systems for population level estimates. Over time, given the high degree of agreement with the RDS system, as completeness of the pathology data increases with increased focus on the importance of structured pathology reporting, this bias may lessen 17 . The survival curves by the different staging systems for breast cancer and CRC were similar although not the same. For both breast cancer and CRC, the survival curve by the three staging methods all seemed to be consistent with the clinical expectation -better survival for cases diagnosed at early stage and survival decreased markedly with advancing stage. The SACCR stage pattern for both breast cancer and CRC are similar to those published in the United States for five-year survival 18,19 and United Kingdom for one-year survival 20 . Data from the National Cancer Registration Services Cancer Analysis System in England in 2012 shows that the one year relative survival for stage 4 cases were 66.4% for breast cancer and 42.5% for CRC 20 , slightly lower than the current study using SACCR stage (breast cancer 70.0% and CRC 66.5%). For distant cancer cases using SEER stage, the American Cancer Society reported a 5-year relative survival rate of 27% for breast cancer and 15% for CRC 18,19 , which were very close to survival of stage 4 cases by the SACCR stage method in the current study (33.0% for breast cancer and 16.5% for CRC).
The strengths of this study include an opportunity to examine three different staging systems collected on the one group of cancer cases in a high quality, population based cancer registry. The data were coded by trained and experienced coding staff and a set of clear coding rules were applied across all systems. Limitations include that only one third of 2011 breast and colorectal cancer cases had the SACCR stage values, and about 50% had pathology stage, which may have introduced selection bias into the findings presented here. The population of cancer cases did however appear to be similar across the three methods, albeit that there were more people from a higher level of disadvantage in the SACCR data than the other two and that only three variables were available for demographic comparison. This study should be repeated over time to better understand how the data quality may change. In particular, pathology stage is likely to have a higher number of cases staged over time, and so it may be that the data quality and ascertainment is better for cases diagnosed in 2019 than when this study was set, in 2011. www.nature.com/scientificreports www.nature.com/scientificreports/ conclusions In conclusion, the data from the two population-based staging systems compared favourably with the clinical based staging system across a number of metrics. This represents an opportunity to collect stage information in a cost-effective manner, while collecting data that usefully represent stage at diagnosis across the population, for population based epidemiological analyses.

Data availability
The datasets analysed for the current study are not publicly available due to the restriction term in the ethics approval. The de-identified data could be available for the editorial team under a confidential agreement, however, for some variables, categories may be combined in order to reduce the risk of revealing patients' identity incidentally.  Table 3. Number of cause-specific death and hazard ratio by different staging methods. *p < 0.05; ** p < 0.01; *** p < 0.001.