Main

Stage at diagnosis is a key predictor of cancer survival (Richards, 2009). Differences in stage are believed to be one of the main drivers of disparities in cancer survival between and within regions (Sant et al, 2003). England is known to lag behind in cancer survival in comparison to other comparably wealthy countries with a universal health system (Coleman et al, 2011). Part of this survival differential is presumably due to a poorer stage distribution of cancer cases in England (Sant et al, 2003; Walters et al, 2013b). In the past couple of decades, many resources have been invested in improving cancer outcomes through identifying and treating cancer at an earlier stage (Richards, 2009; Department of Health, 2011).

Research examining national and international temporal and geographical patterns in cancer outcomes is usually based on population-based cancer registry data, which have historically lacked information on stage. Further granularity of information is required to understand in depth the effect of stage on cancer outcomes at the population level and to monitor and evaluate cancer policy and changes in clinical practice. Recent efforts by Public Health England (PHE) and the National Cancer Registration Service have driven an improvement in availability of stage information for cancers diagnosed in England (McPhail et al, 2015). The national aim is for at least 70% of cancer patients to be staged at diagnosis (Health and Social Care Information Centre, 2015).

Clinical or surgical quality assurance programmes, also called clinical audits, have been developed as instruments to ensure clinical quality standards of health-care providers (van Gijn et al, 2010). Clinical audits contain detailed clinical data, including information on diagnostic investigations, stage at diagnosis, and treatment for cancer (van Gijn et al, 2012). Besides helping clinical specialists improve their practice, clinical audits offer a rich, complementary source of clinical data for population-based cancer research.

Comparability of stage information from different sources has been a controversial issue, especially when making international or temporal comparisons, as clinical protocols, data collection methods, coding practices, and tumour classification systems may vary between geographies and time periods (Walters et al, 2013a). Inconsistencies may also occur between different sources of information from the same country.

We describe an algorithm to derive stage at diagnosis from different sources, based on a series of hierarchical rules applied on both the data sources and the individual stage variables from the TNM classification (Sobin et al, 2009). This extends the algorithm proposed and used by Walters et al (2013a) for the International Cancer Benchmarking Partnership module 1 study. The algorithm is illustrated for colorectal and lung cancer.

Materials and Methods

Data sources

The National Cancer Registry data provides information on date of birth, sex, vital status, date of death, tumour site, and morphology (Office for National Statistics, 2015).

The Cancer Analysis System (CAS) is a national database administered by the National Cancer Intelligence Network of PHE. It combines the National Cancer Registry data with data from other sources (Health and Social Care Information Centre, 2015) and holds information on main tumour features, socio-demographic characteristics, stage, and treatment dates.

The National Clinical Audit Programme comprises multiple clinical audits to monitor and evaluate health-care practice on specific conditions, benchmark performance, and inform patients and the general public of current standards of care in different medical specialties (Healthcare Quality Improvement Partnership, 2015). Cancer clinical audits contain information on patient referral, diagnostic investigations, pretreatment staging, treatment, pathology evaluations, posttreatment follow-up, and outcomes. Information is collected at the hospital level and its accuracy and completeness should, in principle, be ensured by relevant clinicians before submission to the Audit (Scott et al, 2014). The National Bowel Cancer Audit Project (NBOCAP) was developed to collate detailed clinical bowel cancer data by the Association of Coloproctology of Great Britain in 2001 (Health and Social Care Information Centre, 2014). The Lung Cancer Audit Database (LUCADA) was developed by the Royal College of Physicians Intercollegiate Lung Cancer Group in 2002 and started collecting lung cancer data nationally in 2005 (Royal College of Physicians, 2015). Figure 1 summarises the sources of information for the stage algorithm.

Figure 1
figure 1

Sources of data for deriving stage for colorectal and lung cancer, England, 2008–2012. *Main tumour characteristics include ICD-10 topography codes, histology, behaviour, and date of diagnosis. Abbreviations: CAS=Cancer Analysis System; CT=computarised tomography scan; MRI=magnetic resonance imaging; ONS=Office for National Statistics Cancer Registration Dataset.

Data linkage

Individual colorectal and lung cancer records from the ONS National Cancer Registry data were linked to the CAS records of the same cancers diagnosed between 2008 and 2012. It followed a two-part strategy, linking records at the patient level using an eight-level hierarchy based on the availability of information on NHS number, date of birth, sex, and postcode and linking records at the tumour level by tumour site and diagnosis date. Of the 163 915 colorectal cancer cases (ICD-10 C18-C19) in the ONS National Cancer Registry data diagnosed in England during the study period, 158 953 (96.97%) linked to a CAS record and 121 707 (74.25%) linked to an NBOCAP record. For lung cancer (ICD-10 C33-C34), there were 168 158 tumours diagnosed in England during the study period. Of these, 167 236 (99.45%) linked to a CAS record and 131 540 (78.22%) linked to a LUCADA record.

The staging algorithm

The algorithm is based on rules of the Union for International Cancer Control TNM classification of malignant tumours. The TNM classification was developed in the 1950s as an international standard for classifying malignant tumours by anatomical extent (Sobin et al, 2009). It aims to provide an unambiguous grouping of cancer cases for clinicians to make standardised and consistent decisions for adequate disease management.

The anatomical extent of disease is based on the assessment of three components: the extent of the primary tumour (T), the presence and extent of metastases to regional lymph nodes (N), and the presence of distant metastases (M). For each tumour, two classifications are defined: a pretreatment clinical classification (c), drawn from physical examination, imaging tests, endoscopy, or biopsy; and a pathological classification (p), after histopathological assessment of the primary tumour, removal and assessment of lymph nodes, and microscopic evaluation of distant metastases.

The TNM classification goes through periodic prospective and retrospective evaluations that lead to the development and publication of improved editions. The Fifth and Sixth Editions were published in 1997 and 2002, respectively (Sobin and Wittekind, 1997, 2002), followed by the current Seventh Edition, in effect since 2010 (Sobin et al, 2009). Possibly the biggest change between the latest editions was the elimination of the category Mx, previously used to denote that distant metastases could not be assessed (Sobin et al, 2009). This category is now considered inappropriate as clinical assessment of metastases may be based solely on physical examination (cM). Pathological Mx (pMx) may be misinterpreted and overused by pathologists when they have access to histological material to assess pT and pN, but not for pM, a frequent situation after surgery for resection of the primary tumour (Sobin and Compton, 2010). The deletion of this category encourages the use of M0 when metastasis cannot be proven and should facilitate the completeness of stage grouping.

Hierarchies of data sources and stage variables

A hierarchy of data sources was established for different types of information to avoid inconsistencies, given that information could potentially come from a maximum of three data sets. The ONS National Cancer Registration data was our preferred data set for main person and basic tumour characteristics such as date of birth, vital status, deprivation quintile, topography, and morphology of the primary tumour. This decision was based on their established quality-control processes to verify this information and to be consistent with official national cancer statistics (Office for National Statistics, 2015). Clinical audits were the preferred source for detailed clinical data such as pretreatment diagnostic investigations, dates and results of medical interventions, and staging resulting from these. Staging information from CAS was used if missing or invalid from the clinical audit.

The algorithm can be divided into two parts for descriptive purposes. The first part entails deriving individual T, N, and M components from all available sources of pathological and clinical stage information. Once individual T, N, and M components have been ascertained, the second part of the algorithm applies TNM definitions for stage grouping, to obtain the overall grouped TNM stage (I, II, III, IV). Two different strategies for deriving TNM stage grouping are described depending on the acceptable level of missing information from individual T, N, and M components.

Deriving individual T, N, and M components

We used a set of rules to treat potentially discordant information from different sources to derive overall individual T, N, and M components from different types of variables in the data sets following TNM classification rules.

The pathological TNM classification uses information from clinical TNM and complements it using additional information from pathological evaluation (Sobin et al, 2009). Pathological TNM should therefore be the most complete source of staging information, at least for T and N. In our data sets, there was information for pathological and clinical individual T, N, and M components from the clinical audits and CAS plus staging information from additional variables (Table 1). We gave priority to the pathological variables over clinical ones for T and N, but cM was prioritised over pM (Walters et al, 2013a). Although distant metastases are not generally evaluated during surgery for resection of the primary tumour (Sobin and Compton, 2010; Walters et al, 2013a), pathological confirmation of metastases, from a biopsy, for example, was given priority over a negative or inconclusive result from a clinical/imaging test.

Table 1 Sources of valid T, N, and M components: completeness of variables and contribution to final staging

Our algorithm allowed results from medical tests and diagnostic procedures to inform individual clinical T, N, and M components when missing or with a value of zero. Similarly, records of the presence of metastases in specific organs or of regional lymph node involvement were used to inform cM or pN, respectively. Information in these additional variables was used as evidence of local, regional, and/or distant extension of disease when positive but did not rule out their presence. For example, if there was evidence of distant metastases from one of these additional variables, this replaced the value of cM to cM1; however, if there was no evidence of metastasis in that variable, it did not change the value of cM to cM0, allowing the algorithm to keep looking for information in subsequent variables.

In addition to the clinical and pathological T, N, and M components, CAS reports a third type of staging information that may come from either pathological or clinical data and may use the highest value of a particular component for a given tumour or be directly flagged by the registry. This ‘integrated’ stage information was used only when exhausting all other possible sources because its algorithm was not fully documented.

We used an additional step for determining the M component to account for the fact that, although the categories Mx and pM0 do not exist in the Seventh Edition of the TNM classification, their use is still common practice: If M was still missing after looking in all potential sources, and there was indirect evidence of a clinical examination, that is information on both clinical T and N, M was assumed to be M0. Once an individual overall T, N, or M component was populated at a specific step of the algorithm, there was generally no need to look further for information of that particular component in subsequent steps of lower priority.

Deriving the grouped TNM stage

After ascertaining individual – and unique – T, N, and M values to each tumour, the second part of the algorithm followed TNM classification definitions to categorise different combinations of T, N, and M values into TNM stage groupings. This part of the algorithm starts by examining M. Generally, for most cancer sites, including colorectal and lung, a positive M value effectively represents the maximum value of TNM stage grouping, stage IV, independently of the values of N and T. Similarly, once a positive M has been excluded, and there is a positive N, the algorithm assigns a TNM stage III to the tumour, independently of the value of T. The algorithm then evaluates subsequent subcategories in a descending order (stages II and I).

To manage the missing information within N and/or M, we applied two different strategies to derive overall stage based on the algorithm for deriving stage described by Walters et al (2013a). The most conservative of the two approaches, the restrictive strategy, is stricter in the sense that all three components need to be present to derive the grouped stage. In contrast, the non-restrictive strategy allows for the interpretation of missing information as an absence of metastases to the lymph nodes (N) or to distant organs (M). Additionally, after exhausting all possibilities of deriving the grouped TNM stage from individual T, N, and M components, the algorithm moves on to using the pathological and clinical summary stage information. The restrictive strategy differs in that we ignore the grouped stage variables, given that we cannot verify individual T, N, and M components from these.

The staging algorithm applied to colorectal cancer

Both CAS and NBOCAP use the Fifth Edition of the TNM classification (Sobin and Wittekind, 1997), following guidance from the Royal College of Pathologists (Health and Social Care Information Centre, 2014). The definition of node involvement changed in later editions, specifically in that evaluation of satellite mesenteric tumour deposits use a size criterion in the Fifth Edition, while the Sixth and Seventh Editions use a shape criterion to determine the presence of mesenteric lymph node involvement (Doyle and Bateman, 2012). Subdivisions of stage categories have been added, and definitions of T4a and T4b have been reversed in the Seventh Edition (Sobin et al, 2009). Except for the lymph node definition change, none of the changes affect definitions of overall stage grouping categories.

NBOCAP data allowed for single tumours to have several treatment records. These records may hold conflicting information on pathological T, N, and M components, presumably measured at different points in the treatment journey. Therefore, the first part of the algorithm applied to colorectal cancer was to establish a hierarchy of NBOCAP treatment records based on their closeness to diagnosis date. As a general rule, only records with treatment procedures dated within 30 days before or after the date of diagnosis were eligible to contribute with information on pathological T, N, and M components. This was to avoid assigning values of TNM associated with restaging and/or disease progression to what we define as stage at diagnosis. Of these records with procedures dated between the ±30-day window from diagnosis, the closest one to the date of diagnosis would be given priority over information contained in subsequent treatment records, assuming it contained a valid code for that variable. In cases where multiple records of one tumour had the same procedure dates, the one with lowest values of individual T, N, and M components would be given priority, following a general rule of the TNM classification (Sobin et al, 2009, p. 9). Information in subsequent treatment records would only be used if such information was missing in the previous one. Information on individual clinical T, N, and M components from NBOCAP was the same in all treatment records of any single tumour, as in CAS. Additional variables with information on colorectal cancer-specific staging are listed in Table 1. The full procedure to derive individual T, N, and M components of stage for colorectal cancer is detailed in Supplementary Appendix Figures S1–S3.

The second part of the algorithm for deriving overall stage grouping using the non-restrictive strategy used additional information from the colorectal cancer-specific Dukes classification. As the Dukes classification is not directly equivalent to specific combinations of individual T, N, and M components, TNM summary stage variables from CAS were given priority over Dukes staging, in the same order as individual T, N, and M variables (pathological, followed by clinical and integrated). The second part of the colorectal cancer stage algorithm is summarised in Figures 2 and 3.

Figure 2
figure 2

Deriving stage for colorectal cancer using the restrictive strategy, England, 2008–2012. Abbreviations: T=tumour; N=lymph nodes; M=distant metastases.

Figure 3
figure 3

Deriving stage for colorectal cancer using the non-restrictive strategy, England, 2008–2012. Abbreviations: CAS=Cancer Analysis System; cDukes=clinical Dukes staging; cStage=clinical TNM stage; iStage=integrated TNM stage; M=distant metastases; N=lymph nodes; NBOCAP=National Bowel Cancer Project; pDukes=pathological Dukes staging; pStage=pathological TNM stage; T=tumour.

The staging algorithm applied to lung cancer

The first part of the algorithm remains as described above, except that there were no additional variables to inform individual T, N, and M components in LUCADA (See Supplementary Appendix Figures S6–S8).

The main challenge in adapting the algorithm to lung cancer was the substantive modifications between the Sixth and Seventh Editions of the TNM classification (Goldstraw et al, 2007): definitions of some individual components of T and M as well as of some categories of the stage grouping have changed (Mirsadraee et al, 2012). We derived TNM stage grouping following definitions of the Sixth and Seventh Editions of the TNM classification separately. Most of these changes do not affect the overall TNM stage grouping, therefore we chose to apply definitions of the current Seventh Edition of TNM classification for the whole study period (see Supplementary Appendix Figures S4 and S5).

Statistical analyses

We estimated age-standardised 5-year net survival, stratified by stage, including a missing stage category, for patients diagnosed in England between 2008 and 2012 and followed up until end of 2013. Net survival represents survival with cancer as the only potential cause of death by factoring out mortality from other causes (expected mortality) (Pohar Perme et al, 2012). Within the relative survival setting in which causes of death are not available, the expected mortality was provided by life tables from the England general population, namely, life tables by age, sex, calendar year, and deprivation (London School of Hygiene & Tropical Medicine, 2015). Net survival was estimated with the non-parametric Pohar-Perme estimator (Pohar Perme et al, 2012) implemented in the Stata program stns (Clerc-Urmès et al, 2014). We used the complete approach for survival analysis, as used for national cancer survival statistics (Office for National Statistics, 2015). We used the International Cancer Survival Standard weights for age standardisation, which categorises age into five groups (15–44, 45–54, 55–64, 65–74, and 75–99 years of age) (Corazziari et al, 2004). We compared 5-year net survival between both versions of the staging algorithm (restrictive and non-restrictive) for each cancer.

Results

The proportion of cases with valid information on individual T and N components was comparable between cancer sites (Table 1). Completeness of valid information on the M component varied significantly between colorectal and lung cancer (41.2% vs 22.8% missing M component, respectively Table 1). Of the 163 915 primary malignant colorectal tumours, 92 778 (56.6%) had valid stage using the restrictive strategy and 137 429 (83.8%) using the non-restrictive strategy. Of the 168 158 primary lung cancer cases, 128 866 (76.6%) had stage information with the restrictive strategy, vs 135 666 (80.7%) with the non-restrictive strategy. Completeness of derived stage improved over time for both cancer sites, as did the difference in stage completeness between the restrictive and non-restrictive strategies (Table 2).

Table 2 Overall stage grouping by cancer, year of diagnosis, and staging strategy

Distribution of stage differed between strategies for colorectal cancer (Table 2). Assuming equivalence between values of zero and missing for N and M, as in the non-restrictive strategy, decreased dramatically the overall missingness of TNM stage grouping for colorectal cancer and affected the overall stage distribution. For instance, 25 431 (27.4% of data with observed stage) tumours were classified as stage III using the restrictive strategy and increased to 41 537 (30.2% of data with observed stage) using the non-restrictive strategy, mainly owing to the assumption of equivalence between missing and zero value of M in the first part of the algorithm. This difference was less pronounced for lung cancer, because of better completeness of the individual M component (Table 1). Lung cancer stage distribution was comparable between strategies (Table 2).

Using summary stage, variables in the non-restrictive strategy did not considerably improve the completeness of stage for either cancer site, indicating that most cases had fairly complete T, N, and M information before reaching this step or had all stage variables missing.

Age-standardised 1-year net survival for colorectal cancer was significantly lower using the non-restrictive strategy for all stages, particularly for the missing stage category. Differences in lung cancer survival between the two strategies were negligible (Table 3; Supplementary Appendix Figure S9). These figures reflect the differences between both strategies and how incomplete the individual stage components are.

Table 3 Age-standardised estimates of 1- and 5-year net survival by cancer, stage, and staging strategy

Discussion

This paper describes an algorithm to derive stage from multiple data sources. Recording of stage is now one of the Clinical Commissioning Group Outcome Indicators in England. However, it is rarely reported how this information is collected and then integrated into stage categories. We aim to adopt a standard approach to derive stage from multiple sources using a series of hierarchical rules. We have adapted it to specific cancer sites to illustrate its generalisability and highlight some data and cancer-specific issues.

In our example, the use of TNM Fifth Edition for colorectal cancer is justified to facilitate comparability of temporal trends (Royal College of Pathologists, 2014). There is a perceived increase in interobserver variability when assigning lymph node status using the shape criterion of the Seventh TNM Edition, rather than the size criterion of the Fifth Edition (Doyle and Bateman, 2012; Royal College of Pathologists, 2014). In England, the RPC recognises that some multidisciplinary teams – from which Clinical Audit stage data may be collected – use the Seventh Edition of TNM to stage colorectal cancers and that it might be requested in particular cases, such as those enrolled in clinical trials. There was poor individual information of the TNM edition used for staging colorectal cancer in the data sets we used. Given that there were some codes that are valid in TNM Seventh but not in TNM Fifth, we remain uncertain that all cases were staged using the Fifth Edition of TNM. There is conflicting evidence on the effect of using different editions of the TNM classification on the final staging (Nagtegaal et al, 2011; Doyle and Bateman, 2012). Nonetheless, comparing categories using different TNM editions may lead to stage migration, complicating comparisons of stage-specific outcomes. In contrast, the lung cancer data sets, in particular the Clinical Audit data, consistently reported an individual indicator of the edition of TNM used.

Distribution of colorectal cancer stage and stage-specific survival differed between strategies to define summary stage. Survival was lower for all stage categories using the non-restrictive strategy. Imputing all cases with missing M and/or N to a value of zero, as in the non-restrictive strategy, relies on very strong assumptions and may lead to misclassification, biased stage-specific survival estimates, and overly narrow variances. The missing stage categories contain a mixture of various stages, even though on average their prognosis is poorer than observed stage. The real stage distribution within the missing categories is different between strategies, as is their survival. The survival discrepancy between strategies was negligible for lung cancer. This is because there was more complete information on individual M component for lung (77.2%) than for colorectal cancer (58.8%). The restrictive approach is more conservative and keeps open, when necessary, the possibility of using specific approaches to deal with missing data, such as multiple imputation (White et al, 2011).

A particular limitation arises when applying the algorithm for staging tumours receiving neoadjuvant therapy. Pathological stage components are collected after neoadjuvant treatment, thus downgrading may occur. This issue may be addressed by making specific rules to deal with such tumours. This was not possible in our data given that information on neoadjuvant therapy is missing in the vast majority of cases from all available sources. Differences in aggressiveness of diagnostic investigation may also affect the comparability of stage-specific outcomes (Allemani et al, 2013).

We acknowledge potential limitations and have discussed the data issues and our assumptions. We encountered several issues in relation to coding of stage variables, inconsistencies in use of editions of TNM classification, conflicting stage information for single tumours, and a high proportion of missing data. We believe these may arise in other settings and data sets and have tried to address them in a transparent way, useful for other users of cancer staging information.

We have applied the algorithm to two cancer sites in a single country but aim for the hierarchical rules to be adaptable for other cancer sites and data sources in different countries, as the issue of inconsistently defined and reported stage categories is widespread in the current population-based cancer research (Ciccolallo et al, 2005; Walters et al, 2013a). The outcome will depend heavily on the quality of the specific data source but the general approach of prioritising information of highest quality, reporting sources of individual TNM variables, and reporting of assumptions when dealing with missing or inconsistent data is relevant to any cancer research using stage information. Descriptive results such as reported in Tables 1 and 2 are helpful in understanding the origins of summary stage and the reasons for shifts in stage distributions.

Validity of the information contained in cancer records remains a general issue. We believe it should be mandatory to have a relevant clinician at the health-care provider level ensuring that data collected are complete and truly reflect the information clinical decisions are based on. For each cancer case, it should be clear what classification was used to assign stage variables. As skilled clinicians are needed to collect and use stage information to make adequate medical decisions, there is also the need of people with standardised skills for recording and compiling of clinical information from medical records. The National Health System should make an effort to train and support such a workforce. Complete and accurate stage information is essential to assess cancer control policy and to understand inequalities in cancer management and cancer survival, at both national and international levels. We encourage cancer registries and health-care providers to clearly document the process for deriving stage grouping and reporting any data quality checks to validate this information. This information should be readily available for researchers.