Introduction

Global colorectal cancer (CRC) incidence is expected to increase in the coming decades1, which emphasizes the need to fulfill current knowledge gaps and improve clinical outcomes. In the current era of precision medicine, smaller treatment‐eligible target populations are both an advancement as well as a challenge in cancer research2,3. Due to the large amount of CRC subgroups defined by clinical characteristics in combination with the many (low-frequency) molecular markers4,5,6, the enrollment of sufficiently large sample sizes in studies evaluating the safety and efficacy of new therapeutic agents is a growing challenge. In addition, selective enrollment in most phase III randomized clinical trials (RCTs) may affect the generalizability of trial results and limits our understanding of the “true” treatment’s benefit-risk profile in the broader patient population. This is a constraint in clinical cancer research, given that international clinical guidelines are often based on results from strongly selected trial populations.

As advocated by both the research community and regulators such as the U.S. Food and Drug Administration (FDA) and the European Medicines Agency (EMA), the development of high quality population-based studies in cancer patients that provide real-world data (RWD) is a major research priority to overcome challenges in research methodologies, complement RCT data, and ultimately improve patient outcomes7,8. A learning healthcare system approach, defined as a circular system in which “science, informatics, incentives, and culture are aligned for continuous improvement and innovation, with best practices seamlessly embedded in the delivery process and new knowledge captured as an integral by-product of the delivery experience”9, uses RWD to accelerate knowledge generation and its translation into clinical practice. RWD mainly distinguishes itself from trial-based evidence by being population-level data originating from sources outside of the typical clinical research setting, such as electronic health records (EHRs) or cancer registries, with the potential to efficiently answer research questions relevant to the broader patient population10. To warrant high quality RWD, ascertaining a high quality of primary data (i.e. completeness and accuracy of EHRs), linkage of data sources, and quality of derived variables, is paramount11,12. Altogether, a prospective “real-world” cohort requires longitudinal patient, treatment (sequences), and outcome data from an unselected and representative patient population.

Since 2013, the Prospective Dutch Colorectal Cancer (PLCRC) cohort collects extensive longitudinal clinical data, together with blood, (tumor) tissue, and repeated patient-reported outcomes (PROs) in patients with stage I to IV CRC that are prospectively followed from primary diagnosis until death13. PLCRC serves as an infrastructure for a wide variety of research projects including etiological, biomarker, basic, (epi)genetic, and interventional [according to the Trials within Cohorts (TwiCs) design14], as well as health-care policy and cost-effectiveness studies. In order for results to be generalizable, and for accurate evaluation of cancer treatments, it is important to obtain a cohort that consists of a demographically and clinically representative patient population. Therefore, the aims of this manuscript are to (1) describe developments towards a nationwide cohort, (2) provide baseline characteristics, including PROs, of the first 5722 participants, and (3) investigate to what extent PLCRC reflects the “real-world”—over time and by tumor stage—by comparing PLCRC cohort participants with the general Dutch CRC population as registered in the Netherlands Cancer Registry (NCR).

Methods

PLCRC is an initiative coordinated by the Dutch Colorectal Cancer Group (DCCG) and is registered at Clinicaltrials.gov (NCT02070146). The ‘Strengthening the Reporting of Observational Studies in Epidemiology (STROBE)’ guidelines were taken into account when the cohort was designed15. We here describe the cohort briefly since the detailed design is published elsewhere13.

The PLCRC population

PLCRC consists of patients diagnosed with a malignancy of the colon and/or rectum (ICD-10, C18-20) in the Netherlands. Each patient with histologically proven, or a strong suspicion of CRC without pathological confirmation, who is ≥ 18 years of age is eligible. The informed consent procedure is preferably performed shortly after diagnosis and before treatment starts. However, patients can also be enrolled during treatment or follow-up. Consent for longitudinal clinical data collection is mandatory for participation. In addition, patients can choose to consent to other optional items as shown in Box 1. Patients enrolled before August 1, 2019 of whom complete clinical data of the initial diagnosis and treatment period were available—to ascertain correct classification of tumor stage—were included in the analyses. Of these patients, baseline demographic and clinical data were retrieved and reported, as well as self-reported physical activity, fatigue, quality of life, BMI, presence of chronic comorbidities, smoking behavior, alcohol consumption, education level, and living situation at baseline.

Box 1 Informed consent options and main objectives of PLCRC.

The Netherlands Cancer Registry (NCR)

The Netherlands Cancer Registry (NCR) contains an extensive set of clinical data—from diagnosis onwards—of individuals diagnosed with cancer in the Netherlands and has a national coverage of over 95%16. Clinical data of the complete treatment trajectory are retrieved from EHRs and entered into the NCR. Importantly, the completeness of the NCR therefore depends on the completeness of EHRs. Overall, the NCR’s high quality is assured by thorough training of data managers and computerized consistency checks. PLCRC’s informed consent allows for linkage with the NCR and thus ensures the availability of clinical data over the complete cancer trajectory.

For the current analysis, only data of the initial data registration phase, i.e. at diagnosis, were used. We compared characteristics of PLCRC participants with the general Dutch CRC population (Dutch-ref) from the NCR with incidences between January 1, 2013 and December 31, 2017.

Statistical analysis

Descriptive statistics were used to describe baseline patient characteristics, including baseline PROs. Standardized differences (d) were calculated to quantify the magnitude of differences in patient characteristics between PLCRC participants and the Dutch-ref. Values greater than 0.20 indicate a large imbalance, while values between 0.10 and 0.20 indicate a small imbalance, and standardized differences less than 0.10 indicate a negligible imbalance17,18. Results are shown for the total group and stratified by tumor stage and time of enrollment. Two time-periods (enrolled between 2013–‘16 and 2017–August ’19) were evaluated to assess whether PLCRC participants became more representative of the Dutch-ref over time. Logistic regression models were used to investigate to what extent, based on the available a priori selected patient characteristics (i.e. age, sex, primary tumor location and tumor stage), cohort participation could be predicted19. Model performance was assessed based on calibration and discrimination20. Calibration—the goodness of fit—was evaluated using the Hosmer–Lemeshow test21. Discrimination refers to the ability to distinguish cohort participants from non-participants, and was quantified by the area under the receiver operating characteristic curve (AU-ROC)20. The AU-ROC ranges from 1, corresponding to perfect discrimination, to 0.5, corresponding to a model with no discrimination ability, here preferred and defined as cohort participation independent of patient characteristics (0.5 = random chance, 0.5–0.7 = poor, 0.7–0.8 = good, 0.8–1.0 = strong, 1.0 = perfect prediction)22. Statistical analyses were performed using STATA (Release 15, Stata Corp LLC, College Station, TX) and SPSS (version 25.0, IBM Corp, Armonk, NY).

Ethics approval

The study protocol was approved by the Institutional Review Board of the University Medical Center Utrecht (The Netherlands). All procedures performed that involved human participants were in accordance with the institutional and/or national ethical standards and guidelines as well as with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Consent to participate

Informed consent was obtained from all individual participants included in the study.

Results

The flowchart for the selection of individuals for the current analyses is shown in Fig. 1. On August 1, 2019, a total of 5744 patients were enrolled. A complete TNM tumor stage could not be retrieved for 22 patients, who were therefore not included in the analyses.

Figure 1
figure 1

Flow diagram of the selection of individuals for the current analyses. aInsufficient data in patient’s EHRs for the NCR to collect a complete TNM stage. bMissing due to time-lag in NCR clinical data collection. This number is higher than presented in Table 1, as for some cases sufficient staging information was available to classify into TNM stage.

Developments towards a nationwide cohort

PLCRC has continuously strengthened its infrastructure to improve the enrollment rate. In August 2019, patients were enrolled in 50 of 75 Dutch hospitals including 7 (of 8) academic hospitals, 22 (of 26) top clinical hospitals that focus on education and research, and 21 (of 41) regular hospitals (Fig. 2). This led to an improved annual enrollment-rate, from 129 participants from 1 recruiting hospital in 2013 to 2136 from 50 recruiting hospitals in 2018 (note that there are approx. 14,000 incident cases annually). At enrollment, 100% of patients consented for using their clinical data obtained from the NCR, which was mandatory, 81% consented to receive repeated PRO questionnaires, 83% for blood withdrawals, 95% for use of tissue for scientific research, 83% for contact when relevant DNA abnormalities are found, and 78% for future research and trials according to the TwiCs design (Fig. 3). Once consented to receive questionnaires for PROs, 77% of patients returned their baseline questionnaire, and completion rates remained above 60% in the first three years after enrollment (Fig. 4). Interestingly, patients who received paper-based questionnaires had consistently higher completion rates compared to electronic questionnaires (85% vs. 72% at baseline, respectively).

Figure 2
figure 2

PLCRC recruiting hospitals (academic, non-academic, and top clinical hospitals) over time. Note: In The Netherlands there is a total of 75 hospitals, of which 8 academic hospitals and 26 top-clinical hospitals.

Figure 3
figure 3

Baseline informed consent percentages per item. The use of clinical NCR data is 100% since this item is mandatory for participation. NCR Netherlands Cancer Registry, PROs  patient-reported outcomes, TwiCs Trials within Cohorts.

Figure 4
figure 4

Completion rates of questionnaires until three years after enrollment. Overall completion rates are presented in the bars, and electronic and paper-based percentages at the dashed lines. Time-points (T) are months since enrollment.

Baseline patient characteristics

The cohort contained 851 patients with stage I CRC, 1079 with stage II, 1960 with stage III, 946 with stage IV, and 886 patients of whom data on tumor stage is still being collected (Table 1). The median number of days from diagnosis until enrollment was 18 (IQR: 1 to 131) for the total cohort, and was similar for all stages except for stage IV patients, who were enrolled much later in the cancer trajectory (188 [IQR: 23 to 670] days). The percentage of males was higher than females for all stages, and 61% of the patients had a primary tumor located in the colon, and 39% in the rectum. Of the 946 stage IV patients, 79% had synchronous liver, 22% lung, and 17% peritoneal metastases. Regarding molecular diagnostics at diagnosis, RAS status was available in 596 (10%) patients, BRAF in 570 (10%), and microsatellite instability (MSI) in 2600 (45%). In terms of physical and psychological wellbeing at enrollment (supplementary Table 1), patients reported to experience impaired psychosocial functioning, and high levels of fatigue, appetite loss and diarrhea as compared to reference populations (refs).

Table 1 Baseline descriptive demographic and clinical characteristics at PLCRC enrollment, stratified by tumor stage.

PLCRC versus the general Dutch CRC population

While the logistic regression model including age, sex, primary tumor location, and tumor stage overestimates the probability of participation (p < 0.001), the low discriminative power (AU-ROC 0.65, 95% CI: 0.64–0.65) indicates limited ability to distinguish cohort participants from the Dutch-ref, based on available data (full ROC curves in supplementary Fig. 1). This discrimination decreased over time from an AU-ROC of 0.70 (95% CI: 0.68–0.71) in PLCRC’s initial phase (2013–‘16) to 0.64 (95% CI: 0.63–0.64) in the most recent phase (2017–Aug ‘19).

Between PLCRC participants (n = 4759) and the Dutch-ref (n = 72,685), large imbalances were found for age at diagnosis (64.9 years in PLCRC, 69.3 in the Dutch-ref, dage 0.41), primary tumor location (43% rectum in PLCRC and 31% in the Dutch-ref, dpr.tumor 0.24) and TNM stage (41% stage III in PLCRC and 30% in the Dutch-ref, dtnm 0.24), a small imbalance in sex (62% male in PLCRC and 57% in the Dutch-ref, dsex 0.11), and negligible imbalances in BMI at diagnosis (26.6 in PLCRC and 26.6 in the Dutch-ref, dbmi 0.01) and in location of synchronous metastasis (15% liver in PLCRC and 15% in the Dutch-ref, dmeta between 0.01 and 0.08), Table 2.

Table 2 Characteristics of PLCRC participants at diagnosis (2013-Aug’19), compared with the general Dutch CRC population (2013–’17), and stratified by time-period (2013–’16, and 2017–Aug’19).

When the two time-periods were compared with the Dutch-ref to study PLCRC’s evolution, a large imbalance remained for age at diagnosis and tumor stage (dage from 0.45 to 0.40, dtnm from 0.33 to 0.22). The distribution of sex, BMI at diagnosis, and primary tumor location improved to imbalances classified as small or negligible (dsex from 0.17 to 0.09, dbmi from 0.16 to 0.03, dpr.tumor from 0.53 to 0.16). For location of synchronous metastases, e.g. liver metastasis, the imbalance compared with the Dutch-ref was negligible in both time periods (dliver from 0.08 to 0.04).

Table 3 shows stratified analyses in which PLCRC participants were compared with the Dutch-ref by tumor stage. Age at diagnosis was lower for PLCRC participants in all stages, and discrepancies increased by stage (dage from 0.28 in stage I to 0.55 in stage IV). For all disease stages, PLCRC contained relatively more patients with a primary tumor in the rectum and fewer patients with a primary tumor in the colon, compared with the Dutch-ref (dpr.tumor between 0.14 and 0.25). The proportions of sex and BMI at diagnosis were comparable to the ref. population in all stages (dsex between 0.08 and 0.19, dbmi between 0.01 and 0.08).

Table 3 Characteristics of the PLCRC participants at diagnosis (2013–Aug’19), compared with the general Dutch CRC population (2013–’17), stratified by tumor stage.

Discussion

Over the past six years, the increased number of PLCRC recruiting centers has resulted in a steep increase in participating patients, with excellent consent rates for PROs, blood and tissue biobanking, and participation in future research within PLCRC. Although we found an overall shift towards the Dutch-ref for patients enrolled between 2017–Aug 2019, regular hospitals remain underrepresented as participating centers to enroll patients and PLCRC participants were still younger and more often had stage III disease, as compared to the total Dutch CRC population.

Besides common discrepancies such as performance status and number of comorbidities, clinical trial participants are notably younger than the real-world population, which for now might hamper the applicability of trial results in daily clinical practice. It was recently shown that phase-III RCT patients were on average seven years younger than the general CRC population23. Similarly, patients within PLCRC are younger compared to the Dutch-ref, however, this difference only is 4 years (mean age 65 years). Standardized differences for age increased by tumor stage, with stage IV patients showing a mean age comparable to phase-III clinical trials in metastatic CRC24. This emphasizes the need to focus on the enrollment of (stage IV) patients that are diagnosed at older age. Although important factors such as comorbidities and performance status are currently unknown, we believe that PLCRC has the potential to serve as a research platform that fulfills the current demand for RWD as advocated for by regulators and research community. The additional advantage of PLCRC is the large collection of biospecimen, which is intertwined with routine clinical care, and longitudinal PROs from diagnosis onwards. Moreover, the incorporation of PROs that describe the impact of treatment on quality of life, daily activities and symptoms is increasingly recognized as an essential component of real-world evidence and has the potential to improve cancer care, shared decision making, and clinical outcomes25,26,27.

Dutch CRC guidelines recommend, in line with the European guidelines, to determine both mismatch repair status in stage II-IV tumors, and RAS and BRAF mutation status in tumor of patients with metastatic CRC prior to the start of systemic treatment28,29,30. Although our percentages may be an underestimation as mutation status could become available during NCR updates after the initial data registration, the amount of missing data on molecular diagnostics is noteworthy. A limitation that is currently inevitable within PLCRC is that completeness of the NCR depends on daily clinical practices. In contrast to the above mentioned national guidelines, molecular markers are not routinely measured in all patients in the clinic. This means that currently, PLCRC is missing opportunities to optimally use tumor mutation status for research purposes. Efforts are ongoing to perform retrospective molecular profiling within PLCRC to supplement existing molecular pathology data with the aim to be able to tailor treatment options to the individual patient in the future. Next to the identification of predictors for treatment response and clinical outcomes, this will also contribute to the development of a unique cohort that could provide “external” controls for future single arm clinical trials in uncommon CRC subtypes with high unmet medical need31,32.

Given the large variety of available data, PLCRC will allow for comprehensive analyses on CRC. However, future improvements are required to optimize two fundamental elements of RWD sources: completeness of cases and completeness of clinical data. Based on our experience, over 90% of patients provide informed consent once the study aim is explained. Enhanced integration of research into daily clinical practice and the development of local infrastructures that lead to increased willingness and availability of personnel to inform the patient about PLCRC, especially in regular hospitals, are crucial to further improve the completeness of cases and create a true RWD cohort. Second, completeness of clinical data mainly depends on how well clinicians document clinical data in EHRs. Regardless of the list of items to be collected in the NCR, unmeasured or undocumented data will never become available to the research community. Moreover, EHR data are often unstructured and inconsistent due to large variation between clinicians and differences in EHR software systems. Bertagnolli and colleagues33 recently stated that the use of data obtained during routine clinical care as “real-world” data to fuel a learning healthcare system is currently still in its infancy. Prior to utilizing EHRs to facilitate a learning health system, EHRs must contain readily exchangeable and clinically meaningful structured data elements of adequate quality to draw valid inferences33. Therefore, we emphasize that nationwide harmonization and standardization of clinical data entries in EHRs and subsequent implementation of electronic data-capture systems to enable real-time data transfer from EHRs to the NCR, will significantly enhance the completeness and quality of clinical data.

Future focus should be given to reaching and enrolling older patients and to enhance involvement of the gastroenterology departments to enroll patients with early stage tumors. Moreover, especially stage IV patients should be enrolled closer to diagnosis to standardize time points for PROs and avoid potential survivor bias. This can be achieved by an optimal research-focused infrastructure and implementation of research-specific consultations for all cancer patients shortly after diagnosis. During this consultation, the patient is informed about the specific components of PLCRC (Box 1), as well as on the main aim to optimally evaluate treatments, accelerate innovation, and learn from each individual patient. Such an infrastructure will also contribute to enrolling patients with the least hospital visits, e.g. patients with a polypectomy only, or extensively metastasized disease with rapid progression and best supportive care only. Besides the aforementioned suggestions, we need to create a societal change with respect to clinical research. All stakeholders should be aware that, in order to improve oncology practice, research needs to become an integrated part of clinical care and that contributions to clinical research are self-evident. Lastly, PLCRC is a platform to centralize national CRC research to maximize its potential and minimize patient burden. Access to cohort resources for collaborative research projects may be requested through the Scientific Committee [https://plcrc.nl/for-international-visitors] that reviews all research projects for approval.

To conclude, PLCRC is establishing a unique and steeply growing national RWD cohort that allows for a wide range of research. Data from the general patient population enables a learning healthcare system that provides insight into the care and outcomes of patients that are usually underrepresented in RCTs, e.g. the very young and older patients and the ones with multiple comorbidities. Comprehensive analyses within PLCRC are facilitated by the extensive amount of clinical data covering the complete treatment trajectory and additional patient-reported outcomes. Further improvements in recruitment methodologies and multidisciplinary enrollment of patients will contribute to the aim of enrolling all newly diagnosed CRC patients in the Netherlands. This will continue to enhance PLCRC’s representation of the real-world and its ability to improve both scientific research and daily clinical practice.