The Prospective Dutch Colorectal Cancer (PLCRC) cohort: real-world data facilitating research and clinical care

Real-world data (RWD) sources are important to advance clinical oncology research and evaluate treatments in daily practice. Since 2013, the Prospective Dutch Colorectal Cancer (PLCRC) cohort, linked to the Netherlands Cancer Registry, serves as an infrastructure for scientific research collecting additional patient-reported outcomes (PRO) and biospecimens. Here we report on cohort developments and investigate to what extent PLCRC reflects the “real-world”. Clinical and demographic characteristics of PLCRC participants were compared with the general Dutch CRC population (n = 74,692, Dutch-ref). To study representativeness, standardized differences between PLCRC and Dutch-ref were calculated, and logistic regression models were evaluated on their ability to distinguish cohort participants from the Dutch-ref (AU-ROC 0.5 = preferred, implying participation independent of patient characteristics). Stratified analyses by stage and time-period (2013–2016 and 2017–Aug 2019) were performed to study the evolution towards RWD. In August 2019, 5744 patients were enrolled. Enrollment increased steeply, from 129 participants (1 hospital) in 2013 to 2136 (50 of 75 Dutch hospitals) in 2018. Low AU-ROC (0.65, 95% CI: 0.64–0.65) indicates limited ability to distinguish cohort participants from the Dutch-ref. Characteristics that remained imbalanced in the period 2017–Aug’19 compared with the Dutch-ref were age (65.0 years in PLCRC, 69.3 in the Dutch-ref) and tumor stage (40% stage-III in PLCRC, 30% in the Dutch-ref). PLCRC approaches to represent the Dutch CRC population and will ultimately meet the current demand for high-quality RWD. Efforts are ongoing to improve multidisciplinary recruitment which will further enhance PLCRC’s representativeness and its contribution to a learning healthcare system.

The PLCRC population. PLCRC consists of patients diagnosed with a malignancy of the colon and/or rectum (ICD-10, C18-20) in the Netherlands. Each patient with histologically proven, or a strong suspicion of CRC without pathological confirmation, who is ≥ 18 years of age is eligible. The informed consent procedure is preferably performed shortly after diagnosis and before treatment starts. However, patients can also be enrolled during treatment or follow-up. Consent for longitudinal clinical data collection is mandatory for participation. In addition, patients can choose to consent to other optional items as shown in Box 1. Patients enrolled before August 1, 2019 of whom complete clinical data of the initial diagnosis and treatment period were available-to ascertain correct classification of tumor stage-were included in the analyses. Of these patients, baseline demographic and clinical data were retrieved and reported, as well as self-reported physical activity, fatigue, quality of life, BMI, presence of chronic comorbidities, smoking behavior, alcohol consumption, education level, and living situation at baseline.

The Netherlands Cancer Registry (NCR). The Netherlands Cancer Registry (NCR) contains an exten-
sive set of clinical data-from diagnosis onwards-of individuals diagnosed with cancer in the Netherlands and has a national coverage of over 95% 16 . Clinical data of the complete treatment trajectory are retrieved from EHRs and entered into the NCR. Importantly, the completeness of the NCR therefore depends on the completeness of EHRs. Overall, the NCR's high quality is assured by thorough training of data managers and computerized consistency checks. PLCRC's informed consent allows for linkage with the NCR and thus ensures the availability of clinical data over the complete cancer trajectory.
For the current analysis, only data of the initial data registration phase, i.e. at diagnosis, were used. We compared characteristics of PLCRC participants with the general Dutch CRC population (Dutch-ref) from the NCR with incidences between January 1, 2013 and December 31, 2017. Statistical analysis. Descriptive statistics were used to describe baseline patient characteristics, including baseline PROs. Standardized differences (d) were calculated to quantify the magnitude of differences in patient characteristics between PLCRC participants and the Dutch-ref. Values greater than 0.20 indicate a large imbalance, while values between 0.10 and 0.20 indicate a small imbalance, and standardized differences less than 0.10 indicate a negligible imbalance 17,18 . Results are shown for the total group and stratified by tumor stage and time of enrollment. Two time-periods (enrolled between 2013-'16 and 2017-August '19) were evaluated to assess whether PLCRC participants became more representative of the Dutch-ref over time. Logistic regression models were used to investigate to what extent, based on the available a priori selected patient characteristics (i.e. age, sex, primary tumor location and tumor stage), cohort participation could be predicted 19 21 . Discrimination refers to the ability to distinguish cohort participants from nonparticipants, and was quantified by the area under the receiver operating characteristic curve (AU-ROC) 20 . The AU-ROC ranges from 1, corresponding to perfect discrimination, to 0.5, corresponding to a model with no discrimination ability, here preferred and defined as cohort participation independent of patient characteristics (0.5 = random chance, 0.5-0.7 = poor, 0.7-0.8 = good, 0.8-1.0 = strong, 1.0 = perfect prediction) 22

Results
The flowchart for the selection of individuals for the current analyses is shown in Fig. 1 (Fig. 2). This led to an improved annual enrollment-rate, from 129 participants from 1 recruiting hospital in 2013 to 2136 from 50 recruiting hospitals in 2018 (note that there are approx. 14,000 incident cases annually). At enrollment, 100% of patients consented for using their clinical data obtained from the NCR, which was mandatory, 81% consented to receive repeated PRO questionnaires, 83% for blood withdrawals, 95% for use of tissue for scientific research, 83% for contact when relevant DNA abnormalities are found, and 78% for future research and trials according to the TwiCs design (Fig. 3). Once consented to receive questionnaires for PROs, 77% of patients returned their baseline questionnaire, and completion rates remained above 60% in the first three years after enrollment (Fig. 4). Interestingly, patients who received paper-based questionnaires had consistently higher completion rates compared to electronic questionnaires (85% vs. 72% at baseline, respectiv ely).
Baseline patient characteristics. The    This number is higher than presented in Table 1, as for some cases sufficient staging information was available to classify into TNM stage.     Table 2.
When the two time-periods were compared with the Dutch-ref to study PLCRC's evolution, a large imbalance remained for age at diagnosis and tumor stage (d age from 0.45 to 0.40, d tnm from 0.33 to 0.22). The distribution of sex, BMI at diagnosis, and primary tumor location improved to imbalances classified as small or negligible (d sex from 0.17 to 0.09, d bmi from 0.16 to 0.03, d pr.tumor from 0.53 to 0.16). For location of synchronous metastases, e.g. liver metastasis, the imbalance compared with the Dutch-ref was negligible in both time periods (d liver from 0.08 to 0.04). Table 3 shows stratified analyses in which PLCRC participants were compared with the Dutch-ref by tumor stage. Age at diagnosis was lower for PLCRC participants in all stages, and discrepancies increased by stage (d age from 0.28 in stage I to 0.55 in stage IV). For all disease stages, PLCRC contained relatively more patients with a primary tumor in the rectum and fewer patients with a primary tumor in the colon, compared with the Dutchref (d pr.tumor between 0.14 and 0.25). The proportions of sex and BMI at diagnosis were comparable to the ref. population in all stages (d sex between 0.08 and 0.19, d bmi between 0.01 and 0.08).

Discussion
Over the past six years, the increased number of PLCRC recruiting centers has resulted in a steep increase in participating patients, with excellent consent rates for PROs, blood and tissue biobanking, and participation in future research within PLCRC. Although we found an overall shift towards the Dutch-ref for patients enrolled between 2017-Aug 2019, regular hospitals remain underrepresented as participating centers to enroll patients and PLCRC participants were still younger and more often had stage III disease, as compared to the total Dutch CRC population.
Besides common discrepancies such as performance status and number of comorbidities, clinical trial participants are notably younger than the real-world population, which for now might hamper the applicability of trial results in daily clinical practice. It was recently shown that phase-III RCT patients were on average seven years younger than the general CRC population 23 24 . This emphasizes the need to focus on the enrollment of (stage IV) patients that are diagnosed at older age. Although important factors such as comorbidities and performance status are currently unknown, we believe that PLCRC has the potential to serve as a research platform that fulfills the current demand for RWD as advocated for by regulators and research community. The additional advantage of PLCRC is the large collection of biospecimen, which is intertwined with routine clinical care, and longitudinal PROs from diagnosis onwards. Moreover, the incorporation of PROs that describe the impact of treatment on quality of life, daily activities and symptoms is increasingly recognized as an essential component of real-world evidence and has the potential to improve cancer care, shared decision making, and clinical outcomes [25][26][27] . Dutch CRC guidelines recommend, in line with the European guidelines, to determine both mismatch repair status in stage II-IV tumors, and RAS and BRAF mutation status in tumor of patients with metastatic CRC prior to the start of systemic treatment [28][29][30] . Although our percentages may be an underestimation as mutation status could become available during NCR updates after the initial data registration, the amount of missing data on molecular diagnostics is noteworthy. A limitation that is currently inevitable within PLCRC is that completeness of the NCR depends on daily clinical practices. In contrast to the above mentioned national guidelines, molecular markers are not routinely measured in all patients in the clinic. This means that currently, PLCRC is missing opportunities to optimally use tumor mutation status for research purposes. Efforts are ongoing to perform retrospective molecular profiling within PLCRC to supplement existing molecular pathology data with the aim to be able to tailor treatment options to the individual patient in the future. Next to the identification of predictors for treatment response and clinical outcomes, this will also contribute to the development of a unique cohort that could provide "external" controls for future single arm clinical trials in uncommon CRC subtypes with high unmet medical need 31,32 .
Given the large variety of available data, PLCRC will allow for comprehensive analyses on CRC. However, future improvements are required to optimize two fundamental elements of RWD sources: completeness of cases and completeness of clinical data. Based on our experience, over 90% of patients provide informed consent once the study aim is explained. Enhanced integration of research into daily clinical practice and the development of local infrastructures that lead to increased willingness and availability of personnel to inform the patient about PLCRC, especially in regular hospitals, are crucial to further improve the completeness of cases and create a true RWD cohort. Second, completeness of clinical data mainly depends on how well clinicians document clinical data in EHRs. Regardless of the list of items to be collected in the NCR, unmeasured or undocumented data will never become available to the research community. Moreover, EHR data are often unstructured and inconsistent due to large variation between clinicians and differences in EHR software systems. Bertagnolli and colleagues 33 recently stated that the use of data obtained during routine clinical care as "real-world" data to fuel a learning healthcare system is currently still in its infancy. Prior to utilizing EHRs to facilitate a learning health system, EHRs must contain readily exchangeable and clinically meaningful structured data elements of adequate quality to draw valid inferences 33 . Therefore, we emphasize that nationwide harmonization and standardization of clinical data entries in EHRs and subsequent implementation of electronic data-capture systems to enable real-time data transfer from EHRs to the NCR, will significantly enhance the completeness and quality of clinical data.
Future focus should be given to reaching and enrolling older patients and to enhance involvement of the gastroenterology departments to enroll patients with early stage tumors. Moreover, especially stage IV patients should be enrolled closer to diagnosis to standardize time points for PROs and avoid potential survivor bias. This can be achieved by an optimal research-focused infrastructure and implementation of research-specific consultations for all cancer patients shortly after diagnosis. During this consultation, the patient is informed about the specific components of PLCRC (Box 1), as well as on the main aim to optimally evaluate treatments, accelerate innovation, and learn from each individual patient. Such an infrastructure will also contribute to enrolling patients with the least hospital visits, e.g. patients with a polypectomy only, or extensively metastasized disease with rapid progression and best supportive care only. Besides the aforementioned suggestions, we need to create a societal change with respect to clinical research. All stakeholders should be aware that, in order to improve oncology practice, research needs to become an integrated part of clinical care and that contributions to clinical research are self-evident. Lastly, PLCRC is a platform to centralize national CRC research to maximize Table 1. Baseline descriptive demographic and clinical characteristics at PLCRC enrollment, stratified by tumor stage. Descriptives are presented as count (%), mean (± SD), or median (IQR). a N = 22 patients with permanently unknown tumor stage are not included in the analysis. b In case of missing data, the descriptive statistics of complete cases are presented. c Self-reported. d PLCRC is a dynamic cohort with continuous new enrollment and data-linkage. The high percentage of missing data, is due to the time-lag between enrollment and data linkage from the NCR to PLCRC, which is continuously updated.  To conclude, PLCRC is establishing a unique and steeply growing national RWD cohort that allows for a wide range of research. Data from the general patient population enables a learning healthcare system that provides insight into the care and outcomes of patients that are usually underrepresented in RCTs, e.g. the very young and older patients and the ones with multiple comorbidities. Comprehensive analyses within PLCRC are facilitated by the extensive amount of clinical data covering the complete treatment trajectory and additional patientreported outcomes. Further improvements in recruitment methodologies and multidisciplinary enrollment of www.nature.com/scientificreports/ patients will contribute to the aim of enrolling all newly diagnosed CRC patients in the Netherlands. This will continue to enhance PLCRC's representation of the real-world and its ability to improve both scientific research and daily clinical practice.

Data availability
Access to cohort resources for collaborative research projects may be requested through the Scientific Committee [https ://plcrc .nl/for-inter natio nal-visit ors] that reviews all research projects for approval.