Introduction

Clinical real-world data (RWD) has garnered increasing interest for its use alongside clinical trial protocols in generating evidence in medical research. Several studies have explored the utilization of RWD in conducting clinical trials1,2,3,4, providing external control groups for single-arm studies5,6, and complementing the control group in randomized controlled trials (RCTs)7,8,9. The rationale for using RWD to augment RCTs relies on the assumption that RWD and RCT datasets are comparable despite potential biases. However, data generation mechanisms differ substantially between RWD and RCT, and systematic differences are probable. As a result, verifying the compatibility between datasets becomes a crucial component of data preprocessing to ensure the study's feasibility.

Compared to clinical trial data, RWD's quality can greatly fluctuate depending on its purpose and the specific dataset. Characteristics such as accuracy, completeness, and sampling intervals can vary among covariates, patients, and healthcare providers10,11. Certain clinical measures, such as blood pressure and weight, are often available for research, but variables like outpatient medication exposure might need to be inferred indirectly from prescriptions and are potentially overestimated if prescriptions were left unused11. Choices in study design, like index date or lookback window, can influence prevalence and incidence estimates based on claims and electronic health record (EHR) databases12. These factors may not be as clearly defined in RWD as in RCTs, potentially affecting the temporal alignment of the datasets. There might also be other underlying time-related biases which complicate, for instance, the identification of event onset or exposure to treatment13. For example, a delay between disease initiation and detection can introduce temporal variation and bias into the data13.

Apart from inclusion and exclusion criteria, the discrepancy between datasets can be alleviated through the application of appropriate validation criteria for the relevance, reliability, and quality of the RWD14 and by particularly focusing on confounders related to exposure and outcome15. Several methods for selecting control patients have been suggested, such as cardinality matching of individuals15 based on propensity scoring16, which has become the preferred method for adjusting group differences and reducing confounding. However, selection criteria and computational advancements are merely parts of the solution and should not be relied upon unconditionally.

In this study, we compared baseline data from a completed clinical trial on chronic kidney disease outcomes in individuals with type 2 diabetes to that from electronic health records (EHRs) in order to characterize their similarities and differences. Our focus was on five common data types: demographics, diagnoses, medications, laboratory measurements, and vital signs. We evaluated temporal aspects such as extent and sampling density, along with data completeness or missingness. In addition to comparisons between whole datasets, through statistical and cluster analysis, we demonstrate a partial overlap of RWD and RCT data.

Methods

Data

We used RCT data from the completed FIDELIO-DKD trial (Bayer, NCT02540993) to study the effect of finerenone on chronic kidney disease outcomes in type 2 diabetes in adult patients (≥ 18 years)17. Patients included in the trial had diabetes and fulfilled the following criteria for kidney disease at the time of randomization: a urinary albumin-to-creatinine ratio (UACR) of 30–300 (mg/g), estimated glomerular filtration rate (eGFR) of 25–60 (ml/min/1.73m2), or a UACR of 300–5000 and eGFR of 25–75 (ml/min/1.73m2) along with diabetic retinopathy (see17 for further details). The data were pseudonymized, and we used only baseline data prior to randomization, including demographics, vital signs, diagnosis history, laboratory measurements, and concomitant medications. We obtained internal approval for secondary research use of the trial data.

To be included in the RWD set, patients needed to have chronic kidney disease as defined by ICD-10 codes N18, N19, or I12, or eGFR < 45 ml/min/1.73m2 at some time point in their EHRs. Thus, the patients had moderate to severe impairment in kidney function based on the Kidney Disease Improving Global Outcomes (KDIGO) guidelines. UACR criteria were not used in the RWD set due to its low availability. Type 2 diabetes mellitus was also required, defined either by diagnosis code E11, the use of diabetes medication (ATC class A10), or a glycated hemoglobin (H-HbA1c) measurement ≥ 48 mmol/mol.

The RWD set contained data from patients who were first diagnosed with either chronic kidney disease or type 2 diabetes mellitus as adults. We assessed this post hoc using data from medications, diagnoses, and laboratory tests. We extracted these data along with demographics and vital signs from electronic healthcare records of HUS Helsinki University Hospital, Finland, covering a ten-year period from 2012 to 2021.

The index date in the RCT data was the date of randomization, while in the RWD, it was the date when the chronic kidney disease inclusion criteria were met.

Data structuring

The RCT data were structured. RWD were also structured, with the exception of smoking status, some medications, and New York Heart Association (NYHA) classes. We extracted these data from clinical documents using text mining and added them to the dataset with corresponding timestamps.

Data harmonization

We used SNOMED coding to harmonize the different nomenclatures of RCT and RWD. The mapping was straightforward for demographics, medications, laboratory measurements, and vital signs. For diagnoses, 95.9% of MedDRA codes in RCT and 94.1% of ICD-10 codes in RWD were successfully mapped to SNOMED coding. Unmapped diagnoses codes, which were primarily in the Z-category indicating factors affecting health status or contact with health services, were not considered critical and were excluded from subsequent analyses. For diagnoses, we used the latest version of the mapping table from OHDSI Athena18. We performed standard unit conversions between RCT and RWD laboratory values for laboratory measurements.

Analytics environment

We stored and processed the data on the HUS Acamedic cloud-based data analytics platform19. This platform enables high-performance scientific computing and can be scaled as necessary. The platform meets both European and national regulations (General Data Protection Regulation, Finlex 552/2019) for processing sensitive health and social data, it has a valid security certification, and is supervised by the National Supervisory Authority for Welfare and Health (Valvira).

Statistical methods

We compared the prevalence of diagnoses and medications between RCT and RWD sets and reported counts and proportions by category. We compared the data longitudinality and sampling density aspects, namely the length of pre-index time, events per year, unique codes per patient, and interval between events, between RCT and RWD sets as continuous variables. These were reported with medians and interquartile ranges (IQR). We used chi-squared tests for group comparisons with categorical variables and the Kruskal–Wallis H test for continuous data. Moreover, we computed odds ratios to evaluate similarity of features within different clusters of patients. We performed the analyses with Python 3.8.10.0 using pandas (version 1.1.5)20 and SciPy (version 1.5.3)21 libraries.

Cluster analysis of diagnoses and medications

To further examine and visually represent the differences and overlaps between the RCT and RWD sets, we performed separate cluster analyses for medication and diagnosis datasets, which were formed by merging the RCT and RWD sets. These analyses were similar to those outlined in22. For this purpose, we mapped diagnoses in the RCT data to International Classification of Diseases version 10 codes (ICD-10) as detailed in the Results section. Following the method in22, we limited our analysis to ICD-10 diagnosis categories A to N with a precision of three characters. Hence, our focus was on disorders while excluding codes related to pregnancy, external causes, malformations, and contacts with health services. For medication data, we selected the first four characters of the Anatomical Therapeutical Chemical (ATC) codes. In both data sets, a specific code had to have at least 1% prevalence to be included in the analysis. Consequently, the data sets contained 65 covariates for diagnoses and 84 covariates for medications.

Following a two-step approach for clustering22, we initially trained a variational autoencoder (VAE) model23,24 using Keras25 (version 2.3.1). This model projected binary diagnosis vectors into a two-dimensional latent space23,24. Then we clustered the projected vectors using the HDBSCAN algorithm26.

For the encoder and decoder components of the VAE models, we utilized fully connected multilayer perceptrons (MLPs) with a single hidden layer, with either a hyperbolic tangent (tanh) or a rectified linear unit (ReLU)27 as the activation function. Since the purpose of the cluster analysis was to visually distinguish the data sets and their differences, we selected the activation function that produced visually distinct subgroups. The VAE model was trained using the evidence lower bound objective, which comprises a reconstruction loss and a regularizer on the latent space. We divided the data into a training set (90% of the data) and a validation set (10% of the data), with the validation data used to choose a hyperparameter for the number of gradient descent steps for training. Eventually, we mapped both sets to the latent space for the clustering step.

We used the HDBSCAN algorithm26 (version 0.8.29) to extract clusters for subsequent description and interpretation of the identified subgroups. HDBSCAN is a density-based clustering algorithm that groups similar data points together while also identifying outliers. It constructs a hierarchy of clusters by considering local density variations and connectivity, allowing for the automatic determination of the number of clusters. We chose HDBSCAN due to its ability to automatically determine the number of clusters, handle varying cluster shapes and sizes, as well as noise. For diagnosis data, we used the HDBSCAN parameters min_cluster_size = 220 and min_samples = 1, and for medications data, we used min_cluster_size = 200 and min_samples = 5. To visualize the results of the cluster analysis, we used NumPy28 (version 1.21.6) to compute two-dimensional histograms, SciPy21 (version 1.5.3) for kernel density estimation, and Matplotlib29 (version 3.2.1) for plotting. To preserve patient anonymity, we avoided presenting the locations of individual data points and instead utilized histogram and density-based approaches for visualization.

Ethical aspects

According to Finnish legislation (Act on the Secondary Use of Health and Social Data (552/2019) by the Ministry of Social Affairs and Health), the approval of an ethical committee or informed consent is not required for non-interventional, observational retrospective registry studies. The study was conducted in accordance with the Declaration of Helsinki and the General Data Protection Regulation (GDPR). HUS Helsinki University Hospital approved the study (permission HUS/230/2022). The original RCT data collection was based on informed consent from patients participating in the FIDELIO-DKD clinical trial (ClinicalTrials.gov identifier NCT02540993).

Results

Qualitative comparisons

We analyzed the pre-index baseline data of 23,523 RWD and 5,734 RCT patients, all of whom had both chronic kidney disease and type 2 diabetes. Harmonizing to common nomenclature for both RCT and RWD sets was straightforward, but we noted considerable qualitative differences in data generation, temporality, and completeness, as summarized in Table 1. For instance, in the RCT data, diagnoses and concomitant medications based on case report forms (CRFs) collected by investigators spanned up to 50 years pre-index. Conversely, in the RWD, all diagnoses in EHRs with precise dates were available up to 10 years pre-index, constrained by the research permit. Laboratory measurements and vital signs were only available near index in RCT data and up to 10 years pre-index in RWD.

Table 1 Qualitative observations between the five medical domains in the RCT and RWD data sets.

Quantitative comparisons

Statistical analysis revealed notable differences in completeness, particularly in medications and diagnoses. After harmonizing the different nomenclatures of RCT and RWD, we used ATC and ICD-10 code classes for easier interpretation of the results. In concomitant medications (Fig. 1A), the most significant difference in prevalence was observed in anti-infectives for systemic use (class J, RCT 8.6%, RWD 66.0%, P < 0.001). In diagnosis history (Fig. 1B), the largest difference in prevalence was seen in class R, which refers to symptoms, signs, and abnormal clinical and laboratory findings (RCT 14.3%, RWD 57.1%, P < 0.001). Although fulfilling the inclusion criteria, not all patients in the RWD had both inclusion diagnoses of type 2 diabetes and chronic kidney disease, which belong to classes E (endocrine, nutritional, and metabolic diseases, RCT 100.0%, RWD 50.7%, P < 0.001) and N (diseases of the genitourinary system, RCT 100.0%, RWD 49.9%, P < 0.001), respectively.

Figure 1
figure 1

Comparison of proportion (%) of RCT (N = 5 734) and RWD (N = 23 523) patients with different medication/diagnosis types recorded, each pair of RCT and RWD bars corresponding to an ATC class (A) or ICD-10 class (B). Value labels above bars correspond to P values calculated with chi-squared tests. Presented P values were adjusted with Bonferroni correction to address the issue of multiple comparisons.

As demographics, laboratory measurements, and vital signs in the RCT data were only available near the index time, we further analyzed the temporal characteristics of concomitant medications and diagnoses (Table 2). We defined event as a patient encounter in either the outpatient or inpatient setting. The pre-index time (in years) from the first event to the index date was significantly longer in RCT data for both medications (median 7.7 vs. 4.9, P < 0.001) and diagnoses (median 20.3 vs. 5.2, P < 0.001). We also assessed the time interval between events and the sampling density, defined as the number of events per year. The number of events per year and the number of unique codes were significantly larger in RWD compared to trial data for both medications and diagnoses. Consequently, the time interval between consecutive events was significantly shorter in RWD for both medications and diagnoses compared to RCT. Additionally, we observed a significant difference between RWD and RCT cohorts in age at index (median 76 years in RWD, median 66 in RCT, P < 0.001) and sex male/female ratio (RWD 47/53, RCT 70/30, P < 0.001).

Table 2 Comparison of longitudinality and sampling density variables between concomitant medication and diagnosis domains for the RCT (N = 5 734) and RWD (N = 23 523) patients.

Cluster analysis of diagnoses and medications

To elucidate the overlap between RWD and RCT cohorts, we conducted cluster analyses of diagnoses and medications, integrating the RWD and RCT datasets for this purpose. Figures 2 and 3 graphically represent the outcomes of these analyses. In both the diagnosis and medication datasets, the RWD and RCT cohorts emerged as discernable subgroups in the latent space of the Variational Autoencoder (VAE) model. Eleven clusters were discerned from the diagnosis data, and seven from the medication data. For the diagnosis dataset, 5133 patients (18%) did not align with any cluster; a comparable figure for the medication dataset was 5487 patients (19%). Detailed overviews of each cluster can be found in Tables 3 and 4, and in the supplement.

Figure 2
figure 2

Visualization of the results from the cluster analysis of the diagnosis data. (A) The two-dimensional histogram shows the density of the combined RCT and RWD data sets along the two dimensions of the learned latent representation. We truncated histogram bin counts to 150. The red and blue lines show the Gaussian kernel density estimates (KDE) of the RCT and RWD distributions, respectively. (B) Contour lines for the Gaussian KDEs fitted using the datapoints belonging to clusters 0 and 3–10, plotted at the 0.05 value of each probability density estimate. The datapoints in cluster 2 lie very near to each other, leading to kernel density estimation failing, which is why we visualized the mean of these data point locations instead.

Figure 3
figure 3

Visualization of the results from the cluster analysis of the medications data. (A) The two-dimensional histogram shows the density of the combined RCT and RWD data sets along the two dimensions of the learned latent representation. We truncated histogram bin counts to 150. The red and blue lines show the Gaussian kernel density estimates (KDE) of the RCT and RWD distributions, respectively. (B) Contour lines for the Gaussian KDEs fitted using the datapoints belonging to each cluster, plotted at the 0.05 value of each probability density estimate.

Table 3 Results from clustering of diagnosis data.
Table 4 Results from clustering of medications data.

From the diagnosis dataset, three clusters (0, 1, and 3) mainly consisted of RCT patients, while eight clusters (2, 4–10) primarily consisted of RWD patients. Cluster 4 demonstrated the most substantial overlap, with 22% of its patients from the RCT cohort and 78% from the RWD cohort. Cluster 3 also exhibited significant overlap, with 89% of its data derived from the RCT cohort and 11% from the RWD cohort. This overlap signifies the presence of both RCT and RWD patients within a single cluster. Importantly, over half of both the RCT and RWD cohorts were consolidated within Cluster 4, whereas the proportions for other clusters were substantially lower. Clusters 2 and 5 were distinguished by a very low prevalence for all diagnoses utilized in VAE model training.

Aside from these two clusters, those predominantly composed of RWD patients could be loosely divided into two groups: the larger Cluster 4 containing 17,383 patients, and a group comprising the adjacent Clusters 6–10. In Cluster 4, essential hypertension (I10) and non-insulin-dependent diabetes (E11) were the most prevalent diagnoses. Clusters 7–10 were characterized by prevalent diagnoses of pneumonia (J18) and either heart failure (I150), acute myocardial infarction (I21), or both. Cluster 6 differed substantially from Clusters 7–10, although geographically close. The most common diagnoses for Cluster 6 were other diseases of the urinary system (N39) and other soft tissue disorders (M79).

The RCT-dominated clusters (0, 1, and 3) shared prevalent diagnosis codes for chronic renal failure (N18), non-insulin-dependent diabetes (E11), and essential hypertension (I10). Additional prevalent diagnoses in Cluster 0, but not in Cluster 1, included vascular dementia (F01) and depressive episode (F32). Cluster 1, but not Cluster 0, exhibited a higher prevalence of other specified diabetes mellitus (E13), unspecified diabetes mellitus (E14), and glomerular disorders in diseases classified elsewhere (N08).

We further inspected the similarity of diagnosis prevalences within Clusters 3 and 4 using odds ratio and chi-squared test (Supplementary Material). In Cluster 3, the prevalences of chronic renal failure (N18) and non-insulin-dependent diabetes mellitus (E11) were in balance between RCT and RWD patients with odds ratio of 1.0 and p = 1.0, in contrast to respective odds ratios of 5.7 and 1.9 in the original datasets. The same trend towards more balanced prevalences was observed in the diagnoses of Cluster 4 compared to datasets without clustering.

From the medication dataset, two clusters (2 and 4) were mostly composed of RWD patients, and five (0, 1, 3, 5, and 6) were largely comprised of RCT patients. The most substantial overlap between RCT and RWD cohorts was observed in Cluster 0 (11% RWD, 89% RCT) and the largest Cluster 4 (5% RCT, 95% RWD). Over 70% of the RWD patients were included in Cluster 4, and the proportions for the remaining clusters were significantly smaller.

Conversely, RCT patients were more evenly distributed across the clusters Clusters 3, 4, and 5 each comprised approximately 15% of the RCT patients, with the remaining clusters containing smaller proportions. Across all clusters, the most frequently prescribed medications (prevalence > 60%) incorporated at least one medication code pertaining to the cardiovascular system (ATC codes beginning with 'C') and at least one code related to diabetes medications (ATC codes beginning with 'A*').

Contrary to other clusters, neither antithrombotic agents (B01A) nor opioids (N02A) and other analgesics and antipyretics (N02B) were prevalent in Clusters 0 and 3. Clusters 1, 5, and 6 all shared stomatological preparations (A01A) as their most frequent medication code, whereas this code was scarcely seen in Clusters 2 and 4. Clusters 2 and 4 were distinctive in that they had high prevalences of other beta-lactam antibacterials (J01D) and hypnotics and sedatives (N05C). In addition, other antianemic preparations (B03X) and calcium (A12A) were among the medications specific to Cluster 2.

Discussion

This study illustrates that real-world data (RWD) and randomized controlled trial (RCT) datasets, derived from patients with diabetic chronic kidney disease, share common characteristics but also exhibit substantial differences in terms of data generation, completeness, and temporal dynamics. These discrepancies have implications for study design validity and mandate careful examination when merging RWD and RCT data.

Data generation and longitudinality

The noted differences between RCT and RWD predominantly arise from their respective data generation processes and objectives. RCT data is prospectively collected following a specified study protocol, whereas RWD is extracted through queries from hospital data infrastructure. For instance, in RCT data, only the initial diagnosis date is recorded by the investigator on case report forms (CRFs), potentially leading to selection and recall bias. Conversely, in RWD, each inpatient and outpatient diagnosis are precisely dated; however, data from a single institution may not encompass the patient's entire medical history. Unlike RWD, RCT data aims to assess the efficacy and safety of an intervention. Therefore, data elements unrelated to the trial's exposure and outcome, such as anti-infective medications and symptom diagnoses in our study, may be underrepresented. Furthermore, specific elements of RCT data, such as laboratory measurements, may only be cross-sectional and timed near the index date. Conversely, electronic health record (EHR) data chronicles patient interaction with healthcare services and inherently contains longitudinal data without defined start or end dates, possibly spanning decades. In our study, our research permit limited RWD to a ten-year range. Thus, incomplete patient history, whether derived from CRFs or EHRs, introduces biases and differences between RCT and RWD data. Ideally, RCT baseline data should utilize RWD covering the complete patient history over the relevant time range.

Data density and completeness

Diagnoses and medications were sampled significantly more densely in our RWD set compared to RCT. Consequently, RWD offered a more accurate portrayal of the patients' state by capturing all pertinent data. Nevertheless, data completeness and accuracy can be limited if data is sourced solely from a single healthcare provider. Despite all patients meeting the inclusion criteria, some lacked records of chronic kidney disease or type 2 diabetes diagnosis in RWD data, unlike in the RCT data. These diagnoses may be partly recorded in primary healthcare data, which was not included in this study. Additionally, text mining was necessary to extract data from unstructured texts in RWD. Despite these limitations, our findings suggest that, in certain scenarios, RWD could supplement RCT data through EHR to electronic data capture (EHR2EDC) automation30.

The harmonization process resulted in a minor loss of diagnoses, with 94.1% of RWD and 95.9% of RCT diagnosis codes successfully mapped to SNOMED codes. The mapping process for the remaining data types was straightforward.

Cluster analysis

We utilized cluster analysis on the combined real-world and randomized controlled trial (RCT) datasets to illustrate the heterogeneity of the study population, discover patient subgroups, and assess the overlap between the two datasets. The clustering was essentially based on the idea that Variational Autoencoders can learn nonlinear mapping between high-dimensional feature vectors representing patient characteristics and low-dimensional latent space representation. Similar feature vectors are basically located close to each other in the latent space. Our analysis revealed that the datasets were largely distinct. Both RCT and real-world data (RWD) sets comprised unique subgroups, with an overlap observed in only a few clusters. This was true for both diagnosis and medication datasets. However, clustering enabled us to extract from RCT and RWD datasets subgroups of patients who had more similar characteristics than original datasets. In the diagnosis data, Cluster 3 and the largest cluster (Cluster 4) contained a significant number of patients from both RWD and RCT sets. Hence, even though overlap was not present in all clusters, many RWD and RCT patients were grouped together in the cluster analysis. Clusters represented different patient characteristics and tended towards balanced prevalences of features between RCT and RWD sets in cluster-specific manner. Thus, clustering was found useful in understanding and mitigating group differences by selecting subgroups of patients with similar characteristics. Due to low availability of specific parameters in RWD, we did not apply the whole set of trial criteria, which is possibly reflected in the extent of overlap between the datasets. On the other hand, RWD contained a wider spectrum of recorded diagnoses than RCT. We conclude that much of the observed differences are due to different data generation mechanisms of RWD and RCT data.

The cluster analyses underscore the challenges in finding overlaps between real-world and RCT data, emphasizing the necessity for advanced methods to identify matching external controls. In the cluster analyses, we chose input covariates using a prevalence threshold of 1%, leading to 65 and 84 input covariates for diagnoses and medications data, respectively. A higher threshold would yield fewer covariates that could potentially differentiate the real-world and RCT datasets. The covariate set chosen for aligning RCT and RWD can influence the outcome and therefore requires careful selection. In addition to clinical differences between groups, there could be disparities in data completeness, i.e., how well the selected covariates are captured in the different data sources. It's important to note that the clustering results are influenced by our choices made in the VAE model training and cluster analysis and represent one set of possible options. Furthermore, we conducted the cluster analyses without considering possible demographic differences between RCT and RWD, which could account for some of the identified differences. To assess the robustness of clustering, subsampling-based analysis as in22 would be preferable. However, cluster analysis proved beneficial in identifying patients with overlapping characteristics in RCT and RWD. Our results indicate the overlap and discrepancies in one trial and RWD pair but cannot be directly generalized. Nevertheless, similar observations are likely in any study that merges RCT and RWD sources. The clustering method could be used to identify the characteristics of the most common phenotype of patients in a selected therapeutic area of a healthcare provider and compare it to the characteristics of the patients in a global clinical trial. On the other hand, the method could be used to identify more rare patient phenotypes to enable application of precision medicine and further expansion of the research to rare subpopulations of a disease.

Conclusion

In this study, we successfully elucidated the differences and demonstrated the feasibility of combining RCT and RWD, highlighting the potential for enriching RCT data using first-hand baseline information, filling missing data, and effectively mitigating discrepancies between datasets. RCT and RWD exhibit substantial differences in data longitudinality, completeness, sampling density, among other factors, all of which should be considered when designing studies that amalgamate data from these sources. Despite their inherent limitations, RWD sources could be used to enrich RCT datasets, for instance, to enhance the longitudinality and completeness of patient history. RCT and RWD sets were distinct and could form unique patient subgroups, which must be considered in studies merging RCT and RWD and in patient matching.