Main

Overly restrictive, and sometimes poorly justified1, eligibility criteria are a key barrier that leads to low enrolment in clinical trials2. For example, around 80% of patients with advanced non-small-cell lung cancer (aNSCLC) did not meet the criteria of the analysed trials3. As a result, 86% of clinical trials failed to complete their recruitment within the targeted time4. The US National Cancer Institute concluded that eligibility criteria arbitrarily eliminate patients and should be simplified and broadened5,6. The US Food and Drug Administration has also emphasized that certain populations are usually excluded from clinical trials without solid clinical justification. Restrictive trials do not fully capture the efficacy and safety of the drug in the populations that will use the drug after approval1.

There is therefore a great need to have faster trial accrual and better generalizability, with data-driven eligibility criteria7,8,9,10. However, how to broaden eligibility remains a major challenge. Even trials with similar mechanisms that target the same disease often use different eligibility criteria, possibly owing to legacy protocols. Some eligibility criteria are included to reduce the risks of severe toxicity adverse events, which is a critical consideration10. In an evaluation by the American Society of Clinical Oncology, 56% of surveyed clinicians agreed that some criteria are too stringent and harm the trial, but no agreement could be reached on the removal of specific criteria, given the available data9.

Data-driven algorithms combined with real-world data can potentially improve several aspects of clinical trials11,12,13. Artificial intelligence can screen patients that meet eligibility14,15,16, predict which patients are more likely to enrol in trials17,18 and extract features from electronic health records (EHRs)19,20,21. Several studies have introduced approaches to quantify the difference between the study samples of a clinical trial and the target population that can use the treatment22,23. Recent research also used EHR data to evaluate how different eligibility criteria can affect the number of adverse events associated with COVID-19 that are observed in the selected cohort24. Our study differs from these studies in that we focus on evaluating the effect of relaxing specific eligibility criteria on treatment efficacy and cohort size in a real-world population. The Flatiron Health database that we use has effectively been used to analyse outcomes of patients with lung cancer after immunotherapies25,26.

Overview of Trial Pathfinder

We developed Trial Pathfinder as a framework that integrates real-world data and systematically analyses the hazard ratio of the overall survival for cohorts that are defined by different eligibility criteria (Fig. 1). In the first step—trial emulation—we selected individuals in the real-world dataset who met the available eligibility criteria as originally published in the clinical trial protocol. The eligibility criteria were extracted from free text and encoded into programmatic logic statements (Methods). We assigned the selected patients to the treatment groups that were consistent with their treatment records in the Flatiron database. We used the inverse probability of treatment weighting to adjust for baseline confounding factors and to emulate randomization. We then performed survival analysis for the emulated trials using the hazard ratio of overall survival as the outcome. The Trial Pathfinder emulation framework makes it possible to systematically vary the eligibility criteria in silico and quantify how the hazard ratio of overall survival changes with different combinations of criteria.

Real-world data and trial emulation

This retrospective study used the Flatiron Health EHR-derived database (https://flatiron.com/real-world-evidence), which includes de-identified data from approximately 280 cancer clinics in the USA27. Longitudinal de-identified patient-level data included structured and unstructured data curated from the EHRs. We focused on analysing aNSCLC trials because they have the largest number of patients in the Flatiron Health database, comprising 61,094 patients with aNSCLC. Starting from all of the phase-III aNSCLC trials on ClinicalTrials.gov (queried on 8 November 2019), we filtered for trials that had available trial protocols and had at least 250 patients in each arm in the Flatiron Health dataset who matched the description of the patients in the trials. This resulted in 10 completed aNSCLC trials sponsored by diverse companies that we analysed using Trial Pathfinder (Extended Data Fig. 1 and Extended Data Table 1). Four trials are for first-line treatment and six are for second-line treatment.

Using the Flatiron Health data, we encoded commonly used eligibility criteria based on patient characteristics, diagnoses, laboratory values, biomarkers and previous treatments (Supplementary Table 1). There is substantial heterogeneity in which eligibility criteria were used in each aNSCLC trial, even though they all have the same mechanism of action as checkpoint inhibitors (Extended Data Fig. 2). For example, one trial excluded patients on the basis of albumin and lymphocyte levels, whereas the other nine trials did not. This motivated us to investigate the influence that each inclusion or exclusion criterion had on the real-world population.

Effects of the eligibility criteria

For each aNSCLC trial, we first selected all of the patients with aNSCLC in the Flatiron Health database who have taken the treatment or control drugs in the corresponding line of therapy. On average, 5,167 patients were identified for each trial (Table 1). The hazard ratio of the overall survival was estimated with propensity scores to control for differences between groups (Extended Data Fig. 3). This analysis corresponds to the hypothetical setting in which we fully relax the eligibility criteria.

We next emulated each aNSCLC trial using all of the original protocol criteria that can be encoded in the Flatiron database. The number of patients in the Flatiron database who met all of the eligibility criteria of the trial, along with their emulation hazard ratio of the overall survival, is shown in Table 1. The emulation results are broadly consistent with those of the original randomized trials. On average, only 30% of the patients in the Flatiron database who have taken the drugs tested in the trial actually met the trial eligibility criteria. Moreover, across the trials, the hazard ratio of the full patient population is comparable to, and sometimes smaller than, the hazard ratio of the subset of the patients who met the eligibility criteria (Supplementary Table 2). This suggests that many patients who were excluded by the restrictive eligibility criteria can also potentially benefit from the treatment in the trial.

The above findings motivated us to quantify how each inclusion/exclusion criterion affects the number of eligible patients and the trial outcome. The latter is particularly challenging because the effect of each inclusion/exclusion rule on the hazard ratio depends on which other inclusion/exclusion rules are used to select patients. To estimate this effect systematically, for each aNSCLC trial, we simulated thousands of synthetic cohorts using the Flatiron database under different random combinations of inclusion/exclusion criteria and estimated the hazard ratio of the overall survival for each cohort. We then used the Shapley value28, an attribution method used in artificial intelligence, to summarize the influence of each criterion. The Shapley value is a weighted average of the effect on the hazard ratio of adding each criterion to different sets of inclusion/exclusion rules (see Methods for details). In our setting, a Shapley value smaller than zero suggests that including the criterion improves the efficacy of the trial and decreases the hazard ratio.

Figure 2 shows the Shapley values for each eligibility criterion estimated with an efficient Monte Carlo algorithm (Methods and Extended Data Fig. 4). Shapley values close to zero (shown in white) correspond to eligibility criteria that had no effect on the hazard ratio of the overall survival. Criteria with beneficial effects (that is, including the criterion would decrease the hazard ratio of the overall survival on average) are shown in blue and detrimental effects (that is, including the criterion would increase the hazard ratio of the overall survival on average) are shown in red. Figure 2 also shows the decrease in the number of eligible patients when each criterion was applied (see Supplementary Tables 3, 4 for the exact numbers).

Our analysis suggests that several commonly used inclusion/exclusion criteria do not substantially affect the hazard ratio of the overall survival of a trial or potentially reduce the efficacy of the trial. These criteria include conditions analysed by laboratory tests (blood pressure, albumin levels, counts of lymphocytes or neutrophils, or alanine aminotransferase (ALT), alkaline phosphatase (ALP) and aspartate aminotransferase (AST) levels) and previous treatments (ALK, PDL1, EGFR and CYP34A therapies, systemic or antineoplastic therapies). These inclusion/exclusion criteria can be restrictive; for example, requiring the lymphocyte count to be greater than 500 per μl excludes 6.3% of the patients on average. Moreover, patients excluded by these criteria benefit from the treatments of the trial to a extent similar to that of patients who met the criteria, as reflected in a Shapley value close to zero (Fig. 2).

Relaxing trial eligibility criteria

The results above show that it is promising to explore the benefits and trade-offs of relaxing standard eligibility criteria. We investigate this by keeping for each trial only the subset of the criteria that Trial Pathfinder identified to decrease the hazard ratio of the trial (that is, with a Shapley value less than zero) and relax the remaining restrictions. We denote this subset the ‘data-driven criteria’ (Supplementary Table 5). The set of data-driven criteria removes nine inclusion/exclusion rules on average. The hazard ratio of the overall survival had an average reduction of 0.05 compared with using the full eligibility criteria, and the number of eligible patients increased from 1,553 to 3,209 on average, an 107% increase (Table 1 and Extended Data Fig. 5).

Relaxing restrictive eligibility criteria has the important benefit of making clinical trials more inclusive for diverse populations (Supplementary Tables 68). The patients who would be excluded by the original trial criteria but would be eligible in the relaxed rules tend to include more women and more patients older than 75 years. Detailed comparisons of patient characteristics between the original trial cohort and our emulations are shown in Supplementary Tables 918.

We performed several analyses to support the robustness of our results. In addition to using overall survival as the end point, we repeated all of the analyses for each trial using progression-free survival (Extended Data Table 2). To assess the robustness of our findings in light of the recent shift towards immunotherapies, we ran an analysis in which the data-driven criteria were applied to patients who received treatment between February 2017 and February 2020 (Supplementary Table 19). To assess the representativeness of our findings, we stratified our patient populations on the basis of geographical regions in the USA and the types of insurance plan (Supplementary Tables 2028). We also applied Trial Pathfinder to 9,439 patients with aNSCLC who received Foundation Medicine genomic tests (Supplementary Tables 2931). The results of all of these analyses are consistent with our primary findings.

Our primary analyses focused on aNSCLC trials because this cancer type had the most patients in the Flatiron Health database. To investigate the generalizability of Trial Pathfinder to other types of cancer, we identified three additional trials in colorectal cancer, advanced melanoma and metastatic breast cancer with available trial protocols that can be encoded in the Flatiron database (Supplementary Table 32). In all three types of cancer, we found that the original trial criteria were overly restrictive. The data-driven criteria selected by Trial Pathfinder substantially increased the patient population (53% increase on average) while achieving a lower hazard ratio of the overall survival than the original trial criteria (a decrease of 0.13 in the hazard ratio of the overall survival on average) (Extended Data Table 3 and Supplementary Table 33).

Broadening the thresholds of laboratory tests

To more directly assess the effects on safety when broadening eligibility criteria, we analysed the follow-up and evaluation of toxicity for 22 completed Roche oncology trials, which combined comprised 11,602 patients. We found substantial heterogeneity in the eligibility criteria across these trials (Supplementary Table 34). Even trials that targeted the same cancer, in the same phase, and that involved treatments of similar mechanisms used a number of different thresholds of laboratory values to exclude patients. Across aNSCLC, advanced melanoma, metastatic breast cancer and follicular lymphoma, trials with more relaxed thresholds of laboratory values for eligibility did not have more treatment withdrawals due to adverse events than trials with more stringent eligibility thresholds (Supplementary Table 35). This supports our finding that we can potentially broaden several common laboratory-based eligibilities—levels of bilirubin, platelets, haemoglobin and ALP—to align with successful trials that already use these relaxed thresholds without increasing the toxicity risks for the patients.

We further support our findings by analysing abstracted toxicity data in a cohort of 1,000 patients with aNSCLC from the Flatiron database. No significant difference in their baseline laboratory values at the start of treatment were found when comparing patients who reported toxicity-related adverse events during the course of treatment with patients who did not (Extended Data Fig. 6). This reinforces our finding that the broader eligibility thresholds for laboratory tests are feasible from a safety perspective. Furthermore, Extended Data Fig. 7 shows that relaxing the cut-off threshold for the levels of bilirubin, haemoglobin, platelets and ALP within the range of thresholds used in trials (Supplementary Table 35) does not significantly increase the hazard ratio of the overall survival in the Flatiron database and can make trials more inclusive (Supplementary Tables 36, 37).

Discussion

Overly restrictive eligibility criteria limit the access of patients to potentially beneficial treatments. Our findings suggest that it is particularly promising to standardize and potentially broaden several eligibility criteria based on cut-offs for bilirubin, platelets, haemoglobin and ALP values. Recent oncology trials often used different cut-off thresholds for these laboratory tests to exclude patients. We found that across different types of cancer, trials with more relaxed thresholds of laboratory values for eligibility did not have more treatment withdrawals due to adverse events compared with trials with more stringent eligibility thresholds. Together with our findings on the Flatiron data, this suggests that standardizing the eligibility criteria to align with successful trials within the same therapy class that used more relaxed laboratory thresholds could be a good approach to enhance inclusivity.

Because the real-world population can differ from the clinical trial samples, our study aim was not to replicate the original trial results using the Flatiron database. Instead, we investigated how varying the eligibility rules affects the proportion of the real-world population that would be eligible for the trial. Our data-driven evaluation of eligibility criteria should be interpreted as one factor among several that can assist clinical trial specialists in their designs. In each trial, there could be drug-specific nuances, and our hope is that by combining our recommendations with their expertise, trial designers can arrive at more-informed criteria. Currently, longitudinal real-world data with robust outcomes are more limited for diseases other than cancer, which can have more complex end points. There will be opportunities to extend this work outside of oncology as additional high-quality data become available.

Methods

Clinical trial curation

In this study, we focused on aNSCLC, because aNSCLC is a prevalent cancer type and has the largest number of patients in the Flatiron Health database. We systematically identified all of the aNSCLC trials that are available for our analysis. A total of 3,684 interventional clinical trials of NSCLC were retrieved from the ClinicalTrials.gov website of the National Library of Medicine (queried on 8 November 2019). A systematic selection of trials was carried out using the following filters: (1) trials were interventional and only had two arms; (2) treatments consisted of drugs or biologicals only; (3) the drugs selected in each arm are recommended for aNSCLC as listed on the NIH website (https://www.cancer.gov/about-cancer/treatment/drugs/lung); (4) at least 250 patients in each arm were found in the Flatiron Health dataset who match the description of the patients in the trials; (5) the trial was conducted in phase III; and (6) protocols were available. The final list of selected aNSCLC trials included FLAURA29, LUX830, Checkmate01731, Checkmate05732, Checkmate07833, Keynote01034, Keynote18935, Keynote40736, BEYOND37 and OAK38. Detailed information on these trials can be found in Extended Data Table 1. To ensure the completeness of the trial criteria, we carefully extracted all of the eligibility rules directly from the original trial protocols rather than from ClinicalTrials.gov. The eligibility criteria were extracted from the original clinical trial protocol documents and the programmatic encoding of the criteria was verified by a team of experienced oncology data scientists and clinical trial specialists. Additional information about the encoding of the criteria is provided in the Supplementary Methods and Supplementary Discussion. Trial Pathfinder is a flexible framework that can be applied to other clinical trials.

Flatiron Health dataset

The data that support the findings of this study have been obtained by Flatiron Health, a nationwide EHR-derived de-identified database containing 219,312 patients with cancer with an average of 2.6 years of follow-up. The Flatiron data leveraged in this study (the February 2020 data cut) comes from a combination of EHR-derived data and external commercial and US Social Security Death Index data. The Flatiron Health database is considered one of the industry’s leading research databases in oncology owing to the rigorous data curation and abstraction processes as well as publications in which their efforts to validate outcomes are demonstrated. In previous validation studies in which the Flatiron mortality data are compared to data from the gold-standard National Death Index, the sensitivity of mortality capture in a population of patients with aNSCLC was shown to be 91%, and that the effect of the remaining missing deaths on survival analyses was minimal39,40. In addition to curation accuracy, the Flatiron data are harmonized and aggregated across approximately 280 cancer clinics across the country, which enables its data to be more representative than the EHRs of a single healthcare centre. The majority of patients in the database originate from community oncology settings; relative community/academic proportions may vary depending on the study cohort. Data provided to investigators was de-identified and subject to obligations to prevent re-identification and to protect the confidentiality of the patients. These de-identified data may be made available upon request, and are subject to a licence agreement with Flatiron Health; interested researchers can contact DataAccess@flatiron.com to determine licensing terms. Institutional Review Board approval with a waiver of informed consent was obtained before the study was conducted.

Flatiron Health takes a comprehensive approach to data curation, which involves the collection of both structured and unstructured data from the EHRs. Structured data points, such as laboratory test results, are harmonized across different EHRs and mapped into common terminologies. Unstructured data processing, such as data that come from clinician notes or biomarker reports, leverages technology-enabled abstraction. Through this process, qualified abstractors extract key data points from unstructured documents and are aided by software that facilitates this process through organization, searching and surfacing of key documents throughout the abstraction process. Flatiron’s network of abstractors includes certified tumour registrars, oncology nurses and oncology clinical researchers.

Patients in the Flatiron Health network were considered to be part of the aNSCLC real-world cohort if they were diagnosed with lung cancer (the ninth revision of the international classification of diseases (ICD-9) code 162.x; or the tenth revision of the international classification of diseases (ICD-10) code C34x or C39.9); had at least two documented clinical visits on or after 1 January 2011; had pathology consistent with NSCLC; and were diagnosed with stage IIIB, IIIC, IVA or IVB NSCLC on or after 1 January 2011, or diagnosed with early-stage NSCLC and subsequently developed recurrent or progressive disease on or after 1 January 2011. Patients were excluded if there was a lack of relevant unstructured documents in the Flatiron Health database for review by the abstraction team.

A catalogue of the criteria that it was possible to emulate using the Flatiron Health database can be found in Supplementary Table 1. There are some criteria for which Flatiron Health does not currently abstract information from EHRs—for example, reproductive health, some prior co-morbidities, some previous treatments, imaging procedures and results—and these were not included in the present study. For those criteria that are available in the database, we also evaluated the percentage of missing ECOG and laboratory value information for each patient at the start of the first or second line of therapy (Supplementary Table 38). To closely mirror the actual trial screenings, we considered clinical measurements taken within a window from 30 days before to 7 days after the start of the line of therapy40.

We further support our findings by analysing toxicity data for a real-world cohort of 1,000 patients with aNSCLC from the Flatiron database. These patients were randomly selected from the broader aNSCLC cohort based on receipt of anti-PD-1/PD-L1 therapy, and underwent additional data abstraction to determine the reasons for treatment discontinuation, including toxicity. In addition, we identified 22 Roche oncology trials with available clinical study reports, and extracted statistics from the study reports on the number of patients who withdrew from treatment owing to adverse events.

The Trial Pathfinder workflow

In the first step of Trial Pathfinder—trial emulation—we identified individuals in the real-world dataset who met the available eligibility criteria as originally published in the clinical trial protocol. The eligibility criteria were encoded as logic statements and were automatically applied by our workflow. More information on how the semi-structured free-text criteria in the clinical trial protocols were encoded into programmatic statements is provided in the Supplementary Methods. Patients with missing data points (for example, ECOG or laboratory values) in the corresponding criteria were not filtered by those criteria. We then assigned the selected patients to the treatment groups that were consistent with their treatment records in the database (for example, atezolizumab versus docetaxel). To emulate the randomization and blind assignment in the trials, we used inverse probability of treatment weighting (IPTW) to adjust for baseline confounding factors. Time zero was set to be the start of the corresponding line of therapy. Finally, we performed survival analysis for the emulated trials using the hazard ratio of the overall survival as the outcome. Each individual was followed until the occurrence of death or censored at the latest reported activity. Outcomes that occur after 27 months in the Flatiron database are considered censored in our analysis to match the original trial settings. The results are robust to the specific window lengths discussed here (Supplementary Table 39). The Trial Pathfinder open source code was written in Python version 3.6.

Trial Pathfinder trial emulation and survival analysis

To emulate the blind assignment and obtain unbiased estimates of treatment effects, we used IPTW to adjust for the baseline covariates. During the survival analysis, patient i is given the weight defined in equation (1), in which Zi is the indicator variable representing whether patient i is treated or not, with Zi = 1 indicating a treated case. The propensity score ei is defined in equation (2), in which Xi denotes the baseline covariates. We used a logistic regression model to estimate ei. In our experiments of aNSCLC, the covariates X were: age, gender, composite race or ethnicity, histology, smoking status, staging, ECOG and biomarker status, including ALK, EGFR, PDL1, ROS1, KRAS and BRAF. Adjustment by propensity score is effective in balancing all of the covariates between the synthetic treatment and control groups (Extended Data Fig. 3).

$${\omega }_{i}={Z}_{i}/{e}_{i}+(1-{Z}_{i})/(1-{e}_{i})$$
(1)
$${e}_{i}={\rm{\Pr }}({Z}_{i}=1|{X}_{i})$$
(2)

We further performed survival analysis on the emulated trials. For each patient, the index date or time zero, resembling the randomization point in a clinical trial, was chosen to be the start date of the line of therapy of that trial (either first or second). This choice of time zero ensures that there is no immortal time bias41. Patients were followed until the occurrence of death, censoring those patients without a death event. The Cox proportional-hazards model was used to compute hazard ratios and confidence intervals of overall survival. Survival curves were estimated with the Kaplan–Meier method.

Eligibility criteria evaluation with Shapley values

To evaluate the influence of an individual criterion we used the Shapley value, which is the average expected marginal contribution of adding one criterion to the hazard ratio after all possible combinations of criteria have been considered. The Shapley value has recently been proposed in machine learning as a principled approach to quantify the contribution of individual features and data28. The definition of the Shapley value of the ith criterion is given in equation (3), in which n is the total number of criteria and HR(S) indicates the hazard ratio computed when the criteria subset S is used to select patients. The sum in equation (3) is taken over all possible subsets S of the n original criteria (denoted as N for short) that did not contain i.

$${\rm{Shapley}}\,{\rm{value}}\,{\rm{of}}\,{\rm{the}}\,i{\rm{th}}\,{\rm{criterion}}=\sum _{S\subseteq N\backslash \{i\}}(|S|!(n-|S|-1)!/n!)({\rm{HR}}(S\cup \{i\})\mbox{--}{\rm{HR}}(S))$$
(3)

The Shapley value of the ith criterion is a weighted average of the effect of adding this criterion to different subsets of inclusion/exclusion criteria. The weights normalize for the number of possible sets that have the same cardinality and are required to satisfy the Shapley attribution properties.

Exhaustively computing the hazard ratios of overall survival for all possible subsets of criteria (order of n!) was computationally prohibitive. Here we estimated the Shapley value by Monte Carlo sampling subsets of criteria S. The Monte Carlo sampling gives an unbiased estimate of the Shapley value. Following the previously proposed algorithm42, we stop sampling when the Shapley estimate has converged (that is, when the standard error of the Monte Carlo mean is less than 0.001). In practice, convergence happened after a hundred iterations for each criterion. A few thousand Monte Carlo samples combined is sufficient for a trial with tens of criteria to evaluate. This makes Trial Pathfinder computationally efficient (Extended Data Fig. 4) and only needs around half an hour to run with a single CPU for one trial. For each trial, we averaged its results evaluating on a different criteria set from the trials in the same line of therapy (either first or second). A Shapley value larger than zero indicates that the contribution of that criterion is to increase the hazard ratio on average. Conversely, a negative Shapley value means that the contribution of that criterion is to decrease the hazard ratio on average. Finally, Shapley values that are close to zero correspond to a criterion that does not affect the hazard ratio.

Trial Pathfinder reports the subset of criteria used by the original trial that have a Shapley value smaller than 0 as data-driven criteria. Once the data-driven subset of criteria was selected, Trial Pathfinder computed the number of eligible patients and the hazard ratio of the overall survival between the synthetic treatment and control arms.