A machine learning framework supporting prospective clinical decisions applied to risk prediction in oncology

We present a general framework for developing a machine learning (ML) tool that supports clinician assessment of patient risk using electronic health record-derived real-world data and apply the framework to a quality improvement use case in an oncology setting to identify patients at risk for a near-term (60 day) emergency department (ED) visit who could potentially be eligible for a home-based acute care program. Framework steps include defining clinical quality improvement goals, model development and validation, bias assessment, retrospective and prospective validation, and deployment in clinical workflow. In the retrospective analysis for the use case, 8% of patient encounters were associated with a high risk (pre-defined as predicted probability ≥20%) for a near-term ED visit by the patient. Positive predictive value (PPV) and negative predictive value (NPV) for future ED events was 26% and 91%, respectively. Odds ratio (OR) of ED visit (high- vs. low-risk) was 3.5 (95% CI: 3.4–3.5). The model appeared to be calibrated across racial, gender, and ethnic groups. In the prospective analysis, 10% of patients were classified as high risk, 76% of whom were confirmed by clinicians as eligible for home-based acute care. PPV and NPV for future ED events was 22% and 95%, respectively. OR of ED visit (high- vs. low-risk) was 5.4 (95% CI: 2.6–11.0). The proposed framework for an ML-based tool that supports clinician assessment of patient risk is a stepwise development approach; we successfully applied the framework to an ED visit risk prediction use case.

Features (covariates) used by the model were selected through a collaborative and iterative process involving clinical experts and data scientists. An initial list of potential features was informed by input from clinicians, review of existing literature, and data scientists who evaluated which of these features were accessible in structured electronic health record (EHR) data and had acceptable completeness and prevalence in conjunction with clinical experts.
All features used in the model were structured as binary variables. A list of features can be found in Supplemental Table 3. We included both time-varying and non-time-varying features as inputs to the model in order to accommodate the longitudinal nature of patient health and potential deterioration. This enables the model to learn relationships between not just clinical concepts, and to account for their recency in relation to ED visits. Some features, such as race, ethnicity, and sex, were assumed not to change across all visits for a given patient. Others, such as diagnoses, were considered present for all visits on or after the appearance of the relevant ICD code. Features related to past visits, medications, or changes in vitals made use of clinically-relevant time windows relative to the visit of interest. Examples of time window features include whether the patient received an antiemetic order in the past 30 days, and whether the patient had lost greater than 5 pounds in the past 30 days.
The ML model was a L2-regularized logistic regression. We used k-fold cross-validation with 5 folds, using the AUC metric to select the best-performing level of regularization.

Calibration factor definition
We used calibration factor as a summary statistic to evaluate the calibration within each group of interest. The calibration factor was defined as follows. For each recorded visit i within a group of interest, the model calculates a risk score r i between 0 and 1. We then observe an outcome o i , either ED utilization or no ED utilization, encoded as 1 or 0 respectively. The calibration factor is then = ( ) − ( ) cf=mean(r i )mean(o i ) calculated over all visits and outcomes in that group. A calibration factor greater than 0 indicates that the model is over-predicting risk within the group, while a calibration factor less than 0 indicates that the model is under-predicting risk within the group.

MI-CLAIM Checklist.
After the framework in this study was developed and the use cases were designed and developed, an external group of leaders in the healthcare machine learning community published a guide for transparent reporting, The Minimum Information about Clinical Artificial Intelligence Modeling (MI-CLAIM) Checklist. We have included the checklist completed here (Supplemental Table 1) with our post-hoc assessment of the manuscript as a supplement to promote transparency. OR of ED visit (high-risk vs low-risk patients) 3.5 (95% CI: 3.4-3.5) 3.7 (95% CI: 3.7-3.8)

Supplemental
Note: Prospective evaluation metrics are at the patient level and retrospective evaluation metrics for both models are calculated at the encounter level.
Abbreviations: ED, emergency department; AUC, area under the (receiver operating characteristic) curve; NPV, negative predictive value; PPV, positive predictive value.

Notes if not completed
The clinical problem in which the model will be employed is clearly detailed in the paper.
Yes; pg. 13 The research question is clearly stated. Yes; pg. 12,13 The characteristics of the cohorts (training and test sets) are detailed in the text.
Yes; pg. 14-16 The cohorts (training and test sets) are shown to be representative of real-world clinical settings.
Yes; pg. 14-16 The state-of-the-art solution used as a baseline for comparison has been identified and detailed.
Other leading modeling approaches were considered at the time of model building, some results are in Supplemental Table 1 Data and optimization (Parts 2, 3) Completed: page number

Notes if not completed
The origin of the data is described and the original format is detailed in the paper.
Yes; pg. [15][16][18][19] Transformations of the data before it is applied to the proposed model are described.
Yes; pg. 16 Due to space constraints and IP protection, a full provenance of the data is omitted from the manuscript, most relevant methods are described on pg. 16 The independence between training and test sets has been proven in the paper.
Yes; pg. 17 Details on the models that were evaluated and the code developed to select the best model are provided.
Yes; pg. 14-17 Note: The de-identified data that support the findings of this study may be made available upon request, and are subject to a license agreement; interested researchers should contact <DataAccess@flatiron.co m> to determine licensing terms.
Is the input data type structured or unstructured?

Notes if not completed
The primary metric selected to evaluate algorithm performance (e.g., AUC, F-score, etc.), including the justification for selection, has been clearly stated.
Yes; pg. 14-19 The primary metric selected to evaluate the clinical utility of the model (e.g., PPV, NNT, etc.), including the justification for selection, has been clearly stated.
Yes; pg. 17-19 The performance comparison between baseline and proposed model is presented with the appropriate statistical significance. Prospective study time period: 01-04-2020 to 02-07-2020 Exclusion Criteria a. Insufficient structured oncology data: first ever oncology-related visit at the University of Utah is less than 90 days prior to the prediction encounter at the University of Utah (i.e., patients newly diagnosed with a cancer ICD code within 90 days prior to the prediction encounter will be excluded) b. Inactive patient: has fewer than two oncology-related visits in the 90 days prior to the prediction encounter(s) at the University of Utah c. Ineligible zip code for enrollment at Huntsman at Home: zip code is not within the 20-mile radius of the University of Utah (eligible zip code list provided by Huntsman at Home team)

ICD, International Classification of Diseases a
The inclusion/exclusion criteria were the same for model training and the retrospective and prospective studies (with the exception of eligibility time period).