Patient-specific COVID-19 resource utilization prediction using fusion AI model

The strain on healthcare resources brought forth by the recent COVID-19 pandemic has highlighted the need for efficient resource planning and allocation through the prediction of future consumption. Machine learning can predict resource utilization such as the need for hospitalization based on past medical data stored in electronic medical records (EMR). We conducted this study on 3194 patients (46% male with mean age 56.7 (±16.8), 56% African American, 7% Hispanic) flagged as COVID-19 positive cases in 12 centers under Emory Healthcare network from February 2020 to September 2020, to assess whether a COVID-19 positive patient’s need for hospitalization can be predicted at the time of RT-PCR test using the EMR data prior to the test. Five main modalities of EMR, i.e., demographics, medication, past medical procedures, comorbidities, and laboratory results, were used as features for predictive modeling, both individually and fused together using late, middle, and early fusion. Models were evaluated in terms of precision, recall, F1-score (within 95% confidence interval). The early fusion model is the most effective predictor with 84% overall F1-score [CI 82.1–86.1]. The predictive performance of the model drops by 6 % when using recent clinical data while omitting the long-term medical history. Feature importance analysis indicates that history of cardiovascular disease, emergency room visits in the past year prior to testing, and demographic factors are predictive of the disease trajectory. We conclude that fusion modeling using medical history and current treatment data can forecast the need for hospitalization for patients infected with COVID-19 at the time of the RT-PCR test.


INTRODUCTION
Multiple waves of SARS-CoV-2 virus infections threaten to overwhelm the healthcare system 1 .A third of all hospitalized COVID-19 patients require admission and management in an intensive care unit (ICU) 2 to manage complications like acute respiratory distress syndrome (ARDS), secondary sepsis, and multi-organ failure 3 .Predictors of poor outcome and need for assisted ventilation include clinical and laboratory markers like D-dimer levels and SOFA score, and demographic features such as older age and ethnicity 3 .Currently, there is no quantitative criterion that combines clinical and laboratory-based markers to predict the likely level of care required for a given patient at the time of COVID-19 testing.Such a predictive model would allow resource planning by understanding potential hospitalization requirements, especially as testing is distributed out of hospitals.
Much of the literature regarding predictive modeling for COVID-19 patients deal with either mortality prediction [4][5][6] , or analysis of risk factors for mortality 7 .Instead, our work focuses on predicting the probability of future hospitalization at the time of COVID-19 testing (Fig. 1a).This is in contrast to several recent papers that focus on critical event prediction such as ICU admission 8 and mechanical ventilation 9 at the time of presentation to the emergency department.Recently published systematic review of COVID19 related prediction models 10 mentions only three studies related to hospitalization risk prediction (see Supplementary Note 1 for detailed limitations of previously published studies).The major limitation of the existing work, including the studies mentioned by Wynants et.al. 10 , is the use of a narrow feature selection based on expert opinion or published literature 5,6,8,9,[11][12][13] .We overcome this limitation by training multiple machine learning architectures, including multi-branched deep dense network, for the targeted prediction task, using all the data captured in the electronic medical record (EMR) prior to COVID-19 infection.We use interval-based feature representation for medications, comorbidities, past procedures, and laboratory results to ensure that information collected at different time intervals is given due importance by our predictive models.Compared to pre-selected features, we include as many EMR variables as possible, filtering features based on automatic methods while relying on experts to provide intuitive representation or group structure for large features set.We evaluate the predictive performance of each part of the EMR data (demographic information, medication, past procedures, comorbidities, and laboratory results) as well as multiple fusion models that integrate the feature space 14 .

Performance of fusion models
Table 1 reports the class-wise and aggregated (weighted average) precision, recall, and F-score 15 as well as confidence interval (95% confidence) for distinguishing between hospitalization and selfisolation on a held-out set of 569 unique patients.We compare the performance of our fusion models against the performance of individual source classifiers.Results demonstrate that fusing multiple data sources from EMR increases the performance beyond the performance of any individual source.Early fusion is the best performing model with 84 overall F1-score [CI 82.1-86.1]and 85 F1-score for classifying patients who will need hospitalization within 7 days of RT-PCR testing.Late (83 F1-score) and middle fusion (82 F1-score) models also come very close to the performance of the early fusion model.
Our EMR dataset is divided into 'current'-15 days before COVID-19 test, and 'history' interval (data from 1 year before the test, excluding 15 days in the current history).It is evident that information from 'history' interval is crucial for future hospitalization prediction as the performance of fusion models without 'history' interval drops by an average of 6 F1-score (±1 std) than that of models with both 'current' and 'history' intervals.
The receiver operating characteristics (ROC) curve and precision-recall (PR) curve are shown in Fig. 2a.Early (AUROC 0.91 & AUPRC 0.9), late (AUROC 0.88 & AUPRC 0.87) and middle (AUROC 0.87 & AUCPR 0.87) fusion achieve much higher Area under the Receiver Operating Curve (AUROC) and Area under the precision-recall curve (AUCPR), as compared to individual source classifiers.Interestingly, models trained on comorbidities coded as ICD9/10 and procedures performed on the patients also presents high performance.
We also performed calibration analysis of the three fusion models.Figure 2d shows calibration curves along with Brier scores for each model after calibration through isotonic regression.The early fusion model not only performs the best, but is the most reliable model with the lowest Brier score after calibration.While calibrated middle fusion tends to underestimate the positive class (risk of hospitalization), late fusion model seems to swing between over and under estimation with strong over estimation in the upper quadrant.
We present the performance of the early fusion model stratified by race and ethnicity, gender, and age in Fig. 3a-c, respectively.In terms of race and ethnicity, the model performs equally well for all patients with a small drop in performance for Hispanic population which is probably bias given the smaller number of evaluation samples (see Supplementary Note 4 for detail).A similar performance drop is observed for male patients.In terms of age, our model achieves balance between most of the age ranges except for less-than-30-years category where the model achieves better performance.Generally healthier disposition of these patients may account for this performance difference.

Feature importance
We investigated the interpretability of our best performing models, i.e., early and late fusion models, in terms of feature importance assigned to input features.The top features are shown as bar plots in Fig. 2b (early fusion) and Fig. 2c (late fusion) where we used 10-fold cross validation to compute average feature weights; standard deviation is shown as error bars.From the early fusion model, abnormal red blood cell counts, D-dimer test, history of hypertensive disease and previous emergency room encounters are most informative to predict hospitalization for patients with COVID-19.Demographic factors such as race and ethnicity (Black and Hispanic) as well as being male has high importance in prediction.Following the similar trend of the early fusion model, individual prediction using CPT and ICD data had higher weights in the late fusion meta-learner.Individual source model feature importance is presented in the Supplementary Note 4 and is consistent with the literature 3,[16][17][18] : (1) comorbidities related to the lungs and urinary systems seem to be important for the classifier based on comorbidities, (2) treatment of thyroidrelated diseases are given the highest importance by medicationsbased classifier.

DISCUSSION
In this study, we developed a multimodal fusion AI model from demographics, medications, laboratory tests, CPT, and ICD codes documented in the EMR to predict the severity of COVID-19 at the time of testing, and whether a COVID-19 patient will need hospitalization within 7 days of the RT-PCR test.This is in contrast to existing COVID-19 prediction models that employ medical information at the time of presentation to the hospital and predict an event between 24 h and 7 days into the future 5,6,8,9,19,20 .Our models rely on past health records of patients one year prior to testing.This enables our model to provide input to a dashboard that forecasts the utilization of hospital and ICU beds at the time of COVID-19 testing.As national efforts for testing scale up such a model can be used to further assign the patients the level of monitoring they will need based on their risk of disease progression.As mentioned in 10 , predictive models should serve a clinical need and use representative patients' set.We have been careful to achieve both goals.We have used RT-PCR testing as a criterion to select a representative set of patients for COVID-19.Our model serves the clinical need of healthcare resource demand projection.
From a technical perspective, existing predictive models include logistic regression 4,12,21 , Lasso 13,19 , XGBoost 5 , Random Forest 6,8 , convolutional neural network 22 , semantic word embedding models 20,22 .We experimented with various classification models and found XGBoost and multi-branched deep dense network to be the most suitable.The technical novelty lies in a thorough exploration of vast and heterogeneous feature spaces, handling of information collected over long time periods, and their intuitive fusion with minimal expert supervision, Lasso 13,19 , XGBoost 5 , Random Forest 6,8 , convolutional neural network 22 , semantic word embedding models 20,22 .We experimented with various classification models and found XGBoost and multi-branched deep dense network to be the most suitable.The technical novelty lies in a thorough exploration of vast and heterogeneous feature spaces, handling of information collected over long time periods, and their intuitive fusion with minimal expert supervision.
A review of feature importance provides insight for future research and feedback from the community on the significance of various predictors of COVID-19 disease trajectory.For example, several papers have been published on the disparate outcome based on race and ethnicity, with more deaths observed in blacks and Hispanics 23,24 .When only demographics are used in the model, they have a lower F1 score (69% versus 84% for the early fusion model), which could potentially be explained by other Inflammatory marker laboratory levels like procalcitonin, ferritin, and lactate noted to be important for COVID-19 care are not routinely collected in care, and hence are not represented in the top laboratory markers in our patient cohort.Our models show that the immediate pre-testing period is an important predictor of COVID-19 severity and need for hospitalization, especially when patients are recently started on anticoagulation, thyroid, or respiratory medications.Moreover, the complete blood count has the highest feature importance.To our knowledge, the complete blood count has not been linked to COVID-19 disease course.
Our study has important limitations.The models were trained on a population of patients who were cared for in a highly integrated academic healthcare system with 56.4% African American and 2% Asian population.The models may not perform well in a different patient demographic or health system.Second, the number of patients for training and validation is limited given we only consider patients with RT-PCR tests before September 2020.The limited number of patients with sparse data make the modeling problem challenging.Even though early fusion results in the best prediction, statistical metrics (precision, F1-score) indicate late and middle fusion results are very similar (p < 0.05, see Supplementary Note 4).We believe that middle fusion with consistent backpropagation may generate the optimal result with larger training data.

Cohort description
With the approval of Emory Institutional Review Board (IRB), we collected all the EMR data from all patients flagged as COVID-19 positive (ICD10 diagnosis code -U07.1 + codes for symptoms or notes in the record) in 12 different facilities in Emory University Healthcare (EUH).Since only deidentified data were used, IRB waived off the requirement of informed consent by the patients.Between January and September 2020, there were  3194 such patients.We collected PCR testing information available from all EUH facilities.We found that 3120 of 3194 patients had at least one positive PCR test for COVID-19.The remaining patients either had no positive test or had missing test results.We collected all hospitalization (admission/discharge) data for COVID-19 positive patients from January 2020.We carefully examined the data to identify patients who were admitted to the hospital after testing positive for COVID-19, but excluding hospitalization unrelated to COVID-19 (i.e., hospitalization after 7 days of RT-PCR testing).Figure 1a shows the overall architecture of our model including possible outcomes.Figure 1b shows inclusion and exclusion criteria for selecting patients that were hospitalized or not hospitalized after COVID-19 testing.We found 1504 patients who were hospitalized with COVID-19 diagnosis and 1340 patients who were not hospitalized.The rest had irreconcilable Table 1.Performance for binary classification models with hospitalization and non-hospitalization as two targets, in terms of class-wise and aggregated (weighted average) precision, recall, and F-score; C.I. (95% confidence) was computed using bootstrapping over 1000 iterations with random samples.Such hospitalization may be unrelated to COVID-19 or the patient selfquarantined early after testing but later had to be admitted to hospital (more than 7 days after testing), indicating progression of the disease.Of the 1504 patients admitted to the hospital, 365 patients were later admitted to ICU while the remainder stayed in a regular inpatient ward.Table 2 highlights the overall characteristics of our patient populations, including comorbidities, and Fig. 4a-c shows the common comorbidities in our patient population for different age groups and the correlation between race, ethnicity, and comorbidities.
We aim to develop an AI model to help plan healthcare resource needs for each COVID-19 patient by predicting the need for hospitalization at the time the patient takes a RT-PCR test (Fig. 1a).Our predictive models employ retrospective EMR data prior to COVID-19 testing, including diagnoses, prescribed medications, laboratory test results, and demographics collected over a year prior to the test.In our dataset, such information is available from January 2019 to September 2020.Our study complies with TRIPOD 25 guidelines for reporting.Our study complies with TRIPOD 25 guidelines for reporting.Cohort and models are described in the following sub-sections.Performance is reported in the Results section and interpretation of results and limitations of our approach are detailed in the "Discussion" section.

Handling temporal EMR data
Since the clinical encounter data have been generated over more than a year time-period, it is important for the model to be able to differentiate and put justifiable emphasis over more recent versus historical medical information.However, the COVID-19 pandemic resulted in a scenario where patients may have their first healthcare encounter due to infection with very little past medical history.Therefore, the generated EMR data are very sparse and finer time-interval division results in prohibitively large fraction of missing data values.To handle such missing data and at the same time achieve temporal distinction between information, we divide EMR data for each patient into two intervals, i.e., current and history (Fig. 1a).The current interval includes all information collected between 24 h before the RT-PCR test and 15 days before the time of test.The history interval includes all information collected prior to the current interval.We experimented with several temporal data splitting schemes including weekly, monthly, and quarterly splits.The sparsity of data renders most of these splits suboptimal for modeling.We observed that above mentioned scheme of current and history interval suffices for distinguishing between EMR information on the temporal axis for the given problem while avoiding insurmountable data sparsity.

Fig. 1
Fig. 1 Study design.a Proposed AI model decision point shows the prediction of two patients with distinct outcomes.b CONSORT diagram for Cohort selection process including decision nodes and a number of excluded cases.

Fig. 2
Fig. 2 Statistical analysis of the models.a PR (left) and ROC (right) curves for model distinguishing between self-isolation and hospitalization outcomes.Each colored line represents a separate model and the color scheme is consistent between PR and ROC curves.Feature importance from (b) early fusion-shows the importance of top 25 individual EMR data component, c late Fusion model-shows the importance of individual EMR data sources.The standard deviation bar (red) is generated via 10-fold cross-validation on the training data.d Calibration curve for early, late and middle fusion models along with Brier scores for each calibrated model.

Fig. 3
Fig. 3 Performance stratification of the best performaing model based on Early fusion.a stratification based on race and ethnicity, b stratification based on gender, c stratification based on age.

Fig. 4
Fig. 4 Patients characteristics as heatmaps.Heatmaps of a common comorbidities in our patient population according to different age groups, b relation between race and comorbidities, c relation between ethnic group and comorbidities.The value represented as % and darker color represents higher value.