Machine learning risk estimation and prediction of death in continuing care facilities using administrative data

In this study, we aimed to identify the factors that were associated with mortality among continuing care residents in Alberta, during the coronavirus disease 2019 (COVID-19) pandemic. We achieved this by leveraging and linking various administrative datasets together. Then, we examined pre-processing methods in terms of prediction performance. Finally, we developed several machine learning models and compared the results of these models in terms of performance. We conducted a retrospective cohort study of all continuing care residents in Alberta, Canada, from March 1, 2020, to March 31, 2021. We used a univariable and a multivariable logistic regression (LR) model to identify predictive factors of 60-day all-cause mortality by estimating odds ratios (ORs) with a 95% confidence interval. To determine the best sensitivity–specificity cut-off point, the Youden index was employed. We developed several machine learning models to determine the best model regarding performance. In this cohort study, increased age, male sex, symptoms, previous admissions, and some specific comorbidities were associated with increased mortality. Machine learning and pre-processing approaches offer a potentially valuable method for improving risk prediction for mortality, but more work is needed to show improvement beyond standard risk factors.


Variables of interest
In alignment with a previous study 33 , comorbidities were assessed based on historical data from the two years preceding the index date.We also completed a one-year lookback before episodes of infection for the number of admissions, procedures, and special care unit (SCU) visits to establish a baseline of healthcare utilization 33 .
The primary outcome in our study was all-cause mortality within 60 days of a resident's first COVID-19 polymerase chain reaction (PCR) positive or negative test result.According to an early study 34 , there is a time lag of two to eight weeks (60 days) between COVID-19 cases (index date) and death.In our study, we utilized all-cause mortality 20 data within 60 days of a positive or negative COVID-19 test, as this information was available in vital statistics.
The selected covariates were based on both clinical expertise and previous literature that has examined the association between age and the Elixhauser comorbidity index 35 , as well as prior admissions 36 and their connection to severe COVID outcomes.So, the selected covariates encompassed various factors, including age, demographic characteristics (sex, LTC vs DSL resident, and specimen collection location), comorbidities (based on the Elixhauser index), the number of previous procedures (inpatient and outpatient), the number of previous admissions (hospital and ICU), symptomatic status, specimen year-month collection, and the results of the PCR test.

Statistical analysis
Missing values were addressed by employing a strategy that involves imputation from other relevant columns, illustrated in the flowchart in Appendix 3. To illustrate, when dealing with missing values in the "Symptomatic during collection" feature, we have imputed these values using the corresponding test results.Additionally, we addressed outliers within the dataset, removing a minimal percentage of records (specifically, one record from a 17-year-old individual and five with unidentified sex), which collectively constituted only 0.02% of all records.
The removal of the 17-year-old individual from the study cohort was carried out to ensure that the data accurately represents the target population, which includes individuals older than 18 years of age.The exclusion of individuals of unidentified sex was performed to address potential errors in data collection or recording and to enhance the overall quality and accuracy of the dataset.The steps were illustrated in the flowchart in Appendix 3.  In this study, descriptive statistics were utilized to provide a comprehensive overview of the cohort characteristics.Descriptive statistics including frequencies and percentages for categorical variables and means with standard deviations (SD) for normally distributed continuous variables and medians with interquartile ranges (IQR) for skewed continuous variables were used to describe the characteristics of the cohort.
In this study, we aimed to directly estimate the odds 37 of mortality occurring.Given a binary outcome variable, we chose standard logistic regression (LR) 38 over modified Poisson regression as LR is more suitable for estimating odds in binary outcomes.Univariable LR was used to identify individual predictive factors of 60-day mortality by estimating odds ratios (ORs) 39 with a 95% of confidence interval (CI).A multivariable LR was applied to examine the joint association of all risk factors with 60-day mortality with adjusted ORs (aORs) and 95% CIs.
For cross validating the predictive models, the data were split into a training set (90% of the sample data) and a test set (10% of the sample) randomly 40 .The forward selection method was used to add the variables to the predictive models by iteratively adding features (predictor variables) to the model 41 .We used sensitivity as a measure to check which predictors are to be included in the model.
While useful for evaluating standard LR models, the AUC, sensitivity, and specificity values do not explicitly identify the best cut points 42 .To identify ideal cut points Youden index (J) method was proposed 43 .Youden's index measures the difference between the true positive rate and the false positive rate across all potential cutpoint values to calculate the perfect cut-point 44,45 .
In this study, the pre-processing techniques include random over-sampling to balance the classes and power transformation (PT) to normalize the data 17,[46][47][48] .We employed power transformation to address non-normality and skewness in the data distribution and successfully normalized the data 49 .This transformation was found to be more effective in terms of performance, leading to improved results in our analysis.
Oversampling techniques were proven to outperform under-sampling methods by addressing data imbalance without losing valuable information, ultimately leading to improved model performance 50 .So, the random over-sampling technique (ROTE) and synthetic minority over-sampling technique (SMOTE) 51,52 were applied separately for the training set.The test set was kept untouched for the final performance report and the LR models using initial data using 0.5 threshold (model 1), initial data using 0.083 threshold (model 2), SMOTE using 0.46 threshold (model 3), and ROTE using 0.42 threshold (model 4) were evaluated.

Model development
Along with models 1-4 (LR models with pre-processing done), ML models, including RF, SVM, XGBoost, and ANN have been developed and tested to predict the 60-day incidence of mortality [13][14][15]18 . An lpha level of 0.05 (two-tailed) was used to assess statistical significance.We used Advanced Research Computing (ARC) cluster at the University of Calgary 53 , to submit several jobs in parallel.Each job had a unique input to do hyperparameter optimization.In this examination, about 200,000 experiments were performed to detect the best hyper-parameters for all the ML models.The best results received by these hyper-parameters are illustrated in Appendix 4. These models were examined using normalized data and balanced classes.Data analysis and model development were performed by using Python programming language (version 3.9) and packages were illustrated in Appendix 5.As a popular scripting language for coordinating shell tasks, the bourne-again shell (Bash) was utilized to manage the resources in ARC 54 .
Several models were examined in this study.First was the RF model 55 , an ML algorithm, that works based on the majority decisions of several 'decision trees' that are centered on randomly selected variables in a dataset 56 .Second, SVM classifies by using a multidimensional hyperplane (to maximize the margin between the clusters) and a nonlinear function (kernel) 57 .Third, XGBoost was appealing since, when compared to the other classifiers, its default option had a great average performance 58 .XGBoost is based on a decision-tree ensemble algorithm that uses a gradient-boosting framework.Finally, ANN is becoming a common ML model in the field of health care [59][60][61][62][63][64] .This model has at least three layers, including input (receives the features), hidden (extracts the patterns based on weights), and output (presents the output) 59 .Each layer has nodes or neurons that are connected to their adjacent unit by a set of adjustable weights 62 .Like LR, the ANN models benefit from activation functions (like sigmoid) to produce the output 65 .

Results
Among all LTC residents who were tested for COVID- www.nature.com/scientificreports/ The horizontal bar charts in Appendix 8 illustrate the multivariable associations between clinical risk factors and 60-day all-cause mortality in continuing-care residents in Alberta.In our univariable and multivariable analyses between resident characteristics and 60-day mortality, the following associations were observed (Table 2).
(1) Sociodemographic characteristics: In both analyses, older age was associated with a higher probability of 60-day mortality.The odds of death increased by a factor of 1.05 per one-year increase in age [95% CI 1.04-1.05](p < 0.01).Concerning other demographic risk factors, men had a higher risk than women for  2).The OR and aORs of the most prevalent comorbidities were listed in Appendix 9.

Model performance
The univariable comparison in Appendix 10 shows that the ORs for the individual features of the initial data (without balancing) and data with balanced classes (using ROTE and SMOTE) were almost the same.After examination of the standard LR model (model 1-4), the critical weakness of predictive analytics was found to be the cut-off point 66 .After identifying the ideal sensitivity-specificity cut-off point using Youden's index, we could improve the performance.The sensitivity (from 6 to 77%) and AUC (from 53 to 71%) scores for the LR model were increased by utilizing the optimal cut-point value (Table 4).Model 4 (LR model along with PT technique 67 , and ROTE methods) could achieve better performance in terms of sensitivity (1% increased) than model 2 (LR model without using pre-processing techniques).
In this examination, by using optimal cut-off points and pre-processing methods the sensitivity (from 6 to 78%) and AUC (from 53 to 71%) scores were improved for the LR model (Table 4).Also, in this study, we examined three different ML models, including RF, SVM, XGBoost, and ANN.Among all, a 3-hidden-layers ANN model (included in Appendix 11) could accomplish the best performance in terms of sensitivity and AUC which was 82%, and 73% respectively.In Appendix 12, we present informative horizontal bar charts showcasing the sensitivity and AUC values of all the discussed models used in our study.

Discussion
The primary outcome variable in this study was mortality within 60 days of a resident's first COVID-19 test.We aimed to identify factors associated with mortality among continuing care residents in Alberta during the COVID-19 pandemic and develop machine learning models to improve risk prediction for mortality.In this study, we employed a diverse range of databases, including EDW, ACCIS, NACRS, Claims, DAD, and AVS.
We identified several characteristics associated with mortality, including advanced age, male sex, metastatic cancer, chronic liver disease, having symptoms, previous SCU, and hospital admissions.Combining these factors achieved higher estimation in this population than if they were considered individually.Our findings were aligned with previous research as increased age, male sex, previous intensive care unit (ICU) admissions, and chronic conditions, like cancer and chronic liver disease, have been already recognized as risk factors for mortality in ill patients with COVID-19 www.nature.com/scientificreports/ The outcome variable of a study conducted by Panagiotou et al. 9 was death due to any cause within 30 days of a resident's first positive COVID-19 polymerase chain reaction test result.The study leveraged unique electronic medical record data and other clinical data from a large multistate sample of US nursing homes.The study found that increased age, male sex, impaired cognitive and physical function, diabetes, chronic kidney disease, fever, shortness of breath, tachycardia, and hypoxia were independently associated with mortality in US nursing home residents with COVID-19.Understanding these risk factors can aid in the development of clinical prediction models of mortality in this population.Both our study and Panagiotou et al. 9 identified increased age and male sex as risk factors for mortality in COVID-19 patients.Both studies also highlighted previous ICU admissions and chronic conditions (e.g., cancer and chronic liver disease) as risk factors.However, Panagiotou et al. 9 focused on US nursing home residents, while our study may have a broader population.www.nature.com/scientificreports/The primary outcome variable was 28-day in-hospital mortality in a study done by Gupta et al. 68 .Other outcome variables included discharge from the hospital and remaining hospitalized at the end of the study followup.The study assessed 2215 adults with laboratory-confirmed COVID-19 who were admitted to intensive care units (ICUs) at 65 hospitals across the US from March 4 to April 4, 2020.The study identified demographic, clinical, and hospital-level risk factors that may be associated with death in critically ill patients with COVID-19.Factors independently associated with death included older age, male sex, higher body mass index, coronary artery disease, active cancer, and the presence of hypoxemia, liver dysfunction, and kidney dysfunction at ICU admission.Patients admitted to hospitals with fewer ICU beds had a higher risk of death.Hospitals varied considerably in the risk-adjusted proportion of patients who died and in the percentage of patients who received hydroxychloroquine, tocilizumab, and other treatments and supportive therapies.Our study and Gupta et al. 68 , both found older age and male sex to be risk factors for mortality in critically ill COVID-19 patients.But Gupta et al. 68 focused on ICU-admitted patients and identified additional risk factors, including higher body mass index, coronary artery disease, active cancer, and specific organ dysfunctions.
The outcome variables of the study performed by Kuderer et al. 70 were severe COVID-19 illness, hospitalization, admission to the ICU, mechanical ventilation, and death.The data sources were electronic medical records and patient-reported outcomes.The key findings of the study were that patients with cancer who contracted COVID-19 had a higher risk of severe illness, hospitalization, admission to the ICU, mechanical ventilation, and death compared to the general population.Additionally, patients with active cancer and those receiving cancer treatment had a higher risk of these outcomes compared to those with a history of cancer or those not receiving treatment.Our study, similar to Kuderer et al. 70 revealed that cancer patients with COVID-19 faced an elevated risk of death compared to the general population.Yet, Kuderer et al. 70 concentrated on COVID-19 outcomes in cancer patients, whereas our study encompassed a more diverse patient population.
Williamson et al. 71 examined factors associated with COVID-19-related death.The study analyzed primary care records of 17,278,392 adults linked to 10,926 COVID-19-related deaths.Key findings showed associations between COVID-19-related death and male gender, greater age, deprivation, diabetes, severe asthma, and other medical conditions.Black and South Asian individuals had a higher risk, even after adjusting for other factors.The study provided valuable insights from one of the largest cohort studies on this topic.Our study, like Williamson et al. 71 observed associations between COVID-19-related death and male gender, greater age, and specific medical conditions such as diabetes and severe asthma.However, while Williamson et al. 71 analyzed primary care records, our study potentially utilized different data sources.
Grasselli et al. 69 , evaluated independent risk factors associated with mortality of COVID-19 patients treated in ICUs in Lombardy, Italy.The study included 3988 critically ill patients with laboratory-confirmed COVID-19 referred for ICU admission from February 20 to April 22, 2020.Key findings revealed that older age, male sex, high fraction of inspired oxygen, high positive end-expiratory pressure or low Pao2:Fio2 ratio on ICU admission, and history of chronic obstructive pulmonary disease, hypercholesterolemia, and type 2 diabetes were independently associated with mortality.Our study, along with Grasselli et al. 69 identified older age and male sex as risk factors for mortality in COVID-19 patients.However, Grasselli et al. 69 focused on patients treated in ICUs, whereas our study may have a broader population.
We also conducted gender-specific analyses to investigate the adjusted associations between clinical risk factors and 60-day all-cause mortality.While the overall results showed similarities between both genders, a slightly stronger association was observed in females compared to males for being symptomatic.Notably, aORs for a positive result of the Covid-19 test and 60-day all-cause mortality were higher in males compared to females.Additionally, while diabetes was significantly associated with mortality in males, no significant association was found in females.These findings emphasize the importance of gender-specific analyses to better understand the impact of these clinical risk factors on mortality outcomes.
The finding in this study was that the mortality rates were higher among LTC residents with COVID-19 compared with DSL residents even though DSL and LTC residents are almost the same in terms of characteristics, www.nature.com/scientificreports/such as age, sex, and comorbidities.Our study is the first in our jurisdiction to compare rates of mortality between LTC and DSL using administrative data.We discovered the importance of the cut-off point for predictive modeling using the Youden index improved the sensitivity and AUC which is aligned with previous studies 72 .
In the investigation of the pre-processing methods using administrative data and the LR model (to identify the individuals at risk of 60-day mortality), we found that using normalization (power transformation) 73 and balancing classes (over-sampling techniques) 74 , improved the sensitivity and AUC that is aligned with previous studies 17,47,51,52,75 .According to these studies, the large degree of the imbalanced classes (in our dataset the rate of the "survivor" class outweighed the rate of "death" class 9 to 1) lowered the sensitivity, and therefore strategies to improve the model should be considered 17 .In most of these studies, over-sampling techniques have been proposed to solve the imbalanced class issue.In this study, the ANN model received the best performance in terms of sensitivity and AUC which aligns with previous studies.In a study done by Sanderson et al. 15 , they discussed that the ANN model could outperform the LR in terms of both sensitivities by 4% and AUC by 2%.In this work, the author stated that ANN models can learn the non-linear relationships between predictors and outcomes.Also, they are capable of scaling well to large datasets.Furthermore, the ANN model was advocated by other studies as well [59][60][61][62][63][64] .In our study, the ANN model outperformed the standard LR (model 4) in both sensitivity and AUC metrics by 4% and 2%, respectively.ANN may not offer a significant improvement over LR.In many cases, LR's advantage of providing clearer interpretations and similar performance makes it the preferred choice.However, in scenarios where enhanced sensitivity is needed, such as with large datasets or specific organizational requirements, ANN may be a more suitable option.
We had several limitations that should be acknowledged.First, the administrative data used was not collected for this specific purpose.However, by using existing administrative data, we were able to obtain results that would have otherwise been restricted by the cost of primary data collection 76 .By using secondary data sources, certain variables were restricted from use or pre-defined as part of the Alberta COVID-19 Analytics and Research Database.For example, some variables were already coded, and data linkage had been done already.The residual and unmeasured confounding in terms of facility-level and patient-level characteristics limited us from further exploration into the difference in the risk of death between LTC and DSL residents.
Second, the cohort was defined based on continuing care residents being tested for COVID-19, which means that those not tested for COVID-19 were not included potentially introducing a selection bias.However, we believe this was somewhat mitigated as we have captured most residents in continuing care based on previous estimates of the number of residents in continuing care 77 .Also, we had to rely on all-cause mortality data as we did not have access to the specific data required for calculating cause-specific mortality 78 .The vital statistics data lacked cause of death details, preventing us from categorizing deaths as COVID-related or not.Although we acknowledge that cause-specific mortality would provide a more comprehensive understanding, we believe that using all-cause mortality can still be a suitable proxy for deaths related to or indirectly caused by COVID-19.
Third, based on COVID-19 variants in Canada 79 , the Delta and Beta variants of Covid-19 emerged one month before and one month after November 2020, respectively.When considering the year and month of the specimen collection in our cohort, the odds of death in November 2020 and in December 2020 were highest.Yet, we did not have enough data to know in which month the patients died.
Lastly, the Youden index, which was used in this study is a measure of the overall performance of a model and is not directly related to calibration.We opted to use the cut-off point rather than calibration due to its simplicity and ease of interpretation.Calibration, on the other hand, involves fine-tuning the model probabilities to match the observed outcomes, which can be more complex and computationally demanding.Investigating calibration in different risk strata would provide additional information about the performance of the model.In expressions of ML algorithms, the interplay between overfitting and underfitting was a challenge as the success of these algorithms depends on the selection of the parameters according to the number of observations, and features 80 .
In conclusion, in this cohort study of Alberta continuing care residents tested for COVID-19, the all-cause 60-day mortality rate was 10.12%.COVID-19 test results and other characteristics, metastatic cancer, chronic liver disease, advancing age, and male sex, increased the risk of death.The LTC residents were also at higher risk of death compared to the DSL residents.We examined the pre-processing methods and ML models for predicting mortality and found that the combination of normalization, random oversampling techniques, and three-layer neural network classification provided superior prediction to other ML models.Our findings can enhance treatment decisions in healthcare.Further exploration is needed to harness the potential of ANN models for improved outcomes.ANNs offer data-driven insights, leading to precise diagnoses, optimized treatment plans, and better patient outcomes.Rigorous validation and collaboration between healthcare professionals and AI experts are crucial for safe and effective integration in clinical practice.

Table 1 .
The horizontal bar charts in Appendix 7 present the characteristics of continuing-care residents in Alberta who died.For 2,560 residents who died within 60 days, the number of

Table 1 .
Characteristics of continuing care residents in Alberta who were confirmed tested positive or negative with a Covid-19 infection between March 1, 2020, to March 31, 2021.LTC long-term care, DSL designated support living, ED emergency department, H hospital.

Table 3 .
Associations between clinical risk factors and 60-day cause-specific mortality in continuing care residents in Alberta who were confirmed tested first positive or first negative with a Covid-19 infection between, March 1, 2020, to March 31, 2021.LTC long-term care, DSL designated support living, ED emergency department, H hospital, Y year, Proc procedures.

Table 4 .
The performance metric of all the predictive modeling.Significant values are in bold.LR logistic regression, PT power transformation for normalizing the data, SMOTE synthetic minority over-sampling technique, ROTE random over-sampling technique, J Youden index, RF random forest, SVM support vector machine, ANN artificial neural network.