Introduction

Worldwide, healthcare organizations (HCOs) are facing challenges due to an increased demand for care by an aging population and increased multi-morbidity1,2,3. This problem is exacerbated in the long term due to an increased shortage of healthcare professionals4,5 that cannot be solved by increasing healthcare budgets6,7. The necessity to improve healthcare systems using innovative strategies is high and health policymakers are proposing both—cost-containment and production improvement policies8,9. While cost-containment strategies ensure short-term savings they cannot guarantee long-term results8. Therefore, production improvement strategies utilizing capacity efficiently are key for long-term solutions10,11,12,13.

Effective capacity management strategies are based on understanding the barriers to streamlining hospital patient flow and associated root causes as described in several systematic reviews14,15,16,17. The most recent review by Ahlin et al.18 went a step further. First, it explored which factors are preventing swift patient throughput at hospitals and second, it synthesized these factors into main barriers and underlying root causes. The main barriers were long lead times and inefficient coordination during the patient transfer process, caused by inadequate staffing, lack of standards, insufficient operational planning, and a lack of IT support.

Capacity management strategies can benefit from digital innovations19. Artificial intelligence (AI) can power digital medicine clinically via better disease surveillance, improved diagnosis, and novel treatments20, as well as operationally via improved capacity utilization. The Catharina Hospital in Eindhoven, the Netherlands, is pioneering on efficient use of capacity resources and value-based healthcare21,22,23. The hospital provides outpatient and inpatient services for up to 415,000 patients annually with a workforce of ~400 physicians (attendings and residents) and 1250 nurses. It comprises 400 beds and 20 operation rooms with more than 16,000 surgeries performed annually. The Catharina Hospital specializes in cardio-vascular and oncology care24.

An overview of production planning, main capacity challenges, and key performance indicators (KPIs) is depicted in Fig. 1. Given the challenges the hospitals are facing, efficient planning and utilization of capacity resources such as operation rooms (OR), intensive care unit (ICU), post anesthesia care unit (PACU), and the general ward is imperative. To ensure their timely availability for patients and staff, surgery and bed planners are using predictions of surgery duration, post-OR bed type (ICU/PACU/general ward), and hospital length-of-stay.

Fig. 1: Catharina’s capacity management in surgery planning.
figure 1

Provides an overview of Catharina’s production planning, main capacity challenges, and key performance indicators (KPIs).

This study focuses on predictive models for cardio-thoracic surgery duration as a stepstone towards data-driven capacity management. The current model to estimate the surgery duration of a patient with a surgical procedure X is based on 2 steps: (1) the surgeon’s average procedure X time of the last 10 patients and (2) a manual correction to account for patient’s specific characteristics if needed. While this model is simple to understand, the discrepancy (delta) between the actual and planned surgery duration can be substantial and cause suboptimal surgery scheduling. This leads to inefficient OR utilization, surgery rescheduling, long waiting lists, staff overtime, and high workload.

The study objective was (i) to evaluate the performance of the current surgery duration model used in clinical practices, (ii) to develop and validate an enhanced predictive model and (iii) to get insight into which patient and surgery characteristics are key features in the model development.

Results

Surgery inclusion criteria

The surgeries included and excluded in the analysis are illustrated in Fig. 2. In summary, 2363 cardio-thoracic surgeries were performed in the Catharina hospital on 2144 patients older than 18 years in the period Dec 2018—Feb 2020, prior to the COVID-19 pandemic. We excluded 69 (out of 2363; 3%) surgeries according to the exclusion criteria mentioned in the Methods section and Fig. 2. Hence, 2294 (97%) surgeries were included in the analyses, performed on 2098 patients. Thus, 165 patients (out of 2098; 8%) had multiple surgeries, e.g., CABG followed by a resternotomy, both included in the analysis. These 165 patients included 44 patients with at least two elective surgeries, 38 patients with at least two acute surgeries, and 83 patients with one elective and one acute surgery. We refer to the total set as the overall cohort. Most of the surgeries were elective—1925 (out of 2294; 84%) vs. 369 (out of 2294; 16%) acute.

Fig. 2: Surgery flowchart.
figure 2

Summarizes the included and excluded elective and acute surgeries used to develop and evaluate the predictive models of surgery duration—training, test, and validation sets.

Surgery and patient characteristics

The characteristics of the three surgery cohorts—elective, acute, overall—are summarized in Tables 1 and 2. There were statistically significant differences between all characteristics of the elective and acute cohorts, except gender and type of surgeon. The patients characteristics with the highest prevalence in the overall surgery cohort were the age category 60–74 (54%), male (75%), overweight (39%), ASA (American Society of Anesthesiologists) score = 3 (44%), medications category either 1–5 (27%) or 6–10 (26%) and normal creatinine level (44%). The most common surgery procedure in the overall cohort was CABG (49%, Coronary Artery Bypass Graft) followed by AVR (14%, Aortic Valve Replacement). Only 17% of overall surgeries had at least two procedures performed during the same surgery. Nearly two-third of overall surgeries were performed by attending physicians. The three post-OR bed types (ICU, PACU, general ward) were with similar utilization. The target variable—surgery duration—had an average value of 3.5 h in the overall cohort.

Table 1 Patient characteristics by surgery cohort.
Table 2 Surgery characteristics by surgery cohort.

Further, similar analysis was performed on the development, test, and validation sets. No statistically significant differences were found between the characteristics, confirming that the randomization ensured set similarity.

Evaluation of the current model of surgery duration

We evaluated the performance of the current model of surgery duration using the root mean square error (RMSE) and the mean absolute error (MAE) for both elective (RMSEelective = 0.99 and MAEelective = 0.71), and acute surgeries (RMSEacute = 1.87 and MAEacute = 1.22), respectively, see Table 4. While RMSE and MAE are common errors for regression models in data analysis, they do not provide any scheduling insights into OR utilization. Therefore, we clustered the surgeries with respect to their differences between real and planned surgery duration into meaningful categories for OR utilization, namely surgeries “on time”, “behind schedule” and “ahead of schedule”, and named them customized errors, see Table 5 and Fig. 3. The analyses showed that in 43% and 19% of all elective and acute surgeries, respectively, the average surgical procedure duration (step 1 of the current model) was manually corrected by the surgeons (step 2 of the current model). After the correction, see Fig 3b and e, (1) 37% of all elective surgeries and 60% of all acute surgeries were “behind schedule”, (2) 33% of all elective surgeries and 30% of all acute surgeries were “on time”, and (3) 30% of all elective surgeries and 10% of all acute surgeries were “ahead of schedule”. These discrepancies were the rationale for developing improved ML predictive models that leverage additional patient and surgery characteristics.

Fig. 3: Performance of the current and the ensemble models of surgery duration—customized errors on the validation set.
figure 3

Visualizes the delta between real and planned elective ac and acute df surgery duration: a, d current model without surgeon correction; b, e current model with surgeon correction; c, f new ensemble models.

ML predictive models of surgery duration development

Table 3 provides an overview of the patient and surgery characteristics used to develop the ML predictive models for surgery duration. The features were extracted from the surgery request form and the pre-operative screening recorded in the EHR prior to the patients’ surgery. The feature selection had two steps based on univariate and multivariate analyses, respectively, as indicated in Table 3.

Table 3 Features based on patient and surgery characteristics.

The univariate analysis revealed that gender and BMI are not statistically significant predictors (p > 0.05). However, BMI is a significant factor in the anesthesiologist/surgeon clinical evaluation before and during the pre-operative screening and has another weight for elective than acute surgeries. The remaining features went through the multivariate analysis using Boruta algorithm25. It clustered the features into important, tentative, and unimportant, see Fig. 4, where each boxplot corresponds to a category of a feature from Table 3. For example, five of the box plots in Fig. 4 represent the importance of ASA score categories 1, 2, 3, 4, and 5 as features. Examples of unimportant features were the surgeon type (attending vs. resident) and age. The important and tentative features were selected to develop the predictive models using three ML techniques—linear regression (LM), random forest (RF), and extreme gradient boosting (XGBoost; abbreviated as GB). We trained the models on the three surgery cohorts—elective, acute, and overall surgeries. In the remainder of the paper, we dropped the models trained on the overall surgeries since they were outperformed by the dedicated models trained on elective-only and acute-only surgeries. The outperformance is explained by the statistically significant differences between the elective and acute cohorts shown in Tables 1 and 2.

Fig. 4: Feature selection.
figure 4

Depicts the output of Boruta algorithm—features on the x axis clustered into important (green), tentative (yellow), and unimportant (red) according to their importance on the y axis.

Figure 5a, b illustrates the importance of the top 20 features of the RF models. Note that all features are defined during pre-operative screening. The 3 most important predictors for the duration of elective surgery were (1) the anesthesiologist estimate of post-OR bed type being ICU rather than PACU or general ward, (2) the multiple number of procedures during the surgery, e.g., CABG and AVR, and (3) the ASA score being at least 226. The three most important predictors for the duration of acute surgery were (1) the surgery procedure being in the vascular cluster compared to other procedure clusters, (2) the surgery procedure being in the CABG cluster, and (3) the ASA score being at least 4. The top 20 predictors of the LM and GB models were similar although their ranking was different. More insights into the feature importance are provided in the Discussion section.

Fig. 5: Feature importance.
figure 5

Visualizes the feature importance of the RF models for a elective and b acute cardio-thoracic surgery.

Evaluation of the ML models of surgery duration

Table 4 shows the predictive models performance in terms of RMSE and MAE on the validation data sets and the % error reduction compared to the current model. The latter includes both the correction by a surgeon on top of the average procedure time estimate. The GB model for elective cardio-thoracic surgery duration showed the best RMSE and MAE error reduction compared to the current model: −19% for RMSE (from 0.99 to 0.80, p=0.002) and −14% for MAE (from 0.71 to 0.61, p=0.005). The GB model for acute cardio-thoracic surgery duration also showed the best RMSE error reduction compared to the current model: −52% (from 1.87 to 0.89, p<0.001). However, the RF model for acute cardio-thoracic surgery duration was slightly better than the GB model in reducing the MAE error: −50% (from 1.22 to 0.61, p<0.001).

Table 4 Model performances—RMSE, MAE errors on the validation set.

Table 5 shows the predictive models’ performance in terms of customized errors on the validation data sets.

Table 5 Model performances—customized errors on the validation set.

Summing up, all three models LM, RF, and GB had a statistically significant error reduction compared to the current model (see Tables 4 and 5) with a slight outperformance of the GB/RF models compared to the LM model. To further improve the model performance, we created multiple ensemble models using the LM, RF, and GB predictions on the test data sets.

The best predictive model of elective cardio-thoracic surgery duration, according to the customized errors, was an ensemble model—LM + RF + GB predictions stacked by LM (see Table 5). It reduced the number of surgeries “behind schedule” by −9% (from 37 to 28%) and boosted the surgery “on time” by +5% (from 33 to 38%), see also Fig. 3b, c. The number of surgeries “ahead of schedule” was increased by +4% (from 30 to 34%).

The best predictive model of acute cardio-thoracic surgery duration, according to the customized errors, was an ensemble model, namely LM + GB predictions stacked by RF (see Table 5). It reduced the number of surgeries “behind schedule” by −28% (from 60 to 32%) and boosted the surgery “on time” by +15% (from 30 to 45%), see also Fig. 3e, f. The number of surgeries “ahead of schedule” was increased by +13% (from 10 to 23%) and this is the price we paid.

Discussion

The analyses revealed a couple of key findings. First, the discrepancy between the real and planned surgery durations in the current clinical practice is substantial. In cardio-thoracic surgery 37% of all elective surgeries and 60% of all acute surgeries were “behind schedule”, see Fig. 3b, e. Similar percentages were reported by Rozario27 on aggregated OR level—ORs were overtime 48% of the time. In both studies, as well as in27, the current planned surgery duration is the surgeon’s average procedure time of the last 10 patients. In our study, 43% and 19% of the elective and acute average procedure times, respectively, were corrected manually by surgeon. Whilst the corrections for elective surgeries reduced “behind schedule” surgeries from 47 to 37%, see Fig. 3a, b, this reduction for acute surgeries was minimal - from 61 to 60%, see Fig. 3d, e. The current model is surgeon- and procedure-specific with an optional manual patient-specific correction that cannot completely resolve the discrepancy. Furthermore, the last 10 patients are not representative of the next individual patient to be planned. Taking an average of only 10 patients can be inaccurate due to deviation induced by the small sample size and outlier skewness. In contrast, the ML models we developed included (1) 15 months of data cleaned from extreme outliers, so that the outliers’ effect is minimal and (2) more patient and surgery characteristics than the surgeon and the procedure only. Both powered the reduction of discrepancy between the real and planned surgery durations.

The second key finding is that the features quantifying and qualifying the surgery and patient complexity are the most important features of our predictive models (see Fig. 5). For example, type and number of procedures during the surgery, post-OR bed type (ICU/PACU/general ward), ASA score, number of home medications and renal function are among the most important features. The ASA score characterizes patient operative risk on a scale of 1–5, where 1 is a normal health condition and 5 is moribund26. It has already been shown that ASA is a predictor of medical complications and post-surgery mortality28. Our study showed that ASA is an important predictor of surgery duration as well. The age, BMI, and surgeon type (attendings vs. residents) features correlate to the important features aforementioned, which explains their exclusion by the uni-/multivariate analyses. For example, the patient’s age and BMI could implicitly influence the anesthesiologist’s estimate of ASA score and post-OR bed type. Note that most complex surgery procedures were performed by attendings. The surgical residents’ years of experience might affect OR times, but we did not have data to account therefore. The gender was pointed out as an unimportant feature for cardio-thoracic surgery duration by the Catharina anesthesiologists and indeed was statistically unsignificant in the univariate analysis.

The third key finding is that the ensemble model outperformed the LM, RF, and GB ML models with respect to reducing the surgeries “behind schedule” and increasing the surgeries “on time”. Since our data have complex underlying patterns, we stacked the linear (LM) with non-linear (RF and GB) model predictions into an ensemble model to get optimal performance. RF and GB represent the bagging and boosting ML techniques, respectively. Bagging is a variance reduction technique whereas boosting is a bias reduction technique and ensembling them improved accuracy while keeping data variance and bias low29.

The novel aspects of the models described in this paper are (1) inclusion of patient-/surgery-complexity characteristics in addition to the surgeon and surgery procedure; (2) use of features available in the scheduling phase of the surgery, prior to patient’s hospitalization, proving feasibility of good predictions without vital signs, lab results, and other monitoring data; (3) ensemble of linear and non-linear ML algorithms for best performance.

The OR is the major cost and revenue center for most hospitals and effective use of capacity resources can provide significant benefits as summarized by the following papers30,31,32,33. Parameters like surgery duration, post-OR bed type, and length-of-stay are essential for surgery planners and have been target variables of recent publications34,35,36. The models of elective and acute surgery duration we developed support surgery planners in a different way. The former is used in scheduling elective patients several days or weeks ahead of the patients’ hospitalization. The latter is used in elective patient rescheduling caused by an acute patient with a high medical urgency. The rescheduling can result in (1) exceeding the target surgery date of elective patients and (2) inefficient OR utilization. Hence, surgery planners need to decide whether any elective patient can be safely rescheduled and if yes which OR with elective surgery is the best to reschedule. These decisions are supported by the predictive models of acute surgery duration. Summing up, predictive models of elective and acute surgery duration facilitate complex patient scheduling with OR and ward occupancy rates close to the hospital’s KPIs. These models can also enhance decision-making processes elsewhere in the end-to-end chain of in-/out-patients services such as planning for intakes and patient preparation at the ward, or estimation of ICU bed capacity by predicting more accurately the patients’ OR in- and out-flows. It is worth mentioning that the surgery duration predictions need to be combined with OR cleaning time which is dependent on hospital-specific processes and can be derived from historical data.

Recently, there has been a substantial increase in AI research in medicine37,38,39,40, showing that healthcare professionals are most comfortable using AI for workflow tasks such as staffing and patient scheduling (64%), followed by clinical tasks such as flagging anomalies (59%), treatment plan recommendation (47%) and diagnosis (47%). The models described in this paper belong to the first group and may facilitate broader AI adoption by generating data-driven insights that show a positive impact on operational efficiency and capital investments.

Future work will investigate (1) the impact of the ML models on optimizing the surgery schedule and how it translates into more effective OR utilization and bed occupancy in the ICU and general ward, and (2) new predictive models for patients at high risk of complications that can generate meaningful alerts for long-lasting surgeries.

This study has some limitations. First, we analyzed only 2294 surgeries prior to the COVID-19 pandemic as shown in Fig. 2. The pandemic had affected not only surgery volume, but also surgery times and complexity. Patients’ complications due to postponed surgeries may lead to longer surgery duration. Back to “normal” OR schedule was seen mid-2022, which is why we could not use bigger and more recent data for analyses. Second, the need for predictive models tailoring on different levels - per hospital, per medical specialty, per surgeon, etc., limits the models’ scalability. While the model development methodology is reproducible, the model implementation is hospital-specific. Third, the new ML models are prone to drifting due to either a change in the relationship between the target and independent variables, missing input variables, or any other disruption. Changes in patient’s population, surgeon’s skills, surgeon’s fatigue, or surgery’s procedures over time are usually the root cause of the model’s drift leading to poor performance. Then, either retraining or including new features in the ML model accounting for disruptions are needed as a part of the ML models lifecycle. In contrast, the current model based on average surgery duration can quickly pick up upon data changes and might be used as a backup model. Despite the limitations, this study provides valuable insights into the shortcomings of the current models of surgery duration and how they can be overcome by leveraging advanced ML models.

In conclusion, ML technologies based on specific individual patient and surgery characteristics are a fundament of improved predictive models of cardio-thoracic surgery duration. These models are a stepstone towards data-driven capacity management. Surgery planners could benefit from these predictive models as a patient flow AI decision support tool to create an optimized surgery schedule for optimal OR utilization which is a prerequisite to effective capacity management.

Methods

The methods used in the study are summarized according to the guidelines by Luo et al.41 for developing and reporting ML predictive models in biomedical research.

Design

This was a retrospective predictive modeling study of surgery duration to estimate the OR time required for a patient’s cardio-thoracic surgery. The OR time was defined as a difference between the patient’s departure and arrival times in the OR, excluding the OR cleaning time. The study was approved by both—the Institutional Review Board (IRB) of the Catharina Hospital (nWMO-2020.165) and the Internal Committee for Biomedical Experiments (ICBE-S-000239) of Philips. An individual patient’s consent was waived due to the retrospective study design in accordance with the IRB rules of the Catharina Hospital.

Study cohorts

In this study, we included cardio-thoracic surgeries performed in the Catharina hospital during the period Dec 2018—Feb 2020 on patients older than 18 years. The age restriction was related to the privacy requirement of the study protocol. We selected 15-month study period prior to COVID-19 to avoid the pandemic impact on surgery volume, surgery times, and medical urgency. Our focus on cardio-thoracic surgeries was driven not only by the main KPI of the Catharina hospital depicted in Fig. 1, but also by an additional Dutch healthcare regulations rule “the time on the waiting list for cardio-thoracic patients should be less than 7 weeks to prevent severe medical complications”42. This rule emphasizes the importance of tools for effective surgery scheduling, rescheduling, and OR utilization that this paper focuses on. We excluded surgeries according to the following criteria: (1) surgery duration longer than ave(cardio-thoracic surgery time)+ 3*sd(cardio-thoracic surgery time) ~8 h, which excluded 1% of all surgeries, (2) patients’ mortality during surgery, and (3) surgery performed by a surgeon with less than 10 surgeries in the study period. The rationale behind the third exclusion criteria was twofold. First, the surgery duration estimate by the current model was unreliable to compare with. Second, the number of surgeries per surgeon is too low for meaningful data analysis. Summing up, the three exclusion criteria were selected to clean up the surgery data set from outliers related to low surgery frequency per surgeon and long/short surgery duration, e.g., long duration due to complications and short duration due to mortality, that are unplannable. The resulting surgery data set comprised both elective and acute surgeries. The acute surgeries were performed within 24 h of patient admission according to the study definition. There were three study cohorts analyzed—overall, elective, and acute surgeries.

Data sources and preprocessing

The primary data source for this study was the electronic health record (EHR) data repository of the Catharina Hospital. The data contain patient demographics, patient and surgery characteristics collected at the pre-operative screening, and real surgery duration recorded during inpatient encounters. Data over the period Dec 2018—Feb 2020 were extracted from the EHR using Microsoft SQL Server Management Studio (SSMS) 2018. Data management and deidentification were achieved through SSMS and pseudo-coding. All data were deidentified before analyses.

The data preprocessing included both data cleaning and data transformations. The data cleaning comprised removing duplicates, correcting out-of-range variables, and imputing missing values of the following categorical variables—BMI, ASA, Medications, and Creatinine (see Table 1). The imputation strategy was based on replacing missing values with “the most frequent category” or a newly created “unknown” category. The first data transformation consisted of converting discrete into categorical variables, where both medical and statistical rationale were involved. For example, we used well-established medical categories for BMI, ASA26, and Creatinine43, and underlying statistical distributions for age and number of home medications, see Table 1. The second data transformation consisted of performing clustering of surgical procedures. Due to the proprietary procedures codes used by the Catharina Hospital, we were not able to use the Clinical Classifications Software Refined (CCSR)44, which is based on standardized procedure coding systems like ICD-10-PCS. That is why hierarchical (data-driven) and medical (clinician’s expertise-driven) clustering was performed to group the procedure codes into categories meaningful for data analysis. The third data transformation was one-hot encoding of the categorical variables. It was necessary for linear ML models, which cannot take categorical input directly, in contrast to decision tree ML models.

Predictive models development and evaluation

The current model deployed in the Catharina hospital uses the surgeon’s average procedure time of the last 10 patients as a prediction of surgery duration for the next patient having the same procedure. Further, the average surgery duration is sometimes manually corrected by the surgeons. The final estimate is referred to in this paper as a planned surgery duration by the current model used in clinical practice.

The new ML predictive models development went through the following steps: (1) data splitting, (2) feature selection, (3) model training, (4) model testing and evaluation, and (5) models ensembling.

In the first step, each of the three cohorts described above was randomly split into a training set (50%), a test set (20%), and a validation set (30%), which is common practice in data analysis. The splitting was “surgery”-aware but not “patient”-aware, i.e., the three sets were mutually exclusive with respect to surgeries but not patients with multiple surgeries. This splitting strategy might induce data leakage; however, the risk was minimal due to different procedures performed by different surgeons on patients with multiple surgeries (2% in the elective and 12% in the acute cohort). The training sets were used to develop predictive models including features selection and hyperparameters tuning based on a Random Grid Search K-fold Cross-Validation. We selected K = 10 for elective surgeries and K = 5 for acute surgeries since the latter data set was relatively small. Examples of RF hyperparameters tuned were the number of decision trees in the forest, maximum depth of the tree, the max number of features to consider at each split, and error measure to split on. The test sets were used to perform model stacking, i.e., fit ensemble models that combine the predictions of the different ML models developed on the training sets. In this way, the test sets became training sets for the ensemble models and that is why validation sets were needed to report model performance. So, the validation sets have been held out during the ML models development and were only used to evaluate the performance of the current model as well as all new predictive models as reported in Tables 4, 5 and Fig. 3.

In the second step, features were extracted from the EHR data related to patient cardio-thoracic surgeries. All these features were available at the point of surgery scheduling prior to patients’ hospitalization, and therefore lab results and monitoring data such as vital signs and ECG, were not available. The latter are expected to be strong predictors of real-time changes in surgery duration due to intra-operative complications which is a different user case than the patient’s surgery scheduling. We used a 2-step process for feature selection. First, we performed univariate inferential analysis to investigate the predictive power of each feature and dropped those that were not statistically significant predictors (p > 0.05). Second, we performed multivariate inferential analysis using Boruta algorithm25 based on RF ML technique that clusters features into important, tentative, and unimportant.

In the third step, we trained several predictive models using the features selected in the second step and multiple ML techniques—linear regression (LM), random forest (RF), and extreme gradient boosting (XGBoost; abbreviated as GB in this paper). RF and GB are non-linear models, very popular as algorithms of choice for many winning teams of machine learning competitions45.

In the fourth step, we used two well-known error metrics - RMSE and MAE - for the performance evaluation of regression models. Both metrics quantify the delta (real-planned) surgery duration, where planned refers to the surgery duration in hours predicted by the models, and real refers to the surgery duration in hours recorded in the Catharina EHR. In addition to RMSE and MAE, we defined customized error metrics in terms of surgeries “on time”, “behind schedule” and “ahead of schedule”. The categories “ahead of schedule” and “behind schedule” consist of all surgeries ahead and delayed, respectively, at least t min compared to their planned time. The category “on time” comprises all surgeries within the time range [−t, t] compared to their planned time. In our analysis we chose t = 10%*ave(cardio-thoracic surgery time) = 10%*(3 h 30 min) ~20 min. Furthermore, the category “ahead of schedule” had two subcategories with a time range of “more than 60 min” and “60–20 min” ahead of schedule, respectively. Similarly, the category “behind schedule” had three subcategories with a time range of “20–60 min”, “60–120 min”, and “more than 120 min” behind schedule, respectively. The time cut-offs of the subcategories corresponded to 30%*- and 60% * ave(cardio-thoracic surgery duration). We evaluated RMSE, MAE, and the customized errors on the current as well as new ML predictive models of surgery duration.

In the fifth step, we performed model stacking, i.e., the predictions on the test set made by the different models trained in the third step were used as features to fit a new model that we referred to as an ensemble model.

Statistical analysis

The analysis of patients’ and surgeries’ characteristics by study cohort (elective vs. acute vs. overall) and by model development cohort (train, test, validation sets) were presented and summarized as means and standard deviations (SDs), or frequencies and percentages. Comparisons of normally/non-normally distributed continuous variables by cohorts were conducted using Student t tests/Mann–Whitney U tests, respectively. For categorical variables, Pearson Chi-square tests were used to examine the association between the cohorts. The significance level was set to p = 0.05. All data analyses were performed using the statistical software R, version 4.2.146.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.