Development and validation of a reinforcement learning model for ventilation control during emergence from general anesthesia

Lee, Hyeonhoon; Yoon, Hyun-Kyu; Kim, Jaewon; Park, Ji Soo; Koo, Chang-Hoon; Won, Dongwook; Lee, Hyung-Chul

doi:10.1038/s41746-023-00893-w

Download PDF

Article
Open access
Published: 14 August 2023

Development and validation of a reinforcement learning model for ventilation control during emergence from general anesthesia

Hyeonhoon Lee ORCID: orcid.org/0000-0002-9426-823X^1,2^na1,
Hyun-Kyu Yoon³^na1,
Jaewon Kim⁴,
Ji Soo Park⁵,
Chang-Hoon Koo⁶,
Dongwook Won⁷ &
…
Hyung-Chul Lee ORCID: orcid.org/0000-0003-0048-7958³

npj Digital Medicine volume 6, Article number: 145 (2023) Cite this article

2019 Accesses
1 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Ventilation should be assisted without asynchrony or cardiorespiratory instability during anesthesia emergence until sufficient spontaneous ventilation is recovered. In this multicenter cohort study, we develop and validate a reinforcement learning-based Artificial Intelligence model for Ventilation control during Emergence (AIVE) from general anesthesia. Ventilatory and hemodynamic parameters from 14,306 surgical cases at an academic hospital between 2016 and 2019 are used for training and internal testing of the model. The model’s performance is also evaluated on the external validation cohort, which includes 406 cases from another academic hospital in 2022. The estimated reward of the model’s policy is higher than that of the clinicians’ policy in the internal (0.185, the 95% lower bound for best AIVE policy vs. −0.406, the 95% upper bound for clinicians’ policy) and external validation (0.506, the 95% lower bound for best AIVE policy vs. 0.154, the 95% upper bound for clinicians’ policy). Cardiorespiratory instability is minimized as the clinicians’ ventilation matches the model’s ventilation. Regarding feature importance, airway pressure is the most critical factor for ventilation control. In conclusion, the AIVE model achieves higher estimated rewards with fewer complications than clinicians’ ventilation control policy during anesthesia emergence.

Development and validation of a reinforcement learning algorithm to dynamically optimize mechanical ventilation in critical care

Article Open access 19 February 2021

Development of a deep learning model that predicts Bi-level positive airway pressure failure

Article Open access 26 May 2022

Personalized predictions of patient outcomes during and after hospitalization using artificial intelligence

Article Open access 03 April 2020

Introduction

The emergence from general anesthesia is dynamic, and various physiologic responses can occur during this phase¹. Restoration of spontaneous breathing is one of the first physiological signs that appear during the emergence from general anesthesia². When the patient’s spontaneous breathing begins to recover, anesthesiologists switch off the mechanical ventilator and assist the patient with manual ventilation at optimal timing to avoid complications, such as cardiorespiratory instability or patient-ventilator asynchrony.

Since most of the ventilation control during emergence is performed by human clinicians, human factors can affect the risk of emergence from anesthesia³. Especially, anesthesiologists can be unassisted at the end of the surgery and distracted by the high task load and fatigue. Generally, the situation during emergence from anesthesia is less controlled than at induction.

Artificial intelligence algorithms can assist human clinicians in various medical fields^4,5. Among the artificial intelligence algorithms, the reinforcement learning algorithms can find the optimal policy by maximizing the cumulative expected reward⁶. This is similar to the decision-making process of a clinician whose goal is improving the clinical outcome through appropriate intervention⁷. In previous studies, reinforcement learning algorithms have been used for various medical problems⁸, such as drug administration during general anesthesia⁹, hypotension treatment¹⁰, and ventilation settings in the intensive care unit¹¹.

In this study, we aim to develop and validate the reinforcement learning-based Artificial Intelligence model for Ventilation during Emergence (AIVE) from general anesthesia to control ventilation during emergence from general anesthesia while preventing hemodynamic and ventilatory complications. We hypothesize that compared to the clinicians’ policy, AIVE’s policy would achieve higher estimated rewards defined by the clinical outcomes.

Results

Dataset construction

Among the 31,071 cases from the derivation cohort, 14,306 cases (6,763,535 one-second time points) were included for model development and internal validation (Fig. 1). From the derivation cohort, 2146 cases (15%) were randomly selected for internal validation. The remaining cases (85%) were used for model training and hyperparameter tuning. External validation was performed using 406 cases (162,656 one-second time points) from the independent dataset from the external validation cohort. The demographic data and perioperative features of the analyzed cases are presented in Table 1.

Table 1 Demographic, anesthetic, and surgical characteristics of the study population.

Full size table

Performance evaluation

Three hundred models were built from the training set to compare the estimated rewards of AIVE’s policy with those of the clinicians’ policy. The whole learning scheme was consistent for each model. The model’s estimated rewards were significantly higher than the clinician’s rewards in the internal validation (0.185, the 95% lower bound for best AIVE policy vs. −0.406, the 95% upper bound for clinicians’ policy) and the testing set (0.506, the 95% lower bound for best AIVE policy vs. 0.154, the 95% upper bound for clinicians’ policy). As shown in Fig. 2, the 95% lower bound of the estimated rewards of the AIVE’s policy consistently exceeded the 95% upper bound of the estimated performance return of the clinicians’ policy in the internal validation and external validation sets, suggesting that a sufficient number of models were developed.

**Fig. 2: Performances of the AIVE’s and clinicians’ policies.**

The distribution of discrepancy between the AIVE’s and clinicians’ policies is presented in Fig. 3. In most cases, the time discrepancy between the two policies for suggesting ventilation was within 2 min in the internal validation set and a minute in the external validation set, indicating that the AIVE’s policy could be developed from the suboptimal clinicians’ policy.

**Fig. 3: The distribution of discrepancy between the AIVE’s and clinicians’ policies.**

Outcome differences

Mismatched ventilation by a clinician’s policy during the emergence process with the AIVE’s policy was associated with increasing cardiorespiratory instability in a time-dependent manner (Fig. 4 and Table 2). There was a significant positive correlation between mismatched ventilation and increased cardiorespiratory instability in the internal and external test sets (All P < 0.001). As the secondary outcomes, the correlation between the cardiorespiratory parameters, including peripheral oxygen saturation (SpO₂), heart rate (HR), systolic blood pressure (SBP), peak inspiratory pressure (PIP), and end-tidal carbon dioxide concentration (E_TCO₂) in the internal and external test sets, are presented in Figs. 5 and 6, Table 2, and Supplementary Figs. 1 and 2. Significant positive correlations were observed between all cardiorespiratory parameters and the time of policy discrepancy in the internal test set, which was the same as the primary outcome. In the external test set, most cardiorespiratory parameters, except for HR, showed significant positive correlations with the time of policy discrepancy. Regarding secondary outcomes, significant positive correlations were observed between the policy discrepancy and the clinical outcomes (length of hospital stay and length of post-anesthesia care unit [PACU] stay) as well as cardiorespiratory parameters within 48 h after surgery (HR and respiratory rate [RR]). However, the length of hospital stay in the external test set was not statistically significant. Notably, SpO₂ showed a significant negative correlation in the internal test set, suggesting that policy discrepancy could potentially lead to lower SpO₂ levels after surgery. Among the patients who underwent chest X-rays (52.3% of the total) within 48 h after surgery, no significant correlations were found between the policy discrepancy and the incidence of atelectasis and pulmonary edema based on the X-ray results (Table 2). In addition, there were no significant correlations among the patients who had the arterial blood gas analysis after surgery (20.0% of the total). The results of subgroup analyses regarding the primary and secondary outcomes based on age, sex, and type of surgery are presented in Table 3 and Supplementary Tables 2 and 3. In all age and sex subgroups, substantial positive correlations were observed between the policy discrepancy and the cardiorespiratory instability in the internal test set. Regarding the surgical type, considerable positive correlations were detected between the policy discrepancy and cardiorespiratory instability among patients undergoing general, urological, orthopedic, gynecological, and neurosurgery.

**Fig. 4: The changes in cardiorespiratory instability depend on the degree of time discrepancy between the AIVE’s and clinicians’ policies.**

Table 2 The correlation between the policy discrepancy and the primary and secondary outcomes in the internal and external test set.

Full size table

**Fig. 5: The changes in cardiorespiratory parameters depend on the degree of time discrepancy between the AIVE’s and clinicians’ policies in the internal test set.**

**Fig. 6: The changes in cardiorespiratory parameters depend on the degree of time discrepancy between the AIVE’s and clinicians’ policies in the external test dataset.**

Table 3 Subgroup analysis for the cardiorespiratory instability in the internal and external test set.

Full size table

Visualization of representative cases for comparison of policies

Figure 7 shows two representative cases to identify the change in cardiorespiratory parameters with the discrepancy between AIVE’s and the clinician’s policies. Cardiorespiratory parameters are maintained during the emergence from general anesthesia when each policy’s ventilation control is consistent with the AIVE’s policy. However, cardiorespiratory parameters worsened when the clinicians’ actual control was discrepant with the AIVE’s policy. AIVE suggested controlling mechanical or manual ventilation based on the patient’s status to prevent excessive changes in the cardiorespiratory parameters.

Feature importance

The SHapley Additive exPlanations (SHAP) method was used to present the degree of importance of each feature for the AIVE’s and clinicians’ policies, respectively. The most important feature for controlling ventilation in both policies was the decreased airway pressure (AWP) (Fig. 8). However, unlike clinicians who usually focused only on the level of AWP, AIVE comprehensively considered other parameters, such as cumulative apnea time and spontaneous breathing.

Discussion

The present study developed and externally validated a reinforcement learning model that controls ventilation during the emergence from general anesthesia. AIVE’s policy showed higher estimated rewards than the clinicians’ policy, indicating that the actions suggested by AIVE could be superior to those suggested by clinicians for maintaining cardiorespiratory stability, adequate oxygenation, and decarboxylation during the emergence from general anesthesia. As the discrepancy increased between the AIVE’s and clinicians’ ventilation, worse outcomes were observed.

The advantages of the architecture and learning method of AIVE might explain the higher estimated reward of AIVE’s policy to those of clinicians. First, neural network architecture, which was used to build AIVE, can continuously process the complex relationship between the patient’s status and optimal action, including the dose of anesthetic drugs, hemodynamic or respiratory status, or other features at every second¹². The actual clinician’s practice might be suboptimal to interpreting tremendous data from various monitoring devices in real-time. Second, the reinforcement learning algorithm helps AIVE to find an optimal policy from our historical data to maximize a cumulative reward¹³. AIVE decides the action for the current patient status considering the future cardiorespiratory changes until complete recovery from general anesthesia. The reinforcement learning model trained by real-world clinical data makes the model find the best policy efficiently.

In the external validation, the estimated reward of AIVE was even higher than those in the internal testing dataset. This may be explained by the differences between the internal and external datasets, as the external dataset included more cases with shorter duration of anesthesia and younger aged patients than the internal dataset. These differences among the datasets may have influenced the discrepancy between the AIVE’s and clinician’s policies. The small number of cases in the external validation dataset may also have affected the relationship between the discrepancy and some variables (SpO₂), which were not definite.

To the best of our knowledge, this is the first study to develop and validate the reinforcement learning model to suggest the optimal timing of controlling ventilation during anesthesia emergence in surgical patients. Previous studies have developed offline reinforcement learning models to solve complicated medical problems^{11,14,15,16,17}. One study developed a reinforcement learning model to recommend various interventions, such as administering intravenous fluid and medications, to treat patients with sepsis in the intensive care unit (ICU)¹⁴. Prasad et al.¹⁵ reported using a reinforcement learning model for weaning from mechanical ventilation using fitted-Q iteration and the Medical Information Mart for Intensive Care (MIMIC)-III database. Another study developed an inverse reinforcement learning model for discontinuing mechanical ventilation and sedative dosing in critically ill patients¹⁶. A reward function that can be inferred by inverse reinforcement learning was designed in this study. A recent study developed a reinforcement learning model to suggest an optimized regimen using data from the MIMIC-III database, including tidal volume, a fraction of inspired oxygen (F_IO₂), and positive end-expiratory pressure (PEEP). It externally validated the model using another open ICU dataset¹¹. Another reinforcement learning model for guiding adequate electrolyte replacement was developed using electronic health records¹⁷.

The strength of this study is that the reinforcement learning model was developed based on real-world data from actual clinical practice, consisting of high-resolution intraoperative biosignals. Therefore, the reinforcement learning model would better reflect the clinical situation than a model based on a well-refined open dataset. In addition, intraoperative biosignals were obtained from various monitoring devices generally used in the operating room, providing the possibility of application in different clinical environments. The reinforcement learning agent learned the optimal policy using only information about the cardiorespiratory status of the patient during emergence from general anesthesia rather than clinical information. The data used for model development did not require assessment or judgment by clinicians and could be obtained from most hospitals. The reinforcement learning model may develop into a fully automated data-driven clinical decision support system and may facilitate an individualized strategy for controlling ventilation during the emergence from general anesthesia in surgical patients.

Despite these strengths, some important considerations must be taken when deploying our offline reinforcement learning model to real-time settings¹⁸. First, although the AIVE was designed to propose actions based on biosignals every second, there can be delays in monitoring parameters, communications, and delivering actions. Therefore, comprehensive real-time simulations should be conducted before clinical implementation to ensure the AIVE’s stability. Second, although the AIVE was validated in an external dataset, it was developed using a dataset from a single center that could potentially lead to distributional shifts when deployed in different settings. These shifts could result in the model suggesting suboptimal decisions. Considering this potential instability and biases is crucial before running our model in real-world clinical settings.

This study has some limitations. First, bias associated with the retrospective nature of the study would have affected the results. Second, we excluded cardiac and pediatric patients, as well as patients who underwent thoracic surgery requiring one-lung ventilation; therefore, the model’s performance cannot be generalized to these populations. Third, although we externally validated the model’s performance using a different dataset from an independent hospital, the sample size of the external validation dataset may be relatively small compared with that of the derivation dataset. Therefore, our results must be interpreted cautiously. However, despite the relatively small external validation dataset, the reinforcement learning model policy performed better than the clinicians’ policy in the external validation dataset and in our hospital data. Fourth, our study focused on immediate clinical outcomes in the operating room and PACU. Future research should explore the model’s benefits for long-term and relevant outcomes like delayed emergence and emergence delirium. Fifth, we did not specifically record the precise expertise level of attending anesthesiologists and trainee grades for extubation due to the retrospective nature of the study. However, all extubation processes were performed either by attending anesthesiologists or trainees under the direct supervision of attending anesthesiologists. Due to the retrospective nature of the study, clinical care for the emergence and after emergence cannot be strictly controlled, which might have caused some biases. Sixth, patient comorbidity data was not collected in this study, and there were some missing values, such as arterial blood gas analysis and chest X-ray results, which limited the evaluation of its impact on our results. Future studies should address this aspect to provide clarity. Seventh, the start of anesthesia emergence was defined as when the actual F_IO₂ exceeded 70% for automatic detection. However, this led to the exclusion of about 4.7% of patients, introducing potential biases. Eighth, only patients who received volume-controlled ventilation were included in our study. Excluding pressure-controlled or pressure-support ventilation may limit the model’s generalizability. Last, we confined the emergence duration to between 2 and 20 min, excluding 3.9% of patients, which could introduce bias. Future studies may consider employing recent reinforcement learning models capable of stable training with either shorter or longer trajectories.

In conclusion, we developed and validated a reinforcement learning model for the optimal timing of controlling ventilation using intraoperative biosignals during emergence from general anesthesia in surgical patients. A significant discrepancy between the policies of reinforcement learning and clinicians’ policies was associated with greater cardiorespiratory instability, indicating that the reinforcement learning model may have the potential to act as a clinical decision-making support tool. Future prospective validation studies are warranted to confirm our results in the prospective study.

Methods

Study design

All data for model development was retrieved from the prospective registry containing the vital signs of surgical patients at the Seoul National University Hospital (SNUH). This prospective registry was approved by the Institutional Review Board (IRB) of SNUH (Approval number: 1408-101-605) and registered at ClinicalTrials.gov (NCT02914444). The IRB also approved the retrospective analysis of the data from this prospective registry (Approval number: 2205-061-1322). The IRB approved the data extraction and analysis for external validation at Seoul National University Bundang Hospital (SNUBH, Approval number: 2207-768-405). The IRBs waived the requirement of written informed consent due to the retrospective nature of this study and the anonymity of the data.

Data collection

From the registry data, all general anesthesia cases from the derivation cohort (SNUH) between August 2016 and November 2019 were included for model development and internal validation. Cases from the external validation cohort (SNUBH) were included for external validation between January 2022 and June 2022. Additionally, it is worth noting that the majority of cases involving general anesthesia were administered by attending anesthesiologists with several years of experience, and trainees were supervised throughout the process. Cases with the following features were excluded: (1) patient age <18 years, (2) cases in which pressure-controlled ventilation was used, (3) procedures that were not performed under general anesthesia, (4) cases in which the laryngeal mask airway was used rather than an endotracheal tube, (5) cases in which one-lung ventilation was performed using a double-lumen tube, (6) cases that had no tracks regarding critical input variables in the intraoperative biosignals data, (7) cases in which tracheal extubation was not performed at the end of surgery in the operating room, (8) cases in which F_IO₂ was not increased before the patient recovered spontaneous breathing, (9) cases in which the duration of emergence was less than 2 min or greater than 20 min, and (10) cases that had no tracks to evaluate the primary outcome.

The intraoperative biosignal data used in the study were collected by a free biosignal collection program (Vital Recorder, ver.1.9.9, accessible at https://vitaldb.net, Seoul, Republic of Korea)¹⁹. SpO₂, HR, and SBP were measured using a patient monitor (Solar^TM 8000 M, GE Healthcare, Wauwatosa, WI, USA). The indices related to the processed electroencephalogram, such as the bispectral index, electromyogram, and spectral edge frequency, were collected using the brain monitor (BIS Vista^TM, Medtronic, Dublin, Ireland). In addition, data regarding mechanical ventilation, such as AWP, E_TCO₂ level, respiratory compliance, anesthetic agents, PIP, RR, PEEP, and tidal volume, were collected from the anesthesia ventilators (Primus, Dräger, Lübeck, Germany). Among the variables of mechanical ventilation, waveform data, including AWP and E_TCO₂ from the anesthesia ventilators, were sampled at a rate of 62.5 Hz. In comparison, other variables were sampled at a rate of 0.14 Hz. cardiorespiratory-related variables from the patient monitor were sampled at 2 Hz. For handling these time-varying variables, we up-sampled using linear interpolation followed by forward and backward filling methods or down-sampled to bring all variables to 10 Hz. We adopted the maximum value with a one-second time window for our model development.

Anesthesia management

The patients received balanced anesthesia using sevoflurane inhalation and a target-controlled remifentanil infusion or total intravenous anesthesia. For those who received balanced anesthesia, propofol was used for anesthesia induction with a bolus dose of 1.0–2.0 mg/kg, and anesthesia was maintained with sevoflurane and the effect-site target-controlled infusion of remifentanil. The sevoflurane concentration was usually maintained as 0.6–0.8 minimum alveolar concentration, while the target-controlled infusion of remifentanil was usually maintained as 1–4 ng/ml based on hemodynamic changes. In cases of total intravenous anesthesia, target-controlled infusions of propofol and remifentanil were used. Propofol concentrations were usually adjusted to maintain the bispectral index of 40–60, and remifentanil was maintained at 1–4 ng/ml based on hemodynamic changes. An infusion pump (Orchestra^®, Base Primea with module DPS, Fresenius Kabi AG, Bad Homburg, Germany) was used for target-controlled infusion of remifentanil or propofol. At the end of the surgery, any anesthetics were discontinued, and anesthesia emergence and extubation were performed at the discretion of attending anesthesiologists after administering a reversal agent of the neuromuscular blocking agent. According to the institution’s policy, the emergence process was carried out by attending anesthesiologists or trainees under the direct supervision of attending anesthesiologists.

Outcome measurements

The primary outcome of the study was the time duration (in seconds) of cardiorespiratory instability during anesthesia emergence, which was defined by a composite outcome based on a combination of the following parameters: SpO₂, HR, and SBP. The duration of cardiorespiratory instability was quantified by measuring the combined time duration during which any of the following parameters exceeded predefined thresholds: SpO₂ below 95%, or HR or SBP showing changes greater than 20% changes from their baseline values. The secondary outcomes included the time duration of each following variable: SpO₂ (<95%), HR (>20% changes from the baseline), SBP (>20% changes from the baseline), PIP (>20% changes from the baseline), and apnea time (E_TCO₂ <2 mmHg). We additionally included the following postoperative outcomes as secondary outcomes: cardiorespiratory parameters (SBP, HR, RR, and SpO₂), clinical outcomes (the length of hospital stay, length of PACU stay, and postoperative 30-day in-hospital mortality), arterial blood gas analysis (partial pressures of oxygen [PaO₂] and carbon dioxide [PaCO₂]), and chest X-ray results within 48 h after surgery. The specific threshold values for each parameter are presented in Supplementary Table 1.

The SHAP method, which is based on game theory and provides importance scores for each feature, has been used in the medical research field to present the interpretability of model²⁰. Our study also used this method to present how each feature in the state space was attributed to each policy, with 500 weak learners applied in the internal test dataset.

Markov decision process

The problem regarding optimal ventilation control during anesthesia emergence can be formulated as a Markov decision process (MDP), with state space $S\subseteq {{\mathbb{R}}}^{n}$, where collected features $S$ and action space $A\,\in \,{\mathbb{R}}$ include mechanical or manual ventilation on (${a}^{{{\mathrm{vent}}}{{\mathrm{on}}}}:=1$) or off (${a}^{{{\mathrm{vent}}}{{\mathrm{off}}}}:=0$). The reward $R:S\times A{\mathbb{\to }}{\mathbb{R}}$ depends on the current 2-tuple of state and action. Therefore, given a state $s\in S$, the policy is defined as a probability distribution over the action space $A$, $\pi \left(\cdot\, |s\right)\in \varDelta A$, where $\rho \in \varDelta S$ is the distribution of the initial state, ${s}_{0}$. The probability of a $T$-step trajectory-making transition matrix is defined as follows:

$$P(\tau {\rm{|}}{\pi }_{\theta })\triangleq \rho ({s}_{0})\mathop{\prod }\limits_{t=0}^{T-1}P({s}_{t+1}{\rm{|}}{s}_{t},{a}_{t}){\pi }_{\theta }\left({a}_{t}\right|{s}_{t})$$

(1)

With the discounting factor, $\gamma$ $\in \,[0,1)$ for future rewards, our ventilation decision problem can be formulated into MDP to create the following value function:

$${V}_{\pi }\left(s\right)\,=\,{\mathbb{E}}\left[{R}_{t}{{|}}{s}_{0}=s\right]{\mathbb{=}}{\mathbb{E}}\left[\mathop{\sum }\limits_{t=0}^{T-1}{\gamma }^{t}{r}_{t}{\rm{|}}{s}_{0}=s\right]$$

(2)

Moreover, the action-value function (known as the Q-function), ${Q}_{\pi }:S\times A\to R$, can be defined as follows:

$${Q}_{\pi }\left(s,\,a\right)\,=\,{\mathbb{E}}\left[{R}_{t}{{|}}({s}_{0},\,{a}_{0})=(s,\,a)\right]\,=\,{\mathbb{E}}\left[\mathop{\sum }\limits_{t=0}^{T-1}{\gamma }^{t}{r}_{t}{{|}}({s}_{0},\,{a}_{0})=(s,\,a)\right]$$

(3)

Reinforcement learning model

Offline reinforcement learning has emerged as an alternative to the typical online setting for reinforcement learning algorithms, as it can use a fully fixed dataset of trajectories without any further interactions with the environment²¹. This offline setting is suitable for the medical field as it enables using existing datasets made by clinicians’ decision-making for real-world patients. Moreover, offline reinforcement learning does not pose any risk to the patients. However, recent studies have shown that conventional reinforcement learning algorithms yield poor performance in offline settings due to extrapolation errors where the values are estimated from state-action pairs and are not included in the existing dataset^22,23. Therefore, we adopted conservative-Q learning, which learns conservative-Q-function such that the expected value of a policy under this Q-function lower-bounds its true value to reduce overestimation in out-of-distribution actions²⁴. This algorithm has yielded better performance than conventional reinforcement learning algorithms. It has been applied in a few medical tasks, including optimizing mechanical ventilation control or sepsis treatment strategy for intensive care unit patients^25,26. Each definition used in our reinforcement learning model has been described in the following sections.

T-step trajectory was defined by the emergence duration as the time from the beginning of waking patients from general anesthesia to the end of the E_TCO₂ monitoring. The start of emergence was defined as increasing F_IO₂ to 70% or higher. The minimum and maximum bounds of the length of T-step trajectory were determined, considering both the training stability of the reinforcement learning model and the usual length of the anesthesia recovery. We categorized the patients’ ventilation status (ventilation-dependent or ventilation-independent) and extubation status (intubated or extubated status) into two states. The ventilation-dependent status indicates that the lungs were mechanically ventilated by the anesthesia ventilator or manually ventilated by clinicians. In contrast, the ventilation-independent status indicates that the patient had spontaneous breathing, not requiring ventilation support by the anesthesia ventilator or clinicians. Lastly, intubated or extubated status represented whether the endotracheal tube was present in the trachea.

The state space consists of the following 10 features at time t: effect-site concentrations of propofol and remifentanil; end-expiratory pressure of sevoflurane; PIP; tidal volume, the moving averaged AWP and E_TCO₂ within 6 s; HR; SpO₂; SBP; the presence of spontaneous breathing; the cumulation of apnea time, defined as E_TCO₂ <2 mmHg, after turning the mechanical ventilator off; and current ventilation and extubation status (ventilation-dependent/independent and intubated/extubated status). The presence of spontaneous breathing was detected through abrupt changes in airway pressure. These changes indicate sudden increases (at least 5 cmH₂O higher than the previous maximum AWP within 15–30 s) or sudden decreases in AWP (at least 3 cmH₂O lower than the PEEP setting) due to the patient-ventilator asynchrony by spontaneous breathing. The earliest time point that meets the criteria was defined as the moment when spontaneous breathing returns. Furthermore, apnea time was cumulated while E_TCO₂ was lower than 2 mmHg. The action variable was selected at each time $t$ from two discrete action candidates, including ventilation (a^{vent on}) or non-ventilation (a^{vent off}).

Reward function

The AIVE maintained cardiorespiratory stability, adequate oxygenation, and decarboxylation during emergence from general anesthesia. Therefore, we divided the reward system into two parts as ${r}_{{t}}$ was defined by the penalties from cardiorespiratory parameters (${r}^{{{\mathrm{CR}}}}$) in the next time step. Specifically, ${r}^{{{\mathrm{CR}}}}$ consists of oxygen saturation below 97% $({v}^{{{\mathrm{SPO}}}{\mathrm{2}}})$, cumulated apnea time over 6 s $({v}^{{{\mathrm{apnea}}}})$, HR $({v}^{{{\mathrm{HR}}}})$, SBP $({v}^{{{\mathrm{SBP}}}})$, and PIP $({v}^{{{\mathrm{PI}}}{\mathrm{P}}})$ showing a 20% increase from the baseline condition defined by the averaged values for 10 s when F_IO₂ begins to rise. Lastly, the reward system was balanced using four constants (${{\rm{\alpha }}}_{k},k\in \{\mathrm{1,2,3,4}\})$ through anesthesiology experts’ knowledge, and the lower bound of r_t was set at −20, which was the first quantile of elements in reward space, as shown below:

$${r}_{t}={\rm{max }}(-20,-{r}_{t}^{{{{\mathrm{CR}}}}})$$

(4)

$${r}_{t}^{{{\mathrm{CR}}}}\left(s,a\right)\,{{:= }}\,{v}^{{{\mathrm{apnea}}}}+{\alpha }_{1}\cdot {v}^{{{\mathrm{SPO}}}{\mathrm{2}}}+{\alpha }_{2}\cdot {v}^{{{\mathrm{HR}}}}+{\alpha }_{3}\cdot {v}^{{{\mathrm{SBP}}}}+{\alpha }_{4}\cdot {v}^{{{\mathrm{PIP}}}}{{\mathrm{for}}}\,{{\mathrm{all}}}\,\left(s,a\right)\in S\times A$$

(5)

Performance evaluation

To compare the performance of the AIVE’s policy with that of the clinicians’ policy, we adopted the fitted-Q-evaluation (FQE) method with bootstrapping to provide the confidence interval for each policy among 300 different models^27,28. The derivation cohort dataset was randomly divided into two datasets (training set [85%] and testing set [15%]) for each model. Three hundred models were built via various random splits (82.3%) of the training dataset and evaluated by the remaining validation set (17.7%); each model’s whole learning scheme was consistent. All training was conducted on an NVIDIA RTX A6000 GPU.

Using the FQE method, we compared the 95% lower bound of the AIVE’s performance return with the 95% upper bound of clinicians’ rewards to evaluate our new policy conservatively, as suggested by previous RL studies^11,14,29. Finally, the model which maximized the 95% lower bound of the AIVE’s policy was selected for further outcome measurement.

Statistical analysis

Python 3.8.0 (Python Software Foundation, Wilmington, DE, USA) was used for signal preprocessing, model development and validation, statistical testing, and visualization. Statistical analyses of primary and secondary outcomes were conducted using Kendall’s rank correlation for continuous outcomes and the point-biserial correlation for categorical outcomes. All statistics for continuous variables were reported with point estimates and 95% confidence intervals, and those for categorical variables were reported with counts (frequencies) or proportions. The original significance level was set at 0.05. The Bonferroni correction was utilized to account for multiple comparisons, considering one primary and 24 secondary outcomes. Therefore, a P-value <0.002 was considered statistically significant.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The public dataset to run the code for this study is available at https://vitaldb.net/. The data supporting this study’s findings are also available from the corresponding author upon reasonable request.

Code availability

The code to generate the result of this study can be accessed at https://github.com/HyeonhoonLee/AIVE.

References

Cascella, M., Bimonte, S. & Muzio, M. R. Towards a better understanding of anesthesia emergence mechanisms: research and clinical implications. World J. Methodol. 8, 9–16 (2018).
Article PubMed PubMed Central Google Scholar
Brown, E. N., Lydic, R. & Schiff, N. D. General anesthesia, sleep, and coma. N. Engl. J. Med. 363, 2638–2650 (2010).
Article CAS PubMed PubMed Central Google Scholar
Benham-Hermetz, J. & Mitchell, V. Safe tracheal extubation after general anaesthesia. BJA Educ. 21, 446–454 (2021).
Article CAS PubMed PubMed Central Google Scholar
Lavin, A. et al. Technology readiness levels for machine learning systems. Nat. Commun. 13, 6039 (2022).
Article CAS PubMed PubMed Central Google Scholar
Yuba, M. & Iwasaki, K. Systematic analysis of the test design and performance of AI/ML-based medical devices approved for triage/detection/diagnosis in the USA and Japan. Sci. Rep. 12, 16874 (2022).
Article CAS PubMed PubMed Central Google Scholar
Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).
Article CAS PubMed Google Scholar
Bennett, C. C. & Hauser, K. Artificial intelligence framework for simulating clinical decision-making: a Markov decision process approach. Artif. Intell. Med. 57, 9–19 (2013).
Article PubMed Google Scholar
Chang, H., Yu, J. Y., Yoon, S., Kim, T. & Cha, W. C. Machine learning-based suggestion for critical interventions in the management of potentially severe conditioned patients in emergency department triage. Sci. Rep. 12, 10537 (2022).
Article CAS PubMed PubMed Central Google Scholar
Schamberg, G., Badgeley, M., Meschede-Krasa, B., Kwon, O. & Brown, E. N. Continuous action deep reinforcement learning for propofol dosing during general anesthesia. Artif. Intell. Med. 123, 102227 (2022).
Article PubMed Google Scholar
Zhang, K. et al. An interpretable RL framework for pre-deployment modeling in ICU hypotension management. NPJ Digit. Med. 5, 173 (2022).
Article PubMed PubMed Central Google Scholar
Peine, A. et al. Development and validation of a reinforcement learning algorithm to dynamically optimize mechanical ventilation in critical care. NPJ Digit. Med. 4, 1–12 (2021).
Article Google Scholar
Liu, N. et al. Learning the dynamic treatment regimes from medical registry data through deep Q-network. Sci. Rep. 9, 1495 (2019).
Article PubMed PubMed Central Google Scholar
Liu, M., Shen, X. & Pan, W. Deep reinforcement learning for personalized treatment recommendation. Stat. Med. 41, 4034–4056 (2022).
Article PubMed PubMed Central Google Scholar
Saria, S. Individualized sepsis treatment using reinforcement learning. Nat. Med. 24, 1641–1642 (2018).
Article CAS PubMed Google Scholar
Prasad, N., Cheng, L.-F., Chivers, C., Draugelis, M. & Engelhardt, B. E. A reinforcement learning approach to weaning of mechanical ventilation in intensive care units. CoRR. Preprint at https://arxiv.org/abs/1704.06300 (2017).
Yu, C., Liu, J. & Zhao, H. Inverse reinforcement learning for intelligent mechanical ventilation and sedative dosing in intensive care units. BMC Med. Inform. Decis. Mak. 19, 57 (2019).
Article PubMed PubMed Central Google Scholar
Prasad, N. et al. Guiding efficient, effective, and patient-oriented electrolyte replacement in critical care: an artificial intelligence reinforcement learning approach. J. Pers. Med. 12, 661 (2022).
Article PubMed PubMed Central Google Scholar
Nath, S. et al. Reinforcement learning in ophthalmology: potential applications and challenges to implementation. Lancet Digit. Health 4, e692–e697 (2022).
Article CAS PubMed Google Scholar
Lee, H. C. & Jung, C. W. Vital Recorder-a free research tool for automatic recording of high-resolution time-synchronised physiological data from multiple anaesthesia devices. Sci. Rep. 8, 1527 (2018).
Article PubMed PubMed Central Google Scholar
Lundberg, S. M. et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat. Biomed. Eng. 2, 749–760 (2018).
Article PubMed PubMed Central Google Scholar
Levine, S., Kumar, A., Tucker, G. & Fu, J. Offline reinforcement learning: tutorial, review, and perspectives on open problems. In Conference on Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2005.01643 (2020).
Fujimoto, S., Meger, D. & Precup, D. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning 2052–2062 (PMLR, 2019).
Agarwal, R., Schuurmans, D. & Norouzi, M. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning 104–114 (PMLR, 2020).
Kumar, A., Zhou, A., Tucker, G. & Levine, S. Conservative Q-learning for offline reinforcement learning. Adv. Neural Inf. Process. Syst. 33, 1179–1191 (2020).
Google Scholar
Kondrup, F. et al. Towards safe mechanical ventilation treatment using deep offline reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence 37, 15696–15702 (2023).
Kaushik, P., Kummetha, S., Moodley, P. & Bapi, R. S. A conservative Q-learning approach for handling distribution shift in sepsis treatment strategies. In Bridging the Gap: from Machine Learning Research to Clinical Practice Workshop at the 35th Conference on Neural Information Processing Systems (NeurIPS). Preprint at https://arxiv.org/abs/2203.13884 (Sydney, Australia, 2021).
Fu, J. et al. Benchmarks for deep off-policy evaluation. In International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=kWSeGEeHvF8 (2021).
Hao, B. et al. Bootstrapping fitted Q-evaluation for off-policy inference. In International Conference on Machine Learning 4074–4084 (PMLR, 2021).
Tang, S. & Wiens, J. Model selection for offline reinforcement learning: practical considerations for healthcare settings. In Machine Learning for Healthcare Conference 2–35 (PMLR, 2021).

Download references

Acknowledgements

This work was supported by a grant from the MD-PhD/Medical Scientist Training Program through the Korea Health Industry Development Institute (KHIDI); the National Research Foundation of Korea (NRF) grant, funded by the Ministry of Science and ICT, Republic of Korea (NRF-2020R1C1C1014905); the NRF grant funded by the MSIT, Republic of Korea (No. 2022R1C1C1012753); and the Korea Health Technology R&D Project through the KHIDI, funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI21C1074).

Author information

These authors contributed equally: Hyeonhoon Lee, Hyun-Kyu Yoon, Jaewon Kim.

Authors and Affiliations

Department of Anesthesiology and Pain Medicine, Seoul National University Hospital, Seoul, Republic of Korea
Hyeonhoon Lee
Biomedical Research Institute, Seoul National University Hospital, Seoul, Republic of Korea
Hyeonhoon Lee
Department of Anesthesiology and Pain Medicine, Seoul National University College of Medicine, Seoul National University Hospital, Seoul, Republic of Korea
Hyun-Kyu Yoon & Hyung-Chul Lee
Center for Digital Health, Medical Science Research Institute, Kyung Hee University Medical Center, Kyung Hee University College of Medicine, Seoul, Republic of Korea
Jaewon Kim
Department of Pediatrics, Seoul National University College of Medicine, Seoul National University Hospital, Seoul, Republic of Korea
Ji Soo Park
Department of Anesthesiology and Pain Medicine, Seoul National University Bundang Hospital, Seongnam, Republic of Korea
Chang-Hoon Koo
Department of Anesthesiology and Pain Medicine, SMG-SNU Boramae Medical Center, Seoul National University College of Medicine, Seoul, Republic of Korea
Dongwook Won

Authors

Hyeonhoon Lee
View author publications
You can also search for this author in PubMed Google Scholar
Hyun-Kyu Yoon
View author publications
You can also search for this author in PubMed Google Scholar
Jaewon Kim
View author publications
You can also search for this author in PubMed Google Scholar
Ji Soo Park
View author publications
You can also search for this author in PubMed Google Scholar
Chang-Hoon Koo
View author publications
You can also search for this author in PubMed Google Scholar
Dongwook Won
View author publications
You can also search for this author in PubMed Google Scholar
Hyung-Chul Lee
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.L., H.K.Y., and J.K. contributed equally to this work as co-first authors. H.L., J.K., H.K.Y., H.C.L., J.P., C.H.K., and D.W. contributed substantially to the study conception and design, data acquisition, and data analysis. H.L., J.K., H.K.Y., and H.C.L. participated in drafting the article or revising it critically for important intellectual content. All authors gave final approval of the version to be published.

Corresponding author

Correspondence to Hyung-Chul Lee.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lee, H., Yoon, HK., Kim, J. et al. Development and validation of a reinforcement learning model for ventilation control during emergence from general anesthesia. npj Digit. Med. 6, 145 (2023). https://doi.org/10.1038/s41746-023-00893-w

Download citation

Received: 04 February 2023
Accepted: 03 August 2023
Published: 14 August 2023
DOI: https://doi.org/10.1038/s41746-023-00893-w

Subjects

Abstract

Similar content being viewed by others

Development and validation of a reinforcement learning algorithm to dynamically optimize mechanical ventilation in critical care

Development of a deep learning model that predicts Bi-level positive airway pressure failure

Personalized predictions of patient outcomes during and after hospitalization using artificial intelligence

Introduction

Results

Dataset construction

Performance evaluation

Outcome differences

Visualization of representative cases for comparison of policies

Feature importance

Discussion

Methods

Study design

Data collection

Anesthesia management

Outcome measurements

Markov decision process

Reinforcement learning model

Reward function

Performance evaluation

Statistical analysis

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Information

Reporting Summary

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links