Introduction

Sepsis is a life-threatening syndrome triggered by a chain reaction through the body to an infection. Absent timely intervention and treatment could lead to tissue damage, metabolic dysfunction, and acute organ failures1. Almost any infection, including COVID-19, can lead to sepsis. Approximately 30% of patients diagnosed with severe sepsis do not survive2. According to the international sepsis guidelines, fluid resuscitation and vasopressors are oftentimes administered to contain the infection, the dose of which should be adjusted according to dynamic measurements of the disease progression3,4. However, it remains a confounding quest in clinical practice to decide the optimal solution and dose of fluid and vasopressor therapy, particularly considering the individual difference. There is a lack of tools for personalized real-time decision support on sepsis treatment. The recent advances in electronic medical records (EMR)5 have provided an unprecedented opportunity to capture the evolution of patient health status and design cost-effective treatment plans6. Consequently, data-driven and artificial intelligence (AI) approaches, including supervised learning (SL)7,8,9 and reinforcement learning (RL)10,11,12, have been extensively attempted to assist clinical decision making13.

The dynamic treatment decision for sepsis is naturally a problem of Markov Decision Process (MDP)14. Komorowski et al. developed an RL approach based on the SARSA (State-Action-Reward-State-Action) algorithm15 to provide personalized treatment decisions for adult sepsis patients in intensive care unit (ICU)16. Here, the action is referred to as the dose of intravenous fluids and vasopressors, and the dynamic patient health status is considered as the state, which can be inferred from the physiological data. Yet, this method is only limited to a discrete state space, not amenable to the continuously evolving physiological status17. To address this limitation and avoid the curse of dimensionality in Q learning, approximation of the Q value has been extensively investigated in value-based deep reinforcement learning (DRL) algorithms, such as Deep Q-Network (DQN)18, Double Deep Q-Network (DDQN)19, Dueling Deep Q-Network (Dueling DQN)20 and Dueling Double Deep Q-Network (D3QN)17. Q value function elucidates the value to perform a given action in a given state. Such a value for the next state after taking an action is denoted as the target Q value, and an accurate estimation of the target Q value is crucial to policy improvement. However, if the estimation of the target Q value is inaccurate, overestimation or underestimation is likely to occur. For instance, the Dueling DQN structure follows the maximum target values and uses the same parameters in main network and target network to select and evaluate an action, which tends to the overestimation issue21. The D3QN structure possesses two neural networks with two separate sets of weights: the main network selects the optimal action, and then the target network computes the corresponding Q value for the action. We note that D3QN structure often selects sub-optimal actions for the target network, which tends to underestimate the target Q value22,23.

In spite of the recent leap forward in AI-boosted smart healthcare, it remains a challenging task for AI to outperform experienced clinicians in the diagnosis and treatment for a variety of diseases, including sepsis. Indeed, AI-derived systems cannot replace the physician in the clinical management of sepsis. The blind trust of AI algorithms in decision making for healthcare management without clinician supervision has led to increased medical risks and safety issues24,25. Remarkably, AI or data-driven models suffer from biases in data and model building, and consequently may provoke treatment solutions that are against the principle of clinic practices. To this end, hybrid systems of SL and RL that capitalize on the availability of large-scale EMR have been proposed, which are capable of providing reliable medical recommendations26. Nevertheless, the usage of SL not only increases the computational complexity but also limits the self-adaptiveness of the RL decision in long-term reward27. In addition, most existing studies on sepsis treatment hinge on a large number of features extracted from EMR, including blood glucose and white blood cell count, which could further impinge on the performance and interpretability of AI models. Thus, eliminating redundancy and singling out the most representative features are vital for RL agent to make precise perceptions. Therefore, we aim to integrate a DRL model with human expertise and sort out a critical subset of the clinical features in sepsis towards reliable and more clinically interpretable decision making for sepsis treatment.

More specifically, we propose a Weighted Dueling Double Deep Q-Network with embedded human Expertise (WD3QNE) to aid real-time sepsis treatment. The algorithm architecture is shown in Fig. 1. Structure shows feature selection, trajectory, and agent model training. The innovations boil down as follows: (1) We develop a novel target Q value function with adaptive dynamic weight, which improves the accuracy of target Q value estimation and results in a higher-precision reinforcement learning model. The method makes a trade-off between Dueling DQN overestimation and D3QN underestimation in value estimation. It is worth noting that this method can be easily generalized to other value-based DRL methods. (2) An AI platform is constructed that integrates human expertise with the DRL model: the human expertise provides guidance for AI and ensures higher efficiency and reliability in sepsis treatment. We offer novel insights for incorporating human expertise with DRL. (3) The important features of clinical relevance for septic patients are selected by a random forest algorithm. We eliminate statistical redundancy among those commonly used clinical and biological features and enhance the clinical interpretability of DRL. We also compare WD3QNE with other widely used value-based DRL methods, including DQN, DDQN, D3QN, and WD3QN, in terms of the expected return, survival rate and action distribution treatment for sepsis using the MIMIC-III dataset28. We demonstrate that the WD3QNE policy outperforms human clinicians and other value-based DRL methods and achieves the highest survival rate. We further compare the drug intervention distribution of a pure AI and AI with embedded human expertise. In addition, to explore the generality of the target Q value function with adaptive dynamic weight as proposed in this paper, we use OpenAI Gym LunarLander-v2 environments29 to validate our model’s performance (see Supplementary Note 1).

Fig. 1: Architecture of WD3QNE algorithm.
figure 1

a The dynamic treatment process of the WD3QNE agent for sepsis. The continuous state space and discrete action space are then constructed. The DRL agent takes actions based on the current state and clinician expertise. b WD3QNE algorithm structure.

Results

Q value function

In this study, we utilize a dueling deep neural network framework for both the main network and the target network to approximate the Q function, \(Q\left( {S_t,a_t} \right)\). The dueling network structure has two streams to separate out the estimate of state value V and the advantage A for each action30 as \(Q\left( {S_t,a_t} \right) = V\left( {S_t} \right) + A\left( {S_t,a_t} \right)\). The dueling architecture learns the states that are valuable or not, without having to learn the value of each action for each state. Here, the state St represents the health status of the patients and the action at is the prescribed dose of intravenous fluids and vasopressors at time t. The agent takes an action at in current state St and transitions to the next state St+1. In Dueling DQN, the target Q values are derived from the target network under the state \(S_{t + 1}\) at time t + 1. The maximum Q value is then selected, which tends to result in overestimation30. In D3QN, the action is first determined by the main network, and then the target Q value is obtained from the target network, which tends to result in underestimation31. Hence, the target Q value estimation in both Dueling DQN and D3QN can be inaccurate, largely owing to the uncertainty of the action in the next state. To find more accurate target Q value estimation, we design a novel target Q value function with adaptive dynamic weight p (Eq. (1)) to realize a trade-off between Dueling DQN and D3QN to derive the optimal policy:

$$Q\left( {S_{t + 1},a_{t + 1}} \right) = p \times \mathop {{\max }}\limits_{a_{t + 1}} Q\left( {S_{t + 1},a_{t + 1};{{{\mathrm{\omega }}}}^ - } \right) + \left( {1 - p} \right) \times Q( {S_{t + 1},\mathop {{{{{\mathrm{argmax}}}}}}\limits_{a_{t + 1}} Q\left( {S_{t + 1},a_{t + 1};\omega } \right);\omega ^ - } )$$
(1)

where ω are the parameters of the main network, and ω are the parameters of the target network. The adaptive dynamic weight p (Eq. (2)) is calculated as:

$$p = \frac{{\varphi _{a_{t + 1}}}}{{\varphi _{a_{t + 1}} + \sigma _{a_{t + 1}}}}$$
(2)

Here, \(\varphi _{a_{t + 1}}\) is the maximum target Q value divided by the summation of the target Q value under all possible actions (Eq. (3)), and the target Q values are obtained from the Dueling DQN method. Similarly, the dynamic parameter \(\sigma _{a_{t + 1}}\) is obtained from the D3QN method (Eq. (4)).

$$\varphi _{a_{t + 1}} = \frac{{\mathop {{\max }}\limits_{a_{t + 1}} Q\left( {S_{t + 1},a_{t + 1};\omega ^ - } \right)}}{{\mathop {\sum}\nolimits_{a_{t + 1}} {{{\mathrm{Q}}}} \left( {S_{t + 1},a_{t + 1};\omega ^ - } \right)}}$$
(3)
$$\sigma _{a_{t + 1}} = \frac{{Q(S_{t + 1},\mathop {{{{{\mathrm{arg}}}}\max }}\limits_{a_{t + 1}} {\it{Q}}\left( {S_{t + 1},a_{t + 1};\omega } \right);\omega ^ - )}}{{\mathop {\sum}\nolimits_{a_{t + 1}} {Q(S_{t + 1},} \mathop {{{{{\mathrm{arg}}}}\max }}\limits_{a_{t + 1}} {\it{Q}}\left( {S_{t + 1},a_{t + 1};\omega } \right);\omega ^ - )}}$$
(4)

We use the adaptive dynamic weight to seek the balance of the estimated target Q values of the two methods, so that the approximate value of target Q is closer to the unbiased estimator. The Q value function \(Q\left( {S_t,a_t} \right)\) is the expected cumulative reward from taking a certain action at in state St following a policy. In order to estimate the Q value of current state St, we add the reward for performing an action at to the target Q value \(Q\left( {S_{t + 1},a_{t + 1}} \right)\). Finally, the Q value function (Eq. (5)) is obtained.

$$Q\left( {S_t,a_t} \right) = r + \gamma Q\left( {S_{t + 1},a_{t + 1}} \right)$$
(5)

where r is the reward after performing an action in the state St (see Reward function), and γ is the discount factor.

Additionally, the personalized treatment of sepsis is a complex puzzle for clinical management14. It is crucial to ensure the reliability and safety of therapeutic interventions under personalized treatment planning. Nonetheless, a DRL agent only interacts with the environment to seek the optimal actions with high reward, regardless of the potential risks. It has been noted that certain actions induced by AI could cause high risk and lead to contentious medical solutions26, which has significantly stymied the broad adoption of AI in healthcare management. On the other front, human experts maintain an edge over AI in abstract reasoning under ambiguous conditions. Thus, a trend of keeping human in the loop in critical decision making has been emphasized in a host of industries domains. Here, we guide the DRL agent to perform actions by incorporating human expertise. Raghu et al. found that for sepsis patients with mild symptoms, the more similar a pure AI policy is to a clinician’s policy, the greater the patient’s survival rate. Thus, human clinicians are more reliable than pure RL agents in this scenario, which is partially owing to the fact that human clinicians are more cautious about other issues including individualized health status and drug interactions. Interestingly, such disparity does not exist for patients with severe symptoms. For patients with severe symptoms, the optimal treatment strategy is still in the infancy stage31, and not too much human expertise can be utilized for comparison or to guide the AI. Komorowski et al. analyzed the drug dose distribution, and found that AI policy tended to give high doses of the vasopressor32. Particularly, in the latest guideline on sepsis management, an initial target value of 65 mm Hg for the mean-arterial-pressure (MAP) has been suggested in lieu of 72.6 mm Hg as previously recommended3. That said, a high dose of vasopressors is no longer favored in the initial stage. Furthermore, Raghu et al. divided the Sequential Organ Failure Assessment (SOFA) scores into three levels (<5, 5–15, and >15) to evaluate model performance for different severity subcohorts17. Here, we employ the human clinician expertise at the lowest SOFA level and along with the patient outcome to estimate the target Q value function and guide the agent. We propose the Q value function of clinician expertise (Eq. (6)):

$$Q^{clin}\left( {S_t,a_t^{clin}} \right) = r + \gamma Q^{clin}\left( {S_{t + 1},a_{t + 1}^{clin};\omega ^ - } \right)$$
(6)

Accordingly, if SOFA is < 5, we use the Q value function of clinician expertise, otherwise the novel Q value function is leveraged. The Q value function of WD3QNE algorithm is given by:

$$Q^{WD3QNE} = \left\{ {\begin{array}{*{20}{c}} {Q^{clin}\left( {S_t,a_t^{clin}} \right)} \\ {Q\left( {S_t,a_t} \right)} \end{array}} \right.\begin{array}{*{20}{c}} {if\;SOFA < 5} \\ {otherwise} \end{array}$$
(7)

Survival rate and safety rate

We first calculate the expected return based on the double robust off-policy value evaluation using the MIMIC-III dataset28. We choose several value-based DRL algorithms for comparison with our WD3QNE: DQN22 combines Q learning with a deep neural network; DDQN23 is a variant of deep Q learning with two neural networks, main network and target network; D3QN31 is DDQN combined with Dueling DQN; Weighted Dueling Double Deep Q-Network (WD3QN) introduces a target Q value function with adaptive dynamic weight into D3QN, but does not use the human expertise. Compared to other value-based DRL, the target Q value function is additionally adopted to revise the Q value function in WD3QN. We divide the MIMIC-III dataset into training set (80%), validation set (10%) and test set (10%). Experimental studies are conducted and the algorithms are run 30 times.

We obtain the survival rate according to the return value (see Methods). The expected return and the survival rate on the test dataset are shown in Table 1. The results show that the AI policy has a higher survival rate than the human clinician’s policy. The feature selection process improves the performance of algorithms, because all models with feature selection (37 features) achieve better performance than the same type of models without feature selection (45 features). With 37 features, it is noteworthy that the WD3QNE obtains the highest survival rate of 97.81% with the lowest standard deviation of 0.0012. The results showed that the survival rate of the human clinician’s policy is 83.26% with expected return 14.11. The survival rate of D3QN policy is 96.48% with expected return 22.27. The survival rate of WD3QN policy is 97.49% with expected return 23.08. Overall, to human clinicians, WD3QNE survival rate is improved survival by 17.5%. The WD3QNE survival rate is also an improvement of 1.38% compared to D3QN and 0.32% compared to WD3QN.

Table 1 Off-policy evaluation performance of baselines in the test set.

Furthermore, in Fig. 2, we present the expected return of different algorithms at each learning epoch in the validation set. The WD3QNE expected return values converge and stabilize around reward value 24. Our proposed method outperforms other baseline methods. It is noteworthy that WD3QNE with human expertise achieved better performance than WD3QN without human expertise. Additionally, although the DQN algorithm has the fastest convergence in early period, it converges to the local optimal value.

Fig. 2: Expected return of different algorithms at each learning epoch.
figure 2

The value-based DRL algorithms is run for 100 epochs in the validation set with feature selection (37 observation features) and without feature selection (45 observation features). Number 37 means 37 observation features that we select with the random forest algorithm. Number 45 means 45 observation features. Although the DQN algorithm converges fast in the beginning, it exhibits premature convergence.

Action distribution

For further interpretation, we demonstrate the optimal policies derived from the three representative methods (human clinician, D3QN and WD3QNE). The action distribution of the clinician policy is given in the MIMIC-III set. As shown in Fig. 3, the clinician uses low doses of vasopressors, while D3QN uses higher doses of vasopressors than the clinician. Obviously, the AI policy is very different from that of the clinician. If the sepsis is mild, when we introduce human expertise to the AI agent, we find that the WD3QNE policy uses lower doses of vasopressor than the pure AI policy. Although vasopressors are commonly used in the ICU to increase MAP, most sepsis patients do not need high doses of vasopressors3. The WD3QNE model provides personalized treatment decisions based on the patient’s dynamic response.

Fig. 3: Action distribution for the test set.
figure 3

a Action distribution of the human clinician policy. b Action distribution of the DQ3N policy with 37 observation features. c Action distribution of the WD3QNE policy with 37 observation features. We aggregate all actions selected over all timesteps for the five dose bins of both medications. 0 denotes no drug given. We discretize the action space into per-drug quartiles. Action counts represent the utilized times of the drug dose. We can see that the human clinician policies tend to use low doses of vasopressors. The pure AI clinician policies (D3QN) tend to use high doses of vasopressors. The AI clinician policies with embedded human expertise (WD3QNE) tend to use lower doses of vasopressors than D3QN and higher doses of vasopressors than the clinician.

Sensitivity analysis

To analyze the effect of patient information coded as discrete time series into different hours, we perform a sensitivity analysis for the binning intervals of 1 h, 2 h, 4 h, 6 h, and 8 h. As shown in Table 2, the maximum records 1,104,929 and minimum records 138,116 in the training set are used, and therefore we set the batch size of patients to 32 and 256 in training respectively. Specially, to ensure the fairness of the test set, we included 100 patients in the test data and each patient has 5 samples for different binning intervals. The test set has 9768 records. The performance results of different binning intervals at each learning epoch are shown in Fig. 4. We can see that the smaller intervals obtain larger expected value and faster convergence in test set, which captures finer state changes. The binning intervals of 1 h and 2 h has a lot of missing values and therefore can result in overtraining. In total, the more frequent state data make the model better in the case of fewer missing values.

Table 2 Records of different binning intervals.
Fig. 4: Performance results of different binning intervals at each learning epoch.
figure 4

The patient trajectories are discretized into different binning intervals, 1 h, 2 h, 4 h, 6 h, and 8 h. a The loss value of different binning intervals in training. b The expected return of different binning intervals in the training test.

External validation

We conduct an external validation using the eICU Research Institute Database (eRI) from Philips33. A total of 1500 sepsis patients with fewer missing values are select. Using the methods suggested by Komorowski32, hospital mortality is considered as the final outcome in this cohort. We extract the 24,279 records, which spans the time interval of 36 h preceding and 72 h after the estimated onset of sepsis. The off-policy evaluation performance in the eRI dataset is shown in Table 3. The results show that the survival rate of WD3QNE policy is 95.83% with expected return 21.81. The WD3QNE with human expertise achieve better performance than other algorithms.

Table 3 Off-policy evaluation performance of baselines in the eRI set.

Bellman error tracking

For insightful analysis about how the algorithm behaves, we track the Bellman error through training in Fig. 5. The result shows that the Bellman error will gradually decrease with the iteration and finally stabilize. Similarly, we find that p varies between 0 and 1. At last, p is close to 0.5. The target Q value of WD3QN value is between the target Q value of Dueling DQN and the target Q value of D3QN.

Fig. 5: Bellman error as a function of epochs.
figure 5

Visualization of Bellman error evolution. The WD3QN is shown in red. The Dueling DQN is shown in blue while the D3QN is shown in orange.

Discussion

We proposed the WD3QNE algorithm with a novel value function and integration of clinician expertise of human clinicians for septic patients in the ICU. As shown in Table 1 and Fig. 2, the WD3QNE algorithm outperforms the conventional DRL approaches. Compared to the human clinician and pure AI algorithm, our model learns an optimal policy and notches reliable treatment action distribution in Fig. 3.

To address the bias issue in Q value estimation, we design a target Q value function with adaptive dynamic weight (WD3QN) in the WD3QNE algorithm. In Table 1, we demonstrate that WD3QN method achieves a higher survival rate (97.49%) compared to D3QN (96.48%), along with significantly lower variability. This is attributed to the fact our target Q value function finds a trade-off between the Dueling DQN overestimation and the D3QN underestimation. Compared with other DRL methods22,23,28, this Q value function estimates the Q value of the next state more precisely without incurring additional algorithm complexity in this study. We further note that the WD3QN framework can be easily adopted to other problems. As demonstrated in Supplementary Note 1, it registers high performance with respect to optimization results and operation time.

In sepsis treatment, AI can be particularly beneficial in assisting decision-making processes. However, DRL agents tend to seek the maximal rewards via aggressive strategies, incurring extra risks for patients in clinical practice. Moreover, it is still controversial, ethically and practically on how much we can follow the guidelines from AI, particularly when the AI solution deviates substantially from the human clinician’s policy. As aforementioned, it has been recognized that AI policy prescribes overdose of vasopressors in sepsis treratment17. In such cases, some may argue that it is necessary to bring in the expertise of human clinicians to help AI make comprehensive judgments for risk control. Human expertise tends to avoid so-called “common-sense mistakes” such as overdosing as well as be more cautious about other issues including side-effects and drug interactions, which are vital towards elevating survival rate. However, such abstract knowledge or expertise is largely missing in existing AI approaches. In contrast to SL deep neural network models27,34, the focus of our study is to set human expertise as constraints on feasible solutions for AI algorithms. We put to test that with such constraints, the proposed WD3QNE model prescribes reasonable doses of vasopressors (lower than those from alternative models without human expertise) for mild sepsis, as shown in Fig. 3. The algorithm with human expertise converges faster than other approaches without such human expertise, owing to the reduced search space for the optimization problem. Overall, the DRL algorithm with clinician expertise has achieved excellent performance by integrating the advantages of both AI and human clinicians, thus optimizing the allocation of medical resources as well as advancing AI technology in medical applications.

In addition, the high-dimensional ICU data on sepsis and septic shock are challenging for DRL to address. Komorowski et al. used a random forest35 classification model to rank the importance of the features. Their studies suggested the presence of some redundant features, so eliminating those features is also an important step in further training our algorithm for learning the patterns of disease. Therefore, we used a random forest as a feature selection method to narrow our feature set from the original 45 features, as suggested by Raghu et al.31. In total 37 features were selected as the final subset to achieve the best performance as shown in Table 1 and Fig. 2.

One limitation of this study is that our reward function only considered survival rate and SOFA score. The intermediate rewards and final rewards are essential components of RL algorithm. We would like our reward function to capture changes in and clinical significance of organ function of patients accurately. Therefore, in future studies, we will further collect treatment data on sepsis and expert advice to design a better reward structure. Additionally, collecting large amounts of data from the real world or simulator, the agent can greatly reduce sample efficiency and lead to unexpected behavior. We use historical data to learn the rules and teach RL agent to complete the tasks. Usually, the historical data are time series trajectory of human behavior. RL agent learns the optimal policy at a state from different clinicians and extracts implicit knowledge from a large offline data set. The agent can cause extrapolation errors from out-of-distribution actions and is not exploring. A lot of offline data of human behavior or a suitable regularization item of offline RL are needed. We will investigate the offline RL algorithmic in sepsis treatment problem in our future work. In the actual treatment process, doctors should formulate treatment schedules according to the physical or emotional needs of patients. AI-integrated treatment methods should thus also consider personalized needs of the doctor or the patient, such as minimal cost, minimal side effects and minimum ICU stay. Moreover, as we have demonstrated in this study, the inclusion of human expertise in the case of SOFA < 5 improves the survival rate. A full-scale integration of human expertise (e.g., drug interaction, side effect and common sense) in the decision making loop will be further investigated.

Methods

Dataset

We use the dataset from the Multiparameter Intelligent Monitoring in Intensive Care (MIMIC)-III v1.4 database28, which is a de-identified database of 61,532 admissions to the intensive care unit from 2001 to 2012 from the Beth Israel Deaconess Medical Center in Boston, Massachusetts, USA.

We exclude patients whose treatment was withdrawn or had missing records over 24 h. We select 276,232 records from 17,083 adults with SOFA greater than or equal to 2 according to the latest Sepsis definite-Sepsis 3.036. As in previous studies, we use 45 physiological feature variables, including demographics, vital signs and lab values shown in Tables 4 and 5. In training, patient data and interventions are recorded every 4 h. We use 80 h of patient records from up to 24 h preceding until 56 h following the estimated onset of sepsis. As a result, the period T = 20. In the external validation, the maximum period T = 80 from the cohort monitored every hour. The outcome is 90-day mortality37.

Table 4 Demographics of sepsis cohort.
Table 5 WD3QNE Algorithm.

Feature selection

Because the ICU observation data have many redundant features, the algorithm is more likely to suffer the curse of dimensionality38. Feature selection is a vital dimension reduction method for high-dimensional data39. Here, we employ a random forest model to select the vital features35. The random forest is an ensemble classifier composed of multiple decision trees. We randomly sample data by bootstrapping with resampling. Multiple decision trees are constructed for each resampling by the random splitting technique. The final prediction results are obtained through voting. The random forest has a high tolerance for outlier and noise. More importantly, the variable importance of each characteristic variable can be given. We employ the Out-Of-Bag (OOB)40 to measure the importance Gi of features Xi (Eq. (8)):

$$G_i = \frac{1}{B}\mathop {\sum }\limits_{j = 1}^B \left| {D_j - D_{ji}} \right|$$
(8)

where B indicates the number of samples. \(i = 1,2, \ldots ,N\) is the ith feature. Dj indicates the number of correct classifications by OOB, and Dji indicates the number of correct classifications of samples after perturbation.

First, the importance scores are used for ranking all features. Second, the sequential backward search method is employed. The classification accuracy (Acc) is calculated with death as the label. Each time remove the least important feature with the lowest importance score from the feature set. Finally, we obtain 37 observation features with the highest classification accuracy, which will be used as the input of feature perception (see Fig. 6 for the ranking).

Fig. 6: Feature importance score.
figure 6

We calculate the classification accuracy with death as the label for different numbers of features. The 37 features (variables) selected with the highest accuracy are displayed. The glossary of vital signs and lab values is provided in Table 4.

State space and action space

Continuous state space is more sensitive to the subtle change embedded in the physiological data41. The state space is the 37 observed features consisting of vital signs and personal information after selecting features. We combine the observed states excluding the SOFA score into a state space as inputs to networks. The SOFA score is used as an intermediate reward in training. Additionally, we use a combination of intravenous (IV) fluid and vasopressor (VP) as the intervention action space for sepsis. The action takes place every four hours. We define a 5 × 5 action space17 for IV and VP. Except for zero doses of medicines as bin 0, we discretize the action space into per-drug quartiles and convert each drug at every timestep into an integer representing its quartile bin.

Reward function

The reward function provides a flexible indicator to promote or punish specific actions of agents. The agent performs an action at in state St to reach the next state St+1 and receives the reward r. We evaluate agents by associating reward with the target Q value function. Because only the patient’s survival is concerned, the reward is observed after a long sequence of decisions. We also apply intermediate rewards and final rewards in the form of SOFA change and survival after 90 days respectively26. SOFA represents the evidence of organ dysfunction and has been recommended by experts as a screening tool for sepsis36. Then we define the reward r as (Eq. (9)):

$$r = \left\{ {\begin{array}{*{20}{c}} {\beta _s \times \left( {SOFA_t - SOFA_{t + 1}} \right)} \\ {\delta\, \times\, \beta _T} \end{array}\begin{array}{*{20}{c}} {{{{\mathrm{t}}}} < {{{\mathrm{T}}}}} \\ {t = T} \end{array}} \right.$$
(9)

where δ means patient survival 1 or death -1. βT is a final reward value 24 and βs is a reward parameter.

Dueling net architecture

Our paper adopts an off-policy reinforcement learning method based on value function. DQN, DDQN, and D3QN are the popular methods for off-policy learning17,18,19. The goal of those methods is to maximize the expected return. The value function and state-action value function are defined as \(V^\pi (S_t) = {\Bbb E}[Q\left( {S_t,a_t} \right);\pi ]\) and \(Q^\pi (S_t,a_t) = {\Bbb E}[r|S_t = S,a_t = a;\pi ]\). For instance DQN, the optimal Q function satisfies the Bellman equation: \(Q^ \ast (S_t,a_t) = r + \gamma {\Bbb E}[\mathop {{\max }}\limits_{a_{t + 1}} Q\left( {S_{t + 1},a_{t + 1};\omega ^ - } \right)]\). In the sepsis environment, treatment effects rely on both the patient’s observed state and the doctor’s different intervention action. Thus, we use the dueling net architecture30 which maintains separate value and advantage functions: \(Q^\pi \left( {S_t,a_t} \right) = V^\pi \left( {S_t,a_t} \right) + [A^\pi \left( {S_t,a_t} \right) - \frac{1}{{|A|}}\mathop {\sum }\nolimits_{a_t^\prime } A^\pi \left( {S_t,a_t^\prime } \right)]\). V is the value of the patient state and A is the advantage of prescription according to the specific policy π.

Deep neural network

The neural network includes the input layer, the hidden layers with 256-dimensional fully connected layer, the hidden layers with 128-dimensional fully connected layer, the streams layer, and the output layer. All hidden layers are activated by rectified linear units (ReLUs). Dueling neural network parameters are updated by gradient descent according to Q value function shown in Eq. (7). In our paper, we use the Huber loss function21. This loss function combines the mean squared error function and the absolute value function. It divides the error into three segments. Between -1 and 1 use the mean square error (MSE), otherwise use absolute value. Putting all the aforementioned components together, the WD3QNE algorithm is provided in pseudocode WD3QNE Algorithm.

Off-policy evaluation

In experiments, we use the intermediate reward parameter βs = 0.6 and the terminal reward parameter βT = 24, following the setting in existing works26. Specifically, the terminal reward is 24 if the patient survives, otherwise -24. The Q learning rate is 0.0001. We use the Python3.8 environment and PyTorch framework. All computations were performed on a PC equipped with a 3.30 GHz Intel Core i7-11370H CPU and 16 G RAM.

In model evaluation, the value of a newly learned AI policy is evaluated using trajectories of health status generated by another policy (the human clinicians). We employ an off-policy evaluation to evaluate the performance of each algorithm. In this paper, we use the double robust off-policy value evaluation42. The method calculates the unbiased estimator of strategies evaluated under each trajectory in Eq. (10) which combines importance sampling (IS) and approximate Markov decision model.

$$V_{t + 1} = \hat V\left( {S_t} \right) + \rho _t\left( {r + \gamma V_t - \hat Q_{\left( {S_t,a_t} \right)}} \right)$$
(10)

where ρ denotes the importance ratio of AI policy π1 and clinician policy π0: \(\rho = \pi _1/\pi _0\). \(\hat V\left( {S_t} \right)\) is evaluation value. \(\hat Q_{(S_t,a_t)}\) is the expected return on the action a taken in the state St. Jiang et al. confirmed the reward r and the importance ratio ρ are independent42. Hence the expected return V under unbiased estimation is obtained.

To further evaluate the policy survival rate, we apply an on-policy SARSA reinforcement learning algorithm (\(Q\left( {S_t,a_t} \right) \leftarrow Q\left( {S_t,a_t} \right) + \alpha \left( {r + \gamma Q\left( {S_{t + 1},a_{t + 1}} \right) - Q\left( {S_t,a_t} \right)} \right)\) to establish the relationship between expected return and survival rate29. First, the expected return value V is calculated. Then, we calculate the average survival rate based on the return value. The survival formula26 Eq. (11) is shown below:

$$S\left( {Q_i} \right) = \frac{{sur_{V_i}}}{{tal_{V_i}}}$$
(11)

where \(sur_{V_i}\) is the number of survivors and \(tal_{V_i}\) is the total number of people given the expected return Vi. Vi is an integer range of V and \(V_t \in V_i\). The relationship between expected return and survival rate is shown in Fig. 7. The survival rate is positively correlated with the expected return for 45 and 37 observation features. We can see that the survival rate becomes greater as the expected return get increases.

Fig. 7: The relationship between expected return and survival rate.
figure 7

a The relationship between expected return and survival rate for 45 observation features. b The relationship between expected return and survival rate for 37 observation features. The relationship learned from observational data and actions taken by actual clinicians in the MIMIC-III dataset.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.