Exploring optimal control of epidemic spread using reinforcement learning

Pandemic defines the global outbreak of a disease having a high transmission rate. The impact of a pandemic situation can be lessened by restricting the movement of the mass. However, one of its concomitant circumstances is an economic crisis. In this article, we demonstrate what actions an agent (trained using reinforcement learning) may take in different possible scenarios of a pandemic depending on the spread of disease and economic factors. To train the agent, we design a virtual pandemic scenario closely related to the present COVID-19 crisis. Then, we apply reinforcement learning, a branch of artificial intelligence, that deals with how an individual (human/machine) should interact on an environment (real/virtual) to achieve the cherished goal. Finally, we demonstrate what optimal actions the agent perform to reduce the spread of disease while considering the economic factors. In our experiment, we let the agent find an optimal solution without providing any prior knowledge. After training, we observed that the agent places a long length lockdown to reduce the first surge of a disease. Furthermore, the agent places a combination of cyclic lockdowns and short length lockdowns to halt the resurgence of the disease. Analyzing the agent’s performed actions, we discover that the agent decides movement restrictions not only based on the number of the infectious population but also considering the reproduction rate of the disease. The estimation and policy of the agent may improve the human-strategy of placing lockdown so that an economic crisis may be avoided while mitigating an infectious disease.

Through a pandemic situation, the foremost intention is to produce a vaccine that provides immunity over a particular infectious disease. However, an effective vaccine may take years to develop depending on the disease and some certain criteria. While investigating the vaccine, the loss of a pandemic is to be controlled via proper clinical support and by reducing the expanse of the disease. Nevertheless, assuring proper clinical care is not possible in a pandemic situation due to a large number of infections over the available limited clinical support. Therefore, lessening the expanse of a disease is the first and foremost effort to overcome the devastation of a pandemic disaster.
Pandemics are often caused by diseases that transmit through person-to-person close contact 1 . At present, pandemics are caused by flu such as Swine flu 2 , and Coronavirus 3,4 . Different intervention means are proven to reduce the devastation of a pandemic outbreak 5 . However, these interventions often cause an economic breakdown, and it is not possible to reduce the impact of a pandemic without it 6 . Therefore, a pandemic situation raises challenges to balance the viral spread and a steady economy.
Due to the current COVID-19 pandemic, researchers have been investigating various strategies to reduce the pandemic's desolation while striving economic balance. Through several research endeavors, various lockdown strategies have been proposed, such as age-based lockdown 7 , n-work-m-lockdown 8 , and so on. However, agebased lockdown should not apply for a disease that is critical for all ages. Also, repeated n-work-m-lockdown (n days without lockdown followed by m days of lockdown) strategies may not ameliorate critical pandemic situations. The current challenge of a pandemic situation raises cases such as, (a) is placing a long time lockdown the only way to mitigate a pandemic?, (b) should we place lockdowns while the pandemic situation does not ameliorate?, (c) how should the resurgence of the pandemic be handled?, (d) while mitigating a pandemic, how we could also balance the economical circumstances? In our research endeavor, we attempt to resolve these concerns by combining reinforcement learning and virtual environment based epidemic analyses.
In aspects of mathematics and computer science, the challenge of maximizing a constraint (the economic balance) while minimizing some other factor (reducing the spread of disease) is referred to as an optimization 1. We implement a virtual environment that simulates a pandemic situation and also considers economic circumstances. 2. We illustrate the consequences of placing no lockdown, maintaining social distancing, and placing lockdown.
The consequences are derived based on the death of population and economic situations. 3. We investigate optimal strategies to reduce the spread of disease using reinforcement learning. Furthermore, we perform extensive analysis and present the reasoning behind the action.
The rest of the paper is organized as follows: in "Methods", the mechanism of the virtual environment is disclosed, and the neural network architecture of the agent is defined. In "Results", we evaluate the virtual environment and investigate to discover an optimal agent. Then, we explore various control sequences to reduce the disease's spread and consume our effort to find and analyze the optimal control sequence generated by the agent. Finally, "Discussion" concludes the paper.

Methods
To study epidemiology, various compartmental models are being implemented 19 . Compartmental models define a simple mathematical foundation that projects the spread of infectious disease. Furthermore, different mathematical models are being presented to illustrate the relationship of population heterogeneity and the present crisis of pandemic 20 . These compartmental models are mostly generated using ordinary differential equations (ODE) 21 .
Although ODE and other mathematical methods are sufficient in modeling an infectious disease, we argue that they are not suitable for training an RL agent as they lack randomness of being infected, cured, and death. Mathematical models do not include any super-spreaders 22 , and lacks randomness. Randomness is required so that RL agents do not overfit for a certain number of initial parameters and generate more uncertainty. The randomness can be considered as comparable to the data augmentation process in deep learning. Data augmentation often helps DNN models avoid overfitting and help achieve better generalization in unseen data 23 . Moreover, RL environments must be dynamic. General ODE models are static, and transforming them into a dynamic state may require additional parameters 24 that often becomes complex. Therefore, we avoid implementing traditional ODE models and implement a virtual environment that mimics diseases' transmission.
The virtual environment is used to generate states and results based on some particular actions. The virtual environment is designed based on the SEIR (Susceptible-Exposed-Infectious-Recovered) compartmental model. Figure 1 depicts the different stages of SEIR compartmental model. Due to the randomness in various transitions, implementing virtual compartmental models make the problem more challenging. The virtual environment is designed in a 2D grid where the population can randomly move. In each day, the population performs a fixed

Susceptible
People who are currently not infected by the disease, but have a possibility of being infected.

Exposed
People who are infected by the disease, but have not been able to infect someone else.

Infectious
People who are infected by the disease, and can infect someone else.

Recovered
People who are cured from the disease and does not have any chance of being infected. www.nature.com/scientificreports/ number of random moves. In Fig. 2, an info-graphic representation of the environment and the training process is illustrated. In "Results" section, a broad discussion is presented to substantiate the virtual environment.
Transmission stages. In the environment, susceptible individuals are infected if they are in close contact with an infectious person. Initially, the infected population is in the exposed stage. After 1-2 days, individuals of the exposed stage is further transmitted to the infectious stage. In this stage, individuals can transmit the disease. The infographic illustrates the overall dynamics of the virtual environment, environment features, and the agent's possible actions. Notably, level-1 movement restriction is similar to maintaining socialdistancing, and level-2 movement restriction is similar to placing a nationwide lockdown. www.nature.com/scientificreports/ The infectious individuals are either recovered after 21-27 days, or they may even lose their lives. The environment is configured so that around 80% of the infected population may survive.

Movement restrictions.
In the virtual environment, the disease's spread can be mitigated by reducing the population's movements. There are three movement restrictions in the environmental setup: level-0, level-1, and level-2. In level-0, no movement restrictions are enforced, and the population makes the maximum movements.
In level-1, the movement of the individuals is restricted by 25%. In general, maintaining social distancing and avoiding unnecessary means is equivalent to level-1 restriction 25 . In level-2, the movement is reduced by 75%, similar to a lockdown state 26 . The DRL agent provides these movement restrictions. However, although movement restrictions result in reducing the spread of disease, it causes an economic collapse.
State genaration. In RL, a state is an observation that passes estimable information to the agent. By analyzing the information, an agent makes an optimal move based on its policy. States can be both finite or infinite. In the virtual environment setup, relevant information about the spread of the disease is passed through a state. Seven parameters are passed as a state of the environment. Figure 2 illustrates the state parameters as infographic. Active cases represent the number of the population who are in the infectious stage. Newly infected refers to the number of the population who have shifted into the infectious stage on a particular day. Cured cases and death cases illustrate the number of people who have been cured and died from the pandemic's start, respectively. The reproduction rate represents the average number of people who are being infected by the current infectious population. The economy illustrates the daily economic contribution of the population. Along with the states, the current movement restriction is also presented as a state parameter.
Economical setup. In the virtual environment, each individual contributes to the economy through movement. Therefore, if movement restriction is placed, it has an impact on the economy as well. Each individual contributes a value of [0.8, 1] by moving. People who did not survive can not make any further contributions to the economy and does not exist in the environment. Therefore, the increasing number of death count has also a negative impact on the economy. Also, the infectious population can not contribute to the economy. Therefore, a high number of active cases has also a negative influence on the economy.
Virtual environment workflow. The environment's workflow is illustrated in Fig. 3 and the Algorithm contains the pseudocode of the corresponding virtual environment. In the beginning, the virtual environment includes a susceptible and infectious population. Every day, each individual contributes to the economy by making some random movements. Moreover, each individual's economic contribution is kept a random value in scale [0. 8,1]. A susceptible individual will get into the exposed state if he/she collides with an infectious individual. The agent is not reported any information related to the exposed state. It is theoretically valid, as each individual does not have any health issues in the exposed state. Moreover, there is no scientific process to justify that a person is in an exposed condition. The number of days staying in the exposed state is random per capita, and it is limited to 1-2 days after the collision course. Then the individual enters the infectious form. Persons in exposed and infectious states still perform random movements in the environment. However, a contagious individual can not contribute to the economy. Individuals in the infectious state can be considered patients. Due to the illness, he/she can not work yet still cause contact with other individuals. After 21-27 days, an infectious individual may get recovered. However, roughly 20% of the overall infectious population is dead and removed from the environment. The recovered individuals further make arbitrary movements, contributes to the economy, yet they do not get infected for the second time.
Reward function. In DRL, an action is encouraged and discouraged by a reward function. A reward function encourages an agent to be in a particular state/situation by giving it a high reward for the situation. On the contrary, a specific action or situation is discouraged by giving the agent a low reward. An agent tries to generate such a policy/knowledge so that the agent may avoid the discouraging situation by following the policy. By designing a proper reward function, it is possible to generate such an agent that may follow the human desired situation. For the current environment, the reward function is designed as follows, Where, www.nature.com/scientificreports/ The reward function contains three parameters from the environment: the current economy ratio, the current cumulative death ratio, and the current percentage of active cases. Due to the three types of movement restrictions, the economic ratio can be separated into three levels. Due to the direct relationship with movement restriction and economy, level-0, level-1, and level-2 result, the value of E t is approximately close to 1, 0.75, and 0.25, respectively. However, this can be altered due to high death count and randomness. By avoiding the D t parameter, the correlation of the economical levels and active cases can be utilized. In Fig. 5, a similar situation is illustrated. By utilizing the graph, it can be perceived that while the active cases are low, the reward prioritizes higher economic stages. The further increase in active cases lessens the reward of higher economic stages. By setting the value of r = 8 , the reward of different economic stages is almost the same (the absolute difference is less than 0.001) after crossing 0.82% active cases. This boundary is thought of as a critical point, after which the economy does not matter. After this boundary, the goal becomes to lessen the surge of the disease. The threshold can be instantiated by the percentage of the population for which proper medical treatment can be guaranteed. Properly selecting the threshold value may reduce death tolls (through adequate medical care) in a real-world scenario. This percentage is often a variable depending on the geographical areas. Furthermore, including the D t in the reward function, the agent is also encouraged to reduce the death ratio. Figure 4 illustrates the relation of reward function relating to the active case percentage and death ratio in three possible economic stages. The impact of the deaths in the reward function is tuned using the parameter s. And s = 5 is set to prioritize the negative impact of the deaths.
Both r and s are the tuning parameters of the reward function. Increasing the value of r causes the reward threshold (described in Fig. 5) to be reduced. Whereas, the value of s defines the significance of death. A higher value of s influences the agent to heavily reduce the death ratio ignoring the economic balance.  The equation states the transition probability of choosing an action α , given an environment state s, and achieving a new state s ′ . The DRL agent acquires a policy π through bootstrapping. Through this policy, the agent performs an optimal action α i for a given state s, represented as, π(α i |s) . The optimal action is chosen based on the state-value function V π (s) that defines the chained reward value. The reward value is a chain multiplication of discount value γ and state rewards R. This can be presented as, An optimal policy π * finds the best possible state-value function that can be defined as, γ t+k R t+k |s 0 = s Figure 4. A heatmap representation of the reward function. The horizontal axis represents the percentage of active cases. The vertical axis represents the cumulative death percentage. From left to right, the three heatmaps illustrate the reward distribution in level-0 movement restriction, level-1 movement restriction, and level-2 movement restriction, respectively. In the three restriction levels 0, 1, and 2, the value of E t is expected to be approximately 1, 0.75, and 0.25, respectively. www.nature.com/scientificreports/ As the transition of an MDP ( τ (s ′ |s, α) ) is unknown, a state-action function Q π (s, a) is generated. The state action function mimics the value state-value function V π (s) and also tries to identify best action α . The stateaction function greedily chooses the actions for which, it gains the maximum state-value.
The Q π (s, a) function is defined as the DRL agent. In the experiment, we study with memory-based DRL agents since the memory-based agent perceives further possibilities and takes optimal decisions and acquires better rewards 27 . We found that the DRL agent makes better actions with a minimal memory of 30 days among different memory sizes. We further investigate to select the optimal memory length in the "Results" section. The agent is implemented using three bidirectional Long Short Term Memory (LSTM). Bidirectional LSTM performs optimally when there exist both forward and backward relationships in a portion of data 28 . In the case of this epidemic data, using bidirectional LSTMs provides the following benefits: (a) select an optimal action based on previous data, and (b) estimate the influence of selecting a particular action. The agent uses three bidirectional LSTM layers, followed by four dense layers. In Fig. 6 the memory-based DRL agent architecture is depicted.
DDQN method is used to train the agent. The DDQN architecture uses an actual agent and a target agent. Traditionally, in DDQN, both agents contain the same network structure. Furthermore, the traditional DDQN training process is implemented to train the architectures 29 . The agent is trained over 7000 episodes and without any pre-knowledge and human interpolation. Random movements are made in the training episodes to explore the environment suitably. The training is started with a random movement ratio of ǫ = 1 , and it is continuously decayed as ǫ = max(ǫ − ǫ/(6000), 0.1) . The discount value ( γ ) is set to be 0.9, to propagate the future rewards to any particular state. Mean square error (MSE) is used to calculate the loss between the agent's predicted and function generated rewards.

Results
The overall implementation is conducted using Python 30 , Keras 31 , and TensorFlow 32 . Matplotlib 33 is used for graphical representations. The experiments are conducted in a virtual environment implemented on a quadratic time complexity based algorithm, which is provided as a Supplementary file 1. Therefore, we experiment with a limited number of 10,000 population and a default daily movement of 15 steps. In this section, we first evaluate the virtual environment by comparing it with the ODE model. Then we compare the agent's performance based on different memory lengths and try to resonate an optimal agent. Further, we explore the decisions and exploit the strategy behind the agent's decision.
Virtual environment evaluation. Our investigation found that the spread of the disease in the environment acts differently based on the population's density. In Fig. 7, we illustrate distinguishable waves of active cases over different rates of population density. Due to the high density of the population, the probability of contact between two different person increases. Therefore, the rate of spread of a disease depends on the density of the population. On the contrary, in the environment, a disease's reproduction rate is not dependent on the population density. In Table 1, the mean and median reproduction rate is reported, tested over different population densities.
The increase in density does not alter the reproduction rate of the environment. The virtual environment posses nonlinearity in reproduction number. Hu et al. verified that nonlinearity could cause reproductive rates to be at a limit after a particular increase in the density 34 . Yet, the surge of active cases tends to rise while increasing the population density. This scenario can be incarnated by Eq. (6). Higher density causes a higher initial wave of active cases mostly caused by super spreaders. The new infections are caused due to the raised active cases and results in an exponential increase.
The mean and median of the virtual environment's reproduction rate closely simulates the estimated reproduction rate evaluated in China. To approximate the reproduction rate of COVID-19, Liu et al. evaluated multiple reports from different provinces of China (including Wuhan and Hubei) and overseas 35 . The report concludes that the mean R 0 of COVID-19 is approximately 3.28, with a median of 2.79. Compared to the R 0 values, the virtual environment closely mimics the COVID-19 situation above the density of 0.01. Therefore, it can be confirmed that the virtual environment can mimic the COVID-19 situation approximately. However, the population density of 0.01 does not spread the disease properly. On the contrary, the population density of 0.04, 0.1, 0.2 excessively spreads the disease. Therefore we conduct our experiment on the population density of 0.02 and 0.03.
In comparison to ODE, virtual environments are hardly implemented to study epidemiology. ODE based compartmental models are scientifically accepted, and it is often used to study epidemiology. Therefore, we compare the virtual environment with the ODE model to verify the correctness. However, in the comparison, we omit the exposed state. This is because the RL agent does not apprehend the exposed population data, and the agent is only reported data related to the susceptible, infectious, recovered, and death cases. These circumstances are also illustrated in Fig. 3.
In Fig. 8, we illustrate a comparison of the virtual environment with a general SEIR ODE model. Three virtual environments are reported with a density of 0.01, 0.02, and 0.03. For the ODE model, the implemented equations are conferred below, (4) V * (s) = max π V π (S) ∀s ∈ S   Table 1). The graphical flow of the virtual environment reported variables greatly mimics the ODE model. Yet, slight alterations can be noticed due to the value variation of R 0 . The ODE model illustrates herd immunity 36 , for which around 8% of the population never Increasing the density of the population also increases the probability of contact between people. Therefore, the spread of disease also increases. www.nature.com/scientificreports/ becomes infected. The minimum portion of infections required to achieve herd immunity can be calculated as 1 − 1 R 0 . For the calculated ODE, the minimum value is 88.14% (considering the deaths). Considering the mean R 0 of the virtual environments, the 0.02 and 0.03 density environment requires a minimum of 87.75% and 90% population to be infectious. Measuring the flow of the variables and herd immunity, it can be verified that the virtual environment is similar to ODE models. Agent comparison. The memory-based agent is trained in the virtual environment with random initialization of infections. A scale of [1,20]% infectious population is randomly initialized for each virtual environment play/episode. Also, there is no fixed day limit when the disease of the virtual environment fully mitigates. Therefore, each episode is kept running until the disease fully mitigates (zero active cases and exposed state). We initially trained the agent with a 30 days memory, and it took nearly 10 days to complete the training of 7000 episodes. Apart from the 30-day memory agent, we further trained agents with different memory lengths, including 7, 15, 45, and 60 days. Each of the agents is named based on the memory length (i.e., M45 refers to the agent with 45 days of memory). However, to reduce the computational complexity, we initialized the M7, M15, M45, and M60 models with the pre-trained weights of the M30 model. Figure 9 presents a comparison of loss and reward values. As the figure depicts, M30 model requires around 6000 episodes to achieve a better reward. Whereas, after some fluctuation in the loss value, the other pre-initialized models (M7, M15, M45, M60) converge to optimum reward within 3000 episodes. The graph concludes that agent M15 produces a higher score per-play, followed by agent M45 and agent M30. As the reward function  Each of the agents was evaluated after 250 episodes. A single evaluation is presented as a mean of ten runs, and it is guaranteed that every model is tested on the same environment scenario. The M30 agent was trained for 7000 episodes. The M7, M15, M45, and M60 agents were initialized with the M30 agent's trained weights. Therefore, these agents converged to an optimal state within 3000 episodes.  www.nature.com/scientificreports/ (explained in Eq. 1) is formulated by aggregating the target objectives (death, economy, and active cases), we imply that an agent is optimal if it achieves a higher aggregated reward. Therefore, the agent M7 and M15 reaches the optimal state on epoch 2250. In comparison, agent M45 and M60 reach optimal state comparatively earlier, on epoch 1750. Moreover, agent M30 reaches an optimal state on epoch 6000. We further investigate for an optimal agent that not only secures higher rewards but also reduces the disease infection and gains better economic profit. Further comparisons are made with the weights for which each of the agents acquired maximum reward (illustrated in Fig. 9). Also, the environment scenario used to evaluate the agents are the same for all the evaluations (Figs. 9,10,11,12,13,14). Figure 10 represents an investigation on the actions performed by the agents. The best performing agent M15, mostly instructs level-0 and level-2 restrictions. Whereas, the second-best agent M45 provides all types of rules. The third-best agent M30 mainly provides level-0 and level-2 regulations. Figure 11 illustrates the death and infections that occurred due to the agents' execution. Agent M30 achieves minimal infections and deaths due to the strict lockdown policy. In contrast, agent M15 and agent M45 place second and third, in the comparison, respectively.
Further, Fig. 12 illustrates a different scenario of the economic situation of the environment. We demonstrate the financial gains in two different aspects, per-day economy gain (total economy divided by the number of actions/days) and the entire economy gain per episode. Although agent M30 mostly places level-2 restrictions, it achieves better per-day financial profits than agent M15 and M45. In contrast, agent M15 gains better economic profit from each episode. Agent M15 performs more extended actions and keeps the disease propagating for a longer time. For longer runs, agent M15 receives a better cumulative reward than most of the agents.
Moreover, we validate the assumption by performing a close re-investigating the agents' best rewards. Figure 13 illustrates the scenario. The graph validates that the agent M30 receives a better on-average reward for each day. As the agent M30 also mitigates the disease faster, it receives better economic advantages than any other model. Overall, we can conclude that the agent M30 poses some advantages such as, • The agent M30 quickly mitigates the disease.
• It ensures a minimal spread of the disease. The minimal spread of disease also causes lesser death. Figure 12. The graphs depict the economic benefits secured while the agent performed. The left graph presents an average economy gain per day, and the right illustrates the total economic gain per episode. The difference between average and total gain is due to the dynamic day of the environment. The environment is kept running until the disease fully mitigates. Therefore, agents that require a higher time to mitigate a disease gets higher time to make an economic profit. www.nature.com/scientificreports/ • It also achieves a better economic balance comparing to any other agent. Comparatively better than agent M15 and M45 agents' average economy. The agent quickly mitigates the disease, and it does not need to provide lockdowns in the future. Therefore, it has better economic advantages. Table 2 further aggregates the findings of the overall agent comparisons. It further validates that the agent M30 performs optimal, and it also guarantees minimal infection with higher economic benefits. Agent M30 has some architectural benefits compared to the other agents. The virtual environment is initialized so that a cycle of disease's propagation can be approximately discovered in 30 days. Therefore, agent M30 formulates a batter perception because it can receive the full result in about 30 days. Consequently, due to this advantage, we assume that agent M30 can optimally mark the optimum global position of the dimensions generated by the environment. On the contrary, agent M15 could not target the disease's propagation cycle and greedily converges to a state where it can achieve higher rewards by keeping the environment active for a longer time. Although agent M45 and M60 receive the same scenario as agent M30, they fail to establish appropriate reasoning to apply restrictions. Agent M45 and M60 can not appropriately target disease propagation from the input sequences due to increased features. As a result, agent M45 also converges to a state where it can gain better rewards in total. In Fig. 14, the transition probability of the agent M30 is presented. As illustrated in Fig. 10, the agent does not apply level-1 restrictions and mostly performs level-2 regulations. Yet, it achieves a comparatively better economic balance. In the next section, we illustrate the decisions that the agent M30 takes to mitigate the disease. Figure 16 presents a datasheet of the virtual environment simulation and Fig. 15 represents the initial positioning of the infectious population over the environment. The datasheet is separated into four individual graphs. In the current simulation, no lockdown is placed (level-0 restriction). The graph indicates a raise in active cases by simultaneously infecting 20% of the population. Without placing any lockdown, the disease affects more than 80%, among which, around 20% of the population loses their lives. Due to the huge decrease in the population, an impact is also measured in the economical state of the environment. As the non-survivals could not contribute to the economy, the economic ratio of the environment falls around 0.20 due to the loss of the population. Therefore, considering the economy, it can be determined that placing no lockdowns in a pandemic situation may not be a good solution. The reproduction rate of the disease is mostly in a close interval of 2-5. However, a surge in the reproduction rate is reported after passing 160 days of the pandemic, due to the superspreaders.

Evaluation of different control sequences.
The effect of social-distancing (level-1 restriction) is presented in Fig. 17. By maintaining social-distancing, around 20% spread of the disease can be reduced, along with 10% fewer deaths. Also, the surge of active cases is reduced by around 10%. However, due to social-distancing, the economic ratio is decreased by around 0.2. The impact of lockdown (level-2 restriction) is presented in Fig. 18. From the illustration, it can be stated that    www.nature.com/scientificreports/ placing lockdown heavily decreases the spread of disease. On the contrary, placing lockdown also causes the economy to collapse. The simulation also points out that the spread of disease can be fully halted by placing a 63 days lockdown. However, in the real world scenario, complete elimination of a disease through lockdown is near impossible. Figure 19 illustrates the restrictions that the agent placed in the virtual environment of population density 0.02. The initial state of the environment starts with a devastating pandemic situation, in which, the disease infects almost 1% of the population. Therefore, the agent places multiple 30-40 days of lockdown segments to reduce the spread of the disease. Then the agent removes the restrictions and stables the economy. However, multiple smaller peaks of active cases are reported in an approximately 100 days cycle. The agent reduces the spread of the disease by performing two types of actions. At first, the agent activates a cyclic lockdown to level the spread of the virus by keeping the economy steady as much as possible. Finally, the cyclic lockdown is followed by a 10-20 days long lockdown. By further analyzing the reproduction rates of the environment, it can be concluded that this combination optimally reduces the reproduction rate below 1. Reducing the reproduction rate causes the spread of the disease to be halted. In Figs. 20 and 21, the action sequences of the agent are illustrated for an environment of population density 0.01 and 0.03, respectively. In both cases, the agent follows a cyclic lockdown if the situation is less severe; otherwise, it places a full lockdown. Furthermore, by closely evaluating the reproduction rate and active cases of the environment, a pattern of the lockdown placement can be observed.
The agent places lockdown based on the active cases and the reproduction rate. However, it can be observed that the agent sometimes avoids placing lockdown when the reproduction rate is high. The agent only places lockdown when the value of active cases and reproduction rates are high. It further removes the lockdown when the reproduction rate is less than 1. To discover the reason for the action, let us consider the following formula, The equation formulates the possible number of people who may get infected in the next day. The reproduction rate R 0 represents the average number of newly infected cases caused by an infectious person, and the value of ActiveCases indirectly represents the number of infected persons in a single day. Therefore, the increase in infectious cases can generally be formulated using Eq. (6). The agent places strict lockdown actions when the value of Eq. (6) becomes too high. On the contrary, for minor cases, the agent follows a cyclic lockdown phase. This causes optimally controlling the spread of the disease below a particular percentage.
In Fig. 22, we further compare the agent's policy with the traditional n-work-m-lockdown policy. From the comparison, it can be justified that only maintaining the n-work-m-lockdown policy is not an optimal solution (6) δInrease disease = ActiveCases × R 0 Figure 19. The graphs represent the movement restrictions provided by the agent. The red region of the graph denotes the days when a lockdown is placed. The green region of the graph denotes the days when no lockdown is placed. In the early stage of the environment, the agent places multiple 20-40 days lockdown to reduce the spread of the disease. In the later stage, to control the resurgence of the disease, the agent performs a cyclic lockdown (1-3 days cycle) followed by a 10-15 days lockdown to reduce the future spread of the virus. It can be also analyzed that the agent mostly follows this pattern when both the active cases percentage and the reproduction rate is high.  The other environmental parameters are kept unchanged. The graph resembles a similar action pattern of the agent observed in a 0.02 population density environment. However, due to increased population density, the spread of disease is also increased. Therefore, the agent mostly places strict lockdown instead of cyclic lockdown. www.nature.com/scientificreports/ to mitigate a pandemic. Furthermore, adding 40 days of full lockdown before following the n-work-m-lockdown policy reduces the first surge of the disease. However, the n-work-m-lockdown policy does not control the spread of the disease properly. Therefore, a resurgence of the disease is observed. From the general comparison, it can be validated that an agent can optimally control a pandemic crisis if proper training method is implemented.

Discussion
The paper motivates the readers towards the achievements and advancements of reinforcement learning through its application for controlling the pandemic crisis. We introduce a virtual environment that mostly relates to a pandemic situation, and sedulously investigate new tactics to mitigate disease by applying reinforcement learning. In what follows, we perform a pensive analysis of the impact of lockdown, social-distancing, and using agent-based solutions to prevent the mitigation of disease. We find our proposed scheme to be convincing in achieving optimal decision balancing the overweening pandemic and economic situation. We strongly believe that the contribution of this research endeavor will unite the epidemic study with reinforcement learning, and may help the human race to defend against the pandemic crisis.
Received: 22 July 2020; Accepted: 2 December 2020 Figure 22. The graph presents a comparison of the agent's policy with the traditional n-work-m-lockdown policy. The comparison is formed on a 10,000 population with a density of 0.02. By only maintaining a 7-work-7-lockdown policy, a rapid spread of the virus can not be halted, and therefore, a total of 34.5% of the population gets infected. Furthermore, if the 7-work-7-lockdown policy is applied after a full lockdown of 40 days, the overall infection is decreased to 11.5%. However, the agent generated policy mostly flattens the curve.