## Introduction

Single-layer graphene, as an iconic two-dimensional (2D) material, has drawn much scientific attention in recent decades. Because of its ultrathin thickness and outstanding mechanical properties, graphene with artificial pores has been demonstrated to have great potentials in many engineering applications, such as effective hydrogen gas separator1,2,3, next-generation energy storage or supercapacitor building4,5, and high-resolution DNA sequencing6,7,8. Given the potential imminent global water scarcity crisis, another important application for nanoporous graphene is energy-efficient water desalination9,10. Equipped with nanoporous 2D material membranes like graphene, the reverse osmosis (RO) water desalination process can expect 2–3 orders improvement in water flux compared with traditional polymeric membranes9,10,11,12,13,14. In RO, the geometry of nanopores in 2D materials plays a determinant role in water desalination performance9,11. In general, a large pore that allows high water flux is likely to perform poorly in rejecting ions; a small pore that rejects 100% undesired ions, on the other hand, usually have limited water flux. Thus, an optimal nanopore for water desalination is expected to allow as high water flux as possible while maintaining a high ion rejection rate. However, finding the optimal nanopore geometry on graphene can be challenging due to high computational and experimental cost associated with extensive experiments, i.e., there are countless possible shapes for a pore on a 4 nm × 4 nm graphene membrane, but evaluating the water flux and ion rejection of a single pore using 10 ns MD simulation takes roughly 36 h on a 56-core CPU cluster. Given this time benchmark, evaluating the water desalination performance of 1000 graphene nanopores can take more than 4 years. Therefore, to discover the optimal graphene nanopore for water desalination, an efficient nanopore screening method with a fast nanopore water desalination performance predictor (performance predictor in short) is needed. Inspired by the recent success of deep learning15 and reinforcement learning (RL)16, we create an AI framework consists of the combination of the state-of-art deep reinforcement learning (DRL) algorithm with a convolutional neural network (CNN) to solve this challenge.

The main idea of RL17 is to train an agent to find an optimal policy that maximizes the expected return in the future through actively interacting with the environment to achieve a goal. Recently, DRL16,18, which models the RL agent with artificial neural networks, has proven to be an efficient tool in material-related engineering fields, such as material design19,20,21 and molecule optimization22. In this work, we designed and implemented an artificial intelligence framework consisting of DRL, which is capable of creating a nanopore on a single-layer graphene membrane to reach optimal water desalination performance. By a series of decisions on whether or not to remove carbon atoms and which atom to be removed, the DRL agent can eventually create a pore that allows the highest water flux while maintaining ion rejection rate above an acceptable threshold. Such precisely controlled atom-by-atom removal nanopore synthesis can be conducted by electrochemical reaction23,24. Perforation technologies can also offer the opportunities to control the formation of pores, gaps, and bridges with nano-meter dimensions on 2D materials such as graphene experimentally25,26,27,28. During training, the DRL agent learns from the feedback based on the water desalination performance (e.g. reward for high water flux and penalty for lower ion rejection). However, conventional methods to calculate desalination performance, like MD simulation, are too time-consuming to be implemented in our DRL model. To evaluate DRL-designed nanopores fast and accurately, we implemented a CNN-based29,30,31,32 model that uses the geometry of porous graphene membrane to directly predict the water flux and ion rejection rate under certain external pressure. To this end, a ResNet32 model is trained on the dataset we collected through MD simulation of water desalination using various graphene nanopores. With the CNN-accelerated desalination performance prediction, the DRL model can rapidly discover the optimal graphene nanopore for water desalination. MD simulations on top-performing DRL-created graphene nanopores prove that they have higher water flux while maintaining a similar ion rejection rate compared to the circular nanopores. Further investigation of molecular trajectories reveals the reason that DRL-created nanopores outperform the conventional circular nanopores and provides insights for energy-efficient water desalination. Lastly, our AI-driven framework can be potentially applied to various application areas33 of 2D materials besides water desalination, such as gas permeation and separation, battery and supercapacitor applications, and biomolecular translocation34,35.

## Results

### AI framework

The framework (Fig. 1) of water desalination for efficient water desalination consists of a DRL agent and a CNN-based performance predictor network. At each timestep, the DRL agent generates a updated nanopore by removing at most one atom from the graphene, and the CNN-based performance predictor network predicts the water flux/ion rejection rate of the nanopore, such that the DRL agent can get instantaneous feedback on its action. Given the featurized information of the nanoporous graphene sheet (Morgan fingerprint, Cartesian coordinates of each atom, and geometrical features of graphene membrane from the CNN model) and predicted water flux and ion rejection, the DRL agent (details of DRL agent shown in Supplementary Fig. 1) was trained to create a pore on graphene sheet with the goal to maximize its performance in the water desalination process. The dataset used to train CNN performance predictor is generated by MD simulations of various graphene nanopores for water desalination.

### Graphene nanopore dataset

We consider the graphene nanopore system as illustrated in Fig. 2a, which consists of four different sections: a graphene piston that applies constant external pressure; a saline water section containing potassium chloride as solute; a single-layer graphene membrane with the pore of different geometries; and a freshwater section which functions as a reservoir of filtered water. The molarity of the saline water in this work is ~2.28 M, which is higher than normal seawater for the sake of computational efficiency. The dimension of the simulation box is approximately 4 nm × 4 nm × 13 nm in x, y, and z-directions, respectively. A periodic boundary condition was applied to all three dimensions.

The two major performance indicators of a membrane in water desalination: water flux and ion rejection rate, were calculated by post-processing the MD simulation trajectories. The slope of the fitted least-square regression line on filtered water with respect to the simulation time curve was calculated to be the water flux of each membrane (Fig. 2b). The ion rejection rate of each membrane was calculated by dividing the number of ions in the freshwater section by the total number of ions.

The total number of different simulated porous graphene is 185. Since the reward of DRL agent in our model was calculated based on the water flux/ion rejection prediction of performance predictor (Eqs. (1) and (2); Supplementary Fig. 2), highly accurate predictions must be achieved to ensure the quality of DRL training. A much larger training dataset was necessary for the optimization of CNN model. The method employed in our study to substantially increase the size of the dataset was data augmentation36,37. Given that the water desalination performance of a graphene pore depended on its size and geometry, we could assume that a flipped or translated pore on the same graphene membrane would demonstrate identical water flux/ion rejection rate of the original pore (proven by MD simulations in Supplementary Fig. 3). Therefore, copies of original pores were created by being flipped along x- or y-axis and/or translating in −4 to 4 Å in x and y directions (Fig. 2c). The water desalination performance of pore copies is a random variable of normal distribution (μ = original pore performance, σ = 1% of original pore performance). In order to improve CNN’s prediction accuracy on the performance of pores created by the DRL agent, we augmented DRL-generated pores 32 times. Among the other pores, the ones with zero water flux (too small to allow water transport) were augmented 6 times, and the rest of the pores were augmented 24 times. The final dataset used for CNN training contains 3937 samples (Fig. 2d). A reverse sigmoid function was fitted to the distribution of samples to show the general relationship between the water flux and ion rejection rates.

### Water desalination performance prediction

To facilitate the efficient estimation of water desalination performance in our AI-driven framework, a CNN model was trained to make an instantaneous prediction of water flux and ion rejection rates given a specific graphene nanopore. CNN is widely known as a universal feature extractor. Given that the water desalination performance of a graphene nanopore depends on its geometrical features, CNN can be the most suitable model to recognize geometrical features and make predictions based on them. The CNN models were implemented based on VGG31 and ResNet32, and a multi-layer perceptron (MLP) was built on top of the convolutional layers to project the CNN-extracted features to the predicted water desalination performance (i.e., flux and ion rejection rate).

We compared the performance of CNN-based deep learning models with XGBoost38, a widely used shallow machine learning model, which was also trained to predict the water flux/ion rejection rate. The advantage of XGBoost model is that it requires much less time for training compared to CNN. Before the training of the XGBoost model, the graphene membrane was featurized into a one-hot-encoded Morgan fingerprint39 vector of dimension 1024 using RDKit package40, with a cutoff distance of 5 Å. The Morgan fingerprint vector was then fed in the XGBoost regression model as input. A random search was conducted on the hyperparameter grid (Supplementary Tables 2 and 3) for model optimization.

The mean squared error (MSE) and coefficient of determination (R2) are used as metrics to evaluate the performance predictions of models. The water flux and ion rejection labels are standardized before fed into the property prediction models. Thus the metrics tabulated are based on standardized water flux or ion rejection rate (Table 1). Since the accuracy of performance predictor directly influence how accurately the DRL agent is rewarded/penalized during training, the model with the least MSE and highest R2 values was chosen to be used for reward estimation. ResNet32 significantly outperformed other models on both metrics, and the fined-tuned ResNet50 model reaches the highest accuracy in predicting both water flux and ion rejection rate. Therefore, a ResNet50 (retrained using the whole dataset) is used to predict the water desalination performance of various graphene nanopores to accelerate the DRL training.

### DRL for discovering the optimal graphene nanopores

Our goal was to design the optimal geometry of graphene nanopore for energy-efficient water desalination, which simultaneously demanded high flux and high ion rejection under certain external pressure. In order to optimize the nanopore, an agent was expected to remove atoms sequentially until the desired pore geometry was developed. To this end, the agent was set to interact with graphene nanopores in a sequence of actions at, states st, and rewards rt within an episode of length T. The goal of the agent was to select the action such that it could maximize the future discounted return $${R}_{t}=\mathop{\sum }\nolimits_{t = 1}^{T}{\gamma }^{t-1}{r}_{t}$$ in the finite Markov decision process (MDP) setting. In our case, we set the discount factor γ to be 1.

At timestep t, given the graphene nanopore Gt, the agent observed the state st, which was composed of Morgan fingerprint39, coordinates of all the atoms, along with CNN-extracted graphene geometrical features. The graphene geometry $${g}_{t}^{\prime}$$ was fed into the flux and ion rejection predictor, respectively. The geometrical features were the concatenation of last layer before output of the performance predictors. Once an atom was removed, its coordinate was set to the origin since MLP required a homogeneous input dimension. The predicted flux ft and ion rejection it were leveraged to compute the reward signal rt for the agent, as given in Eqs. (1) and (2):

$$\sigma (x)=A+\frac{K-A}{{(C+Q{e}^{-Bx})}^{\frac{1}{\nu }}},$$
(1)
$${r}_{t}=\alpha {f}_{t}+\sigma ({i}_{t})-\sigma (1),$$
(2)

where σ() is the generalized logistic function41 and α is the coefficient for flux term. In our setting, α was set to be 0.01, and A = −15, K = 0, B = 13, Q = 100, ν = 0.01, C = 1 for the logistic function. A linear term of flux reward encouraged the agent to expand nanopores, which would allow higher water flux. Since low ion rejection rate was not favored in water desalination, a generalized logistic function σ() was leveraged to penalize ion rejection term. When it was high, σ(it) was close to zero, allowing the growth of the nanopores. However, when it was low, σ(it) fiercely penalized the agent by outputing a large negative value (Supplementary Fig. 2). Besides, an extra 0.05 reward was given to the agent when it chose to remove an atom at timestep t to encourage pore growth at an early stage. Given state st and reward rt, the agent intended to choose the action at for next step. However, due to the high dimensionality of possible action space (all the atoms in the graphene fragment), it was computationally expensive for the agent to efficiently and thoroughly explore the possible actions and to learn an optimal design. Therefore, only a subset of M atoms was selected as candidates ct. Atoms on the edge of pore were picked based on the rank of their proximity to the pore center, if the number exceeds M, only the first M atoms closest to the center of pore were selected. However, when the number of edge atoms was less than M, non-edge atoms closest to the center of pore were selected as possible candidates to maintain the size of ct. Given the state st, reward rt, and candidate ct, the agent learned to pick the action aiming to maximize future rewards.

We optimized the DRL agent via deep Q-learning16 with experience replay with 10 random seeds to generate various graphene nanopores. In the DRL agent training processes with different random seeds (Fig. 3), the red curves indicate mean values and the blue shadows represent standard deviations. The accumulated reward for each episode increases during training the DRL agent (Fig. 3a). Initially, the policy is noisy and the accumulated rewards are low because the DRL agent has not yet learned to stop expanding the pore before receiving an enormous penalty for a low ion rejection rate. During the training, the DRL agent gradually learns a stable policy through maximizing the rewards (balancing the trade-off between water flux and ion rejection rate). The performance of DRL agent after 2000 episodes of training is demonstrated in Fig. 3b–e. The DRL agent generates the nanopore which brings a positive reward at each timestep, and the agent also automatically learns to stop enlarging the nanopore to avoid a low ion rejection rate (Fig. 3b, c). For example, the evolution of a DRL-created pore (Fig. 3f, animated in Supplementary Movie) shows that DRL stops removing atom from the edge of graphene nanopore after 50th timestep because it determines that higher water flux reward brought by further removing atoms is not worth the penalty for low ion rejection rate. Based on the prediction of the performance predictor, the DRL-created graphene nanopores have averaged ~40 # ns−1 water flux and ~96% ion rejection rate (Fig. 3d, e).

### Investigation on DRL-created graphene nanopores

The collection of both DRL-created graphene nanopores (7999 samples) and nanopores in the training dataset (3937 samples) is visualized using t-SNE42 algorithm (Fig. 4). t-SNE maps the high-dimensional features (1000 dimension) extracted from trained CNN models to the low-dimensional domain while preserving the similarity between data points as the relative distance in 2D. In other words, CNN features that are more similar to each other will have a higher tendency of being clustered. In this work, using CNN-extracted features from each graphene membrane, t-SNE successfully clusters samples with similar water flux or ion rejection. Also, as illustrated in Fig. 4, graphenes with different nanoporous structures are far from each other in the plot while those with similar structures are shown close. The results indicate that our CNN model successfully learns to extract features that strongly correlate the water desalination performance (i.e., water flux and ion rejection) with the geometry of the nanopores.

The water desalination performances of all nanopores, including DRL-created and those in the training dataset, are compared in Fig. 5a. Comparison between permeation rate of nanopores (Supplementary Fig. 4) shows the water flux different normalized of the external pressure. It is worth noting that the process of generating 7999 nanopores using DRL and predicting their water flux/ion rejection rate takes less than a single week; however, evaluating the performance of the same amount of nanopores using MD simulation will take ~33 years (average 36 hrs on each sample, using one 56-core CPU node). Among the nanopores with the same level of ion rejection rate, some nanopores discovered by DRL allow much higher water flux. One common feature shared by those high-performance nanopores is the semi-oval geometry with rough edges. We set 90% ion rejection rate as the threshold to determine if a nanopore can effectively reject ions. The water flux histogram (Fig. 5b) shows that given the baseline ion rejection rate as 90%, DRL can extrapolate from the training dataset and discover graphene nanopores that generally allow higher water flux.

Further MD simulations are conducted with DRL-created graphene nanopores that show high predicted performances to evaluate how the DRL helps in discovering the optimal graphene nanopore for water desalination (simulation process recorded in Supplementary Movie). Although DRL-created pores generally have lower water flux compared with circular pores with the same area, they have a much higher ion rejection rate (Fig. 5c, 90% threshold of ion rejection rate is marked by a red dashed line). For example, when the pore area is 113 Å2, DRL-created nanopore maintained over 90% ion rejection rate while the circular pore rejects only approximately 65% of ions. A pore with high water flux but a very low ion rejection rate is not desirable in water desalination application. Moreover, the comparison between 113 Å2 DRL-created nanopore with 88 Å2 circular pore shows that DRL-created pore can reject more ions when achieving the same water flux: they both have approximately 125 # ns−1 water flux while 113 Å2 DRL-created pore can reject approximately 7% more ions. The comparison between simulation results proves that DRL tends to prioritize the ion rejection rate over water flux, which makes it capable of maximizing the water flux of nanopores while maintaining a valid ion rejection rate. Nanopores with a larger area result in higher pore density on the graphene membrane. The pore density of the graphene membranes with the above-mentioned nanopores are tabulated in Supplementary Table 4. In real-world experiments or applications, the graphene nanopores can be stabilized by adding passivation such as hydrogen to the edge of the pore43.

To gain a deeper understanding of the reason behind the high ion rejection rate of DRL-created pores, distribution of water molecules and ions inside of 113 Å2 DRL-created pore and 88 Å2 circular pore have been visualized (Fig. 5d). From the ion distribution (marked by red dots), we can observe that ions can traverse the circular pore evenly through the entire central area of the pore. The distributions of water molecules (marked by aqua blue color) and ions in the circular pore are in a homogeneous pattern. However, the corners inside of DRL-created nanopore are small enough to block the passage of ions while being large enough to accommodate the transport of water molecules. With the knowledge that ions are covered by hydration shell during the transport through the nanopore, it can be seen that ion-free zones (corners) inside of DRL-created nanopore obstruct the traversing of ions with hydration shell by steric effect (Fig. 5e). The perimeter/area ratio can be used as a shape parameter to quantitatively evaluate the influence of geometry on the water desalination performance of nanopores. Due to the rough edges, the comparison of the perimeter/area ratio of DRL-created and circular pores (Supplementary Fig. 6) shows that DRL-created pore generally have higher perimeter/area ratio (Supplementary Table 4). Higher perimeter/area ratio enables DRL-created pores to achieve higher ion rejection rate compared with circular pores with similar water flux or permeation rate. This is the reason why high-performance nanopores (zoom-in Fig. 5a, more high-performance DRL-created pores shown in Supplementary Fig. 7) all have rough edges. Discovers and utilizes this special geometry, DRL identifies nanopores that can reject most ions while allowing high water transport.

## DISCUSSION

In this work, we propose an AI framework that combines the DRL and CNN performance predictor to discover the optimal graphene nanopore for water desalination. The DRL agent takes the current graphene geometrical features and the candidate atoms as inputs to determine which atom to remove at each timestep. Trained with the DQN algorithm, the agent learns to generate nanopores that allow high water flux while maintaining high ion rejection. ResNet50, a widely used CNN model, is trained on a graphene nanopore dataset to instantly predict the water flux and ion rejection rate under certain pressure. Such prediction by the ResNet50 enables the real-time interaction between the DRL agent with the graphene nanopores, as well as the online optimization of the DRL agent. CNN-accelerated DRL training significantly expedites the exploration of graphene nanopores: 7999 different nanopores are created and evaluated for water desalination performance during 1-week training of DRL. Evaluating the same amount of graphene nanopores using MD simulation can take approximately 33 years with a 56-cores CPU cluster. When we set the baseline ion rejection rate to be 90%, DRL shows the capability of extrapolating from the existing training dataset to discover nanopore with higher water flux. Further MD simulations confirm that DRL-created nanopores outperform circular nanopores in terms of ion rejection rate when they have approximately the same water flux. The better water desalination performance of DRL-created pores can be attributed to DRL’s utilization of rough edges and small corners to increase the perimeter/area ratio of pores and to block ions with the hydration shell. In conclusion, DRL shows the capability of discovering optimal graphene nanopores for water desalination. Moreover, with only minor modifications, this framework can be directly extended to many other fields concerning nanomaterial design. With a well-trained machine learning property predictor, the DRL can automatically learn to discover the optimal material structure effectively and efficiently.

## Methods

### MD simulations

MD simulations were conducted using LAMMPS package44, where porous graphene membranes simulated were either created using Visual Molecular Dynamics45 or automatically generated by DRL agent (samples from the early stage of training). All water molecules in this work were simulated using SPC/E model46, with SHAKE47 algorithm to constrain the bond length and angles. Lennard–Jones (LJ) potentials (Supplementary Table 1) along with long-range Coulombic electrostatic potentials were adopted as interatomic potentials in the MD simulation. The cutoff for the interatomic potentials was set to be 12 Å. Lorentz–Berthelot rules were employed for the calculation of LJ potentials between different kinds of atoms. Particle-particle particle-mesh (PPPM) Ewald sovler48 with 0.005 root-mean-squared error was used for long-range Coulombic potential correction. The porous graphene membrane and piston were each regarded as an entity during the simulation (internal interatomic potentials were not calculated) in order to reduce the computational cost.

In the first stage of each individual simulation, the internal energy of the system was minimized for 1000 iterations. The system then ran for 5 ps under the NPT (isothermal–isobaric) ensemble at 300 K after the velocities of molecules were initialized based on Gaussian distribution. After the equilibrating, the system under NPT ensemble, the system was switched to NVT (canonical) ensemble to run for another 10 ns. The temperature was maintained at 300 K by Nosé–Hoover thermostat49,50 with a time constant of 0.5 ps. At this stage, a z-direction constant external pressure of 100 MPa was applied on saline water by the piston to mimic the RO process in water desalination. Since the relationship between water flux and external pressure in the RO process was generally linear9,11,12,13, the performance of pores under 100 MPa could be extrapolated to lower pressures. Therefore, we chose to run simulations under 100 MPa external pressure to rapidly collect meaningful data. Molecular trajectories of each simulation were collected every 5 ps for data processing. Data augmentation was conducted using the Atomic Simulation Environment (ASE) package51. Area and perimeter of the graphene nanopores are calculated using computer vision methods (details in Supplementary Fig. 5).

### CNN water desalination performance predictor

There were two steps in the CNN modeling, including extracting features from the geometry of graphene nanopore and making predictions through an MLP regression model. First of all, the geometrical features of a graphene nanopore were extracted to a 380 × 380 pixels representation. Color was applied on top of each atom, and all geometrical features were resized to the dimension of 224 × 224 pixels. The processed geometrical features were then fed into a CNN. Multiple CNN models, including ResNet18, ResNet50 (ref. 32), and VGG16 (ref. 31) with batch normalization, were benchmarked based on the MSE and R2 of their resulting water flux/ion rejection rate predictions. An extracted feature vector with the dimension of 1000 was output from the CNN model. Finally, given the feature vector, the MLP was able to make predictions of flux and ion rejection rates. The MLP used in this work consisted of two layers with 256 and 64 neurons in the first and second layers, respectively. A residual block32 and ReLU52 activation function were added after each layer of MLP. Two CNN models were trained: one for the prediction of water flux and the other for ion rejection rate.

The CNN models, including VGG31 and ResNet32, were implemented based on PyTorch library53 and pre-trained on the ImageNet dataset54 to learn the robust CNN feature extractor. A random-initialized MLP was built on top of the convolutional layers to project the CNN-extracted features to the predicted water desalination performance (i.e., flux and ion rejection rate). In training the deep learning models on our graphene dataset, we used gradient-based Adam optimizer55 with the learning rate 0.0001 and 0.001 for pre-trained convolutional layers and the MLPs, respectively. The whole graphene dataset was split into a training set and a test set with the ratio of 4:1, and the models were trained only on the training set for 600 epochs and evaluate on the test set. The model which reached the best performance (i.e., lowest MSE in predicting the flux/ion rejection rate) on the test set was selected as the water desalination performance predictor in the DRL framework. These strategies in CNN training maintains the robust and informative CNN feature extractors in the pre-trained CNN models and avoided the model from overfitting the graphene dataset.

### DRL agent

To train the agent, deep Q-learning 16 with experience replay was implemented. Our task only considered deterministic environment, namely given the pair (s, c) and the action a, $$(s^{\prime} ,c^{\prime} )$$ at the next timestep was determined. Based on Bellman equation17, the optimal action-value function Q*(s, c) in the deterministic environment was defined as

$${Q}^{* }(s,c)=r+\gamma \mathop{\max }\limits_{c^{\prime} }{Q}^{* }(s^{\prime} ,c^{\prime} )$$
(3)

To model the Q function, the Q-network parameterized by θ and target network parameterized by $$\theta ^{\prime}$$, two fully connected networks with identical architecture were built. During training, only the parameters θ in the Q-network were updated through backpropagation from the loss function. The parameters $$\theta ^{\prime}$$ in the target network were updated with θ every 10 steps and are kept fixed otherwise. The input to the network was the pair of graphene state and action candidates, (s, c), and the output was the Q values of all the actions in the candidate. The agent then picked the action with the highest Q value. In addition, the agent’s experience $$(s,c,r,s^{\prime} )$$ in the episodes were stored to a replay buffer $${\mathcal{D}}$$16, such that the experience can be leveraged to update the network parameters multiple times. During training, a mini-batch of samples was drawn uniformly at random from the replay buffer $$(s,c,a,r,s^{\prime} ) \sim U({\mathcal{D}})$$. The loss function (Eq. (4)) measured the difference between the target Q value $${Q}^{* }(s^{\prime} ,c^{\prime} ;{\theta }_{i}^{\prime})$$ and the prediction of current Q-network Q(s, c; θi):

$${L}_{i}({\theta }_{i})={{\mathbb{E}}}_{(s,c,r,s^{\prime} ) \sim U({\mathcal{D}})}\left[\right.\left(\right.r+\gamma \mathop{\max }\limits_{a^{\prime} }Q(s^{\prime} ,c^{\prime} ;{\theta }_{i}^{\prime})-Q(s,c;{\theta }_{i}){\left)\right.}^{2}\left]\right.$$
(4)

In our setting, we use an Adam optimizer55 with learning rate 0.001. The replay buffer is of capacity 10,000 and batch size is set to 128.