Introduction

Rapid development of technology based on advanced materials requires us to considerably shorten the existing ~20-year materials development timeline1. This long timeline results both from the empirical discovery of promising materials as well as the trial-and-error approach to identifying scalable synthesis routes for these material candidates. Over the last decade, we have made considerable progress in addressing the first of these challenges through data-driven materials science to perform large-scale materials screening for improved properties. The exponential explosion in available computing power and increase efficiency of ab initio and machine learning (ML) driven materials simulation software have enabled the high-throughput simulations of several tens of thousands of materials from multiple material classes2,3,4,5. These high-throughput simulations and the resulting rich databases are increasingly being mined and analyzed using emerging ML techniques to identify promising material compositions and phases6,7,8,9,10. These strategies have been successfully employed to identify ultrahard materials, ternary nitride compositions, battery materials, polymers11, organic solar cells12, OLEDs13, thermoelectrics etc14,15,16.

This identification of advanced materials is only one piece necessary towards the goal of reducing time to deployment of advanced materials17. An equally important component in this paradigm is the corresponding ability to synthesize these promising materials and compositions. However, techniques for experimental synthesis of materials have not kept pace with advances in computational materials screening17,18. As a result, materials synthesis is largely dominated by individual groups that can identify synthesis strategies for advanced materials based on empirically insights and materials intuition. There are several attempted strategies to identify and optimize synthesis routes prior to actual synthesis. The first strategy, common in chemical and biological synthesis of small molecules, uses high-throughput experimental synthesis to screen for optimal synthesis precursors for chemical synthesis of small molecules19,20,21,22. The effectiveness of such strategies is limited since an exhaustive search of synthesis strategies is prohibitively expensive and inefficient in regard to time and reagents, whereas a narrow search scheme that varies only a single synthesis parameter at a time will likely miss several promising synthesis strategies.

In contrast to the relatively widespread use of automated algorithms to optimize chemical reactions of molecular and organic systems23, synthesis planning for bulk inorganic materials is still in its infancy24,25. Non-solution-based synthesis of quantum materials involves more complicated time-correlations between synthesis parameters, which are not amenable to experimental high-throughput synthesis26. This also requires considerably more refined models than previous efforts which only considered the combination of reactants to predict the outcome of chemical reactions27,28. Therefore, there are efforts to perform text-mining on published synthesis profiles from the literature, including common solvent concentrations, heating temperatures, processing times, and precursors used to understand common rules-of-thumb and identify synthesis schedules for materials29,30,31. However, even these upcoming ML techniques are limited by scarcity of data in terms of existing schedules and synthesized materials and therefore their extension to potentially unknown materials is problematic30. Finally, the identification of a synthesis schedule is the optimization of a time sequence of multiple synthesis parameters, which requires a new class of ML techniques. This problem is well-suited for Reinforcement Learning (RL), a branch of machine learning, where the goal of the RL agent is design an optimal policy to solve problems that involves sequential decision making in an environment consisting of thousands of tunable parameters and a huge search space32,33. Due to this flexibility and ability of RL in handling complex tasks involving non-trivial decision making and planning under uncertainties imposed by the surrounding environment, it has been used in robotics, self-driving cars and in material science domain for problems such as designing drug molecules with desired proproteins, predict reaction pathways and construct optimal conditions for chemical reactions19,34,35,36,37,38,39,40,41.

In this work, we describe a model-based offline reinforcement learning scheme to optimize synthesis routes for a prototypical member of the family of 2D quantum material, MoS2, via Chemical Vapor Deposition (CVD). CVD, a popular scalable technique for the synthesis of 2D materials42, has numerous time-dependent parameters such as temperature, flow rates, concentration of gaseous reactants, and type of reaction precursors, dopants and substrates (together referred to as the synthesis profile) that need to be optimized for the synthesis of advanced materials. Recent computational studies have identified several mechanistic details about the synthesis process43,44, but there are no comprehensive rules for designing synthesis strategies for a given material. We use RL specifically to (1) Identify synthesis profiles that result in material structures that optimize a desired property (in our case, the phase fraction of the semiconducting crystalline phase of MoS2) in the shortest possible time and (2) Understand trends and time-correlations in the synthesis parameters that are most important in realizing materials with desired properties. These trends and time-correlations effectively provide information about mechanism of the synthesis process. Experimental synthesis by CVD is time-consuming and not amenable to high-throughput synthesis and is therefore incapable of generating the significant amount of data on synthesis using multiple profiles required for RL training. Therefore, we train our RL workflow on data from simulated CVD performed using reactive molecular dynamics simulations (RMD), which were previously shown to accurately reflect the potential energy surface of the reacting system as well as capture important mechanisms involved in the CVD synthesis of MoS2 from MoO3, including MoO3 self-reduction, oxygen-vacancy-enhanced sulfidation, SO/SO2 formation, void formation and closure etc. identified in previous studies44,45,46,47,48,49.

Below, we describe results from the molecular dynamics simulation of CVD, followed by a representation of the dynamics of this CVD-environment as a probability density function using a probabilistic deep generative model called Neural Autoregressive Density Estimator (NADE-CVD) and model-based Offline Reinforcement Learning to identify optimal synthesis strategies. We conclude with a discussion on applicability of RL + NADE-CVD models for prediction of long-time material synthesis.

Results

Reactive MD for chemical vapor deposition

We perform RMD simulations to simulate a multi-step reaction of MoO3 crystal with a sulfidizing atmosphere containing H2S, S2 and H2 molecules. Each RMD simulation models a 20-ns long synthesis schedule, divided into 20 steps, each 1 ns long. At the beginning of each step, the gaseous atmosphere from the previous step is purged and replaced with a predefined number of H2S, S2 and H2 molecules. These changes in RMD parameters reflect the time-dependent changes in synthesis conditions during experimental synthesis. The sulfidizing environment is then made to react with the partially sulfidized MoOxSy structure from the end of the previous step at a predefined temperature for 1 ns. Each step is characterized by 4 variables, the system temperature, and the number of S2, H2S and H2 molecules in the reacting environment denoted as the quartet, \(\left( {T,n_{H_2},n_{S_2},n_{H_2S}} \right)\). While the initial structure for each RMD simulation at t = 0 ns is a pristine MoO3 slab, the final output structure (MoS2 + MoO3-x) is a non-trivial function of its synthesis schedule, defined by 20 such quartets as shown in Fig. 1.

Fig. 1: Reactive MD for computational synthesis.
figure 1

a Schematic of the RMD simulation of a single 20-ns long synthesis schedule. The initial MoO3 slab at t = 0ns reacts with a time-varying sulfidizing environment to generate a final structure composed of MoS2 and MoO3-x at t = 20ns. b Snapshot of RMD simulation cell for MoS2 synthesis. The sulfidizing environment containing S2, H2 and H2S gases reacts with the MoOxSy slab in the middle of the simulation cell (black lines).

NADE for predicting output of synthesis schedules

RMD simulations can generate output structures for thousands of simulated synthesis schedules to overcome the primary problem of data scarcity common to experiments. RL-based optimization of synthesis schedules consists successive stages of policy generation by the RL agent and policy evaluation by the environment. However, using RMD simulations directly as the policy evaluation environment is infeasibly time-consuming since direct evaluation a single synthesis profile by RMD takes approximately 2 days of computing. To overcome this problem, we construct a probabilistic representation of the CVD synthesis of MoS2 as a Bayesian Network (BN) which encodes a functional relationship between the synthesis conditions and generated output structures and can therefore predict output structures for an arbitrary input condition in a fraction of the time required by RMD simulations. The BN consists of two sets random variables, namely the (a) the unobserved variable Z given by the time dependent phase fractions of 2H, 1T phases and defects in the MoOxSy surface, and (b) the observed variables, X, given by the user-defined synthesis condition, namely the temperature and gas concentrations (Fig. 2a, b)50. Each node in the BN represents either the synthesis condition at time t as Xt or the distribution of different phases on MoOxSy surface as Zt. Together, the BN represents the joint distribution of X and Z as P(X, Z). Since, Z1 (initial structure, pristine MoO3) and X (synthesis condition) is known, we can convert P(X, Z) into a conditional distribution \(P\left( {Z_{2:T}|X,Z_1} \right)\) using chain rule. Further, using conditional independence between BN variables, \(P\left( {Z_{2:T}|X,Z_1} \right)\) can be further simplified as the autoregressive probability density function, where each Zt+1 depends only upon the simulation history of observed and unobserved variables till time t (Fig. 2b).

$$\begin{array}{*{20}{c}} {P\left( {Z_{2:T}|X,Z_1} \right) = P\left( {Z_2|Z_1,X_1} \right) \ldots P\left( {Z_{t + 1}|Z_{1:t},X_{1:t}} \right) \ldots P\left( {Z_T|Z_{1:T - 1},X_{1:T - 1}} \right)} \end{array}$$
(1)
Fig. 2: NADE model of computational synthesis of MoS2.
figure 2

a Each 1-ns step of the RMD simulation is characterized by an input vector \(X_i\) characterizing the synthesis conditions and the distribution of phases in the resulting structure, \(Z_i\). b Bayesian Network representation of CVD synthesis of MoS2 over \(T_{max}\)= 20 ns. The green and blue nodes are synthesis condition as observed variables (\(X_n\)), whereas orange nodes are unobserved (\(Z_n\)), which represents phase fraction of 2H, 1T and defect in MoOxSy surface as a function of time. c Schematic of the NADE-CVD, composed of two multi-layer perceptrons FMLP as encoder and decoder networks and an intermediate recurrent neural network block, FRNN. d Test accuracy of NADE-CVD with a mean absolute error <0.1 phase fraction.

In the BN, each of these conditional probabilities, \(P\left( {Z_{t + 1}|Z_{1:t},X_{1:t}} \right)\) is modeled as a multivariate Gaussian distribution \({\cal{N}}\left( {Z_{t + 1}|\mu _{t + 1},\sigma _{t + 1}} \right)\), whose mean \(\mu _{t + 1} = \left\{ {\mu _{t + 1}^{2{\mathrm{H}}},\mu _{t + 1}^{1{\mathrm{T}}},\mu _{t + 1}^{{\mathrm{defect}}}} \right\}\) and variance \(\sigma _{t + 1} = \left\{ {\sigma _{t + 1}^{2{\mathrm{H}}},\sigma _{t + 1}^{1{\mathrm{T}}},\sigma _{t + 1}^{{\mathrm{defect}}}} \right\}\) is function of simulation history, \(\left( {Z_{1:t},X_{1:t}} \right)\).

To learn the BN representation of the CVD process and capture the conditional distribution \(P\left( {Z|X,Z_1} \right)\) compactly, we have developed a deep generative model architecture called a Neural Autoregressive Density Estimator (NADE-CVD; Fig. 2c), which consist of an encoder, decoder and recurrent neural network (RNN)51,52,53,54. The output of NADE-CVD function at time step t + 1 is \(\mu _{t + 1}\) and \(\sigma _{t + 1}\) for three phases in MoOxSy surface which are functions of simulation history encoded by the RNN cell as \(h_t\), where \(h_t\) is a function of \(h_{t - 1}\) and synthesis condition \(\left( {Z_t,X_t} \right)\) at time t. Parameters of the NADE-CVD model are learned using maximum likelihood estimate using a training data of 10,000 RMD simulations of CVD using different synthesis conditions. The prediction error of the trained NADE-CVD model on test data (Fig. 2c) shows a RMSE error of merely 3.5 atoms and maximum prediction error on any phase of ≤30 atoms. The architecture of the NADE-CVD model is described in the Methods section and details about model training are provided the Supplementary Methods.

Offline model-based RL for optimal synthesis schedules

The NADE-CVD model accurately approximates a computationally expensive RMD simulation and provides a fast and probabilistic evaluation of the output structure from a given synthesis schedule. However, on its own, this model cannot be used to achieve the goal of predictive synthesis, which is to identify the most likely synthesis schedules that yield a material with optimal properties (such as high crystallinity, phase purity or hardness). For MoS2 synthesis, one example of a design goal is to determine synthesis schedules that yield high quality MoS2 (i.e., largest phase fraction of semiconducting 2H phase in the final product), in the shortest possible time. In other words, we wish to perform the non-trivial optimization of \(X_{1:t}\) to maximize the value of \(\mathop {\sum }\nolimits_t Z_{1:t}\) (see Supplementary Methods). Mathematically, it can be written as

$$\begin{array}{*{20}{c}} {\arg \mathop {{\max }}\limits_{X_{1:t}} {\sum} {Z_{1:t}} \,{\mathrm{where}}\left( {Z_{1:t}{\mathrm{,}}\,X_{1:t}} \right)} \end{array} \sim P\left( {Z_{1:t},\,X_{1:t}} \right) = P\left( {Z_{1:t}|X_{1:t}} \right)P\left( {X_{1:t}} \right)$$
(2)

For this purpose, we construct a model-based offline reinforcement learning (RL) scheme, where the agent does not have access to the environment (RMD simulation) during training and learns the optimal policy from randomly sampled suboptimal offline data from the environment55,56,57,58,59. Here, the offline RL workflow consists of a RL agent coupled to NADE-CVD trained on offline RMD data as discussed in the previous section, (Fig. 3a). The RL agent \(\left( {\pi _\theta } \right)\) is a multi-layer perceptron, where the input state \(\left( {s_t} \right)\) at time t is a 128-dimension embedding vector of the entire simulation history till t,\(\left( {Z_{1:t},X_{1:t}} \right)\). At each time step t, the RL agent takes an action, at, which is the change in synthesis condition (i.e. reaction temperature and gas concentrations) at t, \(a_t = {\Delta}Z = \left\{ {{\Delta}T,{\Delta}S_2,{\Delta}H_2,{\Delta}H_2S} \right\}\). The synthesis condition for the next nanosecond of the simulation is defined as \(X_{t + 1} = X_t + a_t\). The corresponding action \(\left( {a_t} \right)\) to take at st is modeled using a Gaussian distribution \(\left( {a_t\sim {\cal{N}}\left( {\mu \left( {s_T} \right),\sigma ^2} \right)} \right)\), whose parameters \(\mu \left( {s_T} \right)\) – state dependent mean – is the output of the RL agent,\(\mu \left( {s_T} \right) = \pi _\theta (s_T)\). The variance, \(\sigma ^2\) is assumed to be constant and is tuned as a hyperparameter of the RL scheme. Therefore, the RL scheme designs a 20 ns synthesis schedule \(\left( \tau \right)\)starting with an arbitrary synthesis condition, \(\left\{ {T^0,S_2^0,H_2^0,H_2S^0} \right\}\), such that the action proposed at each timestep t serves to convert the initial MoO3 crystal into 2H-MoS2 structure as quickly as possible.

Fig. 3: Reinforcement Learning model for synthesis schedule design.
figure 3

a Schematic of the RL-NADE model for optimizing schedules for MoS2 synthesis. b Comparison of structures generated by the RL-designed schedules against randomly generated schedules demonstrates that the RL-NADE model consistently identifies CVD synthesis schedules that generate highly crystalline products. c Validation of a promising RL-generated schedule using RMD simulations.

During training, the RL agent learns the policy of designing the optimal synthesis condition via policy gradient algorithm informed by the NADE-CVD model33,60,61,62. At each time step t in an episode, the RL agent receives an input state st and proposes an action at that determines the synthesis condition at next time step, \(X_{t + 1}\). Using this, NADE-CVD predicts the distribution of various phases in the synthesized product \(Z_{t + 1}\). The NADE-CVD model also gives a reward \(\left( {r_t} \right)\) proportional to the concentration of 2H phase \(Z_{t + 1}[n_{2H}]\) and a new state \(s_{t + 1}\) to the RL agent. During training, the goal of the RL agent is to use these reward signals and adjust its policy parameters \(\left( {\pi _\theta } \right)\) so as to maximize its total reward, to produce 2H-rich MoS2 structure in minimum time.

$$\begin{array}{*{20}{c}} {{\mathrm{Objective}}\!\!:\arg \mathop {{\max }}\limits_\theta {\Bbb E}_{\tau \sim \pi _\theta }\left[ {\mathop {\sum}\limits_{t = 1}^t {\left( {s_t,a_t} \right)} } \right]{\mathrm{where}}\,r_t\left( {s_t,a_t} \right) = \left\{ {\begin{array}{*{20}{l}} {0.0\,if\,Z_{t + 1}\left[ {n_{2H}} \right] < 0.4} \\ {0.2Z_{t + 1}\,if\,Z_{t + 1}\left[ {n_{2H}} \right] \ge 0.4} \end{array}} \right.} \end{array}$$
(3)

The details of the network architecture, and the policy gradient algorithm is given in the Methods section and RL agent training is described in the Supplementary Methods.

The efficiency of the trained RL agent in identifying promising synthesis schedules is demonstrated in Fig. 3b, which compares the 2H phase fraction of the resulting structures from 3200 synthesis schedules generated by the RL agent against 3200 randomly generated schedules, similar to what is used for training NADE-CVD. The RL agent is able to consistently identify schedules that result in highly crystalline and phase-pure products, while the randomly generated schedules overwhelmingly yield poorly-sulfidized and/or poorly crystalline products. This shows that offline RL agent is able to learn a superior policy from the sub-optimal random RMD simulation used in its training. Also, from probabilistic viewpoint, the RL agent constructs a probability distribution function (pdf) of \(X_{1:t}\) that places most of its probability mass on regions on \(X_{1:t}\) that maximizes \({\sum }Z_{1:t}\). Figure 3c shows the validation of one RL-predicted synthesis schedule by subsequent RMD simulation, showing that the observed time-dependent phase fraction tracks the RL-NADE prediction closely.

Optimal synthesis schedules and mechanistic insights from RL

The RL agent is trained to learn polices that generate time-dependent temperatures, and concentrations of H2S, S2 and H2 molecules to synthesize 2H-rich MoS2 structures in least time. Closer inspection of these RL designed policies provides mechanistic insight into CVD synthesis and the effect of variations in temperature and gas concentration on the quality of the synthesized product. Figure 4 shows that the RL agent has learned to generate a two-part temperature profile consisting of an early high-temperature (>3000 K) phase spanning the first 7–10 ns followed by annealing to an intermediate temperature (~ 2000 K) for the reminder of the synthesis profile. This two-part synthesis profile identified by RL policy is consistent with the experiments and atomistic simulations, that is high temperature (> 3000 K) is necessary for both the reduction of MoO3 surface and its sulfidation, whereas the subsequent lower temperature (~ 2000 K) is necessary for enabling crystallization in the 2H structure, while continuing to promote residual sulfidation. Consistent with previous reactive and quantum molecular dynamics simulations of material synthesis, a significantly elevated temperature is necessary to observe reaction event within the limited time domain accessible to atomistic simulations44,45,46,49. It is observed that the RL agent maintains this two-stage synthesis profile even if the provided initial temperature at t = 0 ns is low by quickly ramping up the synthesis temperature to the high-temperature regime (> 3000 K). The RL agent is also able to predict non-trivial mechanistic details about phase evolution, including the observation that the nucleation of the 1T phase precedes the nucleation of the 2H crystal structure (Fig. 4a, b). Similar trends were observed in previous mechanistic studies of MoS2 synthesis44.

Fig. 4: Effect of synthesis conditions on products.
figure 4

a A generated synthesis profile starting from low temperature and low gas concentrations. The RL model quickly ramps up the temperature up to 7 ns to promote reduction and sulfidation and then lowers the temperature to intermediate values to promote crystallization. This profile generates significant phase fraction of 2H starting from 10 ns. b A generated synthesis profile starting from high temperature and high S2 concentrations. The RL-NADE model retains the high temperature at early stages of synthesis and slowly anneals the system to intermediate temperatures after 10 ns. This schedule promotes relatively late crystallization and 2H phase formation. c Synthesis profiles with initially low S2 concentrations yield significantly higher phase fraction of 2H in the final product compared to profiles containing higher S2 concentrations at t = 0 ns. d, e Synthesis schedules are relatively insensitive to the initial concentration of reducing species, H2S and H2.

Another important phenomenon identified by RL agent is the effect of gas concentrations on the quality of the final product (Fig. 4b). To analyse the effect of initial gas concentration, we compute the probability distribution of 2H phase in MoS2 over the last 10 ns of the simulation for the synthesis conditions proposed by the RL agent under different initial conditions of gas conc. but with similar temperature profile. The mean \(\left( {\mu _{2H}} \right)\) of the pdf is \(\mu _{2H} = {\Bbb E}_{\tau \sim \pi _\theta }\left[ {\frac{1}{{10}}\mathop {\sum}\nolimits_{t = 20}^{t = 20} {Z_t\left[ {n_2H} \right]} } \right]\), is the expected fraction of the 2H phase in over the last 10 ns of the synthesis simulation and a higher value of \(\mu _{2H}\) provides an indication of the extent of sulfidation as well as the time required to generate 2H phases. The RL agent is found to promote synthesis profiles that have low concentration of gas molecules (particularly non-reducing S2 molecules) at early stages (0–3 ns) of the synthesis, when the temperature is high. This partially evacuated synthesis atmosphere promotes the evolution of oxygen from and self-reduction of the MoO3 surface. This can be clearly observed by comparing the histogram of 2H phase fractions in structures generated by synthesis profiles with low initial (i.e. t = 0 ns) concentration of S2 molecules against those with higher concentration of S2 molecules (Fig. 4c). Profiles with low initial S2 concentrations enable greater self-reduction of the MoO3 surface resulting in a significantly higher 2H phase fraction in the synthesized product at t = 10–20 ns. H2S and H2 molecules, which are more reducing than S2, do not meaningfully affect the MoO3 self-reduction rate, and the 2H phase fraction in the final MoOxSy product is largely independent of the initial H2S and H2 concentrations (Fig. 4d, e).

Multi-task RL-CVD: schedules for heterostructure synthesis

The outputs of the NADE-CVD model, each \(\mu _{t + 1}\) and \(\sigma _{t + 1}\) is only function of simulation history up to time t. Similarly, each action \(a_t\) taken by the RL agent is a function only of the input state \(s_t\), which is an encoded representation of simulation history up to time t. Hence, we can use RL + NADE-CVD to design policies for synthesis over time scales significantly longer than the 20 ns RMD simulation trajectories used for NADE-CVD training. Figure 5 shows a policy proposed by the RL + NADE-CVD model for a 30 ns simulation. This extended synthesis profile retains the design principles such as a two-phase temperature cycle and low initial gas phase concentrations that were learned from 20-ns trajectories. Further, the longer synthesis schedule also allows the RL agent to uncover synthesis design rules for improving 2H phase fraction. The RL profile in Fig. 5 includes a heating-cooling cycle between 15–30 ns what has previously been shown to improve the crystallinity and 2H phase fraction in the synthesized material44.

Fig. 5: Extensions of RL + NADE-CVD Method.
figure 5

a, b A 30-ns long synthesis profile predicted by RL + NADE-CVD retains design principles about two-phase temperature cycle and low initial gas phase concentrations learned from 20-ns RMD trajectories. In addition, the 30-ns profile also includes a temperature annealing step between 15–30 ns (arrows) that improves the 2H phase fraction beyond 60%. c RL + NADE-CVD generated synthesis schedule for optimizing 1T phase fraction. d Output structure from an RMD simulation of the 1T-optimized synthesis schedule reveals a heterostructure containing a 1T-rich region embedded in the 2H phase. e, f The robustness of RL-generated profiles against system size-scaling is validated by the identical fractions of 2H and 1T phases in laterally-small and laterally-large systems simulated using RMD using the same profile.

The RL agent learns promising synthesis profile by adjusting its policy parameters \(\left( {\pi _\theta } \right)\) to maximize a pre-defined reward function, that corresponds the material to be synthesized. Therefore, the RL agent can optimize synthesis schedules for other material structures, including multi-phase heterostructures, by constructing corresponding reward functions. The following reward function, \(r_t\left( {s_t,a_t} \right)\) maximizes the phase fraction of 1T crystal structure over the 20 ns simulation.

$$\begin{array}{*{20}{c}} {{\mathrm{Objective}}\!\!:\arg \mathop {{\max }}\limits_\theta {\Bbb E}_{\tau \sim \pi _\theta }\left[ {\mathop {\sum }\limits_{t = 1}^{t = 20} r\left( {s_t,a_t} \right)} \right]{\mathrm{where}}\,r_t\left( {s_t,a_t} \right) = \left\{ {\begin{array}{*{20}{c}} {0.0\,if\,Z_{t + 1}\left[ {n_{1T}} \right] < 0.17} \\ {0.35Z_{t + 1}\,if\,Z_{t + 1}\left[ {n_{1T}} \right] \ge 0.17} \end{array}} \right.} \end{array}$$
(4)

Figure 5c shows a RL-generated schedule to synthesize 1T-rich structures. The temperature profile is largely consistent with those observed for 2H-maximized synthesis schedules. The RL generated gas-phase concentrations optimized for 1T synthesis maximize H2 and H2S concentrations, while minimizing S2 concentrations. This is consistent with experimental observations, where reducing environments were observed to produce more 1T phase fractions63. This is in contrast to schedules optimized for 2H MoS2, where the concentration of all three gaseous species show correlated variations (Fig. 4a, b). Figure 5d shows a MoS2 2H-1T heterostructure configuration generated at the end of MD simulations according to the RL-generated synthesis schedule. The synthesized heterostructure consists of an island of 1T-MoS2 embedded in the 2H-MoS2 matrix with an atomically sharp interface between the two phases. We note here that same RMD data is used to train the CVD dynamics (NADE) models followed by training the RL-agent for two different objectives (2H or 1T maximization) by simply modifying the reward function. This shows the capability of the model-based offline RL in learning policies for multiple-tasks/objective without generating additional data.

Finally, RL-predicted synthesis schedules are also extremely robust with respect to system-size scaling. Figure 5e shows the validation of a single RL-generated profile using RMD simulations on systems of two different sizes – 51 Å × 49 Å and 100 Å × 100 Å. Figure 5f shows that the observed fractions of 2H and 1T phases in RMD simulations of both the small and large systems are consistent with each other over the entire 20-ns simulation range. Further, these phase fractions are also quantitatively consistent with the values predicted by the NADE model used in the RL optimization loop (See Supplementary Figures 4 and 5 and Supplementary Discussion on accuracy and scale-independence of NADE-CVD predictions). This capability to optimize synthesis schedules independent of system size is useful to extend this approach to experimental synthesis.

Discussion

We have developed a machine learning scheme based on offline reinforcement learning for the predictive design of time-dependent reaction conditions for material synthesis. The scheme integrates a reinforcement learning agent with a deep generative model of chemical reactions to predict and design optimum conditions for the rapid synthesis of two-dimensional MoS2 monolayers using chemical vapor deposition. This model was trained on thousands of computational synthesis simulations at different reaction conditions performed using reactive molecular dynamics. The model successfully learned the dynamics of material synthesis during simulated chemical vapor deposition and was able to accurately predict synthesis schedules to generate a variety of MoS2 structures such as 2H-MoS2, 1T-MoS2 and 2H-1T in-plane heterostructures. Beyond mere synthesis design, the model is also useful for mechanistic understanding of the synthesis process and helped identify distinct temperature regimes that promote sulfidation and crystallization and the impact of a reducing environment on the phase purity of the synthesis product. We also demonstrate how the reinforcement learning scheme can be extended to predict the outcome of material synthesis over long time-scales for system sizes larger than those used for training. This flexibility makes the offline reinforcement learning based design scheme suitable for optimization of experimental synthesis of wide variety of nanomaterials, where the agent does not have to directly interact with the environment during training and can still learn optimal policy from the randomly data collected from the environment.

Methods

Molecular dynamics simulation

All 10000 RMD simulations were performed using the RXMD molecular dynamics engine64,65 using the reactive forcefield originally developed by Hong et al.45 that is optimized for reacting Mo-O-S-H systems. RMD computational synthesis simulations were performed on a 51 Å × 49 Å × 94 Å simulation cell containing 1200-atom MoO3 slab at z = 47 Å surrounded by a reacting atmosphere containing H2, S2 and H2S molecules. During RMD simulations, a one-dimensional harmonic potential is applied to each Mo atom along the z-axis (i.e., normal to the slab surface) with the spring constant of 75.0 kcal/mol to keep the atoms in a two-dimensional plane at elevated temperatures. For each nanosecond of the computational synthesis simulation, the system temperature is maintained at the value specified in the synthesis profile by scaling the velocities of the atoms. MD trajectories are integrated with a timestep of 1 femtosecond and charge-equilibration is performed every 10 timesteps66.

NADE-CVD

The NADE-CVD consists of an encoder, a LSTM block and a decoder (Fig. 2a). The encoder transforms \(\left( {X_t,Z_t} \right)\) into a 72-dimension vector, \(e_t = F_{{\mathrm{encoder}}}\left( {X_t,Z_t} \right)\) . After that, the LSTM layer constructs an embedding of the simulation history till time t as \(h_t = F_{{\mathrm{LSTM}}}\left( {h_{t - 1},e_t} \right)\), where \(h_t\) is a 128 dimension vector. The decoder than uses the \(h_t\) to predict the mean and variance of various phases in MoOxSy surface as \(\mu _{t + 1},\sigma _{t + 1} = F_{{\mathrm{decoder}}}\left( {h_t} \right)\). The encoder and decoder are fully connected neural network of dimensions 7 × 24, 24 × 48, 48 × 72 and 128 × 72, 72 × 24, 24 × 3, respectively. The parameters of the NADE-CVD \(\left( {\mathrm{{\Theta}}} \right)\) are learned via maximum likelihood estimate (MLE) of the following likelihood function

$$\begin{array}{*{20}{c}} {L\left( {{\Theta};D} \right) = \mathop {\prod }\limits_{j = 1}^{j = m} P_{\Theta}\left( {Z^j,X^j} \right) = \mathop {\prod }\limits_{j = 1}^{j = m} \mathop {\prod }\limits_{t = 2}^{t = n} P_{\Theta}\left( {Z_t^j|Z_{1:t - 1}^j,X_{1:t - 1}^j} \right)} \end{array}$$
(5)

Here, \(D = \left\{ \!{\left( {X_{1:n}^1Z_{1:n}^1} \right),\left( {X_{1:n}^2Z_{1:n}^2} \right), \ldots \left( {X_{1:n}^mZ_{1:n}^m} \right)}\! \right\}\) is training dataset of m RMD simulation trajectories. Further details such as log-likelihood of training data during training and evaluation of the NADE-CVD on test data is given in Supplementary Methods.

RL agent architecture and policy gradient

The RL agent, \(\pi _\theta\), is constructed using a fully connected neural network with tunable parameters θ. It consists of an input layer of 128 nodes that is followed by two hidden layers with 72 and 24 nodes and then an output layer. The input \(s_t\) to \(\pi _\theta\) is the embedding of the simulation history, \(\left( {X_{1:t},Z_{1:t}} \right)\), generated by NADE-CVD, \(h_t\). The output of the RL agent is the mean \(\mu \left( {s_t} \right)\) of action \(a_t\) and value function \(V\left( {s_t} \right)\) associated with \(s_t\). The hyperparameters \(\sigma ^2\) associated with the variance of the Gaussian distribution of actions \(a_t\) is taken as 5. During training, the RL agent learns the optimal policy that maximize the total expected reward \({\Bbb E}\) (Eq. 1) using policy gradient algorithm by taking the derivative of \({\Bbb E}\) with respect to its parameter θ, \(\nabla {\Bbb E} = \frac{{\partial {\Bbb E}_{\tau \sim \pi _\theta }\left[ {\mathop {\sum }\nolimits_{t = 1}^T r\left( {s_t,a_t} \right)} \right]}}{{\partial \theta }}\), where trajectory \(\tau = \left\{ {s_1,a_1,s_2,a_2, \ldots s_T,a_T} \right\}\). This derivate reduces into the following objective function which is optimized via gradient accent.

$$\begin{array}{*{20}{c}} {\nabla _{\uptheta}E = {\Bbb E}_{\tau \sim \pi _\theta }\left[ {\mathop {\sum}\limits_{t = 1}^{T_{{\it{max}}}} {\nabla _\theta \log \,\pi _\theta \left( {s_t,a_t} \right)\left( {G_t - V(s_t)} \right)} } \right];{\mathrm{where}}\,G_t = \mathop {\sum}\limits_{t = 1}^{t = t} {r_t} } \end{array}$$
(6)

Here, value function \(V\left( {s_t} \right)\) is used as a variance reduction technique in the calculation of \(\nabla _{\uptheta}{\Bbb E}\) via Monte Carlo estimate. Details of the above derivation and the policy gradient algorithm is given in Supplementary Methods.