Introduction

For a given target molecule, retrosynthetic planning aims to identify a feasible and cost-effective synthetic route from the enormous chemical reaction space, using known or commercially available building block molecules1. It is one of the fundamental problems in synthetic chemistry and plays a critical role in a wide range of applications, such as drug design2,3 and material science4. The expenditure required to bring a single drug to market exceeds 2 to 3 billion dollars, which is primarily allocated toward the preceding discovery phase and clinical trials5. Using the data-driven retrosynthetic tools during the discovery phase to accelerate and reduce failures in the synthesis of new drug molecules has attracted significant interest in recent years6.

Considering that the synthesis of most molecules typically requires multiple chemical reactions and the number of possible chemical transformations is vast, retrosynthesis planning is a challenging task even for experienced chemists. The difficulty of synthesizing a molecule is highly influenced by its structure complexity and available building blocks. In some cases, the identification of a viable synthetic pathway requires weeks of intense effort by human experts. Computer-assisted synthesis planning could assist chemists in expediting the identification of high-quality synthetic pathways7, which commonly employs a search algorithm guided by a single-step retrosynthetic model. With the development of artificial intelligence technology, machine learning-based algorithms have demonstrated remarkable performance, learning to generate recommendations automatically. These algorithms usually perform the single-step retrosynthesis as a classification task based on reaction templates8,9,10,11. Another alternative approach is to leverage sequence-to-sequence models in natural language processing, treating the one-step retrosynthetic task as a translation problem between products and reactants12,13,14,15,16,17. Recently, semi-template-based methods are proposed to predict the reaction center dictating a reaction firstly via graph neural networks, and then translate the resulting intermediate synthons into reactants via translation models18,19,20. Single-step models predict the most promising chemical reactions that can directly synthesize the target molecule. Although a greedy strategy or other simple exploration strategies can be employed to generate a complete synthetic route21,22,23, search algorithms are usually employed to extent the single-step model to full route design, providing higher-quality solutions with improved efficiency. Due to the large number of possible precursors at each step, the single-step model serves as a guiding search strategy to prevent combinatorial explosion. Figure 1 illustrates how the traditional chemical synthetic pathway is transformed into the search tree. The nodes in the search tree represent the synthetic position, containing all molecules required to synthesize the target molecule in the root state. The target molecule can be synthesized if all molecules within a state are available building blocks. The edges in the search tree correspond to chemical reactions that induce state transitions between the two connected nodes.

Fig. 1: Translation of the chemical retrosynthetic route representation to the search tree representation.
figure 1

a Is the chemical representation of the synthesis plan, and b is the search tree representation. In the search tree, the states of the nodes encompass a set of molecules essential for the synthesis of the target molecule, including all unsynthesized intermediate molecules as well as the building blocks. The edges represent chemical reactions applied to the parent node.

An excellent searching strategy can reveal meaningful retrosynthesis efficiently from the large chemical space. Traditional heuristic methods, such as depth-first, have poor performance and the quality of their suggested pathway is not guaranteed. Inspired by the excellent performance of deep reinforcement learning in mastering combinatorial games24, Segler et al.25 proposed employing neural network-guided MCTS to do retrosynthetic planning, which almost reached the level of the literature routes in a double-blind AB test. There exists a trade-off between exploitation and exploration in MCTS due to the traversing tree policy24. The selected leaf node is evaluated for its expected subsequent cost, which is propagated back to all nodes on the traversed path to influence the subsequent search process. EG-MCTS26 and GRASP27 are also built on MCTS. Depth-first proof-number search (DFPN) has been utilized to address retrosynthetic planning28,29. Besides, A*-like search algorithms, including Retro*30, Retro*+31, GNNRetro32, RetroGraph33 and PDVN34, have demonstrated promising results.Several widely used retrosynthetic planning software tools are publicly available, such as AiZynthFinder35 and SynRoute36. DFPN is considered inferior to the other two search algorithms. MCTS and Retro* have demonstrated comparable search speeds and capabilities in route discovery. When comes to the route quality and diversity, MCTS outperforms Retro* in the 10,000 test molecules in PaRoutes37,38. However, the routes with a depth of more than 10 reactions have been excluded in PaRoutes37, and MCTS exhibits poor performance on these discarded molecules with longer paths, as will be discussed later. This selection process may introduce a bias in favor of the MCTS search. Actually, 15.3% of the routes in the USPTO test benchmark30 exceed a length of 10. Despite the optimality of A* search being theoretically guaranteed39, obtaining an optimal solution more efficiently remains a challenging problem in practice. Compared to MCTS, the search process of A* lacks exploratory behavior, and the estimated values are not updated once determined as more information is available40.

Both A* search and MCTS used in retrosynthetic planning algorithms are heavily dependent on the quality of the heuristic guiding functions, including the single-step expansion policy and the cost estimator. Guided by biased heuristic functions, A* search may spend a considerable amount of effort expanding states on a non-optimal branch due to the lack of exploration, which significantly reduces the efficiency of the search process, as displayed in Fig. 2a. The search process of MCTS relies on a balance between exploitation and exploration. Insufficient exploration may lead MCTS to exhibit behavior similar to that of A* search. If exploration is over-emphasized, a significant amount of effort will be spent to explore unnecessary branches compulsively. Pruning can be employed during the search process to reduce those meaningless explorations, but it may also result in the exclusion of the optimal solution due to insufficient simulations in the early search stages. Taking retrosynthetic planning results of the USPTO test set as an example, feasible synthetic pathways are provided alongside the target molecules30, which are obtained by exhaustive search with reactions in the USPTO database. As shown in Fig. 2c, considering the failed molecules by Retro*+31, which is an A*-like search algorithm, 78.95% of them can be synthesized within 7 steps in the USPTO benchmark, and 89.47% can be synthesized within 9 steps. This observation suggests that A* search may delve too deeply into unproductive branches, and fail to find the solution under a limited search time. In contrast, 45.0% failure cases of MCTS require at least 8 steps, and the failure rate of MCTS for molecules that require more than 8 synthetic steps by exhaustive search is 84.8%. The compulsive exploration of MCTS prevents it from proceeding deeper toward the optimal solution. Compared to the A* search, MCTS encounters difficulties in synthesizing complex molecules with longer routes within the same search time.

Fig. 2: Demonstration of the results by methods of insufficient or excessive exploration under the limited search time.
figure 2

a The first failure case expands states on a non-optimal branch due to the lack of exploration; b The second failure case compulsively explores unnecessary branches and is unable to delve deeper toward the optimal solution; c The cumulative proportions of lengths of synthetic routes required by the exhaustive search for molecules that are failed to be synthesized by the search algorithm; d The failure rates of molecules with different path lengths to be synthesized by the exhaustive search (The source data for c and d is provided in Supplementary Data 1 and 2). Retro*+ and MCTS employ the same expansion function and value estimator. (MCTS does not utilize the pruning technique).

In general, A* search is guaranteed to find the optimal solution, but the efficiency of A* in problem-solving is compromised due to the deficiency in exploratory. Although MCTS is capable of exploration, compulsive exploration of meaningless states also decreases the efficiency of the search process. To address the aforementioned limitations, the MCTS exploration enhanced A* search algorithm (MEEA*) is proposed to incorporate the exploration capacity of MCTS into A* search. Experiments are conducted on ten organic molecule datasets, including the widely used USPTO benchmark30. To the best of our knowledge, MEEA* achieves a 100.0% success rate on the USPTO benchmark for the first time, and improves the success rate on all 11,310 test molecules of the ten datasets from 60.50% to 65.14%, demonstrating the effectiveness of the proposed search algorithm. What’s more, path consistency (PC) constraint is considered to improve the generalization ability of the cost estimator. The PC-enhanced version of MEEA* significantly improves the overall synthesis success rate to 76.27%. As is known, biosynthetic natural products (NPs) and the aforementioned tested organic molecules are derived from two distinct structural spaces, and thus the NP synthesis may be beyond the capabilities of the existing methods for the above organic molecule benchmarks25,41. Here, MEEA* is easily extended to work on NP synthesis by adopting the single-step expansion policy from BioNavi-NP41 and obtains the NP synthesis success rate at 97.68%, which is much higher than 90.2% by the state-of-the-art BioNavi-NP.

Results

MEEA* search algorithm

MCTS comprises two crucial operations: the pUCT tree policy42 for selecting the expanded leaf node, which enables a balance between exploration and exploitation, and the revision of state values during the search process, which influences the selection of nodes to be expanded in subsequent iterations. Correspondingly, the node with the smallest f-value is expanded in the A* search algorithm, in which f value is the summation of g value, the accumulated cost from the initial state s0 to st, and h value, the expected cost from st to the preferred goal sG,

$$f({s}_{t})=g({s}_{t})+h({s}_{t}).$$
(1)

MEEA* search algorithm is proposed by integrating the MCTS with A* search. As depicted in Fig. 3a, the search process of our MEEA* consists of three steps:

  • Simulation: A set of candidate nodes is collected by performing KMCTS MCTS simulations without node expansion. The pUCT tree policy is employed to traverse to leaf nodes, thereby introducing exploratory information into the candidate set. f values of the traversed nodes are estimated using the cost estimator. The evaluation of the leaf node is updated to all nodes in the path from the root sr to the leaf sl in a backward pass.

  • Selection: The node s with the smallest f value in the candidate set is identified as the next node in the search tree to be expanded.

  • Expansion: Integrate the children of s into the search tree provided by the single-step retrosynthetic model.

Fig. 3: The overall framework of MEEA* retrosynthetic planning search algorithm.
figure 3

a The search process of our MEEA* algorithm includes three steps. (1) Simulation: conduct KMCTS iterations to generate the candidate set from the open set; (2) Selection: select the state with the minimum f value from the candidate set; (3) Expansion: expand the selected state using the single-step retrosynthetic model \({{{{{{{\mathcal{B}}}}}}}}\). b Evaluation of states during the Simulation step of MEEA*. g(s) is the summation of reaction costs along the traversal path, and h(s) is estimated by the value network, considering all molecules in the state. c Single-step retrosynthetic model \({{{{{{{\mathcal{B}}}}}}}}\) used in the Expansion step. The first non-building block molecule is selected for expansion. The top k best reaction templates are obtained based on the priors provided by the policy network, which are applied to the expanded molecule to generate its possible precursors.

MEEA* is a heuristic search algorithm. As displayed in Fig. 3b, c, a single-step retrosynthetic model is employed to provide possible synthetic reactions. The policy network takes as input the Morgan fingerprint of the first non-building block molecule in the expanded state. The top k reaction templates are utilized to produce potential precursors of the target molecule. According to Eq. (1), the f value of state s is evaluated by calculating g(s) based on the traversal path and using a value neural network as a cost estimator to predict h(s). The details of MEEA* are presented in the Method section. Compared to the A* search, the existence of a candidate set injects the exploratory capability of MCTS into MEEA*. The simulation times KMCTS is a hyperparameter balancing the influence of A* search and MCTS on MEEA*. Besides, MEEA* also preserves the optimality of A* search, because all nodes expanded by A* search are also expanded by MEEA* under the consistency assumption of the heuristic value h predicted by the neural network39. The detailed proof is provided in Supplementary Note 1.

Path consistency enhanced cost estimator

Reliable guiding functions benefit the search process of MEEA* greatly. From a chemical perspective, the synthetic difficulty of a molecule is related to its structural complexity. The high structural diversity and sparse training data present a challenge in accurately estimating their synthetic cost. What’s more, the synthetic cost is also influenced by the available building blocks. Complex molecules can be synthesized expeditiously in a few steps if precursors are readily available25. The synthetic cost of molecules with similar structures can vary considerably, due to differences in the availability of their precursors. Therefore, the generalization ability of the cost estimator is critical to the success of the retrosynthesis. As shown in Fig. 4a, Retro*30 exhibits a substantial gap between its testing and training error of their cost estimator, indicating a severe overfitting problem.

Fig. 4: MEEA* achieves superior performance compared with A* search and MCTS.
figure 4

a The learning curve of the cost estimator in Retro* illustrates the overfitting problem; b Success rate on USPTO benchmark with different iterations. Guided by the same heuristic function, Retro*+ also achieves 100% success rate after enough iterations, but MEEA* is much more efficient; c Success rate on Natural Products (The source data for ac is provided in Supplementary Data 35); d, e Synthetic route for molecule COc1ccc2nccc(C(N)CC[C@@H]3CCN(CCSc4cccs4)C[C@@H]3C(=O)O)c2c1. MEEA* provides a shorter reaction pathway than the exhaustive search. Retro*+ fails to provide a solution, although the target can be synthesized in three steps. MCTS yields the same solution with MEEA*, but MCTS requires 35 expansions while MEEA* only requires 11 expansions. fh Synthetic route for molecule COCCCc1cc(CN(C(=O)[C@H]2CN(C(=O)OC(C)(C)C)CC[C@@H]2c2ccc(OCCOc3c(Cl) cc(C)cc3Cl)cc2)C2CC2)cc(OC[C@@H]2C[C@H]2C(=O)OCC(=O)N(C)C)c1. Reaction pathways provided by the exhaustive search, Retro*+, and MEEA* with lengths of 16, 12, and 11 respectively. MCTS fails to provide a solution. (MEEA*, MCTS, and Retro*+ are guided with the same single-step expansion policy and cost estimator).

In this paper, MEEA*-PC is proposed to alleviate this problem. Path consistency (PC), which is f values on one optimal path should be identical, is suggested to be used as a constraint to improve the learning efficiency40. PCZero43 incorporated PC into MCTS, and experiment results imply that PC sacrifices accuracy on the training set slightly to improve the generalization ability on the testing set greatly. In this paper, PC is utilized to train a more reliable cost estimator to assist MEEA* search by considering PC as a regularization term \({{{{{{{{\mathcal{L}}}}}}}}}_{PC}\) of the regular loss function \({{{{{{{{\mathcal{L}}}}}}}}}_{RL}\), i.e.,

$${{{{{{{\mathcal{L}}}}}}}}(\theta )={{{{{{{{\mathcal{L}}}}}}}}}_{RL}(\theta )+\lambda {{{{{{{{\mathcal{L}}}}}}}}}_{PC}(\theta ),$$
(2)

where λ is a hyperparameter. PC is realized by minimizing the deviation of the estimated f values to the mean f value of all nodes in the synthetic route. To train the PC-enhanced cost estimator via reinforcement learning, we construct a training set that consists of the synthesis pathways of molecules in the USPTO training set. The synthesis ways are identified by MEEA* with the heuristic functions provided in Retro*+31.

Retrosynthetic evaluation of MEEA* search algorithm

Experiments are conducted to validate the effectiveness of our MEEA* search algorithm. Following the literature26,27,30,31,33, commercially available molecules in eMolecules are used as the building block set. Firstly, experiments are conducted on the widely used USPTO benchmark, containing 190 molecules. Following previous works26,27,30,31,33, all algorithms are limited to a maximum of 500 single-step model calls, or 10 min of real-time. Details of benchmark algorithms are provided in Supplementary Note 2 and Supplementary Table 1. It needs to be noted that although MEEA* employs MCTS to conduct lookahead search, MCTS simulation in MEEA* does not call single-step models. The invocation of the one-step retrosynthetic model accounts for the majority of the runtime. Average times are evaluated by the number of calls to the single-step expansion policy to measure the efficiency of algorithms. For molecules that appeared in the USPTO dataset, an exhaustive search is employed to collect all possible pathways for synthesizing the target molecule utilizing chemical reactions from the USPTO dataset, and the shortest route is selected as the reference synthetic route. The performance of an exhaustive search is used as a benchmark for evaluation. As presented in Table 1, MEEA* achieves a remarkable 100.0% success rate, outperforming the existing A*-based and MCTS-based algorithms. Additionally, the synthesized routes demonstrate superior quality compared to the existing algorithms, in terms of both route length and synthesis cost. The substantial performance improvement highlights the effectiveness of our MEEA*. Although MEEA* requires the use of MCTS as a pre-search to determine the candidate set, the execution time spent for each expansion slightly increases. However, because of MEEA*’s ability to find solutions with fewer expansions, the overall runtime is actually shorter. Taking the USPTO test benchmark as an example, Retro*+ spends 6578 s while MEEA* only spends 3416 s. RetroGraph33 achieves the shortest search time by adopting batch processing, which enables the expansion of nodes to be shared during the synthesis of multiple molecules. Theoretically, Retro*+ can also identify synthesis pathways for all molecules in the USPTO benchmark since it is guaranteed to find the optimal solution if the search resource is unlimited. As shown in Fig. 4b, the success rate of Retro*+ achieves 100.0% when the number of one-step model calls is 2800, which is more than five times the number by MEEA*. There is a significant improvement in search efficiency achieved by MEEA*. The success rate on the USPTO benchmark with different single-step model calls is displayed in Supplementary Note 5 and Supplementary Table 3.

Table 1 Performance summary on USPTO benchmark with a hard time limit of 500 single-step model calls, or 10 min

The USPTO benchmark is widely utilized to evaluate retrosynthetic planning algorithms. However, molecules in the test benchmark need to satisfy two conditions. A synthetic route with reactions in the USPTO database is available, and each reaction in the synthetic route must rank within the top 50 predictions of the single-step expansion model of Retro*30. These two screening criteria have intentionally reduced the difficulty of this benchmark. The post-filtered test dataset exhibits inherent biases, thereby inadequately representing the real distribution of molecules. Therefore, nine additional datasets are included to ensure the authenticity of the evaluations, which also have strong practical significance. The distribution of molecules in the USPTO benchmark and the other nine molecules is presented in Supplementary Note 10 and Supplementary Fig. 3. More detailed information is listed in the Method. The USPTO benchmark requires that each molecule in the test set possesses at least one synthetic pathway, and the pathway is constructed by the reactions ranking within the top 50 predictions of the single-step expansion model. The molecules in the nine additional datasets are not subject to these restrictions, and may not have feasible synthesis routes among the top 50 chemical reactions, which renders them more representative of real-world molecule distributions and raises higher requirements for the single-step expansion policy. As shown in Table 2, MEEA* outperforms Retro*+ and EG-MCTS in all ten datasets with 65.14% overall success rates. This clearly demonstrates the effectiveness of MEEA* on a large-scale test set. The results of RetroGraph33 and GRASP27 are not presented as their codes are not publicly available. On the other hand, MEEA* focuses on improving the search algorithm. While Retro*+ and MEEA* utilize the same heuristic function, the significant performance improvement provides compelling evidence of the effectiveness of our search algorithm. Investigation on the hyperparameter KMCTS is displayed in Supplementary Note 8 and Supplementary Fig. 1.

Table 2 The success rate comparison tested in 10 datasets

Trade-off between quality and diversity

Existing works usually conducted experiments by limiting the number of calls of the single-step model26,27,30,31,33. Varying the single-step model can result in uneven resource allocation across different models, and limiting search with wall-clock time is recommended37,38. Besides the success rate, the diversity of the generated routes is highly recommended as a criterion37,38. For a given molecule, multiple diverse synthetic pathways are desired to be provided to human experts for further selection. The generated routes are clustered into different groups and the number of clusters is regarded as a metric of diversity37. Tree edit distance (TED) is calculated to measure the similarity between the algorithm-generated synthetic pathway and the referenced pathway, and the leaves overlap is computed as the average ratio of the building blocks present in both the generated route and the reference route. The synthetic routes provided in the USPTO benchmark are used as the referenced routes to compute TED and the leaves overlap. During the experiment, at most 25 synthetic routes are generated to compute the number of clusters, considering the computational time. The TED value and the leaves overlap are calculated based on the first generated route. A* search and MCTS are included for comparisons, and they are implemented by setting KMCTS →  or KMCTS = 1 in MEEA*. Unsolved molecules are excluded from the calculation of the evaluation metrics.

The results of the success rates and the first solution time are reported in Table 3, where the first solution time means the time used to find the first feasible solution. MCTS typically takes the longest time to find the first feasible solution, due to its compulsory exploration to unnecessary nodes. MEEA* not only solves all molecules with 100.0% success rate, but also only requires 19.93 seconds on average, significantly less than the 55.33s for MCTS and 22.99s for A*. The results again demonstrate the high efficiency of the proposed MEEA*. It is observed that the above results are basically consistent with the ones by limiting the number of calls to the single-step retrosynthetic model. The reasons are as follows. The invocation of the single-step retrosynthetic model accounts for the majority of the runtime. When the single-step model is kept fixed, capping the number of model calls is a reliable setting to evaluate the performances of retrosynthetic planning methods. Since the neural network architecture of the expansion policy of MEEA* remains the same as that of Retro* and Retro*+, the computational resources consumed by each single-step model invocation are basically identical. Moreover, considering the small variations in implementing various search algorithms, there exist only small differences between the runtime and the number of calls to the single-step retrosynthetic model. The success rate with different real search time is displayed in Supplementary Note 9, including both Supplementary Table 6 and Supplementary Fig. 2.

Table 3 The results by imposing a time limits of 10 min on every run on USPTO benchmark

According to results in Table 3, the routes provided by MCTS have greater diversity with larger cluster numbers. Conversely, the routes generated by A* are more closely aligned with the routes of exhaustive search with a smaller TED and a higher proportion of overlapping leaves. This result implies a deficiency in exploratory because the routes of exhaustive search are used to train the heuristic guidance. MEEA* is a hybrid algorithm that combines the characteristics of A* and MCTS, and it achieves a good balance between route quality and diversity.

Evaluation of path consistency enhanced MEEA*

As demonstrated in the previous sections, MEEA* borrows the single-step expansion policy and cost estimator from Retro*+, and achieves better performance than Retro*+. Here, the performance can be further improved by considering a better policy or cost estimator as follows. On the one hand, the cost estimator from Retro*+ predicts the synthetic cost of only one molecule at a time and calculates the total cost of the current state as the summation of predicted costs for all molecules related to the state. A cost estimation network is proposed. It is capable of taking all molecules within the current state jointly as input and computing the state cost in a direct and more accurate manner. The cost network is trained by minimizing the mean square error of cost prediction, and PC is employed to improve the generalization capability via Eq. (2). Expansion policy is also updated on the newly generated data by MEEA*

With the guidance of the updated policy and the cost estimator trained under the PC constraint, the overall success rate of MEEA* has significantly improved from 65.14% to 76.27%. Note that, only the success rate of USPTO decreased, while the performance improved across all other nine datasets. This observation is possibly due to the fact that there exists a distribution bias in the USPTO test benchmark due to expansion policy-based molecule filtering. MEEA* is a search algorithm, which can be combined with different single-step expansion models. Employing a better single-step expansion model can also enhance the performance of MEEA*. More experimental results are presented in Supplementary Note 7 and Supplementary Table 5.

Synthesis of natural products

Natural products (NPs) play a significant role in drug discovery since 60% of FDA-approved small molecule drugs are either NPs or their derivatives. However, synthetic pathways of over 90% of NPs are unknown. Biosynthetic NPs and the aforementioned tested organic molecules are derived from two distinct structural spaces41. Most existing organic retrosynthesis tools cannot be directly applied to the synthesis of NPs25. Recently, BioNavi-NP41 specifically built an expansion policy for natural products, which was combined with Retro*30 to trace NPs back to biologically plausible building blocks. When limiting the search algorithm to 100 single-step model calls and 50 expansions, the success rate is improved from 88.66% to 97.68% by replacing Retro* with our MEEA*-PC. As shown in Fig. 4c, if increasing the iterations to 2000, the success rate of Retro* stays at 93.04%, which remains inferior to our MEEA*. The search space for both BioNavi-NP and MEEA* is identical as the same expansion policy is employed. BioNavi-NP may be trapped in the non-optimal branches due to the lack of enough exploration. The number of search iterations required to overcome this limitation in BioNavi-NP will significantly surpass that of MEEA*. Overall, MEEA* has achieved significant performance improvements in both general organic molecules and natural molecules. An example of the synthesis of a natural product is given in Supplementary Note 11 and Supplementary Fig. 4.

Case study

Figure 4d, e displays the synthetic pathway of a molecule that can be synthesized with four chemical reactions by the exhaustive search. Both MEEA* and MCTS have discovered the same synthetic pathway, which exhibits a shorter route length. MCTS requires 35 single-step model calls while MEEA* only requires 11 calls, indicating that the compulsory exploration of MCTS results in lower efficiency compared to MEEA*. Guided by the same heuristic functions, Retro*+ is trapped in the first failure case depicted in Fig. 2a and fails to identify the synthetic route, although the target molecule can be synthesized with four steps.

Figure 4f–h illustrates the synthesis planning of a complex molecule. The exhaustive search, Retro*+, and MEEA* provide reaction pathways with lengths of 16, 12, and 11, respectively. MCTS fails to find a solution within the given time limit. The compulsory exploration in MCTS prevents it from effectively delving into deeper regions in the search tree within the given time limit. The search process is trapped in the second failure case displayed in Fig. 2b. Furthermore, due to the exploratory property, MEEA* requires a slightly higher number of single-step model calls for finding the synthetic pathway than Retro*+, but it successfully identifies a shorter synthetic route. The two retrosynthesis cases above have demonstrated the superiority of MEEA* from different perspectives.

Besides, MEEA* is applied to complex drug molecules to illustrate its practical significance, including Paxlovid, Fostemsavir, Enarodustat, Pacritinib, and Oteseconazole. MEEA* has successfully identified the synthesis pathway for the above drugs, and the synthetic plans are illustrated in Supplementary Note 12 and Supplementary Fig. 5.

Discussion

In this paper, we proposed an efficient retrosynthetic planning algorithm MEEA*, incorporating the exploratory behavior of MCTS into A* search by providing a lookahead search. The quality of the guiding function is crucial for the success of heuristic search. A* search is prone to being misled into non-optimal branches due to the lack of exploration ability. The compulsive exploration in MCTS limits the exploration depth within a finite iteration, rendering it less effective than A* in synthesizing complex molecules. Compared to A*, MEEA* escapes local optimal branches by incorporating exploration. In contrast to MCTS, MEEA* prevents exhaustive exploration of all branches by utilizing A*’s efficient node expansion criteria. To the best of our knowledge, MEEA* achieves 100.0% success rate on the widely used USPTO benchmark for the first time. Guided with the same heuristic functions with Retro*+, MEEA* improves the success rate on all 11,310 molecules from 60.50% to 65.14%. Path consistency is considered to improve the generalization capacity of the cost estimator and significantly improves the overall synthesis success rate to 76.27%. Experiments on natural products are conducted. Guided with the single-step expansion policy provided by BioNavi-NP, the success rate of synthesizing NPs has been increased to 97.68% from BioNavi-NP’s 90.2%. In summary, MEEA* has achieved significant performance improvements in both general organic molecules and natural molecules. Besides, we have provided a theoretical guarantee that MEEA* can definitely find the optimal solution.

There still remain certain challenges for the refinement of the MEEA*. For example, the chemical reactions suggested by the single-step retrosynthesis model may present significant barriers to implementation in practice. Our work mainly focuses on the multi-step synthesis search algorithm, and the single-step model primarily draws from previous works. It is recommended to utilize a more reliable single-step retrosynthesis model or to implement additional filtering of the generated potential chemical reactions. Additionally, MEEA* integrates the exploratory nature of MCTS with A* search. It is critical to strike a balance between the influence of MCTS and A* by selecting an appropriate value for KMCTS. The necessity of exploration in MEEA* is dependent upon the reliability of the heuristic function. If the heuristic guidance is highly reliable, exploration will be necessary because the optimal solution can be identified through a best-first search directly. In such cases, MEEA* should use a larger value of KMCTS to prioritize exploitation and move closer to A* search. When the heuristic functions are unreliable, the importance of exploration becomes more crucial, MEEA* should choose a smaller value of KMCTS to better leverage the power of MCTS. What’s more, molecules are represented by one-dimensional vectors, which may not capture sufficient stereochemical information. Graph neural networks offer a promising approach to leveraging more comprehensive molecular information.

Methods

Markov decision process for retrosynthesis planning

The Markov decision process (MDP) is a mathematical framework used to model sequential decision-making problems for an agent operating within an environment. It is described with state space \({{{{{{{\mathcal{S}}}}}}}}\), action space \({{{{{{{\mathcal{A}}}}}}}}\), transition function \({{{{{{{\mathcal{T}}}}}}}}\) and cost \({{{{{{{\mathcal{C}}}}}}}}\). The objective is to identify a pathway for synthesizing the target molecule m0 using building blocks from \({{{{{{{\mathcal{I}}}}}}}}\). In this paper, we formulate a retrosynthesis planning problem as an MDP as follows:

  • State space \({{{{{{{\mathcal{S}}}}}}}}\): A state \(s\in {{{{{{{\mathcal{S}}}}}}}}\) is defined as a set of molecules {mi, mj,   } required for the synthesis of the target molecule. The initial state is s0 = {m0}, consisting only of the target molecule. The retrosynthesis of m0 is considered successful if a state \({s}_{t}\subseteq {{{{{{{\mathcal{I}}}}}}}}\) is identified. Molecules in a state are sorted based on their SMILES representation.

  • Action space \({{{{{{{\mathcal{A}}}}}}}}\): An action \(a\in {{{{{{{\mathcal{A}}}}}}}}\) represents a potential chemical reaction that can be utilized to decompose a molecule into its constituent reactants. The action space is vast due to the multitude of chemical reactions available for synthesizing a molecule. The single-step retrosynthetic model \({{{{{{{\mathcal{B}}}}}}}}\) selects the k most promising chemical reactions as legal actions i.e., \(B(m)={\left\{{a}_{i},\pi ({a}_{i}| m)\right\}}_{i = 1}^{k}\), where π(aim) is the associated probability of taking action ai for molecule m. The remaining reactions are discarded to narrow down the width of the search tree.

  • Transition model \({{{{{{{\mathcal{T}}}}}}}}\): When taking action at in state st, \({{{{{{{\mathcal{T}}}}}}}}({s}_{t},{a}_{t})\) determines the reaching state st+1. It replaces the product molecule associated with action at in state st with its corresponding reactants, yielding the subsequent state st+1. Actions are always applied to the first molecule in the state. For example, suppose that the current state is st = {m1, m2}, and the action at is the reaction m3 + m4 → m1. The resulting state \({s}_{t+1}={{{{{{{\mathcal{T}}}}}}}}({s}_{t},{a}_{t})=\{{m}_{2},{m}_{3},{m}_{4}\}\).

  • Cost \({{{{{{{\mathcal{C}}}}}}}}\): Reaction cost \({{{{{{{\mathcal{C}}}}}}}}(s,a,s^{\prime} )\) returns the cost when transitioning to \(s{\prime}\) via action a in state s. In practice, \({{{{{{{\mathcal{C}}}}}}}}\) is defined as \(-\log {p}_{r}(a| m)\), where pr(sm) is a predictive model trained on a real reaction dataset to estimate the probability of selecting reaction a for molecule m in the physical world.

MEEA* search

MEEA* search is a hybrid algorithm that introduces the exploration of MCTS into A* search. MCTS follows iteratively four steps:

  • Selection: Traverse from the root node to a leaf node with an exploratory tree policy.

  • Expansion: Incorporate promising children into the search tree based on the single-step retrosynthetic expansion policy.

  • Evaluation: Score the leaf by either fast rollout or heuristic value models.

  • Update: Back up the simulation results to all states on the traversed path.

A* search maintains two state sets: the OPEN set for generated but unexplored nodes and the CLOSED set for explored nodes. Initially, the CLOSED set is empty and the OPEN set contains only the root state. The search process iteratively executes the following steps:

  • Selection: If the OPEN set is empty, terminate the search with failure. Otherwise, identify the state s with the smallest f value in the OPEN set.

  • Expansion: Generate the child states of s based on the expansion policy. If sG is generated, terminate with success. Otherwise, mark state s as CLOSED and add generated child states to the OPEN set.

Besides the OPEN set and CLOSED set, MEEA* maintains an additional candidate set, which is collected with the exploratory MCTS. Each edge in the search tree stores the visit count N(s, a), the prior probability π(sa) received from the expansion policy network, and the entire synthetic cost Q(s, a). The transition function \(s^{\prime} ={{{{{{{\mathcal{T}}}}}}}}(s,a)\) is deterministic. The search process of our MEEA* consists of three steps: simulation, selection, and expansion. In the first phase of MEEA*, simulations are conducted for KMCTS iterations without node expansion to collect a candidate set. Each simulation consists of three steps.

  • Traverse from the root to a leaf node according to a tree policy. In order to further encourage exploration, a variant of pUCT is adopted, and a uniform distribution is employed as the prior policy. The next action is selected via:

    $${a}^{* }=\arg \mathop{\max }\limits_{a\in {{{{{{{\mathcal{A}}}}}}}}}\left\{-Q\left(s,a\right)+{c}_{puct}\frac{1}{| {{{{{{{\mathcal{A}}}}}}}}| }\frac{\sqrt{{\sum }_{b\in {{{{{{{\mathcal{A}}}}}}}}}N(s,b)}}{1+N(s,a)}\right\},$$
    (3)

    where cpuct is a hyperparameter to control the exploration level of the simulation. The tree policy is applied until a leaf node sl is found.

  • Put the leaf node sl into the candidate set and calculate the f value of sl, which is the summation of g(sl) and h(sl) according to Eq. (1). g(sl) is the total cost of all reactions from s0 to sl, and the subsequent accumulated cost h(sl) is predicted by the heuristic cost estimator.

  • Update the stored statistics for nodes in the path from the root s0 to the leaf sl.

    $$Q(s,a)\leftarrow \frac{N(s,a)\times Q(s,a)+f({s}_{l})}{N(s,a)+1},N(s,a)\leftarrow N(s,a)+1$$
    (4)

After KMCTS simulations, the candidate set with exploratory is determined. During the selection step, the candidate node with the lowest f value is chosen from the candidate set to be expanded in the next step. Single-step expansion model \({{{{{{{\mathcal{B}}}}}}}}\) provides top 50 successor states, which are directly added to the search tree. If a newly added state contains exclusively building blocks, the synthetic pathway for the target molecule is successfully identified. The above three steps are iteratively performed until the iteration step budget has been exhausted. The simulation times KMCTS is a hyperparameter balancing the influence of A* search and MCTS on MEEA*. When KMCTS is relatively small, MEEA* approaches MCTS with increased exploration and degenerates into pure MCTS when KMCTS = 1. In this case, only one candidate node is available for expansion, which is determined by the exploratory tree policy of MCTS. When KMCTS grows, more leaf nodes are included in the candidate set. If KMCTS is sufficiently large, all opening nodes will be chosen as candidate nodes because of the exploratory nature of MCTS. Consequently, MEEA* is degraded to A* search, where the node with the lowest f-value among all open nodes is selected for expansion. Single-step model \({{{{{{{\mathcal{B}}}}}}}}\) and cost estimator are borrowed from Retro*+ to assist the search process of our MEEA*. The parameter KMCTS and cpuct is set to 100 and 4.0 respectively. Experiments are conducted on Tesla V100 GPUs and Intel(R) Xeon(R) Gold 6238R CPU with 512G memory. There is a wide range for KMCTS to make MEEA* perform well.

Synthesis pathway preparation

MEEA* is a heuristic search algorithm that requires an expansion policy and cost estimator to assist the search process. Although the guiding function can be provided by previous works, it is still necessary to train our own networks for better performance. The USPTO training set provided by Retro*30 does not include synthetic pathways for molecules, so we are required to generate these routes by ourselves. One choice is to employ an exhaustive search to identify the shortest synthetic pathway of the given molecule using the chemical reactions available in the USPTO dataset, the same as Retro*30. This strategy is effective in training a model with relatively high performance from scratch, but limits the performance of the model, making it challenging to surpass the capabilities represented by the training set42. The other choice is to update the heuristic model with reinforcement learning, collecting synthetic routes with MEEA* to further update its heuristic guidance iteratively. Retro*+31 has illustrated the effectiveness of this self-improved learning process.

MEEA* search is utilized to identify the synthetic pathways for molecules in the USPTO training set. The expansion policy and cost estimator from Retro*+31 are employed as the initial heuristic functions to assist MEEA* search. There are 299202 molecules in the USPTO training set, and MEEA* has successfully identified synthetic pathways for 299024 of them. The chemical reactions that appear in the collected synthetic pathways will be used to update the expansion policy. The synthetic cost is calculated as \(c(a)=\log {p}_{r}(a| m)\), prepared for the training of the cost estimator. pr is the expansion policy used by Retro*30, which is trained using actual chemical reactions and can be used to measure the likelihood of a reaction taking place. It needs to be noted that the updated policy and newly trained value function are exclusively employed in MEEA*-PC, and additional ablation studies are provided in the Supplementary Note 6 and Supplementary Table 4 for further analysis.

Single-step retrosynthetic model

In this paper, a template-based single-step retrosynthesis model is adopted by treating the problem as a classification task based on reaction templates. The identical network architecture as Retro*+ is employed, which is a neural network comprising a single hidden layer with 512 dimensions. The network parameters are updated using the cross-entropy loss, with Retro*+’s model serving as the initial model for the training process. We utilize the Adam optimizer with a learning rate of 0.001, and the batch size is set to 1024. The architecture of the network is presented in Supplementary Note 3.

Cost estimator and path consistency

The cost estimator is employed to predict the cost of synthesizing all molecules in the state s = {mi, mj,   }. However, previous cost estimator algorithms, such as Retro* and Retro*+, were limited to predicting the synthetic cost of one molecule at a time. The total synthetic cost for state s is calculated as the summation of independent predictions of synthetic costs for all molecules within s,

$$h(s)=\mathop{\sum}\limits_{m\in s}h(m),$$
(5)

where h is a cost estimation function. In this paper, we have designed a network to estimate the synthetic cost jointly for all molecules within a given state s, as illustrated in Eq. (6). This computational approach provides greater flexibility than the method in Eq. (5), allowing the model to utilize inter-molecular information to improve the prediction performance,

$$h(s)=h({m}_{i},{m}_{j},\cdots ).$$
(6)

Multiple molecules are fed into the same network to extract their feature embeddings. The global embedding eg is computed as the sum of all molecules’ embedding \({e}_{i}^{v}\), which can be viewed as a representation of the entire state. Global embedding eg is further fed into a fully connected layer to obtain the cost estimation of the state directly.

Through analysis of the collected synthetic pathways, the states of each step in the synthesis process, as well as the associated real synthetic cost c(s), are obtained and utilized as training data for the cost estimator. The mean square error is utilized as the loss function, as shown in Eq. (7). The Adam optimizer with a learning rate of 0.001 is employed to update the parameters, and the batch size is set to 256.

$${{{{{{{{\mathcal{L}}}}}}}}}_{RL}(s)={(h(s)-c(s))}^{2}$$
(7)

After obtaining a function model that can directly evaluate the entire state, we consider incorporating the path consistency constraint into the model’s learning process to improve its generalization ability. Path consistency is defined as f values on one optimal path should be identical. For a collected synthetic pathway {s0, s1,   , sN}, PC constraint is evaluated by the deviation of the state evaluation f(st) from the learning target, which is the average state evaluation along the estimated optimal path.

$${{{{{{{{\mathcal{L}}}}}}}}}_{PC}(s)={\left(f(s)-\bar{f}(s)\right)}^{2},\bar{f}(s)=\frac{1}{N+1}\mathop{\sum }\limits_{i=0}^{N}f({s}_{i})$$
(8)

As illustrated in Eq. (2), the PC loss is used as a regularization term for the reinforcement learning loss, which is adjusted by the hyperparameter λ. During training, λ is set to 5.5.

Test dataset

The United States Patent Office (USPTO) dataset is a publicly available reaction dataset, which consists of approximately 3.8 million reactions collected from patents granted between 1976 and September 2016. It is widely used for machine learning applications because of its size, diversity, and accessibility. USPTO dataset is prepossessed and split into train/validation/test sets by Retro*30 as a benchmark dataset for retrosynthesis planning problems. For each molecule in the USPTO dataset, a brute-force search is conducted to determine if it can be synthesized using reactions within the USPTO training data. For each synthesizable molecule, the shortest-possible synthesis pathways are collected to train a single-step expansion policy, which is used to provide the most potential reactions for multi-step retrosynthesis planning algorithms. Only molecules, for which the reactions in their synthesis pathways are all covered by the top-50 predictions by the single-step expansion policy, are preserved in the test benchmark. As a result, the distribution of molecules in the USPTO test benchmark differs from the distribution of molecules in real life.

To enable a more rigorous evaluation of the algorithm’s performance in real-world scenarios, nine additional real-world molecular datasets are also utilized as the test set. logS44 logP45, and Toxicity LD5046 are used to predict the solubility, hydrophobic property, and toxicity of molecules separately. The Ames dataset47 is a standardized collection of data used to assess the genetic toxicity of bacteria, i.e., whether it induces mutations. Human blood-brain barrier penetration (BBBP)48 is a crucial property in the process of drug design. Dipeptidyl peptidase-4 (DPP4) inhibitors49 are an essential drug in the treatment of type-2 diabetes mellitus. β-secretase 1 (BACE-1) inhibitors50 are a class of drugs that target the BACE-1 enzyme, which plays a key role in the production of β-amyloid peptides in Alzheimer’s disease. Inhibition of protein-protein interactions in SVS51 is of great interest in drug design and discovery, because dysfunction of PPIs can lead to various diseases, including immunodeficiency, autoimmune disorders, allergy, drug addiction, and cancer. ClinTox52 compromises FDA-approved drugs and compounds that have undergone clinical trials but failed due to toxicity-related issues. Those drug molecules in ClinTox are selected by experts for clinical trials in the early stages of drug development, which is suitable to evaluate the efficacy of retrosynthesis planning algorithms in drug design. More information about the test dataset is summarized in Supplementary Note 4 and Supplementary Table 2. To increase the challenge of the test, datasets are preprocessed as follows:

  • Remove molecules present in either the USPTO database or the building block set.

  • Remove molecules solved by a heuristic BFS planning algorithm within 50 steps.

  • Remove molecules that can be solved by Retro* in one step.

Although biosynthetic natural products and chemically synthesized compounds share certain similarities, they are fundamentally distinct in terms of structural space and reaction types41. NPs exhibit a greater diversity and more complex structures, and existing organic retrosynthesis tools cannot be directly used for biosynthesis prediction. What’s more, NPs can be synthesized from a dozen simple building blocks, in contrast to millions of building blocks required in the preceding organic dataset. 388 molecules collected by BioNavi-NP are used to test the performance in the synthesis of natural products.