As more data are introduced in the building of models of chemical reactivity, the mechanistic component can be reduced until ‘big data’ applications are reached. These methods no longer depend on underlying mechanistic hypotheses, potentially learning them implicitly through extensive data training. Reactivity models often focus on reaction barriers, but can also be trained to directly predict lab-relevant properties, such as yields or conditions. Calculations with a quantum-mechanical component are still preferred for quantitative predictions of reactivity. Although big data applications tend to be more qualitative, they have the advantage to be broadly applied to different kinds of reactions. There is a continuum of methods in between these extremes, such as methods that use quantum-derived data or descriptors in machine learning models. Here, we present an overview of the recent machine learning applications in the field of chemical reactivity from a mechanistic perspective. Starting with a summary of how reactivity questions are addressed by quantum-mechanical methods, we discuss methods that augment or replace quantum-based modelling with faster alternatives relying on machine learning.
Subscribe to Journal
Get full journal access for 1 year
only $8.25 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Engkvist, O. et al. Computational prediction of chemical reactions: current status and outlook. Drug Discov. Today 23, 1203–1218 (2018).
de Almeida, A. F., Moreira, R. & Rodrigues, T. Synthetic organic chemistry driven by artificial intelligence. Nat. Rev. Chem. 3, 589–604 (2019).
Struble, T. J. et al. Current and future roles of artificial intelligence in medicinal chemistry synthesis. J. Med. Chem. 63, 8667–8682 (2020).
Coley, C. W., Eyke, N. S. & Jensen, K. F. Autonomous discovery in the chemical sciences part II: outlook. Angew. Chem. Int. Ed. 59, 23414–23436 (2020).
Zahrt, A. F., Athavale, S. V. & Denmark, S. E. Quantitative structure–selectivity relationships in enantioselective catalysis: past, present, and future. Chem. Rev. 120, 1620–1689 (2020).
Reid, J. P. & Sigman, M. S. Comparing quantitative prediction methods for the discovery of small-molecule chiral catalysts. Nat. Rev. Chem. 2, 290–305 (2018).
Cramer, C. J. Essentials of Computational Chemistry: Theories and Models 2nd edn (Wiley, 2004).
Maskill, H. The Physical Basis of Organic Chemistry (Oxford Univ. Press, 1985).
Eyring, H. The activated complex in chemical reactions. J. Chem. Phys. 3, 107–115 (1935).
Clot, E. & Norrby, P.-O. in Innovative Catalysis in Organic Synthesis: Oxidation, Hydrogenation, and C-X Bond Forming Reactions (ed. Andersson, P. G.) (Wiley, 2012).
Kozuch, S. & Shaik, S. How to conceptualize catalytic cycles? The energetic span model. Acc. Chem. Res. 44, 101–110 (2011).
Plata, R. E. & Singleton, D. A. A case study of the mechanism of alcohol-mediated Morita Baylis–Hillman reactions. The importance of experimental observations. J. Am. Chem. Soc. 137, 3811–3826 (2015).
Jorner, K., Brinck, T., Norrby, P.-O. & Buttar, D. Machine learning meets mechanistic modelling for accurate prediction of experimental activation energies. Chem. Sci. 12, 1163–1175 (2021).
Maeda, S. & Ohno, K. Global mapping of equilibrium and transition structures on potential energy surfaces by the scaled hypersphere search method: applications to ab initio surfaces of formaldehyde and propyne molecules. J. Phys. Chem. A 109, 5742–5753 (2005).
Nett, A. J., Zhao, W., Zimmerman, P. M. & Montgomery, J. Highly active nickel catalysts for C–H functionalization identified through analysis of off-cycle intermediates. J. Am. Chem. Soc. 137, 7636–7639 (2015).
Hansen, E., Rosales, A. R., Tutkowski, B., Norrby, P.-O. & Wiest, O. Prediction of stereochemistry using Q2MM. Acc. Chem. Res. 49, 996–1005 (2016).
Houk, K. N. & Liu, F. Holy grails for computational organic chemistry and biochemistry. Acc. Chem. Res. 50, 539–543 (2017).
Guan, Y., Ingman, V. M., Rooks, B. J. & Wheeler, S. E. AARON: an automated reaction optimizer for new catalysts. J. Chem. Theory Comput. 14, 5249–5261 (2018).
Maeda, S., Ohno, K. & Morokuma, K. Systematic exploration of the mechanism of chemical reactions: the global reaction route mapping (GRRM) strategy using the ADDF and AFIR methods. Phys. Chem. Chem Phys 15, 3683–3701 (2013).
Bannwarth, C. et al. Extended tight-binding quantum chemistry methods. Wiley Interdiscip. Rev. Comput. Mol. Sci. 11, e1493 (2020).
Grimme, S. et al. Fully automated quantum-chemistry-based computation of spin–spin-coupled nuclear magnetic resonance spectra. Angew. Chem. Int. Ed. 56, 14763–14769 (2017).
Koerstz, M., Christensen, A. S., Mikkelsen, K. V., Nielsen, M. B. & Jensen, J. H. High throughput virtual screening of 230 billion molecular solar heat battery candidates. PeerJ Phys. Chem. 3, e16 (2021).
Kromann, J. C., Jensen, J. H., Kruszyk, M., Jessing, M. & Jørgensen, M. Fast and accurate prediction of the regioselectivity of electrophilic aromatic substitution reactions. Chem. Sci. 9, 660–665 (2018).
Hwang, M. J., Stockfisch, T. P. & Hagler, A. T. Derivation of class II force fields. 2. Derivation and characterization of a class II force field, CFF93, for the alkyl functional group and alkane molecules. J. Am. Chem. Soc. 116, 2515–2525 (1994).
Senftle, T. P. et al. The ReaxFF reactive force-field: development, applications and future directions. NPJ Comput. Mater. 2, 15011 (2016).
Jensen, F. Introduction to Computational Chemistry 3rd edn (Wiley, 2017).
Jensen, F. Locating minima on seams of intersecting potential energy surfaces. An application to transition structure modeling. J. Am. Chem. Soc. 114, 1596–1603 (1992).
Eksterowicz, J. E. & Houk, K. N. Transition-state modeling with empirical force fields. Chem. Rev. 93, 2439–2461 (1993).
Åqvist, J. & Warshel, A. Simulation of enzyme reactions using valence bond force fields and other hybrid quantum/classical approaches. Chem. Rev. 93, 2523–2544 (1993).
Hartke, B. & Grimme, S. Reactive force fields made simple. Phys. Chem. Chem. Phys. 17, 16715–16718 (2015).
Weill, N., Corbeil, C. R., De Schutter, J. W. & Moitessier, N. Toward a computational tool predicting the stereochemical outcome of asymmetric reactions: development of the molecular mechanics-based program ACE and application to asymmetric epoxidation reactions. J. Comput. Chem. 32, 2878–2889 (2011).
Sherrod, M. J. & Menger, F. M. “Transition-state modeling” does not always model transition states. J. Am. Chem. Soc. 111, 2611–2613 (1989).
Rosales, A. R. et al. Rapid virtual screening of enantioselective catalysts using CatVS. Nat. Catal. 2, 41–45 (2019).
Rosales, A. R. et al. Transition state force field for the asymmetric redox-relay Heck reaction. J. Am. Chem. Soc. 142, 9700–9707 (2020).
Rosales, A. R. et al. Application of Q2MM to predictions in stereoselective synthesis. Chem. Commun. 54, 8294–8311 (2018).
Burai Patrascu, M. et al. From desktop to benchtop with automated computational workflows for computer-aided design in asymmetric catalysis. Nat. Catal. 3, 574–584 (2020).
Smith, J. S., Isayev, O. & Roitberg, A. E. ANI-1: an extensible neural network potential with DFT accuracy at force field computational cost. Chem. Sci. 8, 3192–3203 (2017).
Smith, J. S., Isayev, O. & Roitberg, A. E. ANI-1, A data set of 20 million calculated off-equilibrium conformations for organic molecules. Sci. Data 4, 170193 (2017).
Smith, J. S., Nebgen, B., Lubbers, N., Isayev, O. & Roitberg, A. E. Less is more: sampling chemical space with active learning. J. Chem. Phys. 148, 241733 (2018).
Kang, P.-L., Shang, C. & Liu, Z.-P. Glucose to 5-hydroxymethylfurfural: origin of site-selectivity resolved by machine learning based reaction sampling. J. Am. Chem. Soc. 141, 20525–20536 (2019).
Grambow, C. A., Pattanaik, L. & Green, W. H. Deep learning of activation energies. J. Phys. Chem. Lett. 11, 2992–2997 (2020).
Grambow, C. A., Pattanaik, L. & Green, W. H. Reactants, products, and transition states of elementary chemical reactions based on quantum chemistry. Sci. Data 7, 137 (2020).
Friederich, P., dos Passos Gomes, G., De Bin, R., Aspuru-Guzik, A. & Balcells, D. Machine learning dihydrogen activation in the chemical space surrounding Vaska’s complex. Chem. Sci. 11, 4584–4601 (2020).
Mulliner, D., Wondrousch, D. & Schuurmann, G. Predicting Michael-acceptor reactivity and toxicity through quantum chemical transition-state calculations. Org. Biomol. Chem. 9, 8400–8412 (2011).
Palazzesi, F. et al. Bireactive: a machine-learning model to estimate covalent warhead reactivity. J. Chem. Inf. Model. 60, 2915–2923 (2020).
Mortelmans, K. & Zeiger, E. The Ames Salmonella/microsome mutagenicity assay. Mutat. Res. 455, 29–60 (2000).
Kuhnke, L., Ter Laak, A. & Goller, A. H. Mechanistic reactivity descriptors for the prediction of Ames mutagenicity of primary aromatic amines. J. Chem. Inf. Model. 59, 668–672 (2019).
Finkelmann, A. R., Goller, A. H. & Schneider, G. Site of metabolism prediction based on ab initio derived atom representations. ChemMedChem 12, 606–612 (2017).
Rydberg, P., Gloriam, D. E., Zaretzki, J., Breneman, C. & Olsen, L. SMARTCyp: a 2D method for prediction of cytochrome P450-mediated drug metabolism. ACS Med. Chem. Lett. 1, 96–100 (2010).
Rydberg, P., Rostkowski, M., Gloriam, D. E. & Olsen, L. The contribution of atom accessibility to site of metabolism models for cytochromes P450. Mol. Pharm. 10, 1216–1223 (2013).
Olsen, L., Montefiori, M., Tran, K. P. & Jørgensen, F. S. SMARTCyp 3.0: enhanced cytochrome P450 site-of-metabolism prediction server. Bioinformatics 35, 3174–3175 (2019).
Tomberg, A., Johansson, M. J. & Norrby, P.-O. A predictive tool for electrophilic aromatic substitutions using machine learning. J. Org. Chem. 84, 4695–4703 (2019).
Li, X., Zhang, S. Q., Xu, L. C. & Hong, X. Predicting regioselectivity in radical C–H functionalization of heterocycles through machine learning. Angew. Chem. Int. Ed. 59, 13253–13259 (2020).
De, S., Bartók, A. P., Csányi, G. & Ceriotti, M. Comparing molecules and solids across structural and alchemical space. Phys. Chem. Chem. Phys. 18, 13754–13769 (2016).
Beker, W., Gajewska, E. P., Badowski, T. & Grzybowski, B. A. Prediction of major regio-, site-, and diastereoisomers in Diels–Alder reactions by using machine-learning: the importance of physically meaningful descriptors. Angew. Chem. Int. Ed. 58, 4515–4519 (2019).
Skoraczyński, G. et al. Predicting the outcomes of organic reactions via machine learning: are current descriptors sufficient? Sci. Rep. 7, 3582 (2017).
Muratov, E. N. et al. QSAR without borders. Chem. Soc. Rev. 49, 3525–3564 (2020).
Sigman, M. S., Harper, K. C., Bess, E. N. & Milo, A. The development of multidimensional analysis tools for asymmetric catalysis and beyond. Acc. Chem. Res. 49, 1292–1301 (2016).
Woods, B. P., Orlandi, M., Huang, C.-Y., Sigman, M. S. & Doyle, A. G. Nickel-catalyzed enantioselective reductive cross-coupling of styrenyl aziridines. J. Am. Chem. Soc. 139, 5688–5691 (2017).
Hwang, Y., Jung, H., Lee, E., Kim, D. & Chang, S. Quantitative analysis on two-point ligand modulation of iridium catalysts for chemodivergent C–H amidation. J. Am. Chem. Soc. 142, 8880–8889 (2020).
Ferreira, M. A. B. et al. Noncovalent interactions drive the efficiency of molybdenum imido alkylidene catalysts for olefin metathesis. J. Am. Chem. Soc. 141, 10788–10800 (2019).
Verloop, A., Hoogenstraaten, W. & Tipker, J. in Drug Design Vol. 11 (ed. Ariëns, E. J.) 165–207 (Academic, 1976).
Santiago, C. B., Guo, J. Y. & Sigman, M. S. Predictive and mechanistic multivariate linear regression models for reaction development. Chem. Sci. 9, 2398–2412 (2018).
Durand, D. J. & Fey, N. Computational ligand descriptors for catalyst design. Chem. Rev. 119, 6561–6594 (2019).
Ravasco, J. M. J. M. & Coelho, J. A. S. Predictive multivariate models for bioorthogonal inverse-electron demand Diels–Alder reactions. J. Am. Chem. Soc. 142, 4235–4241 (2020).
Reid, J. P., Proctor, R. S. J., Sigman, M. S. & Phipps, R. J. Predictive multivariate linear regression analysis guides successful catalytic enantioselective Minisci reactions of diazines. J. Am. Chem. Soc. 141, 19178–19185 (2019).
Reid, J. P. & Sigman, M. S. Holistic prediction of enantioselectivity in asymmetric catalysis. Nature 571, 343–348 (2019).
Ahneman, D. T., Estrada, J. G., Lin, S., Dreher, S. D. & Doyle, A. G. Predicting reaction performance in C–N cross-coupling using machine learning. Science 360, 186–190 (2018).
Chuang, K. V. & Keiser, M. J. Comment on “Predicting reaction performance in C–N cross-coupling using machine learning”. Science 362, eaat8603 (2018).
Estrada, J. G., Ahneman, D. T., Sheridan, R. P., Dreher, S. D. & Doyle, A. G. Response to Comment on “Predicting reaction performance in C–N cross-coupling using machine learning”. Science 362, eaat8763 (2018).
Mayr, H. & Patz, M. Scales of nucleophilicity and electrophilicity: a system for ordering polar organic and organometallic reactions. Angew. Chem. Int. Ed. Engl. 33, 938–957 (1994).
Hoffmann, G. et al. Predicting experimental electrophilicities from quantum and topological descriptors: a machine learning approach. J. Comput. Chem. 41, 2124–2136 (2020).
St. John, P. C., Guan, Y., Kim, Y., Kim, S. & Paton, R. S. Prediction of organic homolytic bond dissociation enthalpies at near chemical accuracy with sub-second computational cost. Nat. Commun. 11, 2328 (2020).
St John, P. C. et al. Quantum chemical calculations for over 200,000 organic radical species and 40,000 associated closed-shell molecules. Sci. Data 7, 244 (2020).
Guan, Y. et al. Regio-selectivity prediction with a machine-learned reaction representation and on-the-fly quantum mechanical descriptors. Chem. Sci. 12, 2198–2208 (2021).
Zahrt, A. F. et al. Prediction of higher-selectivity catalysts by computer-driven workflow and machine learning. Science 363, eaau5631 (2019). A recent example of selectivity prediction with results close to experiment.
Schneider, N., Lowe, D. M., Sayle, R. A. & Landrum, G. A. Development of a novel fingerprint for chemical reactions and its application to large-scale reaction classification and similarity. J. Chem. Inf. Model. 55, 39–53 (2015).
Ghiandoni, G. M. et al. Development and application of a data-driven reaction classification model: comparison of an electronic lab notebook and medicinal chemistry literature. J. Chem. Inf. Model. 59, 4167–4187 (2019).
Patel, H., Bodkin, M. J., Chen, B. & Gillet, V. J. Knowledge-based approach to de novo design using reaction vectors. J. Chem. Inf. Model. 49, 1163–1184 (2009).
Sandfort, F., Strieth-Kalthoff, F., Kühnemund, M., Beecks, C. & Glorius, F. A structure-based platform for predicting chemical reactivity. Chem 6, 1379–1390 (2020).
Duvenaud, D. K. et al. in Advances in Neural Information Processing Systems 28 (eds Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M. & Garnett, R.) 2224–2232 (Curran Associates, 2015).
Wei, J. N., Duvenaud, D. & Aspuru-Guzik, A. Neural networks for the prediction of organic chemistry reactions. ACS Cent. Sci. 2, 725–732 (2016).
Schwaller, P. et al. Mapping the space of chemical reactions using attention-based neural networks. Nat. Mach. Intell. 3, 144–152 (2021).
Schwaller, P., Vaucher, A. C., Laino, T. & Reymond, J.-L. Prediction of chemical reaction yields using deep learning. Preprint at https://doi.org/10.26434/chemrxiv.12758474.v1 (2020).
Yang, K. et al. Analyzing learned molecular representations for property prediction. J. Chem. Inf. Model. 59, 3370–3388 (2019).
Varnek, A., Fourches, D., Hoonakker, F. & Solov’ev, V. P. Substructural fragments: an universal language to encode reactions, molecular and supramolecular structures. J. Comput. Aided Mol. Des. 19, 693–703 (2005). This work introduced the CGR–ISIDA approach used for the reactions and conditions prediction, clustering, similarity searching etc.
Fujita, S. Description of organic reactions based on imaginary transition structures. 1. Introduction of new concepts. J. Chem. Inf. Model. 26, 205–212 (1986).
Körner, R. & Apostolakis, J. Automatic determination of reaction mappings and reaction center information. 1. The imaginary transition state energy approach. J. Chem. Inf. Model. 48, 1181–1189 (2008).
Glavatskikh, M. et al. Predictive models for kinetic parameters of cycloaddition reactions. Mol. Inform. 38, 1800077 (2019).
Madzhidov, T. I. et al. Structure–reactivity relationship in bimolecular elimination reactions based on the condensed graph of a reaction. J. Struct. Chem. 56, 1227–1234 (2016).
Gimadiev, T. et al. Bimolecular nucleophilic substitution reactions: predictive models for rate constants and molecular reaction pairs analysis. Mol. Inform. 38, 1800104 (2019).
Marcou, G. et al. Expert system for predicting reaction conditions: the Michael reaction case. J. Chem. Inf. Model. 55, 239–250 (2015).
Lin, A. I. et al. Automatized assessment of protective group reactivity: a step toward big reaction data analysis. J. Chem. Inf. Model. 56, 2140–2148 (2016).
Nugmanov, R. I. et al. CGRtools: python library for molecule, reaction, and condensed graph of reaction processing. J. Chem. Inf. Model. 59, 2516–2521 (2019).
Fialkowski, M., Bishop, K. J. M., Chubukov, V. A., Campbell, C. J. & Grzybowski, B. A. Architecture and evolution of organic chemistry. Angew. Chem. Int. Ed. 44, 7263–7269 (2005).
Szymkuć, S. et al. Computer-assisted synthetic planning: the end of the beginning. Angew. Chem. Int. Ed. 55, 5904–5937 (2016).
Klucznik, T. et al. Efficient syntheses of diverse, medicinally relevant targets planned by computer and executed in the laboratory. Chem 4, 522–532 (2018).
Tiano, K. Merck acquires Grzybowski scientific inventions to expand chemical synthesis offering. Merck https://www.merckmillipore.com/SE/en/20170505_202234 (2017).
Plehiers, P. P., Marin, G. B., Stevens, C. V. & Van Geem, K. M. Automated reaction database and reaction network analysis: extraction of reaction templates using cheminformatics. J. Cheminformatics 10, 11 (2018).
Krallinger, M., Rabal, O., Lourenço, A., Oyarzabal, J. & Valencia, A. Information retrieval and text mining technologies for chemistry. Chem. Rev. 117, 7673–7761 (2017).
Warr, W. A. A short review of chemical reaction database systems, computer-aided synthesis design, reaction prediction and synthetic feasibility. Mol. Inform. 33, 469–476 (2014).
Lowe, D. M. Extraction of Chemical Structures and Reactions from the Literature. Doctor of Philosophy (PhD) thesis, Univ. Cambridge (2012).
Zhang, Q.-Y. & Aires-de-Sousa, J. Structure-based classification of chemical reactions without assignment of reaction centers. J. Chem. Inf. Model. 45, 1775–1783 (2005).
Carrera, G. V. S. M., Gupta, S. & Aires-de-Sousa, J. Machine learning of chemical reactivity from databases of organic reactions. J. Comput. Mol. Des. 23, 419–429 (2009).
Segler, M. H. S. & Waller, M. P. Neural-symbolic machine learning for retrosynthesis and reaction prediction. Chem. Eur. J. 23, 5966–5971 (2017).
Coley, C. W., Barzilay, R., Jaakkola, T. S., Green, W. H. & Jensen, K. F. Prediction of organic reaction outcomes using machine learning. ACS Cent. Sci. 3, 434–443 (2017).
Segler, M. H. S., Preuss, M. & Waller, M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555, 604–610 (2018). This work introduced a fully data-driven neural network for general reactivity prediction.
Schneider, N., Stiefl, N. & Landrum, G. A. What’s what: the (nearly) definitive guide to reaction role assignment. J. Chem. Inf. Model. 56, 2336–2346 (2016).
Jaworski, W. et al. Automatic mapping of atoms across both simple and complex chemical reactions. Nat. Commun. 10, 1434 (2019).
Schwaller, P., Hoover, B., Reymond, J.-L., Strobelt, H. & Laino, T. Unsupervised attention-guided atom-mapping. Preprint at https://doi.org/10.26434/chemrxiv.12298559.v1 (2020).
Kayala, M. A., Azencott, C.-A., Chen, J. H. & Baldi, P. Learning to predict chemical reactions. J. Chem. Inf. Model. 51, 2209–2222 (2011).
Kayala, M. A. & Baldi, P. ReactionPredictor: prediction of complex chemical reactions at the mechanistic level using machine learning. J. Chem. Inf. Model. 52, 2526–2540 (2012).
Fooshee, D. et al. Deep learning for chemical reaction prediction. Mol. Syst. Des. Eng. 3, 442–452 (2018).
Sadowski, P., Fooshee, D., Subrahmanya, N. & Baldi, P. Synergies between quantum mechanics and machine learning in reaction prediction. J. Chem. Inf. Model. 56, 2125–2128 (2016).
Fujinami, M., Seino, J. & Nakai, H. Quantum chemical reaction prediction method based on machine learning. Bull. Chem. Soc. Jpn. 93, 685–693 (2020).
Jin, W. C., Connor W., Barzilay, R. & Jaakkola, T. in Neural Information Processing Systems (eds Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S. & Garnett, R.) 2607–2616 (Curran Associates, 2017).
Coley, C. W. et al. A graph-convolutional neural network model for the prediction of chemical reactivity. Chem. Sci. 10, 370–377 (2019).
Schwaller, P. & Laino, T. in Machine Learning in Chemistry: Data-Driven Algorithms, Learning Systems, and Predictions Vol. 1326 61–79 (American Chemical Society, 2019).
Liu, B. et al. Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS Cent. Sci. 3, 1103–1113 (2017).
Schwaller, P., Gaudin, T., Lanyi, D., Bekas, C. & Laino, T. “Found in Translation”: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models. Chem. Sci. 9, 6091–6098 (2018).
Schwaller, P. et al. Molecular Transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent. Sci. 5, 1572–1583 (2019). In this work, natural language processing methods were successfully used for general reaction prediction.
Alammar, J. The Illustrated Transformer. J. Alammar http://jalammar.github.io/illustrated-transformer/ (2018).
Walker, E. et al. Learning to predict reaction conditions: relationships between solvent, molecular structure, and catalyst. J. Chem. Inf. Model. 59, 3645–3654 (2019).
Gao, H. et al. Using machine learning to predict suitable conditions for organic reactions. ACS Cent. Sci. 4, 1465–1476 (2018).
Coley, C. W., Green, W. H. & Jensen, K. F. Machine learning in computer-aided synthesis planning. Acc. Chem. Res. 51, 1281–1289 (2018).
Segler, M. H. S. & Waller, M. P. Modelling chemical reasoning to predict and invent reactions. Chem. Eur. J. 23, 6118–6128 (2017).
Gromski, P. S., Henson, A. B., Granda, J. M. & Cronin, L. How to explore chemical space using algorithms and automation. Nat. Rev. Chem. 3, 119–128 (2019).
Wang, Z., Zhao, W., Hao, G. & Song, B. Automated synthesis: current platforms and further needs. Drug Discov. Today 25, 2006–2011 (2020).
Nesterov, V., Wieser, M. & Roth, V. J. 3DMolNet: a generative network for molecular structures. Preprint at https://arxiv.org/abs/2010.06477 (2020).
Pattanaik, L., Ingraham, J. B., Grambow, C. A. & Green, W. H. Generating transition states of isomerization reactions with deep learning. Phys. Chem. Chem. Phys. 22, 23618–23626 (2020).
Smith, J. S. et al. The ANI-1ccx and ANI-1x data sets, coupled-cluster and density functional theory properties for molecules. Sci. Data 7, 134 (2020).
Kammeraad, J. A., Goetz, J., Walker, E. A., Tewari, A. & Zimmerman, P. M. What does the machine learn? Knowledge representations of chemical reactivity. J. Chem. Inf. Model. 60, 1290–1301 (2020).
Herges, R. & Hoock, C. Reaction planning: computer-aided discovery of a novel elimination reaction. Science 255, 711–713 (1992).
William, B. et al. Discovery of novel chemical reactions by deep generative recurrent neural network. Sci. Rep. 11, 3178 (2021).
Unsleber, J. P. & Reiher, M. The exploration of chemical reaction networks. Annu. Rev. Phys. Chem. 71, 121–142 (2020).
Sameera, W. M. C., Maeda, S. & Morokuma, K. Computational catalysis using the artificial force induced reaction method. Acc. Chem. Res. 49, 763–773 (2016).
Martínez, T. J. Ab initio reactive computer aided molecular design. Acc. Chem. Res. 50, 652–656 (2017).
Rappoport, D., Galvin, C. J., Zubarev, D. Y. & Aspuru-Guzik, A. Complex chemical reaction networks from heuristics-aided quantum chemistry. J. Chem. Theory Comput. 10, 897–907 (2014).
Bergeler, M., Simm, G. N., Proppe, J. & Reiher, M. Heuristics-guided exploration of reaction mechanisms. J. Chem. Theory Comput. 11, 5712–5722 (2015).
Smith, D. G. A. et al. The MolSSI QCArchive project: an open-source platform to compute, organize, and share quantum chemistry data. Wiley Interdiscip. Rev. Comput. Mol. Sci. 11, e1491 (2020).
Álvarez-Moreno, M. et al. Managing the computational chemistry big data problem: the ioChem-BD platform. J. Chem. Inf. Model. 55, 95–103 (2014).
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Jaeger, S., Fulle, S. & Turk, S. Mol2vec: unsupervised machine learning approach with chemical intuition. J. Chem. Inf. Model. 58, 27–35 (2018).
Feinberg, E. N. et al. PotentialNet for molecular property prediction. ACS Cent. Sci. 4, 1520–1530 (2018).
Coley, C. W., Barzilay, R., Green, W. H., Jaakkola, T. S. & Jensen, K. F. Convolutional embedding of attributed molecular graphs for physical property prediction. J. Chem. Inf. Model. 57, 1757–1772 (2017).
Korolev, V., Mitrofanov, A., Korotcov, A. & Tkachenko, V. Graph convolutional neural networks as “general-purpose” property predictors: the universality and limits of applicability. J. Chem. Inf. Model. 60, 22–28 (2020).
Kearnes, S., McCloskey, K., Berndl, M., Pande, V. & Riley, P. Molecular graph convolutions: moving beyond fingerprints. J. Comput. Mol. Des. 30, 595–608 (2016).
Cawley, G. C. & Talbot, N. L. C. On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res. 11, 2079–2107 (2010).
Varma, S. & Simon, R. Bias in error estimation when using cross-validation for model selection. BMC Bioinforma. 7, 91 (2006).
Hanser, T., Barber, C., Marchaland, J. F. & Werner, S. Applicability domain: towards a more formal definition. SAR QSAR Environ. Res. 27, 865–881 (2016).
Abu-Mostafa, Y. S., Magdon-Ismail, M. & Lin, H. T. Learning from Data: A Short Course (AMLBook.com, 2012).
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction 2nd edn (Springer, 2009).
Harrell, F. E. Regression Modeling Strategies: with Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis 2nd edn (Springer, 2015).
James, G., Witten, D., Hastie, T. & Tibshirani, R. An Introduction to Statistical Learning: with Applications in R (Springer, 2013).
K.J. is a fellow of the AstraZeneca Postdoc Programme.
The authors declare no competing interests.
Peer review information
Nature Reviews Chemistry thanks the anonymous reviewers for their contribution to the peer review of this work.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Daylight Chemical Information Systems: Fingerprints: https://www.daylight.com/dayhtml/doc/theory/theory.finger.html
Daylight Chemical Information Systems: SMILES: https://www.daylight.com/dayhtml/doc/theory/theory.smiles.html
Daylight Chemical Information Systems: SMARTS: https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html
Dragon Descriptors: https://chm.kode-solutions.net/products_dragon_descriptors.php
IUPAC InChI: http://www.iupac.org/inchi/
Lowe, D. M. Patent Reaction Extractor: https://github.com/dan2097/patent-reaction-extraction
Open Reaction Database: https://ord-schema.readthedocs.io/en/latest/
RDKit: Open-Source Cheminformatics Software: https://www.rdkit.org/
- Density functional theory
(DFT). A quantum-mechanical method based on electron density for simulating molecules and reactions.
Also referred to as features. The properties used to train a machine learning model.
- Semiempirical QM methods
Use the same algorithms as wave function and density functional theory methods, but approximated values for matrix elements.
- Domains of applicability
The regions of chemical space within which a model can reliably make predictions.
- Gaussian process regression
Machine learning algorithm in which the data points are assumed to be the means of Gaussian distributions. Delivers both predicted means and variance.
- Extra tree regressor model
Machine learning algorithm similar to random forest. Owing to differences in implementation, this method is usually faster than a random forest.
- Random forest
Machine learning algorithm that builds an ensemble of decision trees and predicts the value of a new example by taking into consideration the prediction from each decision tree in the ensemble.
- Sterimol parameters
A set of parameters that describes the steric effects of substituents.
- Gradient boosting decision tree model
Machine learning algorithm that is based on decision trees (see ‘random forest’). The model is built stepwise, conjoined with the introduction of a learning rate. This approach has been shown to avoid overfitting problems.
- Receiver operator characteristic
(ROC). Curve of true positive rate versus the false positive rate of a machine learning classification algorithm. The area under the ROC curve is often used as a performance metric.
- Support vector machine
(SVM). A machine learning algorithm based on the idea that data points are divided by a hyperplane. The model tries to define the form of the hyperplane so as to maximize the separation between dissimilar data points.
- Deep feed-forward neural network models
A feed-forward neural network, also called a multilayer perceptron, is one of the basic architectures in machine learning, in which the input nodes connect to hidden layers of nodes, which, in turn, connect to the output nodes. A neural network is feed-forward when no output information is channelled back into the model, as opposed to recurrent networks.
- Molecular fingerprints
Molecular representations derived from the molecular connectivity.
Machine-readable descriptions of a molecule as, for example, a string of characters, a vector or a graph.
- Atom mapping
Refers to the labelling of atoms in the reactants and the corresponding atoms in the products in a reaction SMARTS.
- Deep learning
The field of machine learning that uses neural networks with many hidden layers.
Patterns describing a chemical reaction, often represented by reaction SMARTS.
A string representation of a molecular pattern, based on the simplified molecular input line entry system (SMILES). SMARTS are used to define a substructure of a molecule. For example, ethanol could be represented using the SMILES string CCO. To define the alcohol functional group, one uses SMARTS [#6][OX2H], in which each atomic position is enclosed in square brackets and encodes which atom types are allowed at this position.
- Negative reactions
Reactions that give a low or zero yield. These are important for machine learning because the model needs to learn that not all input leads to a product.
- Graph convolutional networks
Neural networks that operate on a graph and use convolution to create their own features for learning.
About this article
Cite this article
Jorner, K., Tomberg, A., Bauer, C. et al. Organic reactivity from mechanism to machine learning. Nat Rev Chem 5, 240–255 (2021). https://doi.org/10.1038/s41570-021-00260-x