Abstract
Deep generative models are powerful tools for the exploration of chemical space, enabling the on-demand generation of molecules with desired physical, chemical or biological properties. However, these models are typically thought to require training datasets comprising hundreds of thousands, or even millions, of molecules. This perception limits the application of deep generative models in regions of chemical space populated by a relatively small number of examples. Here, we systematically evaluate and optimize generative models of molecules based on recurrent neural networks in low-data settings. We find that robust models can be learned from far fewer examples than has been widely assumed. We identify strategies that further reduce the number of molecules required to learn a model of equivalent quality, notably including data augmentation by non-canonical SMILES enumeration, and demonstrate the application of these principles by learning models of bacterial, plant and fungal metabolomes. The structure of our experiments also allows us to benchmark the metrics used to evaluate generative models themselves. We find that many of the most widely used metrics in the field fail to capture model quality, but we identify a subset of well-behaved metrics that provide a sound basis for model development. Collectively, our work provides a foundation for directly learning generative models in sparsely populated regions of chemical space.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Variational autoencoder-based chemical latent space for large molecular structures with 3D complexity
Communications Chemistry Open Access 16 November 2023
-
Functional annotation of enzyme-encoding genes using deep learning with transformer layers
Nature Communications Open Access 14 November 2023
-
Neural scaling of deep chemical models
Nature Machine Intelligence Open Access 23 October 2023
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout






Code availability
Code used to train and evaluate chemical language models is available from GitHub at http://github.com/skinnider/low-data-generative-models81.
References
Bohacek, R. S., McMartin, C. & Guida, W. C. The art and practice of structure-based drug design: a molecular modeling perspective. Med. Res. Rev. 16, 3–50 (1996).
Virshup, A. M., Contreras-García, J., Wipf, P., Yang, W. & Beratan, D. N. Stochastic voyages into uncharted chemical space produce a representative library of all possible drug-like compounds. J. Am. Chem. Soc. 135, 7296–7303 (2013).
van Deursen, R. & Reymond, J.-L. Chemical space travel. ChemMedChem 2, 636–640 (2007).
Lameijer, E.-W., Kok, J. N., Bäck, T. & Ijzerman, A. P. The molecule evoluator. An interactive evolutionary algorithm for the design of drug-like molecules. J. Chem. Inf. Model. 46, 545–552 (2006).
Pollock, S. N., Coutsias, E. A., Wester, M. J. & Oprea, T. I. Scaffold topologies. 1. Exhaustive enumeration up to eight rings. J. Chem. Inf. Model. 48, 1304–1310 (2008).
Fink, T. & Reymond, J.-L. Virtual exploration of the chemical universe up to 11 atoms of C, N, O, F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physicochemical properties, compound classes and drug discovery. J. Chem. Inf. Model. 47, 342–353 (2007).
Blum, L. C. & Reymond, J.-L. 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. J. Am. Chem. Soc. 131, 8732–8733 (2009).
Ruddigkeit, L., van Deursen, R., Blum, L. C. & Reymond, J.-L. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J. Chem. Inf. Model. 52, 2864–2875 (2012).
Elton, D. C., Boukouvalas, Z., Fuge, M. D. & Chung, P. W. Deep learning for molecular design-a review of the state of the art. Mol. Syst. Des. Eng 4, 828–849 (2019).
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Model. 28, 31–36 (1988).
Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).
Segler, M. H. S., Kogej, T., Tyrchan, C. & Waller, M. P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci. 4, 120–131 (2018).
Olivecrona, M., Blaschke, T., Engkvist, O. & Chen, H. Molecular de-novo design through deep reinforcement learning. J. Cheminform. 9, 48 (2017).
Arús-Pous, J. et al. Exploring the GDB-13 chemical space using deep generative models. J. Cheminform. 11, 20 (2019).
Merk, D., Friedrich, L., Grisoni, F. & Schneider, G. De novo design of bioactive small molecules by artificial intelligence. Mol. Inform. 37, 1700153 (2018).
Moret, M., Friedrich, L., Grisoni, F., Merk, D. & Schneider, G. Generative molecular design in low data regimes. Nat. Mach. Intell. 2, 171–180 (2020).
Popova, M., Isayev, O. & Tropsha, A. Deep reinforcement learning for de novo drug design. Sci. Adv. 4, eaap7885 (2018).
Kotsias, P.-C. et al. Direct steering of de novo molecular generation with descriptor conditional recurrent neural networks. Nat. Mach. Intell. 2, 254–265 (2020).
Li, Y., Zhang, L. & Liu, Z. Multi-objective de novo drug design with conditional graph generative model. J. Cheminform. 10, 33 (2018).
Zhou, Z., Kearnes, S., Li, L., Zare, R. N. & Riley, P. Optimization of molecules via deep reinforcement learning. Sci. Rep. 9, 10752 (2019).
Jin, W., Barzilay, R. & Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. In Proc. 35th International Conference on Machine Learning Vol. 80 (eds Dy, J. & Krause, A.) 2323–2332 (PMLR, 2018).
Brown, N., Fiscato, M., Segler, M. H. S. & Vaucher, A. C. Guacamol: benchmarking models for de novo molecular design. J. Chem. Inf. Model. 59, 1096–1108 (2019).
Polykovskiy, D. et al. Molecular sets (MOSES): a benchmarking platform for molecular generation models. Front. Pharmacol. 11, 565644 (2020).
Sanchez-Lengeling, B. & Aspuru-Guzik, A. Inverse molecular design using machine learning: generative models for matter engineering. Science 361, 360–365 (2018).
Ståhl, N., Falkman, G., Karlsson, A., Mathiason, G. & Boström, J. Deep reinforcement learning for multiparameter optimization in de novo drug design. J. Chem. Inf. Model. 59, 3166–3176 (2019).
Liu, X., Ye, K., van Vlijmen, H. W. T., IJzerman, A. P. & van Westen, G. J. P. An exploration strategy improves the diversity of de novo ligands using deep reinforcement learning: a case for the adenosine A2A receptor. J. Cheminform. 11, 35 (2019).
Neil, D. et al. Exploring deep recurrent models with reinforcement learning for molecule design. In Proc. 6th International Conference on Learning Representations (ICLR, 2018).
Amabilino, S., Pogány, P., Pickett, S. D. & Green, D. V. S. Guidelines for recurrent neural network transfer learning-based molecular generation of focused libraries. J. Chem. Inf. Model. 60, 5699–5713 (2020).
Gupta, A. et al. Generative recurrent networks for de novo drug design. Mol. Inform. 37, 1700111 (2018).
Awale, M., Sirockin, F., Stiefl, N. & Reymond, J.-L. Drug analogs from fragment-based long short-term memory generative neural networks. J. Chem. Inf. Model. 59, 1347–1356 (2019).
Merk, D., Grisoni, F., Friedrich, L. & Schneider, G. Tuning artificial intelligence on the de novo design of natural-product-inspired retinoid X receptor modulators. Commun. Chem. 1, 68 (2018).
Renz, P., Van Rompaey, D., Wegner, J. K., Hochreiter, S. & Klambauer, G. On failure modes in molecule generation and optimization. Drug Discov. Today Technol. 32–33, 55–63 (2019).
Arús-Pous, J. et al. Randomized SMILES strings improve the quality of molecular generative models. J. Cheminform. 11, 71 (2019).
Irwin, J. J. & Shoichet, B. K. ZINC—a free database of commercially available compounds for virtual screening. J. Chem. Inf. Model. 45, 177–182 (2005).
Benhenda, M. Can AI reproduce observed chemical diversity? Preprint at bioRxiv https://doi.org/10.1101/292177 (2018).
Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S. & Klambauer, G. Fréchet ChemNet distance: a metric for generative models for molecules in drug discovery. J. Chem. Inf. Model. 58, 1736–1741 (2018).
van Deursen, R., Ertl, P., Tetko, I. V. & Godin, G. GEN: highly efficient SMILES explorer using autodidactic generative examination networks. J. Cheminform. 12, 22 (2020).
Mendez, D. et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 47, D930–D940 (2019).
Sorokina, M. & Steinbeck, C. Review on natural products databases: where to find data in 2020. J. Cheminform. 12, 20 (2020).
Sanchez-Lengeling, B., Outeiral, C., Guimaraes, G. L. & Aspuru-Guzik, A. Optimizing distributions over molecular space. An Objective-Reinforced Generative Adversarial Network for Inverse-design Chemistry (ORGANIC). Preprint at https://doi.org/10.26434/chemrxiv.5309668.v3 (2017).
O’Boyle, N. & Dalke, A. DeepSMILES: an adaptation of SMILES for use in machine-learning of chemical structures. Preprint at https://doi.org/10.26434/chemrxiv.7097960 (2018).
Krenn, M., Häse, F., Nigam, A., Friederich, P. & Aspuru-Guzik, A. Self-referencing embedded strings (SELFIES): a 100% robust molecular string representation. Mach. Learn. Sci. Technol. 1, 045024 (2020).
Kusner, M. J., Paige, B. & Hernandez-Lobato, J. M. Grammar variational autoencoder. Preprint at https://arxiv.org/pdf/1703.01925.pdf (2017).
Dai, H., Tian, Y., Dai, B., Skiena, S. & Song, L. Syntax-directed variational autoencoder for structured data. Preprint at https://arxiv.org/pdf/1802.08786.pdf (2018).
Bjerrum, E. J. SMILES enumeration as data augmentation for neural network modeling of molecules. Preprint at https://arxiv.org/pdf/1703.07076.pdf (2017).
Bjerrum, E. J. & Sattarov, B. Improving chemical autoencoder latent space and molecular de novo generation diversity with heteroencoders. Biomolecules 8, 131 (2018).
Winter, R., Montanari, F., Noé, F. & Clevert, D.-A. Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem. Sci. 10, 1692–1701 (2019).
Zhang, Q. et al. Structural investigation of ribosomally synthesized natural products by hypothetical structure enumeration and evaluation using tandem MS. Proc. Natl Acad. Sci. USA 111, 12031–12036 (2014).
Johnston, C. W. et al. An automated genomes-to-natural products platform (GNP) for the discovery of modular natural products. Nat. Commun. 6, 8421 (2015).
Zheng, S. et al. QBMG: quasi-biogenic molecule generator with deep recurrent neural network. J. Cheminform. 11, 5 (2019).
da Silva, R. R., Dorrestein, P. C. & Quinn, R. A. Illuminating the dark matter in metabolomics. Proc. Natl Acad. Sci. USA 112, 12549–12550 (2015).
Vanhaelen, Q., Lin, Y.-C. & Zhavoronkov, A. The advent of generative chemistry. ACS Med. Chem. Lett. 11, 1496–1505 (2020).
Coley, C. W., Eyke, N. S. & Jensen, K. F. Autonomous discovery in the chemical sciences part II: outlook. Angew. Chem. Int. Ed. 59, 23414–23436 (2020).
Schwaller, P. et al. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent. Sci. 5, 1572–1583 (2019).
Kadurin, A., Nikolenko, S., Khrabrov, K., Aliper, A. & Zhavoronkov, A. druGAN: an advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico. Mol. Pharm. 14, 3098–3104 (2017).
Samanta, B. et al. NEVAE: a deep generative model for molecular graphs. J. Mach. Learn. Res. 21, 1–33 (2020).
Mercado, R. et al. Practical notes on building molecular graph generative models. Appl. AI Lett. https://doi.org/10.1002/ail2.18 (2020).
De Cao, N. & Kipf, T. MolGAN: an implicit generative model for small molecular graphs. Preprint at https://arxiv.org/pdf/1805.11973.pdf (2018).
Jaeger, S., Fulle, S. & Turk, S. Mol2vec: unsupervised machine learning approach with chemical intuition. J. Chem. Inf. Model. 58, 27–35 (2018).
O’Boyle, N. & Dalke, A. DeepSMILES: an adaptation of SMILES for use in machine-learning of chemical structures. Preprint at https://doi.org/10.26434/chemrxiv.7097960.v1 (2018).
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
O’Boyle, N. M. & Sayle, R. A. Comparing structural fingerprints using a literature-based similarity benchmark. J. Cheminform. 8, 36 (2016).
Skinnider, M. A., Dejong, C. A., Franczak, B. C., McNicholas, P. D. & Magarvey, N. A. Comparative analysis of chemical similarity methods for modular natural products with a hypothetical structure enumeration algorithm. J. Cheminform. 9, 46 (2017).
Smith, S. L., Kindermans, P.-J. & Le, Q. V. Don’t decay the learning rate, increase the batch size. Preprint at https://arxiv.org/pdf/1711.00489.pdf (2017).
Bertz, S. H. The first general index of molecular complexity. J. Am. Chem. Soc. 103, 3599–3601 (1981).
Wildman, S. A. & Crippen, G. M. Prediction of physicochemical parameters by atomic contributions. J. Chem. Inf. Comput. Sci. 39, 868–873 (1999).
Bemis, G. W. & Murcko, M. A. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem. 39, 2887–2893 (1996).
Ertl, P., Roggo, S. & Schuffenhauer, A. Natural product-likeness score and its application for prioritization of compound libraries. J. Chem. Inf. Model. 48, 68–74 (2008).
Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. & Hopkins, A. L. Quantifying the chemical beauty of drugs. Nat. Chem. 4, 90–98 (2012).
Ertl, P. & Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminform. 1, 8 (2009).
Ertl, P., Rohde, B. & Selzer, P. Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties. J. Med. Chem. 43, 3714–3717 (2000).
Sajed, T. et al. ECMDB 2.0: a richer resource for understanding the biochemistry of E. coli. Nucleic Acids Res. 44, D495–D501 (2016).
Huang, W. et al. PAMDB: a comprehensive Pseudomonas aeruginosa metabolome database. Nucleic Acids Res. 46, D575–D580 (2018).
Moumbock, A. F. A. et al. StreptomeDB 3.0: an updated compendium of streptomycetes natural products. Nucleic Acids Res 49, D600–D604 (2020).
Zeng, X. et al. NPASS: natural product activity and species source database for natural product research, discovery and tool development. Nucleic Acids Res. 46, D1217–D1222 (2018).
Karp, P. D. et al. The BioCyc collection of microbial genomes and metabolic pathways. Brief. Bioinform. 20, 1085–1093 (2019).
Neveu, V. et al. Phenol-Explorer: an online comprehensive database on polyphenol contents in foods. Database (Oxford) 2010, bap024 (2010).
Ramirez-Gaona, M. et al. YMDB 2.0: a significantly expanded version of the yeast metabolome database. Nucleic Acids Res. 45, D440–D445 (2017).
McInnes, L., Healy, J. & Melville, J. UMAP: Uniform Manifold Approximation and Projection for dimension reduction. Preprint at https://arxiv.org/pdf/1802.03426.pdf (2018).
Molecules used to train generative models (Zenodo, 2021); https://doi.org/10.5281/zenodo.4641960
Python source code used to train and evaluate generative models of molecules (Zenodo, 2021); https://doi.org/10.5281/zenodo.4642099
Acknowledgements
This work was supported by funding from Genome Canada, Genome British Columbia and Genome Alberta (project nos. 284MBO and 264PRO). Computational resources were provided by WestGrid, Compute Canada and Advanced Research Computing at the University of British Columbia. M.A.S. acknowledges support from a CIHR Vanier Canada Graduate Scholarship, a Roman M. Babicki Fellowship in Medical Research, a Borealis AI Graduate Fellowship, a Walter C. Sumner Memorial Fellowship and a Vancouver Coastal Health–CIHR–UBC MD/PhD Studentship. We thank J. Liigand and F. Wang for helpful discussions.
Author information
Authors and Affiliations
Contributions
M.A.S., D.S.W. and L.J.F. designed experiments. M.A.S. and R.G.S. performed experiments. M.A.S. wrote the manuscript. All authors edited the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Machine Intelligence thanks Sebastian Raschka and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Evaluating low-data generative models of purchasable chemical space.
a, Schematic overview of the ‘% valid’, ‘% unique’, and ‘% novel’ metrics. b, Values of the five top-performing metrics with the strongest correlations (ρ ≥ 0.82) to training dataset size for n = 110 generative models trained on varying numbers of molecules from the ZINC database. c, Values of five exemplary metrics with moderate to weak correlations (0.48 ≤ ρ ≤ 0.73) to training dataset size for n = 110 generative models trained on varying numbers of molecules from the ZINC database. d, Values of five exemplary metrics with little or no correlation (ρ ≤ 0.36) to training dataset size for n = 110 generative models trained on varying numbers of molecules from the ZINC database.
Extended Data Fig. 2 Evaluating low-data generative models of divergent chemical spaces.
a, Values of the five top-performing metrics with the strongest correlations (average rank correlation ≥ 0.80) to training dataset size for n = 440 generative models trained on varying numbers of molecules from the ChEMBL, COCONUT, GDB, or ZINC databases. Points and error bars show the mean and standard deviation, respectively, of ten independent replicates. b, Values of five exemplary metrics with moderate to weak correlations to training dataset size for n = 440 generative models trained on varying numbers of molecules from the ChEMBL, COCONUT, GDB, or ZINC databases. c, Values of five exemplary metrics with little or no correlation to training dataset size for n = 440 generative models trained on varying numbers of molecules from the ChEMBL, COCONUT, GDB, or ZINC databases. d, PC1 scores for n = 440 chemical language models trained on varying numbers of molecules sampled from the ChEMBL, COCONUT, GDB, or ZINC databases. Inset text shows the Spearman correlation. e, Factor loadings onto the first principal component in a PCA of n = 440 chemical language models trained on varying numbers of molecules sampled from the ChEMBL, COCONUT, GDB, or ZINC databases.
Extended Data Fig. 3 Robustness of principal component analysis for the evaluation of chemical generative models.
a, PCA of top-performing metrics, top, and PC1 scores, bottom, for chemical language models trained on varying numbers of molecules sampled from the ChEMBL, COCONUT, GDB, and ZINC database, with PCA performed separately for each database. Bottom, inset text shows the Spearman correlation. b, PCA of top-performing metrics for chemical language models trained on varying numbers of molecules sampled from three of four databases, colored by the size of the training dataset, top, or the chemical database on which the generative models were trained, middle. Bottom, PC1 scores for models trained on the withheld database, projected onto the coordinate basis of the other three databases. Inset text shows the Spearman correlation.
Extended Data Fig. 4 Learning chemical language models from less than 1,000 examples.
a, Proportion of valid SMILES generated by chemical language models trained on samples of between 200 and 1,000 molecules from one of four chemical databases. b, Fréchet ChemNet distance of chemical language models trained on samples of between 200 and 1,000 molecules from one of four chemical databases. c, PC1 scores of chemical language models trained on samples of between 200 and 1,000 molecules from one of four chemical databases.
Extended Data Fig. 5 Training dataset size requirements in different chemical spaces.
Mean difference in PC1 scores between chemical language models trained on varying numbers of molecules sampled from each pair of chemical structure databases. Dark squares indicate pairs without statistically significant differences (uncorrected p > 0.05, two-sided t-test).
Extended Data Fig. 6 Low-data generative models of diverse and homogeneous molecules from the ChEMBL and ZINC databases.
a, PCA of top-performing metrics for molecules generated by chemical language models trained on varying numbers of more or less diverse molecules from the GDB, ChEMBL, and ZINC databases, colored by the size of the training dataset. b, As in a, but colored by the chemical database on which the generative models were trained. c, As in a, but colored by the diversity (minimum Tanimoto coefficient to a randomly selected ‘founder’ molecule). d-i, Performance of chemical language models trained on samples of molecules from the ChEMBL (d-f) and ZINC (g-i) databases with a minimum Tanimoto coefficient (Tc) to a randomly selected ‘founder’ molecule. d, Proportion of valid SMILES generated by chemical language models trained on varying numbers of more or less diverse molecules from the ChEMBL database. e, Fréchet ChemNet distances of chemical language models trained on varying numbers of more or less diverse molecules from the ChEMBL database. f, PC1 scores of chemical language models trained on varying numbers of more or less diverse molecules from the ChEMBL database. g, Proportion of valid SMILES generated by chemical language models trained on varying numbers of more or less diverse molecules from the ZINC database. h, Fréchet ChemNet distances of chemical language models trained on varying numbers of more or less diverse molecules from the ZINC database. i, PC1 scores of chemical language models trained on varying numbers of more or less diverse molecules from the ZINC database.
Extended Data Fig. 7 Evaluating alternative molecular representations for low-data generative models in distinct chemical spaces.
a, Proportion of valid SMILES generated by chemical language models trained on one of three string representations of molecules from the ChEMBL, COCONUT, and GDB databases. b, PCA of top-performing metrics for molecules generated by n = 1,320 chemical language models trained on one of three string representations of molecules from the ChEMBL, COCONUT, and GDB databases, colored by the size of the training dataset. c, As in b, but colored by the chemical database on which the generative models were trained. d, As in b, but colored by molecular representation. e, PC1 scores of chemical language models trained on one of three string representations of molecules from the ChEMBL, COCONUT, and GDB databases. f, Fréchet ChemNet distances of chemical language models trained on one of three string representations of molecules from the ChEMBL, COCONUT, and GDB databases. g, Mean difference in PC1 scores between chemical language models trained on varying numbers of molecules sampled from the ChEMBL, COCONUT, and GDB databases, represented either as DeepSMILES or SELFIES, y-axis, or SMILES, x-axis. Dark squares indicate pairs without statistically significant differences (uncorrected p > 0.05, two-sided t-test).
Extended Data Fig. 8 Data augmentation by non-canonical SMILES enumeration.
a, Proportion of valid SMILES generated by chemical language models trained on molecules from the ChEMBL, COCONUT, and GDB databases after varying degrees of non-canonical SMILES enumeration. b, Data as in a and Fig. 3i, but showing the relationship between the size of the training dataset and the proportion of valid SMILES generated by models for each degree of non-canonical SMILES enumeration separately. c, PCA of top-performing metrics for molecules generated by n = 1,760 chemical language models trained on molecules from the ChEMBL, COCONUT, and GDB databases after varying degrees of non-canonical SMILES enumeration, colored by the size of the training dataset. d, As in c, but colored by the chemical database on which the generative models were trained. e, As in c, but colored by the amount of SMILES enumeration. f, PC1 scores of chemical language models trained on molecules from the ChEMBL, COCONUT, and GDB databases after varying degrees of non-canonical SMILES enumeration. g, Mean difference in PC1 scores between chemical language models trained on molecules from the ChEMBL, COCONUT, and GDB databases represented as canonical SMILES, x-axis, or non-canonical SMILES after varying degrees of data augmentation, y-axis. Dark squares indicate pairs without statistically significant differences (uncorrected p > 0.05, two-sided t-test).
Extended Data Fig. 9 Hyperparameter tuning in the ChEMBL database.
a, PCA of top-performing metrics for molecules generated by n = 1,210 chemical language models, trained on varying numbers of molecules from the ChEMBL database with varying model hyperparameters, colored by the size of the training dataset. b, Mean PC1 scores of chemical language models as a function of the total number of neurons in the model. Solid lines show local polynomial regression. c, Mean PC1 scores for molecules trained on the ChEMBL database, as a function of both the number of molecules in the training dataset, x-axis, and varying hyperparameters, y-axis. The mean of five independent replicates is shown. d, Proportion of n = 110 chemical language models with varying hyperparameters, trained on the number of molecules shown on the y-axis, that outperformed a model without hyperparameter tuning trained on the number of molecules shown on the x-axis.
Extended Data Fig. 10 Optimizing generative models of bacterial, fungal, and plant metabolomes.
a, PCA of top-performing metrics for molecules generated by n = 48 chemical language models, trained on bacterial, fungal, or plant metabolomes with varying inputs and hyperparameters, colored by the target metabolome. b, As in a, but colored by the molecular representation and data augmentation strategy. c, As in a, but colored by the RNN architecture. d, Proportion of valid molecules produced by generative models of metabolomes trained with different molecular representations (SMILES, DeepSMILES, or SELFIES), data augmentation strategies (non-canonical SMILES enumeration with an augmentation factor of between 2x and 30x), and RNN architectures (GRU or LSTM). e, As in d, but showing the Fréchet ChemNet distance between generated and real metabolites. f, As in d, but showing the Jensen-Shannon distance of the proportion of stereocenters between generated and real metabolites.
Supplementary information
Supplementary Information
Supplementary Figs. 1–5.
Supplementary Data 1
Metrics for all 8,447 models discussed in this study.
Rights and permissions
About this article
Cite this article
Skinnider, M.A., Stacey, R.G., Wishart, D.S. et al. Chemical language models enable navigation in sparsely populated chemical space. Nat Mach Intell 3, 759–770 (2021). https://doi.org/10.1038/s42256-021-00368-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-021-00368-1
This article is cited by
-
Leveraging molecular structure and bioactivity with chemical language models for de novo drug design
Nature Communications (2023)
-
67 million natural product-like compound database generated via molecular language processing
Scientific Data (2023)
-
Neural scaling of deep chemical models
Nature Machine Intelligence (2023)
-
Functional annotation of enzyme-encoding genes using deep learning with transformer layers
Nature Communications (2023)
-
Variational autoencoder-based chemical latent space for large molecular structures with 3D complexity
Communications Chemistry (2023)