Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Chemical language models enable navigation in sparsely populated chemical space

A preprint version of the article is available at ChemRxiv.

Abstract

Deep generative models are powerful tools for the exploration of chemical space, enabling the on-demand generation of molecules with desired physical, chemical or biological properties. However, these models are typically thought to require training datasets comprising hundreds of thousands, or even millions, of molecules. This perception limits the application of deep generative models in regions of chemical space populated by a relatively small number of examples. Here, we systematically evaluate and optimize generative models of molecules based on recurrent neural networks in low-data settings. We find that robust models can be learned from far fewer examples than has been widely assumed. We identify strategies that further reduce the number of molecules required to learn a model of equivalent quality, notably including data augmentation by non-canonical SMILES enumeration, and demonstrate the application of these principles by learning models of bacterial, plant and fungal metabolomes. The structure of our experiments also allows us to benchmark the metrics used to evaluate generative models themselves. We find that many of the most widely used metrics in the field fail to capture model quality, but we identify a subset of well-behaved metrics that provide a sound basis for model development. Collectively, our work provides a foundation for directly learning generative models in sparsely populated regions of chemical space.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Learning generative models of molecules from limited training examples.
Fig. 2: Low-data generative models of distinct chemical spaces.
Fig. 3: Low-data generative models of diverse and homogeneous molecules.
Fig. 4: Alternative molecular representations for low-data generative models.
Fig. 5: Data, not architecture, dictates the performance of low-data generative models.
Fig. 6: Low-data generative models of bacterial, fungal and plant metabolomes.

Data availability

Input datasets used to train chemical language models are available from Zenodo80. Calculated metrics for all 8,447 models discussed in this study are provided as Supplementary Data 1.

Code availability

Code used to train and evaluate chemical language models is available from GitHub at http://github.com/skinnider/low-data-generative-models81.

References

  1. 1.

    Bohacek, R. S., McMartin, C. & Guida, W. C. The art and practice of structure-based drug design: a molecular modeling perspective. Med. Res. Rev. 16, 3–50 (1996).

    Article  Google Scholar 

  2. 2.

    Virshup, A. M., Contreras-García, J., Wipf, P., Yang, W. & Beratan, D. N. Stochastic voyages into uncharted chemical space produce a representative library of all possible drug-like compounds. J. Am. Chem. Soc. 135, 7296–7303 (2013).

    Article  Google Scholar 

  3. 3.

    van Deursen, R. & Reymond, J.-L. Chemical space travel. ChemMedChem 2, 636–640 (2007).

    Article  Google Scholar 

  4. 4.

    Lameijer, E.-W., Kok, J. N., Bäck, T. & Ijzerman, A. P. The molecule evoluator. An interactive evolutionary algorithm for the design of drug-like molecules. J. Chem. Inf. Model. 46, 545–552 (2006).

    Article  Google Scholar 

  5. 5.

    Pollock, S. N., Coutsias, E. A., Wester, M. J. & Oprea, T. I. Scaffold topologies. 1. Exhaustive enumeration up to eight rings. J. Chem. Inf. Model. 48, 1304–1310 (2008).

    Article  Google Scholar 

  6. 6.

    Fink, T. & Reymond, J.-L. Virtual exploration of the chemical universe up to 11 atoms of C, N, O, F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physicochemical properties, compound classes and drug discovery. J. Chem. Inf. Model. 47, 342–353 (2007).

    Article  Google Scholar 

  7. 7.

    Blum, L. C. & Reymond, J.-L. 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. J. Am. Chem. Soc. 131, 8732–8733 (2009).

    Article  Google Scholar 

  8. 8.

    Ruddigkeit, L., van Deursen, R., Blum, L. C. & Reymond, J.-L. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J. Chem. Inf. Model. 52, 2864–2875 (2012).

    Article  Google Scholar 

  9. 9.

    Elton, D. C., Boukouvalas, Z., Fuge, M. D. & Chung, P. W. Deep learning for molecular design-a review of the state of the art. Mol. Syst. Des. Eng 4, 828–849 (2019).

    Article  Google Scholar 

  10. 10.

    Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Model. 28, 31–36 (1988).

    Article  Google Scholar 

  11. 11.

    Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).

    Article  Google Scholar 

  12. 12.

    Segler, M. H. S., Kogej, T., Tyrchan, C. & Waller, M. P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci. 4, 120–131 (2018).

    Article  Google Scholar 

  13. 13.

    Olivecrona, M., Blaschke, T., Engkvist, O. & Chen, H. Molecular de-novo design through deep reinforcement learning. J. Cheminform. 9, 48 (2017).

    Article  Google Scholar 

  14. 14.

    Arús-Pous, J. et al. Exploring the GDB-13 chemical space using deep generative models. J. Cheminform. 11, 20 (2019).

    Article  Google Scholar 

  15. 15.

    Merk, D., Friedrich, L., Grisoni, F. & Schneider, G. De novo design of bioactive small molecules by artificial intelligence. Mol. Inform. 37, 1700153 (2018).

    Article  Google Scholar 

  16. 16.

    Moret, M., Friedrich, L., Grisoni, F., Merk, D. & Schneider, G. Generative molecular design in low data regimes. Nat. Mach. Intell. 2, 171–180 (2020).

    Article  Google Scholar 

  17. 17.

    Popova, M., Isayev, O. & Tropsha, A. Deep reinforcement learning for de novo drug design. Sci. Adv. 4, eaap7885 (2018).

    Article  Google Scholar 

  18. 18.

    Kotsias, P.-C. et al. Direct steering of de novo molecular generation with descriptor conditional recurrent neural networks. Nat. Mach. Intell. 2, 254–265 (2020).

    Article  Google Scholar 

  19. 19.

    Li, Y., Zhang, L. & Liu, Z. Multi-objective de novo drug design with conditional graph generative model. J. Cheminform. 10, 33 (2018).

    Article  Google Scholar 

  20. 20.

    Zhou, Z., Kearnes, S., Li, L., Zare, R. N. & Riley, P. Optimization of molecules via deep reinforcement learning. Sci. Rep. 9, 10752 (2019).

    Article  Google Scholar 

  21. 21.

    Jin, W., Barzilay, R. & Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. In Proc. 35th International Conference on Machine Learning Vol. 80 (eds Dy, J. & Krause, A.) 2323–2332 (PMLR, 2018).

  22. 22.

    Brown, N., Fiscato, M., Segler, M. H. S. & Vaucher, A. C. Guacamol: benchmarking models for de novo molecular design. J. Chem. Inf. Model. 59, 1096–1108 (2019).

    Article  Google Scholar 

  23. 23.

    Polykovskiy, D. et al. Molecular sets (MOSES): a benchmarking platform for molecular generation models. Front. Pharmacol. 11, 565644 (2020).

    Article  Google Scholar 

  24. 24.

    Sanchez-Lengeling, B. & Aspuru-Guzik, A. Inverse molecular design using machine learning: generative models for matter engineering. Science 361, 360–365 (2018).

    Article  Google Scholar 

  25. 25.

    Ståhl, N., Falkman, G., Karlsson, A., Mathiason, G. & Boström, J. Deep reinforcement learning for multiparameter optimization in de novo drug design. J. Chem. Inf. Model. 59, 3166–3176 (2019).

    Article  Google Scholar 

  26. 26.

    Liu, X., Ye, K., van Vlijmen, H. W. T., IJzerman, A. P. & van Westen, G. J. P. An exploration strategy improves the diversity of de novo ligands using deep reinforcement learning: a case for the adenosine A2A receptor. J. Cheminform. 11, 35 (2019).

    Article  Google Scholar 

  27. 27.

    Neil, D. et al. Exploring deep recurrent models with reinforcement learning for molecule design. In Proc. 6th International Conference on Learning Representations (ICLR, 2018).

  28. 28.

    Amabilino, S., Pogány, P., Pickett, S. D. & Green, D. V. S. Guidelines for recurrent neural network transfer learning-based molecular generation of focused libraries. J. Chem. Inf. Model. 60, 5699–5713 (2020).

    Article  Google Scholar 

  29. 29.

    Gupta, A. et al. Generative recurrent networks for de novo drug design. Mol. Inform. 37, 1700111 (2018).

    Article  Google Scholar 

  30. 30.

    Awale, M., Sirockin, F., Stiefl, N. & Reymond, J.-L. Drug analogs from fragment-based long short-term memory generative neural networks. J. Chem. Inf. Model. 59, 1347–1356 (2019).

    Article  Google Scholar 

  31. 31.

    Merk, D., Grisoni, F., Friedrich, L. & Schneider, G. Tuning artificial intelligence on the de novo design of natural-product-inspired retinoid X receptor modulators. Commun. Chem. 1, 68 (2018).

    Article  Google Scholar 

  32. 32.

    Renz, P., Van Rompaey, D., Wegner, J. K., Hochreiter, S. & Klambauer, G. On failure modes in molecule generation and optimization. Drug Discov. Today Technol. 32–33, 55–63 (2019).

    Article  Google Scholar 

  33. 33.

    Arús-Pous, J. et al. Randomized SMILES strings improve the quality of molecular generative models. J. Cheminform. 11, 71 (2019).

    Article  Google Scholar 

  34. 34.

    Irwin, J. J. & Shoichet, B. K. ZINC—a free database of commercially available compounds for virtual screening. J. Chem. Inf. Model. 45, 177–182 (2005).

    Article  Google Scholar 

  35. 35.

    Benhenda, M. Can AI reproduce observed chemical diversity? Preprint at bioRxiv https://doi.org/10.1101/292177 (2018).

  36. 36.

    Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S. & Klambauer, G. Fréchet ChemNet distance: a metric for generative models for molecules in drug discovery. J. Chem. Inf. Model. 58, 1736–1741 (2018).

    Article  Google Scholar 

  37. 37.

    van Deursen, R., Ertl, P., Tetko, I. V. & Godin, G. GEN: highly efficient SMILES explorer using autodidactic generative examination networks. J. Cheminform. 12, 22 (2020).

    Article  Google Scholar 

  38. 38.

    Mendez, D. et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 47, D930–D940 (2019).

    Article  Google Scholar 

  39. 39.

    Sorokina, M. & Steinbeck, C. Review on natural products databases: where to find data in 2020. J. Cheminform. 12, 20 (2020).

    Article  Google Scholar 

  40. 40.

    Sanchez-Lengeling, B., Outeiral, C., Guimaraes, G. L. & Aspuru-Guzik, A. Optimizing distributions over molecular space. An Objective-Reinforced Generative Adversarial Network for Inverse-design Chemistry (ORGANIC). Preprint at https://doi.org/10.26434/chemrxiv.5309668.v3 (2017).

  41. 41.

    O’Boyle, N. & Dalke, A. DeepSMILES: an adaptation of SMILES for use in machine-learning of chemical structures. Preprint at https://doi.org/10.26434/chemrxiv.7097960 (2018).

  42. 42.

    Krenn, M., Häse, F., Nigam, A., Friederich, P. & Aspuru-Guzik, A. Self-referencing embedded strings (SELFIES): a 100% robust molecular string representation. Mach. Learn. Sci. Technol. 1, 045024 (2020).

    Article  Google Scholar 

  43. 43.

    Kusner, M. J., Paige, B. & Hernandez-Lobato, J. M. Grammar variational autoencoder. Preprint at https://arxiv.org/pdf/1703.01925.pdf (2017).

  44. 44.

    Dai, H., Tian, Y., Dai, B., Skiena, S. & Song, L. Syntax-directed variational autoencoder for structured data. Preprint at https://arxiv.org/pdf/1802.08786.pdf (2018).

  45. 45.

    Bjerrum, E. J. SMILES enumeration as data augmentation for neural network modeling of molecules. Preprint at https://arxiv.org/pdf/1703.07076.pdf (2017).

  46. 46.

    Bjerrum, E. J. & Sattarov, B. Improving chemical autoencoder latent space and molecular de novo generation diversity with heteroencoders. Biomolecules 8, 131 (2018).

    Article  Google Scholar 

  47. 47.

    Winter, R., Montanari, F., Noé, F. & Clevert, D.-A. Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem. Sci. 10, 1692–1701 (2019).

    Article  Google Scholar 

  48. 48.

    Zhang, Q. et al. Structural investigation of ribosomally synthesized natural products by hypothetical structure enumeration and evaluation using tandem MS. Proc. Natl Acad. Sci. USA 111, 12031–12036 (2014).

    Article  Google Scholar 

  49. 49.

    Johnston, C. W. et al. An automated genomes-to-natural products platform (GNP) for the discovery of modular natural products. Nat. Commun. 6, 8421 (2015).

    Article  Google Scholar 

  50. 50.

    Zheng, S. et al. QBMG: quasi-biogenic molecule generator with deep recurrent neural network. J. Cheminform. 11, 5 (2019).

    Article  Google Scholar 

  51. 51.

    da Silva, R. R., Dorrestein, P. C. & Quinn, R. A. Illuminating the dark matter in metabolomics. Proc. Natl Acad. Sci. USA 112, 12549–12550 (2015).

    Article  Google Scholar 

  52. 52.

    Vanhaelen, Q., Lin, Y.-C. & Zhavoronkov, A. The advent of generative chemistry. ACS Med. Chem. Lett. 11, 1496–1505 (2020).

    Article  Google Scholar 

  53. 53.

    Coley, C. W., Eyke, N. S. & Jensen, K. F. Autonomous discovery in the chemical sciences part II: outlook. Angew. Chem. Int. Ed. 59, 23414–23436 (2020).

    Article  Google Scholar 

  54. 54.

    Schwaller, P. et al. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent. Sci. 5, 1572–1583 (2019).

    Article  Google Scholar 

  55. 55.

    Kadurin, A., Nikolenko, S., Khrabrov, K., Aliper, A. & Zhavoronkov, A. druGAN: an advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico. Mol. Pharm. 14, 3098–3104 (2017).

    Article  Google Scholar 

  56. 56.

    Samanta, B. et al. NEVAE: a deep generative model for molecular graphs. J. Mach. Learn. Res. 21, 1–33 (2020).

    MathSciNet  Google Scholar 

  57. 57.

    Mercado, R. et al. Practical notes on building molecular graph generative models. Appl. AI Lett. https://doi.org/10.1002/ail2.18 (2020).

  58. 58.

    De Cao, N. & Kipf, T. MolGAN: an implicit generative model for small molecular graphs. Preprint at https://arxiv.org/pdf/1805.11973.pdf (2018).

  59. 59.

    Jaeger, S., Fulle, S. & Turk, S. Mol2vec: unsupervised machine learning approach with chemical intuition. J. Chem. Inf. Model. 58, 27–35 (2018).

    Article  Google Scholar 

  60. 60.

    O’Boyle, N. & Dalke, A. DeepSMILES: an adaptation of SMILES for use in machine-learning of chemical structures. Preprint at https://doi.org/10.26434/chemrxiv.7097960.v1 (2018).

  61. 61.

    Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).

    Article  Google Scholar 

  62. 62.

    O’Boyle, N. M. & Sayle, R. A. Comparing structural fingerprints using a literature-based similarity benchmark. J. Cheminform. 8, 36 (2016).

    Article  Google Scholar 

  63. 63.

    Skinnider, M. A., Dejong, C. A., Franczak, B. C., McNicholas, P. D. & Magarvey, N. A. Comparative analysis of chemical similarity methods for modular natural products with a hypothetical structure enumeration algorithm. J. Cheminform. 9, 46 (2017).

    Article  Google Scholar 

  64. 64.

    Smith, S. L., Kindermans, P.-J. & Le, Q. V. Don’t decay the learning rate, increase the batch size. Preprint at https://arxiv.org/pdf/1711.00489.pdf (2017).

  65. 65.

    Bertz, S. H. The first general index of molecular complexity. J. Am. Chem. Soc. 103, 3599–3601 (1981).

    Article  Google Scholar 

  66. 66.

    Wildman, S. A. & Crippen, G. M. Prediction of physicochemical parameters by atomic contributions. J. Chem. Inf. Comput. Sci. 39, 868–873 (1999).

    Article  Google Scholar 

  67. 67.

    Bemis, G. W. & Murcko, M. A. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem. 39, 2887–2893 (1996).

    Article  Google Scholar 

  68. 68.

    Ertl, P., Roggo, S. & Schuffenhauer, A. Natural product-likeness score and its application for prioritization of compound libraries. J. Chem. Inf. Model. 48, 68–74 (2008).

    Article  Google Scholar 

  69. 69.

    Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. & Hopkins, A. L. Quantifying the chemical beauty of drugs. Nat. Chem. 4, 90–98 (2012).

    Article  Google Scholar 

  70. 70.

    Ertl, P. & Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminform. 1, 8 (2009).

    Article  Google Scholar 

  71. 71.

    Ertl, P., Rohde, B. & Selzer, P. Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties. J. Med. Chem. 43, 3714–3717 (2000).

    Article  Google Scholar 

  72. 72.

    Sajed, T. et al. ECMDB 2.0: a richer resource for understanding the biochemistry of E. coli. Nucleic Acids Res. 44, D495–D501 (2016).

    Article  Google Scholar 

  73. 73.

    Huang, W. et al. PAMDB: a comprehensive Pseudomonas aeruginosa metabolome database. Nucleic Acids Res. 46, D575–D580 (2018).

    Article  Google Scholar 

  74. 74.

    Moumbock, A. F. A. et al. StreptomeDB 3.0: an updated compendium of streptomycetes natural products. Nucleic Acids Res 49, D600–D604 (2020).

    Article  Google Scholar 

  75. 75.

    Zeng, X. et al. NPASS: natural product activity and species source database for natural product research, discovery and tool development. Nucleic Acids Res. 46, D1217–D1222 (2018).

    Article  Google Scholar 

  76. 76.

    Karp, P. D. et al. The BioCyc collection of microbial genomes and metabolic pathways. Brief. Bioinform. 20, 1085–1093 (2019).

    Article  Google Scholar 

  77. 77.

    Neveu, V. et al. Phenol-Explorer: an online comprehensive database on polyphenol contents in foods. Database (Oxford) 2010, bap024 (2010).

    Article  Google Scholar 

  78. 78.

    Ramirez-Gaona, M. et al. YMDB 2.0: a significantly expanded version of the yeast metabolome database. Nucleic Acids Res. 45, D440–D445 (2017).

    Article  Google Scholar 

  79. 79.

    McInnes, L., Healy, J. & Melville, J. UMAP: Uniform Manifold Approximation and Projection for dimension reduction. Preprint at https://arxiv.org/pdf/1802.03426.pdf (2018).

  80. 80.

    Molecules used to train generative models (Zenodo, 2021); https://doi.org/10.5281/zenodo.4641960

  81. 81.

    Python source code used to train and evaluate generative models of molecules (Zenodo, 2021); https://doi.org/10.5281/zenodo.4642099

Download references

Acknowledgements

This work was supported by funding from Genome Canada, Genome British Columbia and Genome Alberta (project nos. 284MBO and 264PRO). Computational resources were provided by WestGrid, Compute Canada and Advanced Research Computing at the University of British Columbia. M.A.S. acknowledges support from a CIHR Vanier Canada Graduate Scholarship, a Roman M. Babicki Fellowship in Medical Research, a Borealis AI Graduate Fellowship, a Walter C. Sumner Memorial Fellowship and a Vancouver Coastal Health–CIHR–UBC MD/PhD Studentship. We thank J. Liigand and F. Wang for helpful discussions.

Author information

Affiliations

Authors

Contributions

M.A.S., D.S.W. and L.J.F. designed experiments. M.A.S. and R.G.S. performed experiments. M.A.S. wrote the manuscript. All authors edited the manuscript.

Corresponding authors

Correspondence to Michael A. Skinnider or Leonard J. Foster.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Machine Intelligence thanks Sebastian Raschka and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Evaluating low-data generative models of purchasable chemical space.

a, Schematic overview of the ‘% valid’, ‘% unique’, and ‘% novel’ metrics. b, Values of the five top-performing metrics with the strongest correlations (ρ ≥ 0.82) to training dataset size for n = 110 generative models trained on varying numbers of molecules from the ZINC database. c, Values of five exemplary metrics with moderate to weak correlations (0.48 ≤ ρ ≤ 0.73) to training dataset size for n = 110 generative models trained on varying numbers of molecules from the ZINC database. d, Values of five exemplary metrics with little or no correlation (ρ ≤ 0.36) to training dataset size for n = 110 generative models trained on varying numbers of molecules from the ZINC database.

Extended Data Fig. 2 Evaluating low-data generative models of divergent chemical spaces.

a, Values of the five top-performing metrics with the strongest correlations (average rank correlation ≥ 0.80) to training dataset size for n = 440 generative models trained on varying numbers of molecules from the ChEMBL, COCONUT, GDB, or ZINC databases. Points and error bars show the mean and standard deviation, respectively, of ten independent replicates. b, Values of five exemplary metrics with moderate to weak correlations to training dataset size for n = 440 generative models trained on varying numbers of molecules from the ChEMBL, COCONUT, GDB, or ZINC databases. c, Values of five exemplary metrics with little or no correlation to training dataset size for n = 440 generative models trained on varying numbers of molecules from the ChEMBL, COCONUT, GDB, or ZINC databases. d, PC1 scores for n = 440 chemical language models trained on varying numbers of molecules sampled from the ChEMBL, COCONUT, GDB, or ZINC databases. Inset text shows the Spearman correlation. e, Factor loadings onto the first principal component in a PCA of n = 440 chemical language models trained on varying numbers of molecules sampled from the ChEMBL, COCONUT, GDB, or ZINC databases.

Extended Data Fig. 3 Robustness of principal component analysis for the evaluation of chemical generative models.

a, PCA of top-performing metrics, top, and PC1 scores, bottom, for chemical language models trained on varying numbers of molecules sampled from the ChEMBL, COCONUT, GDB, and ZINC database, with PCA performed separately for each database. Bottom, inset text shows the Spearman correlation. b, PCA of top-performing metrics for chemical language models trained on varying numbers of molecules sampled from three of four databases, colored by the size of the training dataset, top, or the chemical database on which the generative models were trained, middle. Bottom, PC1 scores for models trained on the withheld database, projected onto the coordinate basis of the other three databases. Inset text shows the Spearman correlation.

Extended Data Fig. 4 Learning chemical language models from less than 1,000 examples.

a, Proportion of valid SMILES generated by chemical language models trained on samples of between 200 and 1,000 molecules from one of four chemical databases. b, Fréchet ChemNet distance of chemical language models trained on samples of between 200 and 1,000 molecules from one of four chemical databases. c, PC1 scores of chemical language models trained on samples of between 200 and 1,000 molecules from one of four chemical databases.

Extended Data Fig. 5 Training dataset size requirements in different chemical spaces.

Mean difference in PC1 scores between chemical language models trained on varying numbers of molecules sampled from each pair of chemical structure databases. Dark squares indicate pairs without statistically significant differences (uncorrected p > 0.05, two-sided t-test).

Extended Data Fig. 6 Low-data generative models of diverse and homogeneous molecules from the ChEMBL and ZINC databases.

a, PCA of top-performing metrics for molecules generated by chemical language models trained on varying numbers of more or less diverse molecules from the GDB, ChEMBL, and ZINC databases, colored by the size of the training dataset. b, As in a, but colored by the chemical database on which the generative models were trained. c, As in a, but colored by the diversity (minimum Tanimoto coefficient to a randomly selected ‘founder’ molecule). d-i, Performance of chemical language models trained on samples of molecules from the ChEMBL (d-f) and ZINC (g-i) databases with a minimum Tanimoto coefficient (Tc) to a randomly selected ‘founder’ molecule. d, Proportion of valid SMILES generated by chemical language models trained on varying numbers of more or less diverse molecules from the ChEMBL database. e, Fréchet ChemNet distances of chemical language models trained on varying numbers of more or less diverse molecules from the ChEMBL database. f, PC1 scores of chemical language models trained on varying numbers of more or less diverse molecules from the ChEMBL database. g, Proportion of valid SMILES generated by chemical language models trained on varying numbers of more or less diverse molecules from the ZINC database. h, Fréchet ChemNet distances of chemical language models trained on varying numbers of more or less diverse molecules from the ZINC database. i, PC1 scores of chemical language models trained on varying numbers of more or less diverse molecules from the ZINC database.

Extended Data Fig. 7 Evaluating alternative molecular representations for low-data generative models in distinct chemical spaces.

a, Proportion of valid SMILES generated by chemical language models trained on one of three string representations of molecules from the ChEMBL, COCONUT, and GDB databases. b, PCA of top-performing metrics for molecules generated by n = 1,320 chemical language models trained on one of three string representations of molecules from the ChEMBL, COCONUT, and GDB databases, colored by the size of the training dataset. c, As in b, but colored by the chemical database on which the generative models were trained. d, As in b, but colored by molecular representation. e, PC1 scores of chemical language models trained on one of three string representations of molecules from the ChEMBL, COCONUT, and GDB databases. f, Fréchet ChemNet distances of chemical language models trained on one of three string representations of molecules from the ChEMBL, COCONUT, and GDB databases. g, Mean difference in PC1 scores between chemical language models trained on varying numbers of molecules sampled from the ChEMBL, COCONUT, and GDB databases, represented either as DeepSMILES or SELFIES, y-axis, or SMILES, x-axis. Dark squares indicate pairs without statistically significant differences (uncorrected p > 0.05, two-sided t-test).

Extended Data Fig. 8 Data augmentation by non-canonical SMILES enumeration.

a, Proportion of valid SMILES generated by chemical language models trained on molecules from the ChEMBL, COCONUT, and GDB databases after varying degrees of non-canonical SMILES enumeration. b, Data as in a and Fig. 3i, but showing the relationship between the size of the training dataset and the proportion of valid SMILES generated by models for each degree of non-canonical SMILES enumeration separately. c, PCA of top-performing metrics for molecules generated by n = 1,760 chemical language models trained on molecules from the ChEMBL, COCONUT, and GDB databases after varying degrees of non-canonical SMILES enumeration, colored by the size of the training dataset. d, As in c, but colored by the chemical database on which the generative models were trained. e, As in c, but colored by the amount of SMILES enumeration. f, PC1 scores of chemical language models trained on molecules from the ChEMBL, COCONUT, and GDB databases after varying degrees of non-canonical SMILES enumeration. g, Mean difference in PC1 scores between chemical language models trained on molecules from the ChEMBL, COCONUT, and GDB databases represented as canonical SMILES, x-axis, or non-canonical SMILES after varying degrees of data augmentation, y-axis. Dark squares indicate pairs without statistically significant differences (uncorrected p > 0.05, two-sided t-test).

Extended Data Fig. 9 Hyperparameter tuning in the ChEMBL database.

a, PCA of top-performing metrics for molecules generated by n = 1,210 chemical language models, trained on varying numbers of molecules from the ChEMBL database with varying model hyperparameters, colored by the size of the training dataset. b, Mean PC1 scores of chemical language models as a function of the total number of neurons in the model. Solid lines show local polynomial regression. c, Mean PC1 scores for molecules trained on the ChEMBL database, as a function of both the number of molecules in the training dataset, x-axis, and varying hyperparameters, y-axis. The mean of five independent replicates is shown. d, Proportion of n = 110 chemical language models with varying hyperparameters, trained on the number of molecules shown on the y-axis, that outperformed a model without hyperparameter tuning trained on the number of molecules shown on the x-axis.

Extended Data Fig. 10 Optimizing generative models of bacterial, fungal, and plant metabolomes.

a, PCA of top-performing metrics for molecules generated by n = 48 chemical language models, trained on bacterial, fungal, or plant metabolomes with varying inputs and hyperparameters, colored by the target metabolome. b, As in a, but colored by the molecular representation and data augmentation strategy. c, As in a, but colored by the RNN architecture. d, Proportion of valid molecules produced by generative models of metabolomes trained with different molecular representations (SMILES, DeepSMILES, or SELFIES), data augmentation strategies (non-canonical SMILES enumeration with an augmentation factor of between 2x and 30x), and RNN architectures (GRU or LSTM). e, As in d, but showing the Fréchet ChemNet distance between generated and real metabolites. f, As in d, but showing the Jensen-Shannon distance of the proportion of stereocenters between generated and real metabolites.

Supplementary information

Supplementary Information

Supplementary Figs. 1–5.

Supplementary Data 1

Metrics for all 8,447 models discussed in this study.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Skinnider, M.A., Stacey, R.G., Wishart, D.S. et al. Chemical language models enable navigation in sparsely populated chemical space. Nat Mach Intell 3, 759–770 (2021). https://doi.org/10.1038/s42256-021-00368-1

Download citation

Further reading

Search

Quick links