Chemical language models enable navigation in sparsely populated chemical space

Skinnider, Michael A.; Stacey, R. Greg; Wishart, David S.; Foster, Leonard J.

doi:10.1038/s42256-021-00368-1

Article
Published: 19 July 2021

Chemical language models enable navigation in sparsely populated chemical space

Michael A. Skinnider ORCID: orcid.org/0000-0002-2168-1621¹,
R. Greg Stacey¹,
David S. Wishart^2,3,4,5 &
…
Leonard J. Foster^1,6

Nature Machine Intelligence volume 3, pages 759–770 (2021)Cite this article

3871 Accesses
48 Citations
52 Altmetric
Metrics details

Subjects

A preprint version of the article is available at ChemRxiv.

Abstract

Deep generative models are powerful tools for the exploration of chemical space, enabling the on-demand generation of molecules with desired physical, chemical or biological properties. However, these models are typically thought to require training datasets comprising hundreds of thousands, or even millions, of molecules. This perception limits the application of deep generative models in regions of chemical space populated by a relatively small number of examples. Here, we systematically evaluate and optimize generative models of molecules based on recurrent neural networks in low-data settings. We find that robust models can be learned from far fewer examples than has been widely assumed. We identify strategies that further reduce the number of molecules required to learn a model of equivalent quality, notably including data augmentation by non-canonical SMILES enumeration, and demonstrate the application of these principles by learning models of bacterial, plant and fungal metabolomes. The structure of our experiments also allows us to benchmark the metrics used to evaluate generative models themselves. We find that many of the most widely used metrics in the field fail to capture model quality, but we identify a subset of well-behaved metrics that provide a sound basis for model development. Collectively, our work provides a foundation for directly learning generative models in sparsely populated regions of chemical space.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Learning generative models of molecules from limited training examples.**

**Fig. 2: Low-data generative models of distinct chemical spaces.**

**Fig. 3: Low-data generative models of diverse and homogeneous molecules.**

**Fig. 4: Alternative molecular representations for low-data generative models.**

**Fig. 5: Data, not architecture, dictates the performance of low-data generative models.**

**Fig. 6: Low-data generative models of bacterial, fungal and plant metabolomes.**

Language models can learn complex molecular distributions

Article Open access 07 June 2022

67 million natural product-like compound database generated via molecular language processing

Article Open access 19 May 2023

Large-scale chemical language representations capture molecular structure and properties

Article 21 December 2022

Data availability

Input datasets used to train chemical language models are available from Zenodo⁸⁰. Calculated metrics for all 8,447 models discussed in this study are provided as Supplementary Data 1.

Code availability

Code used to train and evaluate chemical language models is available from GitHub at http://github.com/skinnider/low-data-generative-models⁸¹.

References

Bohacek, R. S., McMartin, C. & Guida, W. C. The art and practice of structure-based drug design: a molecular modeling perspective. Med. Res. Rev. 16, 3–50 (1996).
Article Google Scholar
Virshup, A. M., Contreras-García, J., Wipf, P., Yang, W. & Beratan, D. N. Stochastic voyages into uncharted chemical space produce a representative library of all possible drug-like compounds. J. Am. Chem. Soc. 135, 7296–7303 (2013).
Article Google Scholar
van Deursen, R. & Reymond, J.-L. Chemical space travel. ChemMedChem 2, 636–640 (2007).
Article Google Scholar
Lameijer, E.-W., Kok, J. N., Bäck, T. & Ijzerman, A. P. The molecule evoluator. An interactive evolutionary algorithm for the design of drug-like molecules. J. Chem. Inf. Model. 46, 545–552 (2006).
Article Google Scholar
Pollock, S. N., Coutsias, E. A., Wester, M. J. & Oprea, T. I. Scaffold topologies. 1. Exhaustive enumeration up to eight rings. J. Chem. Inf. Model. 48, 1304–1310 (2008).
Article Google Scholar
Fink, T. & Reymond, J.-L. Virtual exploration of the chemical universe up to 11 atoms of C, N, O, F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physicochemical properties, compound classes and drug discovery. J. Chem. Inf. Model. 47, 342–353 (2007).
Article Google Scholar
Blum, L. C. & Reymond, J.-L. 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. J. Am. Chem. Soc. 131, 8732–8733 (2009).
Article Google Scholar
Ruddigkeit, L., van Deursen, R., Blum, L. C. & Reymond, J.-L. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J. Chem. Inf. Model. 52, 2864–2875 (2012).
Article Google Scholar
Elton, D. C., Boukouvalas, Z., Fuge, M. D. & Chung, P. W. Deep learning for molecular design-a review of the state of the art. Mol. Syst. Des. Eng 4, 828–849 (2019).
Article Google Scholar
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Model. 28, 31–36 (1988).
Article Google Scholar
Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).
Article Google Scholar
Segler, M. H. S., Kogej, T., Tyrchan, C. & Waller, M. P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci. 4, 120–131 (2018).
Article Google Scholar
Olivecrona, M., Blaschke, T., Engkvist, O. & Chen, H. Molecular de-novo design through deep reinforcement learning. J. Cheminform. 9, 48 (2017).
Article Google Scholar
Arús-Pous, J. et al. Exploring the GDB-13 chemical space using deep generative models. J. Cheminform. 11, 20 (2019).
Article Google Scholar
Merk, D., Friedrich, L., Grisoni, F. & Schneider, G. De novo design of bioactive small molecules by artificial intelligence. Mol. Inform. 37, 1700153 (2018).
Article Google Scholar
Moret, M., Friedrich, L., Grisoni, F., Merk, D. & Schneider, G. Generative molecular design in low data regimes. Nat. Mach. Intell. 2, 171–180 (2020).
Article Google Scholar
Popova, M., Isayev, O. & Tropsha, A. Deep reinforcement learning for de novo drug design. Sci. Adv. 4, eaap7885 (2018).
Article Google Scholar
Kotsias, P.-C. et al. Direct steering of de novo molecular generation with descriptor conditional recurrent neural networks. Nat. Mach. Intell. 2, 254–265 (2020).
Article Google Scholar
Li, Y., Zhang, L. & Liu, Z. Multi-objective de novo drug design with conditional graph generative model. J. Cheminform. 10, 33 (2018).
Article Google Scholar
Zhou, Z., Kearnes, S., Li, L., Zare, R. N. & Riley, P. Optimization of molecules via deep reinforcement learning. Sci. Rep. 9, 10752 (2019).
Article Google Scholar
Jin, W., Barzilay, R. & Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. In Proc. 35th International Conference on Machine Learning Vol. 80 (eds Dy, J. & Krause, A.) 2323–2332 (PMLR, 2018).
Brown, N., Fiscato, M., Segler, M. H. S. & Vaucher, A. C. Guacamol: benchmarking models for de novo molecular design. J. Chem. Inf. Model. 59, 1096–1108 (2019).
Article Google Scholar
Polykovskiy, D. et al. Molecular sets (MOSES): a benchmarking platform for molecular generation models. Front. Pharmacol. 11, 565644 (2020).
Article Google Scholar
Sanchez-Lengeling, B. & Aspuru-Guzik, A. Inverse molecular design using machine learning: generative models for matter engineering. Science 361, 360–365 (2018).
Article Google Scholar
Ståhl, N., Falkman, G., Karlsson, A., Mathiason, G. & Boström, J. Deep reinforcement learning for multiparameter optimization in de novo drug design. J. Chem. Inf. Model. 59, 3166–3176 (2019).
Article Google Scholar
Liu, X., Ye, K., van Vlijmen, H. W. T., IJzerman, A. P. & van Westen, G. J. P. An exploration strategy improves the diversity of de novo ligands using deep reinforcement learning: a case for the adenosine A2A receptor. J. Cheminform. 11, 35 (2019).
Article Google Scholar
Neil, D. et al. Exploring deep recurrent models with reinforcement learning for molecule design. In Proc. 6th International Conference on Learning Representations (ICLR, 2018).
Amabilino, S., Pogány, P., Pickett, S. D. & Green, D. V. S. Guidelines for recurrent neural network transfer learning-based molecular generation of focused libraries. J. Chem. Inf. Model. 60, 5699–5713 (2020).
Article Google Scholar
Gupta, A. et al. Generative recurrent networks for de novo drug design. Mol. Inform. 37, 1700111 (2018).
Article Google Scholar
Awale, M., Sirockin, F., Stiefl, N. & Reymond, J.-L. Drug analogs from fragment-based long short-term memory generative neural networks. J. Chem. Inf. Model. 59, 1347–1356 (2019).
Article Google Scholar
Merk, D., Grisoni, F., Friedrich, L. & Schneider, G. Tuning artificial intelligence on the de novo design of natural-product-inspired retinoid X receptor modulators. Commun. Chem. 1, 68 (2018).
Article Google Scholar
Renz, P., Van Rompaey, D., Wegner, J. K., Hochreiter, S. & Klambauer, G. On failure modes in molecule generation and optimization. Drug Discov. Today Technol. 32–33, 55–63 (2019).
Article Google Scholar
Arús-Pous, J. et al. Randomized SMILES strings improve the quality of molecular generative models. J. Cheminform. 11, 71 (2019).
Article Google Scholar
Irwin, J. J. & Shoichet, B. K. ZINC—a free database of commercially available compounds for virtual screening. J. Chem. Inf. Model. 45, 177–182 (2005).
Article Google Scholar
Benhenda, M. Can AI reproduce observed chemical diversity? Preprint at bioRxiv https://doi.org/10.1101/292177 (2018).
Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S. & Klambauer, G. Fréchet ChemNet distance: a metric for generative models for molecules in drug discovery. J. Chem. Inf. Model. 58, 1736–1741 (2018).
Article Google Scholar
van Deursen, R., Ertl, P., Tetko, I. V. & Godin, G. GEN: highly efficient SMILES explorer using autodidactic generative examination networks. J. Cheminform. 12, 22 (2020).
Article Google Scholar
Mendez, D. et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 47, D930–D940 (2019).
Article Google Scholar
Sorokina, M. & Steinbeck, C. Review on natural products databases: where to find data in 2020. J. Cheminform. 12, 20 (2020).
Article Google Scholar
Sanchez-Lengeling, B., Outeiral, C., Guimaraes, G. L. & Aspuru-Guzik, A. Optimizing distributions over molecular space. An Objective-Reinforced Generative Adversarial Network for Inverse-design Chemistry (ORGANIC). Preprint at https://doi.org/10.26434/chemrxiv.5309668.v3 (2017).
O’Boyle, N. & Dalke, A. DeepSMILES: an adaptation of SMILES for use in machine-learning of chemical structures. Preprint at https://doi.org/10.26434/chemrxiv.7097960 (2018).
Krenn, M., Häse, F., Nigam, A., Friederich, P. & Aspuru-Guzik, A. Self-referencing embedded strings (SELFIES): a 100% robust molecular string representation. Mach. Learn. Sci. Technol. 1, 045024 (2020).
Article Google Scholar
Kusner, M. J., Paige, B. & Hernandez-Lobato, J. M. Grammar variational autoencoder. Preprint at https://arxiv.org/pdf/1703.01925.pdf (2017).
Dai, H., Tian, Y., Dai, B., Skiena, S. & Song, L. Syntax-directed variational autoencoder for structured data. Preprint at https://arxiv.org/pdf/1802.08786.pdf (2018).
Bjerrum, E. J. SMILES enumeration as data augmentation for neural network modeling of molecules. Preprint at https://arxiv.org/pdf/1703.07076.pdf (2017).
Bjerrum, E. J. & Sattarov, B. Improving chemical autoencoder latent space and molecular de novo generation diversity with heteroencoders. Biomolecules 8, 131 (2018).
Article Google Scholar
Winter, R., Montanari, F., Noé, F. & Clevert, D.-A. Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem. Sci. 10, 1692–1701 (2019).
Article Google Scholar
Zhang, Q. et al. Structural investigation of ribosomally synthesized natural products by hypothetical structure enumeration and evaluation using tandem MS. Proc. Natl Acad. Sci. USA 111, 12031–12036 (2014).
Article Google Scholar
Johnston, C. W. et al. An automated genomes-to-natural products platform (GNP) for the discovery of modular natural products. Nat. Commun. 6, 8421 (2015).
Article Google Scholar
Zheng, S. et al. QBMG: quasi-biogenic molecule generator with deep recurrent neural network. J. Cheminform. 11, 5 (2019).
Article Google Scholar
da Silva, R. R., Dorrestein, P. C. & Quinn, R. A. Illuminating the dark matter in metabolomics. Proc. Natl Acad. Sci. USA 112, 12549–12550 (2015).
Article Google Scholar
Vanhaelen, Q., Lin, Y.-C. & Zhavoronkov, A. The advent of generative chemistry. ACS Med. Chem. Lett. 11, 1496–1505 (2020).
Article Google Scholar
Coley, C. W., Eyke, N. S. & Jensen, K. F. Autonomous discovery in the chemical sciences part II: outlook. Angew. Chem. Int. Ed. 59, 23414–23436 (2020).
Article Google Scholar
Schwaller, P. et al. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent. Sci. 5, 1572–1583 (2019).
Article Google Scholar
Kadurin, A., Nikolenko, S., Khrabrov, K., Aliper, A. & Zhavoronkov, A. druGAN: an advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico. Mol. Pharm. 14, 3098–3104 (2017).
Article Google Scholar
Samanta, B. et al. NEVAE: a deep generative model for molecular graphs. J. Mach. Learn. Res. 21, 1–33 (2020).
MathSciNet Google Scholar
Mercado, R. et al. Practical notes on building molecular graph generative models. Appl. AI Lett. https://doi.org/10.1002/ail2.18 (2020).
De Cao, N. & Kipf, T. MolGAN: an implicit generative model for small molecular graphs. Preprint at https://arxiv.org/pdf/1805.11973.pdf (2018).
Jaeger, S., Fulle, S. & Turk, S. Mol2vec: unsupervised machine learning approach with chemical intuition. J. Chem. Inf. Model. 58, 27–35 (2018).
Article Google Scholar
O’Boyle, N. & Dalke, A. DeepSMILES: an adaptation of SMILES for use in machine-learning of chemical structures. Preprint at https://doi.org/10.26434/chemrxiv.7097960.v1 (2018).
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Article Google Scholar
O’Boyle, N. M. & Sayle, R. A. Comparing structural fingerprints using a literature-based similarity benchmark. J. Cheminform. 8, 36 (2016).
Article Google Scholar
Skinnider, M. A., Dejong, C. A., Franczak, B. C., McNicholas, P. D. & Magarvey, N. A. Comparative analysis of chemical similarity methods for modular natural products with a hypothetical structure enumeration algorithm. J. Cheminform. 9, 46 (2017).
Article Google Scholar
Smith, S. L., Kindermans, P.-J. & Le, Q. V. Don’t decay the learning rate, increase the batch size. Preprint at https://arxiv.org/pdf/1711.00489.pdf (2017).
Bertz, S. H. The first general index of molecular complexity. J. Am. Chem. Soc. 103, 3599–3601 (1981).
Article Google Scholar
Wildman, S. A. & Crippen, G. M. Prediction of physicochemical parameters by atomic contributions. J. Chem. Inf. Comput. Sci. 39, 868–873 (1999).
Article Google Scholar
Bemis, G. W. & Murcko, M. A. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem. 39, 2887–2893 (1996).
Article Google Scholar
Ertl, P., Roggo, S. & Schuffenhauer, A. Natural product-likeness score and its application for prioritization of compound libraries. J. Chem. Inf. Model. 48, 68–74 (2008).
Article Google Scholar
Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. & Hopkins, A. L. Quantifying the chemical beauty of drugs. Nat. Chem. 4, 90–98 (2012).
Article Google Scholar
Ertl, P. & Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminform. 1, 8 (2009).
Article Google Scholar
Ertl, P., Rohde, B. & Selzer, P. Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties. J. Med. Chem. 43, 3714–3717 (2000).
Article Google Scholar
Sajed, T. et al. ECMDB 2.0: a richer resource for understanding the biochemistry of E. coli. Nucleic Acids Res. 44, D495–D501 (2016).
Article Google Scholar
Huang, W. et al. PAMDB: a comprehensive Pseudomonas aeruginosa metabolome database. Nucleic Acids Res. 46, D575–D580 (2018).
Article Google Scholar
Moumbock, A. F. A. et al. StreptomeDB 3.0: an updated compendium of streptomycetes natural products. Nucleic Acids Res 49, D600–D604 (2020).
Article Google Scholar
Zeng, X. et al. NPASS: natural product activity and species source database for natural product research, discovery and tool development. Nucleic Acids Res. 46, D1217–D1222 (2018).
Article Google Scholar
Karp, P. D. et al. The BioCyc collection of microbial genomes and metabolic pathways. Brief. Bioinform. 20, 1085–1093 (2019).
Article Google Scholar
Neveu, V. et al. Phenol-Explorer: an online comprehensive database on polyphenol contents in foods. Database (Oxford) 2010, bap024 (2010).
Article Google Scholar
Ramirez-Gaona, M. et al. YMDB 2.0: a significantly expanded version of the yeast metabolome database. Nucleic Acids Res. 45, D440–D445 (2017).
Article Google Scholar
McInnes, L., Healy, J. & Melville, J. UMAP: Uniform Manifold Approximation and Projection for dimension reduction. Preprint at https://arxiv.org/pdf/1802.03426.pdf (2018).
Molecules used to train generative models (Zenodo, 2021); https://doi.org/10.5281/zenodo.4641960
Python source code used to train and evaluate generative models of molecules (Zenodo, 2021); https://doi.org/10.5281/zenodo.4642099

Download references

Acknowledgements

This work was supported by funding from Genome Canada, Genome British Columbia and Genome Alberta (project nos. 284MBO and 264PRO). Computational resources were provided by WestGrid, Compute Canada and Advanced Research Computing at the University of British Columbia. M.A.S. acknowledges support from a CIHR Vanier Canada Graduate Scholarship, a Roman M. Babicki Fellowship in Medical Research, a Borealis AI Graduate Fellowship, a Walter C. Sumner Memorial Fellowship and a Vancouver Coastal Health–CIHR–UBC MD/PhD Studentship. We thank J. Liigand and F. Wang for helpful discussions.

Author information

Authors and Affiliations

Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia, Canada
Michael A. Skinnider, R. Greg Stacey & Leonard J. Foster
Department of Biological Sciences, University of Alberta, Edmonton, Alberta, Canada
David S. Wishart
Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada
David S. Wishart
Department of Laboratory Medicine and Pathology, University of Alberta, Edmonton, Alberta, Canada
David S. Wishart
Faculty of Pharmacy and Pharmaceutical Sciences, University of Alberta, Edmonton, Alberta, Canada
David S. Wishart
Department of Biochemistry and Molecular Biology, University of British Columbia, Vancouver, British Columbia, Canada
Leonard J. Foster

Authors

Michael A. Skinnider
View author publications
You can also search for this author in PubMed Google Scholar
R. Greg Stacey
View author publications
You can also search for this author in PubMed Google Scholar
David S. Wishart
View author publications
You can also search for this author in PubMed Google Scholar
Leonard J. Foster
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.A.S., D.S.W. and L.J.F. designed experiments. M.A.S. and R.G.S. performed experiments. M.A.S. wrote the manuscript. All authors edited the manuscript.

Corresponding authors

Correspondence to Michael A. Skinnider or Leonard J. Foster.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Machine Intelligence thanks Sebastian Raschka and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Evaluating low-data generative models of purchasable chemical space.

a, Schematic overview of the ‘% valid’, ‘% unique’, and ‘% novel’ metrics. b, Values of the five top-performing metrics with the strongest correlations (ρ ≥ 0.82) to training dataset size for n = 110 generative models trained on varying numbers of molecules from the ZINC database. c, Values of five exemplary metrics with moderate to weak correlations (0.48 ≤ ρ ≤ 0.73) to training dataset size for n = 110 generative models trained on varying numbers of molecules from the ZINC database. d, Values of five exemplary metrics with little or no correlation (ρ ≤ 0.36) to training dataset size for n = 110 generative models trained on varying numbers of molecules from the ZINC database.

Extended Data Fig. 2 Evaluating low-data generative models of divergent chemical spaces.

a, Values of the five top-performing metrics with the strongest correlations (average rank correlation ≥ 0.80) to training dataset size for n = 440 generative models trained on varying numbers of molecules from the ChEMBL, COCONUT, GDB, or ZINC databases. Points and error bars show the mean and standard deviation, respectively, of ten independent replicates. b, Values of five exemplary metrics with moderate to weak correlations to training dataset size for n = 440 generative models trained on varying numbers of molecules from the ChEMBL, COCONUT, GDB, or ZINC databases. c, Values of five exemplary metrics with little or no correlation to training dataset size for n = 440 generative models trained on varying numbers of molecules from the ChEMBL, COCONUT, GDB, or ZINC databases. d, PC1 scores for n = 440 chemical language models trained on varying numbers of molecules sampled from the ChEMBL, COCONUT, GDB, or ZINC databases. Inset text shows the Spearman correlation. e, Factor loadings onto the first principal component in a PCA of n = 440 chemical language models trained on varying numbers of molecules sampled from the ChEMBL, COCONUT, GDB, or ZINC databases.

Extended Data Fig. 3 Robustness of principal component analysis for the evaluation of chemical generative models.

a, PCA of top-performing metrics, top, and PC1 scores, bottom, for chemical language models trained on varying numbers of molecules sampled from the ChEMBL, COCONUT, GDB, and ZINC database, with PCA performed separately for each database. Bottom, inset text shows the Spearman correlation. b, PCA of top-performing metrics for chemical language models trained on varying numbers of molecules sampled from three of four databases, colored by the size of the training dataset, top, or the chemical database on which the generative models were trained, middle. Bottom, PC1 scores for models trained on the withheld database, projected onto the coordinate basis of the other three databases. Inset text shows the Spearman correlation.

Extended Data Fig. 4 Learning chemical language models from less than 1,000 examples.

a, Proportion of valid SMILES generated by chemical language models trained on samples of between 200 and 1,000 molecules from one of four chemical databases. b, Fréchet ChemNet distance of chemical language models trained on samples of between 200 and 1,000 molecules from one of four chemical databases. c, PC1 scores of chemical language models trained on samples of between 200 and 1,000 molecules from one of four chemical databases.

Extended Data Fig. 5 Training dataset size requirements in different chemical spaces.

Mean difference in PC1 scores between chemical language models trained on varying numbers of molecules sampled from each pair of chemical structure databases. Dark squares indicate pairs without statistically significant differences (uncorrected p > 0.05, two-sided t-test).

Extended Data Fig. 6 Low-data generative models of diverse and homogeneous molecules from the ChEMBL and ZINC databases.

a, PCA of top-performing metrics for molecules generated by chemical language models trained on varying numbers of more or less diverse molecules from the GDB, ChEMBL, and ZINC databases, colored by the size of the training dataset. b, As in a, but colored by the chemical database on which the generative models were trained. c, As in a, but colored by the diversity (minimum Tanimoto coefficient to a randomly selected ‘founder’ molecule). d-i, Performance of chemical language models trained on samples of molecules from the ChEMBL (d-f) and ZINC (g-i) databases with a minimum Tanimoto coefficient (Tc) to a randomly selected ‘founder’ molecule. d, Proportion of valid SMILES generated by chemical language models trained on varying numbers of more or less diverse molecules from the ChEMBL database. e, Fréchet ChemNet distances of chemical language models trained on varying numbers of more or less diverse molecules from the ChEMBL database. f, PC1 scores of chemical language models trained on varying numbers of more or less diverse molecules from the ChEMBL database. g, Proportion of valid SMILES generated by chemical language models trained on varying numbers of more or less diverse molecules from the ZINC database. h, Fréchet ChemNet distances of chemical language models trained on varying numbers of more or less diverse molecules from the ZINC database. i, PC1 scores of chemical language models trained on varying numbers of more or less diverse molecules from the ZINC database.

Extended Data Fig. 7 Evaluating alternative molecular representations for low-data generative models in distinct chemical spaces.

a, Proportion of valid SMILES generated by chemical language models trained on one of three string representations of molecules from the ChEMBL, COCONUT, and GDB databases. b, PCA of top-performing metrics for molecules generated by n = 1,320 chemical language models trained on one of three string representations of molecules from the ChEMBL, COCONUT, and GDB databases, colored by the size of the training dataset. c, As in b, but colored by the chemical database on which the generative models were trained. d, As in b, but colored by molecular representation. e, PC1 scores of chemical language models trained on one of three string representations of molecules from the ChEMBL, COCONUT, and GDB databases. f, Fréchet ChemNet distances of chemical language models trained on one of three string representations of molecules from the ChEMBL, COCONUT, and GDB databases. g, Mean difference in PC1 scores between chemical language models trained on varying numbers of molecules sampled from the ChEMBL, COCONUT, and GDB databases, represented either as DeepSMILES or SELFIES, y-axis, or SMILES, x-axis. Dark squares indicate pairs without statistically significant differences (uncorrected p > 0.05, two-sided t-test).

Extended Data Fig. 8 Data augmentation by non-canonical SMILES enumeration.

a, Proportion of valid SMILES generated by chemical language models trained on molecules from the ChEMBL, COCONUT, and GDB databases after varying degrees of non-canonical SMILES enumeration. b, Data as in a and Fig. 3i, but showing the relationship between the size of the training dataset and the proportion of valid SMILES generated by models for each degree of non-canonical SMILES enumeration separately. c, PCA of top-performing metrics for molecules generated by n = 1,760 chemical language models trained on molecules from the ChEMBL, COCONUT, and GDB databases after varying degrees of non-canonical SMILES enumeration, colored by the size of the training dataset. d, As in c, but colored by the chemical database on which the generative models were trained. e, As in c, but colored by the amount of SMILES enumeration. f, PC1 scores of chemical language models trained on molecules from the ChEMBL, COCONUT, and GDB databases after varying degrees of non-canonical SMILES enumeration. g, Mean difference in PC1 scores between chemical language models trained on molecules from the ChEMBL, COCONUT, and GDB databases represented as canonical SMILES, x-axis, or non-canonical SMILES after varying degrees of data augmentation, y-axis. Dark squares indicate pairs without statistically significant differences (uncorrected p > 0.05, two-sided t-test).

Extended Data Fig. 9 Hyperparameter tuning in the ChEMBL database.

a, PCA of top-performing metrics for molecules generated by n = 1,210 chemical language models, trained on varying numbers of molecules from the ChEMBL database with varying model hyperparameters, colored by the size of the training dataset. b, Mean PC1 scores of chemical language models as a function of the total number of neurons in the model. Solid lines show local polynomial regression. c, Mean PC1 scores for molecules trained on the ChEMBL database, as a function of both the number of molecules in the training dataset, x-axis, and varying hyperparameters, y-axis. The mean of five independent replicates is shown. d, Proportion of n = 110 chemical language models with varying hyperparameters, trained on the number of molecules shown on the y-axis, that outperformed a model without hyperparameter tuning trained on the number of molecules shown on the x-axis.

Extended Data Fig. 10 Optimizing generative models of bacterial, fungal, and plant metabolomes.

a, PCA of top-performing metrics for molecules generated by n = 48 chemical language models, trained on bacterial, fungal, or plant metabolomes with varying inputs and hyperparameters, colored by the target metabolome. b, As in a, but colored by the molecular representation and data augmentation strategy. c, As in a, but colored by the RNN architecture. d, Proportion of valid molecules produced by generative models of metabolomes trained with different molecular representations (SMILES, DeepSMILES, or SELFIES), data augmentation strategies (non-canonical SMILES enumeration with an augmentation factor of between 2x and 30x), and RNN architectures (GRU or LSTM). e, As in d, but showing the Fréchet ChemNet distance between generated and real metabolites. f, As in d, but showing the Jensen-Shannon distance of the proportion of stereocenters between generated and real metabolites.

Supplementary information

Supplementary Information

Supplementary Figs. 1–5.

Supplementary Data 1

Metrics for all 8,447 models discussed in this study.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Skinnider, M.A., Stacey, R.G., Wishart, D.S. et al. Chemical language models enable navigation in sparsely populated chemical space. Nat Mach Intell 3, 759–770 (2021). https://doi.org/10.1038/s42256-021-00368-1

Download citation

Received: 19 January 2021
Accepted: 11 June 2021
Published: 19 July 2021
Issue Date: September 2021
DOI: https://doi.org/10.1038/s42256-021-00368-1

This article is cited by

Reinvent 4: Modern AI–driven generative molecule design
- Hannes H. Loeffler
- Jiazhen He
- Ola Engkvist
Journal of Cheminformatics (2024)
Fast and effective molecular property prediction with transferability map
- Shaolun Yao
- Jie Song
- Zunlei Feng
Communications Chemistry (2024)
Prospective de novo drug design with deep interactome learning
- Kenneth Atz
- Leandro Cotos
- Gisbert Schneider
Nature Communications (2024)
Invalid SMILES are beneficial rather than detrimental to chemical language models
- Michael A. Skinnider
Nature Machine Intelligence (2024)
Exploring Novel Fentanyl Analogues Using a Graph-Based Transformer Model
- Guangle Zhang
- Yuan Zhang
- Cong Pian
Interdisciplinary Sciences: Computational Life Sciences (2024)