Deep learning has acquired considerable momentum over the past couple of years in the domain of de novo drug design. Here, we propose a simple approach to the task of focused molecular generation for drug design purposes by constructing a conditional recurrent neural network (cRNN). We aggregate selected molecular descriptors and transform them into the initial memory state of the network before starting the generation of alphanumeric strings that describe molecules. We thus tackle the inverse design problem directly, as the cRNNs may generate molecules near the specified conditions. Moreover, we exemplify a novel way of assessing the focus of the conditional output of such a model using negative log-likelihood plots. The output is more focused than traditional unbiased RNNs, yet less focused than autoencoders, thus representing a novel method with intermediate output specificity between well-established methods. Conceptually, our architecture shows promise for the generalized problem of steering of sequential data generation with recurrent neural networks.
Subscribe to Journal
Get full journal access for 1 year
only $8.25 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
The curated datasets used to train all models are available at https://github.com/pcko1/Deep-Drug-Coder/tree/master/datasets.
The Python code and the trained neural networks used in this work are available under MIT licence57 in the Deep Drug Coder (DDC) GitHub repository https://github.com/pcko1/Deep-Drug-Coder and https://doi.org/10.5281/zenodo.3739063, which also includes an optional encoding network to constitute a molecular heteroencoder.
Lopyrev, K. Generating news headlines with recurrent neural networks. Preprint at https://arxiv.org/pdf/1512.01712.pdf (2015).
Briot, J.-P., Hadjeres, G. & Pachet, F.-D. Deep Learning Techniques for Music Generation (Springer, 2020).
Wang, Z. et al. Chinese poetry generation with planning based neural network. In Proceedings of 26th International Conference of Computing and Linguistics 1051–1060 (COLING 2016 Organizing Committee, 2016).
Elgammal, A., Liu, B., Elhoseiny, M. & Mazzone, M. CAN: Creative adversarial networks, generating ‘art’ by learning about styles and deviating from style norms. Preprint at https://arxiv.org/abs/1706.07068 (2017).
Segler, M. H. S., Preuss, M. & Waller, M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555, 604–610 (2018).
Ronneberger, O., Fischer, P. & Brox, T. U-Net: convolutional networks for biomedical image segmentation. In Proceedings of Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015 (eds Navab, N., Hornegger, J., Wells, W. M. & Frangi, A. F.) 234–241 (Springer, 2015).
Chen, H., Engkvist, O., Wang, Y., Olivecrona, M. & Blaschke, T. The rise of deep learning in drug discovery. Drug Discov. Today 23, 1241–1250 (2018).
Xu, Y. et al. Deep learning for molecular generation. Future Med. Chem. 11, 567–597 (2019).
Elton, D. C., Boukouvalas, Z., Fuge, M. D. & Chung, P. W. Deep learning for molecular design-a review of the state of the art. Mol. Syst. Des. Eng. 4, 828–849 (2019).
Vamathevan, J. et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 18, 463–477 (2019).
Sanchez-Lengeling, B. & Aspuru-Guzik, A. Inverse molecular design using machine learning: generative models for matter engineering. Science 361, 360–365 (2018).
Weininger, D. SMILES, a chemical language and information system. 1: Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).
Schwalbe-Koda, D. & Gómez-Bombarelli, R. Generative models for automatic chemical design. Preprint at https://arxiv.org/pdf/1907.01632.pdf (2019).
Lipton, Z. C., Berkowitz, J. & Elkan, C. A critical review of recurrent neural networks for sequence learning. Preprint at https://arxiv.org/pdf/1506.00019.pdf (2015).
Arús-Pous, J. et al. Exploring the GDB-13 chemical space using deep generative models. J. Cheminform. 11, 20 (2019).
Arús-Pous, J. et al. Randomized SMILES strings improve the quality of molecular generative models. J. Cheminform. 11, 71 (2019).
Segler, M. H. S., Kogej, T., Tyrchan, C. & Waller, M. P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci. 4, 120–131 (2018).
Olivecrona, M., Blaschke, T., Engkvist, O. & Chen, H. Molecular de-novo design through deep reinforcement learning. J. Cheminform 9, 48 (2017).
Zhou, Z., Kearnes, S., Li, L., Zare, R. N. & Riley, P. Optimization of molecules via deep reinforcement learning. Sci. Rep. 9, 10752 (2019).
Popova, M., Isayev, O. & Tropsha, A. Deep reinforcement learning for de novo drug design. Sci. Adv 4, 7 (2018).
Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).
Polykovskiy, D., Artamonov, A., Veselov, M., Kadurin, A. & Nikolenko, S. Molecular Sets (MOSES): a benchmarking platform for molecular generation models. Preprint at https://arxiv.org/pdf/1811.12823.pdf (2019).
Brown, N., Fiscato, M., Segler, M. H. S. & Vaucher, A. C. GuacaMol: benchmarking models for de novo molecular design. J. Chem. Inf. Model. 59, 1096–1108 (2019).
Bjerrum, E. J. SMILES enumeration as data augmentation for neural network modeling of molecules. Preprint at https://arxiv.org/pdf/1703.07076.pdf (2017).
Bjerrum, E. J. & Sattarov, B. Improving chemical autoencoder latent space and molecular de novo generation diversity with heteroencoders. Biomolecules 8, 131 (2018).
Winter, R., Montanari, F., Noé, F. & Clevert, D. A. Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem. Sci. 10, 1692–1701 (2019).
Blaschke, T., Olivecrona, M., Engkvist, O., Bajorath, J. & Chen, H. Application of generative autoencoder in de novo molecular design. Mol. Inform. 37, 1–11 (2018).
Winter, R. et al. Efficient multi-objective molecular optimization in a continuous latent space. Chem. Sci 10, 8016–8024 (2019).
Prykhodko, O. et al. A de novo molecular generation method using latent vector based generative adversarial network. J. Cheminform 11, 74 (2019).
Lim, J., Ryu, S., Kim, J. W. & Kim, W. Y. Molecular generative model based on conditional variational autoencoder for de novo molecular design. J. Cheminform. 10, 31 (2018).
Jin, W., Barzilay, R. & Jaakkola, T. S. Multi-resolution autoregressive graph-to-graph translation for molecules. Preprint at https://chemrxiv.org/articles/Multi-Resolution_Autoregressive_Graph-to-Graph_Translation_for_Molecules/8266745/1 (2019).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1–32 (1997).
Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S. & Klambauer, G. Fréchet ChemNet distance: a metric for generative models for molecules in drug discovery. J. Chem. Inf. Model. 58, 1736–1741 (2018).
Duvenaud, D. K. et al. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems Vol. 28 (eds Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M. & Garnett, R.) 2224–2232 (Curran Associates, 2015).
Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. & Hopkins, A. L. Quantifying the chemical beauty of drugs. Nat. Chem. 4, 90–98 (2012).
Ester, M., Kriegel, H., Xu, X. & Miinchen, D. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge, Discovery and Data Mining 226–231 (AAAI Press, 1996).
Škrlj, B., Džeroski, S., Lavrač, N. & Petkovič, M. Feature importance estimation with self-attention networks. Preprint at https://arxiv.org/pdf/2002.04464.pdf (2020).
Olden, J. D., Joy, M. K. & Death, R. G. An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data. Ecol. Model. 178, 389–397 (2004).
Hung, L. & Chung, H. Decoupled control using neural network-based sliding-mode controller for nonlinear systems. Expert Syst. Appl. 32, 1168–1182 (2007).
Gaulton, A. et al. The ChEMBL database in 2017. Nucleic Acids Res. 45, D945–D954 (2017).
Sun, J. et al. ExCAPE-DB: an integrated large scale dataset facilitating big data analysis in chemogenomics. J. Cheminform. 9, 41 (2017).
Swain, M. MolVS: Molecule Validation and Standardization v0.1.1 (2019); https://molvs.readthedocs.io/en/latest/
Sun, J. et al. ExCAPEDB (2019); https://solr.ideaconsult.net/search/excape/
Landrum, G. et al. RDKit: Open-Source Cheminformatics Software (2019); https://www.rdkit.org/
Butina, D. Unsupervised data base clustering based on daylight’s fingerprint and Tanimoto similarity: a fast and automated way to cluster small and large data sets. J. Chem. Inf. Comput. Sci. 39, 747–750 (1999).
Bjerrum, E. J. Molvecgen: Molecular Vectorization and Batch Generation (2019); https://github.com/EBjerrum/molvecgen
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
O’Boyle, N. M. & Sayle, R. A. Comparing structural fingerprints using a literature-based similarity benchmark. J. Cheminform. 8, 36 (2016).
Probst, D. & Reymond, J. L. A probabilistic molecular fingerprint for big data settings. J. Cheminform. 10, 66 (2018).
Chollet, F. Keras (2019); https://keras.io/
Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems. Preprint at https://arxiv.org/pdf/1603.04467.pdf (2016).
Arora, R., Basu, A., Mianjy, P. & Mukherjee, A. Understanding deep neural networks with rectified linear units. Preprint at https://arxiv.org/pdf/1611.01491.pdf (2016).
Williams, R. J. & Zipser, D. A learning algorithm for continually running fully recurrent neural networks. Neural Comput. 1, 270–280 (2008).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In 3rd International Conference for Learning Representations, (ICLR) 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (eds Bengio, Y. & LeCun, Y.) (2015).
Wildman, S. A. & Crippen, G. M. Prediction of physicochemical parameters by atomic contributions. J. Chem. Inf. Comput. Sci. 39, 868–873 (1999).
Tan, C. et al. A survey on deep transfer learning. In Artificial Neural Networks and Machine Learning — ICANN 2018 (eds Krurková, V., Manolopoulos, Y., Hammer, B., Iliadis, L. & Maglogiannis, I.) 270–279 (Springer, 2018).
MIT Licence; https://opensource.org/licenses/MIT
We thank the entire MolecularAI team at AstraZeneca for their invaluable input and the fruitful discussions held during development of the present work. J.A.-P. is supported financially by the European Union’s Horizon 2020 research and innovation programme under a Marie Skłodowska-Curie grant (agreement no. 676434, ‘Big Data in Chemistry’, ‘BIGCHEM’; http://bigchem.eu).
The authors declare no competing interests.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
a, Wildman-Crippen coefficient (logP), b, topological polar surface area (TPSA), c, molecular weight (MW), d, drug-likeness score (QED), e, number of hydrogen bond acceptors (HBA) and f, hydrogen bond donors (HBD) with respect to the complete CHEMBL25 and DRD2 datasets before splitting. Subfigures a-d show the continuous histogram density as estimated by the kdeplot method of the seaborn Python library using default parameters.
a, Distribution of pairwise Tanimoto similarity of uniquely generated Murcko scaffolds to the seeding Murcko scaffold. The physchem-based (PCB) model generates SMILES that correspond to new scaffolds whereas the fingerprint-based (FPB) model generates scaffolds that are more similar or even identical to the seeding scaffold. b, Predicted active probability of all unique structures behind all generated SMILES strings per model. Both models generate SMILES that are predicted to be active with similar probability distributions.
Extended Data Fig. 3 Novelty of uniquely generated underlying molecules with respect to different datasets.
Novelty is assessed with respect to the train and test ChEMBL datasets using the physchem-based (PCB) and fingerprint-based (FPB) models. The first element of every pair on the x-axis corresponds to the dataset the conditions were drawn from. The second element represents the dataset with respect to which novelty was calculated. For any model the difference between datasets is insignificant, reflecting a consistent generation of novel compounds regardless of the seeding conditions. The numbers correspond to the fraction of valid unique novel molecules out of 25,600 generated SMILES strings.
Extended Data Fig. 4 Optimization of properties individually in every direction with the physchem-based model.
The pattern of the molecular properties of the generated valid SMILES (blue dots) seems to follow the set conditions (red lines). The length of a step represents the number of valid SMILES for that setpoint out of 256 sampled SMILES strings. Low molecular weight or high QED setpoints lead to unstable generation of valid SMILES for the given condition. QED displays the largest deviations from the seed conditions and is the hardest property to control as the formula contains a weighted sum of the other five properties. The area annotated by arrows refers to an input combination with a high QED target that caused the output to collapse with respect to the rate of valid SMILES and the fulfillment of the specified conditions. The exact percentage of unique molecules stemming from all valid SMILES sampled at each step is shown in Supplementary Fig. 12.
About this article
Cite this article
Kotsias, PC., Arús-Pous, J., Chen, H. et al. Direct steering of de novo molecular generation with descriptor conditional recurrent neural networks. Nat Mach Intell 2, 254–265 (2020). https://doi.org/10.1038/s42256-020-0174-5
Discovering Relationships between OSDAs and Zeolites through Data Mining and Generative Neural Networks
ACS Central Science (2021)
Expert Opinion on Drug Discovery (2021)
Journal of Cheminformatics (2021)
International Journal of Molecular Sciences (2021)
Drug Discovery Today (2021)