Direct steering of de novo molecular generation with descriptor conditional recurrent neural networks

Matters Arising to this article was published on 10 December 2020


Deep learning has acquired considerable momentum over the past couple of years in the domain of de novo drug design. Here, we propose a simple approach to the task of focused molecular generation for drug design purposes by constructing a conditional recurrent neural network (cRNN). We aggregate selected molecular descriptors and transform them into the initial memory state of the network before starting the generation of alphanumeric strings that describe molecules. We thus tackle the inverse design problem directly, as the cRNNs may generate molecules near the specified conditions. Moreover, we exemplify a novel way of assessing the focus of the conditional output of such a model using negative log-likelihood plots. The output is more focused than traditional unbiased RNNs, yet less focused than autoencoders, thus representing a novel method with intermediate output specificity between well-established methods. Conceptually, our architecture shows promise for the generalized problem of steering of sequential data generation with recurrent neural networks.

Fig. 1: cRNN models based on different conditions.
Fig. 2: NLL of sampling known molecules.
Fig. 3: Unique structures corresponding to generated SMILES strings from two different known active seeds.
Fig. 4: Property satisfaction with the PCB model.
Fig. 5: Exclusivity of sampling.

Data availability

The curated datasets used to train all models are available at

Code availability

The Python code and the trained neural networks used in this work are available under MIT licence57 in the Deep Drug Coder (DDC) GitHub repository and, which also includes an optional encoding network to constitute a molecular heteroencoder.


We thank the entire MolecularAI team at AstraZeneca for their invaluable input and the fruitful discussions held during development of the present work. J.A.-P. is supported financially by the European Union’s Horizon 2020 research and innovation programme under a Marie Skłodowska-Curie grant (agreement no. 676434, ‘Big Data in Chemistry’, ‘BIGCHEM’;

P.-C.K. and E.J.B. planned the project and jointly performed analysis of the results. P.-C.K. developed the necessary code. E.J.B. supervised the overall project. J.A.-P. assisted with the preprocessing of the datasets. J.A.-P., H.C., O.E. and C.T. provided valuable feedback on the methods used, the experimental set-up and the results at every stage. P.-C.K. wrote the manuscript and all authors reviewed it.

Correspondence to Esben Jannik Bjerrum.

The authors declare no competing interests.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Distribution of physicochemical properties of datasets.

a, Wildman-Crippen coefficient (logP), b, topological polar surface area (TPSA), c, molecular weight (MW), d, drug-likeness score (QED), e, number of hydrogen bond acceptors (HBA) and f, hydrogen bond donors (HBD) with respect to the complete CHEMBL25 and DRD2 datasets before splitting. Subfigures a-d show the continuous histogram density as estimated by the kdeplot method of the seaborn Python library using default parameters.

Extended Data Fig. 2 Tanimoto similarity and predicted activity of generated structures.

a, Distribution of pairwise Tanimoto similarity of uniquely generated Murcko scaffolds to the seeding Murcko scaffold. The physchem-based (PCB) model generates SMILES that correspond to new scaffolds whereas the fingerprint-based (FPB) model generates scaffolds that are more similar or even identical to the seeding scaffold. b, Predicted active probability of all unique structures behind all generated SMILES strings per model. Both models generate SMILES that are predicted to be active with similar probability distributions.

Extended Data Fig. 3 Novelty of uniquely generated underlying molecules with respect to different datasets.

Novelty is assessed with respect to the train and test ChEMBL datasets using the physchem-based (PCB) and fingerprint-based (FPB) models. The first element of every pair on the x-axis corresponds to the dataset the conditions were drawn from. The second element represents the dataset with respect to which novelty was calculated. For any model the difference between datasets is insignificant, reflecting a consistent generation of novel compounds regardless of the seeding conditions. The numbers correspond to the fraction of valid unique novel molecules out of 25,600 generated SMILES strings.

Extended Data Fig. 4 Optimization of properties individually in every direction with the physchem-based model.

The pattern of the molecular properties of the generated valid SMILES (blue dots) seems to follow the set conditions (red lines). The length of a step represents the number of valid SMILES for that setpoint out of 256 sampled SMILES strings. Low molecular weight or high QED setpoints lead to unstable generation of valid SMILES for the given condition. QED displays the largest deviations from the seed conditions and is the hardest property to control as the formula contains a weighted sum of the other five properties. The area annotated by arrows refers to an input combination with a high QED target that caused the output to collapse with respect to the rate of valid SMILES and the fulfillment of the specified conditions. The exact percentage of unique molecules stemming from all valid SMILES sampled at each step is shown in Supplementary Fig. 12.

Supplementary information

Supplementary Information

Likelihood of sampling of canonical SMILES and Figs. 1–12.

Kotsias, PC., Arús-Pous, J., Chen, H. et al. Direct steering of de novo molecular generation with descriptor conditional recurrent neural networks. Nat Mach Intell 2, 254–265 (2020).

