Language models can learn complex molecular distributions

Deep generative models of molecules have grown immensely in popularity, trained on relevant datasets, these models are used to search through chemical space. The downstream utility of generative models for the inverse design of novel functional compounds, depends on their ability to learn a training distribution of molecules. The most simple example is a language model that takes the form of a recurrent neural network and generates molecules using a string representation. Since their initial use, subsequent work has shown that language models are very capable, in particular, recent research has demonstrated their utility in the low data regime. In this work, we investigate the capacity of simple language models to learn more complex distributions of molecules. For this purpose, we introduce several challenging generative modeling tasks by compiling larger, more complex distributions of molecules and we evaluate the ability of language models on each task. The results demonstrate that language models are powerful generative models, capable of adeptly learning complex molecular distributions. Language models can accurately generate: distributions of the highest scoring penalized LogP molecules in ZINC15, multi-modal molecular distributions as well as the largest molecules in PubChem. The results highlight the limitations of some of the most popular and recent graph generative models– many of which cannot scale to these molecular distributions.

The efficient exploration of chemical space is one of the most important objectives in all of science, with numerous applications in therapeutics and materials discovery.However, exploration efforts have only probed a very small subset of the synthetically accessible chemical space [1], therefore developing new tools is essential.It is possible that the rise of artificial intelligence will provide the methods to unlock the mysteries of the chemical universe, given its success in other challenging scientific questions like protein structure prediction [2].
Very recently, deep generative models have emerged as one of the most promising tool for this immense challenge [3].These models are trained on relevant subsets of chemical space and can generate novel molecules similar to their training data.Their ability to learn the training distribution and generate valid, similar molecules-is important for success in downstream applications like the inverse design of functional compounds.
The first models involved re-purposing recurrent neural networks (RNNs) [4] in order to generate molecules as SMILES strings [5].These language models can be used to generate molecular libraries for drug discovery [6] or built into variational autoencoders (VAE) [3,7] where bayesian optimization can be used to search through the model's latent space for drug-like molecules.However there are problems with language models that make it challenging to train models' capable of generating valid molecules.The largest issue is the brittleness of the SMILES string representation that is widely used in these models.If a single character is misplaced or erroneously generated then the produced SMILES string will correspond to an invalid molecule.This makes it difficult to train and apply language models for practical applications, in spite of this, researchers have attempted to use them for ligand based de novo design [8].
Other models that use graph representations do not have this issue and can be directly constrained to enforce valency restrictions.These graph generative models sequentially construct molecules as a series of probabilistic decisions [9][10][11][12][13][14] using graph neural networks [15,16].These models have been shown to be more capable in modeling distributions and useful in the inverse design of molecules with specific properties [11,12].Most importantly, they will always generate valid molecules in contrast to SMILES language models and others that generate whole molecules in one shot [17][18][19][20].
Recently, more robust molecular string representations have been proposed [21][22][23].In particular, self referencing embedding strings (SELFIES) [24] distinguish themselves as every SELFIES string is a valid molecule.Additionally, with improved training methods, language models have achieved promising results [10,25], particularly in the low-data regime [26,27].Given these developments, in order to test the ability of language models, we formulate a series of difficult generative modeling tasks by constructing training sets of molecules that are especially challenging to learn.We train language models on all tasks and provide baseline comparisons with graph models.The results demonstrate that language models are powerful generative models and can learn very complex molecular distributions that some popular graph models are less able or completely unable to.examples of molecules plotted using rdkit [28] from the training data in each of the generative modeling tasks d The penalized LogP task e The multi-distribution task f The large scale task g The main models considered and the representation they learn on: 1) SM-RNN trained on SMILES strings 2) SF-RNN trained on SELFIES 3) JTVAE using a graph and tree representation and 4) CGAVE using a graph representation.

I. RESULTS
We define three tasks, generating: 1) distributions of molecules with high scores of penalized LogP [3] (FIG. 1 a,d) 2) multi-modal distributions of molecules (FIG. 1 b,e) and 3) the largest molecules in PubChem (FIG. 1 c,e).Necessarily, each different generative modeling task is defined by learning to generate from the distribution of molecules in a dataset.We build three datasets using relevant subsets of larger databases.
For each task we assess performance by plotting the distribution of training molecules properties and the distribution learned by the language models and graph models.We use a histogram for the training molecules and fit a Gaussian kernel density estimator to it by tuning its bandwidth parameter.We plot KDE's for molecular properties from all models using the same bandwidth parameter.
From all models we initially generate 10K (thousand) molecules, compute their properties and use them to produce all plots and metrics.Furthermore, for fair comparison of learned distributions, we use the same number of generated molecules from all models after removing duplicates and training molecules.
For quantitative evaluation of any model's ability to learn its training distribution, we compute the Wasserstein distance between property values of generated molecules and training molecules.We also compute the Wasserstein distance between different samples of training molecules in order to determine a most optimal baseline, which we can compare with as an oracle.
For models, our main consideration is a chemical language model using a recurrent neural network with long short-term memory [34] and is trained on SMILES (SM-RNN) or SELFIES (SF-RNN).We also train two state of the art graph generative models: the junction tree variational autoencoder (JTVAE) [11] and the constrained graph variational autoencoder (CGVAE) [10].Each model is trained on a different molecular representation: SMILES/SELFIES strings or trees/graphs (FIG. 1 g).

A. Penalized LogP Task
For the first task, we consider one of the most widely used benchmark assessments for searching chemical space, the penalized LogP task-finding molecules with high LogP [35] penalized by synthesizability [30] and unrealistic rings.We turn this task into a generative modeling one, where the goal is to learn distributions of molecules with high penalized LogP scores.Finding a single molecule with a good score (above 3.0) can be challenging but learning to directly generate from this part of chemical space, so that every molecule produced by the model has high penalized LogP, adds another layer of complexity to the standard task.For this we build a training dataset by screening the ZINC15 database [36] for molecules with values of penalized LogP exceeding 4.0.Most machine learning approaches only find a handful of molecules in this range, for example JTVAE [11] found 22 total during all their attempts.After screening, the top scoring molecules in ZINC amounted to roughly 160K (K is thousand) molecules for the training data in this task.Thus, the training distribution is extremely spiked with most density falling around 4.0-4.5 penalized LogP as seen in FIG.1a with most training molecules resembling the examples shown in FIG. 1 d.However, some of the training molecules, around 10% have even higher penalized LogP scores-adding a subtle tail to the distribution.
The results of training all models are shown in FIG. 2  and 3.The language models perform significantly better than the graph models, with the SELFIES RNN pro-   Overall, the language models could learn distributions of molecules with high penalized LogP scores, better than the graph models.

B. Multi-distribution Task
For the next task, we created a dataset by combining subsets of : 1) GDB13 [37] molecules with molecular weight (MW) ≤ 185 2) ZINC [3,36]  ≤ MW ≤ 425 3) Harvard clean energy project (CEP) [38] molecules with 460 ≤ MW ≤ 600, and the 4) POLY-MERS [39]  In the multi-distribution task, both RNN models capture the data distribution quite well and learn every mode in the training distribution FIG.4a, with comparable quality.On the other hand, JTVAE entirely misses the first mode from GDB13 then poorly learns ZINC and CEP.As well, CGVAE learns GDB13 but underestimates ZINC and entirely misses the mode from CEP.More evidence that the RNN models learn the training distribution more closely is apparent in FIG.4c where CGVAE and JTVAE barely distinguish the main modes.Additionally, the RNN models generate molecules better resembling the training data (supplementary FIG.II).Despite this, all models-except CGVAE, capture the training distribution of QED, SA score and Bertz Complexity (FIG.4b).Lastly, in TABLE II the RNN trained on SMILES has the lowest Wasserstein metrics followed by the SELFIES RNN then JTVAE and CGVAE.

C. Large Scale Task
The last generative modeling task, involves testing the ability of deep generative models to learn large molecules, the largest possible molecules relevant to molecular generative models that use SMILES/SELFIES string representations or graphs.For this we turn to PubChem [40] and screen for the largest molecules with more than 100 heavy atoms, producing 300K molecules.These are molecules of various kinds: small bio-molecules, photovoltaics and others (FIG.1f ).They also have a wide range of molecular weight from 1500 to 5000 but most This task was the most challenging for the graph models, both failed to train and were entirely incapable of learning the training data.In particular, JTVAE's tree decomposition algorithm applied to the training data produced a fixed vocabulary of ∼11,000 substructures.However both RNN models were able to learn to generate molecules as large and as varied as the training data.The training molecules correspond to very long SMILES and SELFIES string representations, in this case, the SELFIES strings provided an additional advantage and the SELFIES RNN could match the data distribution more closely (FIG.5a).In particular, learning valid molecules is substantially more difficult with the SMILES grammar, as there are many more characters to generate for these molecules and a higher probability that the model will make a mistake and produce an invalid string.In contrast, the SELFIES string generated will never be invalid.Interestingly, even when the RNN models generated molecules that were out of distribution and substantially smaller than the training molecules-they still had similar substructures and resemblance to the training molecules (5a).In addition, the training molecules seemed to be divided into two modes of molecules with lower and higher LogP values (FIG.5b): with biomolecules defining the lower mode and molecules with more rings and longer carbons chains defining the higher LogP mode (more example molecules can be seen in supplementary FIG.S7).The RNN models were both able to learn the bi-modal nature of the training distribution.
The training data has a variety of different molecules and substructures, in FIG.6a the RNN models adequately learn the distribution of substructures arising in the training molecules.Specifically the distribution for the number of: fragments, single and double atom fragments as well as single and multi-rings fragments in each molecule.As the training molecules get larger and occur less, both RNN models still learn to generate these molecules (FIG.5a when molecular weight>3000).To demonstrate the size of the molecules being generated a few examples from both RNN models and training data are shown, from the main mode of the distribution (FIG.6b) and largest training molecules with weight larger than 4000 (FIG.6c).

D. Metrics
We also evaluate models on standard metrics in the literature: validity, uniqueness and novelty.Using the same 10K molecules generated from each model for each task we compute the following statistics defined in [17] and store them in TABLE.III: 1) validity: the ratio between the number of valid and generated molecules, 2) unique-ness: the ratio between the number of unique molecules (that are not duplicates) and valid molecules, 3) novelty: the ratio between unique molecules that are not in the training data and the total number of unique molecules.In the first two tasks (TABLE.III), JTVAE and CGVAE have better metrics with very high validity, uniqueness and novelty (all close to 1), here the SMILES and SELF-IES RNN perform worse but the SELFIES RNN is close to their performance.The SMILES RNN has the worse metrics due to its poor grammar but is not substantially worse than the other models.

II. DISCUSSION
In this work, in effort to test the ability of chemical language models, we introduce three complex modeling tasks for deep generative models of molecules.Language models and graph baselines perform each task, which entails learning to generate molecules from a challenging dataset to learn.The results demonstrate that language models are very powerful, flexible models that can learn a variety of very different complex distributions while the popular graph baseline are much less capable.SELFIES improves performance, decreases memorization.While the SMILES RNN still achieves impressive performance in all tasks, we notice that the SELFIES representation seems to improve the performance of language models in every task, particularly in the large scale task and consistently yields better standard metrics of validity, uniqueness and novelty.Additionally, the SELFIES grammar may be more difficult to overfit to, which is the likely explanation for why the SELFIES RNN has better standard metrics but worse Wasserstein metrics than the SMILES RNN.The SMILES RNN with lower novelty scores and Wasserstein metrics that are very close to the TRAIN oracle may be memorizing more of the training data than the SELFIES RNN.In future, it would be valuable to ascertain if one should ever use SMILES over SELFIES.Graph generative models are not flexible The results show that the most popular graph generative models: JTVAE and CGVAE had trouble learning the distributions in each task.For the penalized LogP task, the difference between a molecule that has a score of 2 and one that scores 4 often can be very subtle.Sometimes changing a single carbon or other atom can cause a large drop in score-this likely explains why the CGVAE severely misfit the main training mode.For the multidistribution task, JTVAE and CGVAE's difficulties are clear but very understandable.For JTVAE, it has to learn a wide range of tree types: many of which have no large substructures like rings (the GDB13 molecules) while others are entirely rings (some of the CEP and POLYMERS).For CGVAE, it has to learn a wide range of very different generation traces-which is difficult especially since it only uses one sample trace during learning.For the same reasons, these models were incapable of training on the largest molecules in PubChem.
Based on the experiments conducted, language models are very powerful generative models for learning any complex molecular distribution and should see even more widespread use.However, it is still possible to see improvements to these models, especially in combating over-fitting as the results did give evidence that language models can suffer from this.In addition, we hope that the molecular modeling tasks and datasets introduced can motivate new generative models that address the difficulties that the graph baselines experienced.Future work will explore how capable chemical language models are in learning larger and larger snapshots of chemical space.

III. METHODS
Model details and hyper-parameters: For each task we give the model and training hyper-parameters of the best model found during hyper-parameter optimization, where for all models we used random search with 100 randomly sampled hyper-parameters sets.Penalized LogP Task.For the SM-RNN we used an LSTM with 2 hidden layer with 400 units and dropout in the last layer with prob=0.2 and learning rate of 0.0001.For the SF-RNN we used an LSTM with 2 hidden layer with 600 units and dropout in the last layer with prob=0.4 and learning rate of 0.0002.The CGVAE used 8 propagation layers and hidden layer side of 100 with kl annealed to 0.1 and a learning rate of 0.0015.The JTVAE used a learning rate of 0.001 and 3 GNN layers with a hidden size of 356.
Multi-Distribution Task.For the SM-RNN we used an LSTM with 3 hidden layer with 512 units and dropout in the last layer with prob=0.5 and learning rate of 0.0001.For the SF-RNN we used an LSTM with 2 hidden layer with 500 units and dropout in the last layer with prob=0.2 and learning rate of 0.0003.The CGVAE used 8 propagation layers and hidden layer side of 100 with kl annealed to 0.1 and a learning rate of 0.001.The JTVAE used a learning rate of 0.0001 and 3 GNN layers with a hidden size of 356.
Large-Scale Task.For the SM-RNN we used an LSTM with 2 hidden layer with 512 units and dropout in the last layer with prob=0.25 and learning rate of 0.001.For the SF-RNN we used an LSTM with 2 hidden layer with 800 units and dropout in the last layer with prob=0.4 and learning rate of 0.0001.

FIG. 1 .
FIG. 1. a-c The molecular distributions defining the three complex molecular generative modeling task.a The distribution of penalized LogP vs SA score from the training data in the penalized logP task.b The four modes of differently weighted molecules in the training data of the multi-distribution task c Large scale task's molecular weight training distribution.d-fexamples of molecules plotted using rdkit[28] from the training data in each of the generative modeling tasks d The penalized LogP task e The multi-distribution task f The large scale task g The main models considered and the representation they learn on: 1) SM-RNN trained on SMILES strings 2) SF-RNN trained on SELFIES 3) JTVAE using a graph and tree representation and 4) CGAVE using a graph representation.

FIG. 2 .
FIG. 2. Penalized LogP Task a The plotted distribution of the penalized LogP scores of molecules from the training data (TRAIN) with the SM-RNN trained on SMILES, the SF-RNN trained on SELFIES and graph models: CGVAE and JTVAE.For the graph models we display 3 molecules from the out of distribution mode at penalized LogP score ∈ [1.5, 2.5] as well as 3 molecules with penalized LogP score in the the main mode [4.0,4.5]from all models.b Distribution plots for all models and training data of molecular properties QED, LogP and SA score.
FIG.3.Penalized LogP Task a 2d histograms of penalized LogP and SA score from molecules generated by the models or from training data that have penalized LogP ≥ 6.0.b Histograms of penalized LogP, Atoms #, Ring # and length of largest carbon chain (all per molecule) from molecules generated by all models or from the training data that have penalized LogP ≥ 6.0.c 10 molecules generated by the models or from the training data that have penalized LogP ≥ 6.0.

FIG. 4 .
FIG. 4. Multi-distribution Task a The Histogram and KDE of molecular weight of training molecules along with KDEs of molecular weight of molecules generated from all models.Three training molecules from each mode are shown.b The Histogram and KDE of QED, LogP and SA scores of training molecules along with KDES of molecules generated from all models.c 2d histograms of molecular weight and SA score of training molecules and molecules generated by all models.

FIG. 5 .
FIG. 5. Large Scale Task a The Histogram and KDE of molecular weight of training molecules along with the KDEs of molecular weight of molecules generated from the RNNs.Two molecules generated by the RNN's with lower molecular weight than most training molecules are shown on the left of the plot.In addition, two training molecules from three different regions in the distribution of molecular weight are displayed.b The Histogram and KDE of LogP of training molecules along with the KDEs of LogP of molecules generated from the RNNs.On either side of the plot, for each mode in the LogP distribution, we display 2 molecules from the training data and generated by the RNNs.

FIG. 6 .
FIG. 6.Large Scale Task a Histograms of fragment #, single atom fragment #, double atom fragment #, single ring fragment #, multi-ring fragment # (all per molecule) from molecules generated by the RNN models or from the training data b Five molecules generated by the RNN models or from the training data with 1500 ≤ Molecular Weight ≤ 2000.c Four molecules generated by the RNN models or from the training data with Molecular Weight ≥ 4000

a
FIG. S1.Penalized LogP Task a-e Training molecules with different properties.

b
FIG.S2.Penalized LogP Task a-b Generated molecules with penalized LogP < 4 from the graph generative models.

ab
FIG. S7.Large Scale Task a-b Training molecules from each LogP mode.

TABLE I .
Penalized LogP Task Wasserstein metrics for LogP, SA, QED, MW, BT and NP between molecules from the training data and generated by the models.TRAIN is an oracle baseline-values closer to it are better.
and fewer rings (FIG.3b)-theRNNs are capable of picking up on this.This is very apparent in samples shown from the baselines and training data in FIG.3c-the RNNs produce similar molecules while the CGVAE and JTVAE generate molecules with many rings

TABLE II .
molecules with 185 Multi-distribution Task Wasserstein metrics of LogP, SA, QED, MW, BCT, NP of training and model generated molecules.TRAIN is an oracle baseline.

TABLE III .
Standard Metrics Validity, uniqueness and novelty of molecules generated by all models in every task.Closer to 1.0 indicates better performance.