Main

Over the past century, more than 100 million small molecules have been synthesized in the search for new drugs and materials1. These efforts have explored only an infinitesimal subset of chemical space, the size of which is estimated at over 1060 molecules2. Yet, remarkably and often serendipitously, this limited exploration of chemical space has led to the discovery of numerous molecules that can modulate biological processes. That our extremely limited exploration of chemical space has already yielded so many medically or industrially valuable compounds suggests that more efficient approaches to chemical space exploration could help address many of the most pressing challenges facing humanity.

Chemical space is so large that its exhaustive enumeration is essentially impossible. Instead, searches for bioactive molecules generally focus on particular subsets of chemical space3,4. Historically, these subsets were defined primarily by rule-based approaches, in which new molecules were generated by iterative application of predefined chemical transformations to a ‘starter’ population5,6,7,8,9,10,11,12. More recently, generative models based on deep neural networks have emerged as a powerful framework for chemical space exploration13,14,15,16. Given a set of molecules as input, these models are able to learn the chemistries implicitly embedded within this training set, and then leverage this understanding to sample unseen molecules from the same areas of chemical space.

Initial demonstrations that deep generative models could design novel molecules with desired physicochemical or biological properties17,18,19,20,21,22,23,24 have triggered the development of myriad approaches to molecule generation. These methods differ not only in the architectures of the underlying neural networks, but also in the conceptual frameworks they use to represent molecules: for instance, as chemical graphs25,26, as combinations of substructures27 or as three-dimensional objects28,29. Thus far, however, these approaches have not consistently surpassed the empirical state of the art established by the earliest deep neural network approaches based on chemical language models30,31. These models represent molecules as strings of text (commonly using the SMILES format32; Fig. 1a), and adapt neural network architectures from the field of natural language processing to learn the statistical properties of these strings and generate new ones.

Fig. 1: Language models that can generate invalid outputs outperform models that cannot.
figure 1

a, Schematic overview of chemical space exploration with chemical language models. Language models are trained on a set of chemical structures represented as strings (for example, in SMILES or SELFIES formats). Sampling new strings from the trained model enables generation of novel molecules from the same chemical space as the training set. b, Illustration of invalid SMILES. A single character substitution in the SMILES string for caffeine, top, creates a syntactically invalid SMILES string that does not correspond to any chemical structure, bottom. c, Experimental framework to benchmark language models trained on SMILES versus SELFIES. CLM, chemical language model. d, Proportion of valid molecules generated by language models trained on SMILES versus SELFIES representations (n = 10 each; P = 2.0 × 10–10, paired t-test). e, Fréchet ChemNet distance between generated and training molecules for language models trained on SMILES versus SELFIES representations (lower is better; n = 10 each; P = 1.1 × 10–9, paired t-test). f, Relationship between the proportion of valid SMILES generated by chemical language models, and the difference in Fréchet ChemNet distance (FCD) between each model and an equivalent model trained on SELFIES representations of the same training set. Inset text shows the Pearson correlation coefficient and P value. The line and shaded area show linear regression and 95% confidence interval, respectively.

Like a human language, the SMILES syntax imposes strict rules on which strings are syntactically valid. This means that chemical language models can generate SMILES that do not correspond to any valid chemical structure (Fig. 1b). The generation of invalid SMILES is widely perceived to be an important shortcoming of chemical language models. This perception has motivated an enormous amount of work to address this shortcoming and encourage generation of valid molecules, whether by developing alternative textual representations of molecules33,34,35, developing methods that generate valid SMILES by design36,37, or developing methods to correct invalid SMILES post hoc38,39,40. The generation of invalid SMILES is also frequently cited as a motivation to eschew the language modelling framework and develop models that generate chemical graphs directly25,27,41,42,43,44,45,46,47 and used in benchmark suites to quantify the performance of generative models48,49.

That the generation of invalid SMILES is so widely perceived to be a limitation of chemical language models might be seen as surprising. Removing invalid SMILES from the output of a chemical language model is a simple post hoc processing step that does not carry substantial computational cost. Moreover, despite the assumption that generating invalid SMILES is a shortcoming, several benchmarks have identified that language models trained on SMILES outperform those trained on SELFIES (SELF-referencIng Embedded Strings)34, a textual representation that produces 100% valid output by design, as well as models that generate chemical graphs directly50,51,52. These observations raise the possibility that the ability to generate invalid SMILES is actually a desirable property for a generative model: in other words, that generating invalid SMILES is a feature, not a bug.

In this study, I set out to empirically test the possibility that invalid SMILES are beneficial, rather than harmful, to chemical language models. I show that invalid SMILES are sampled with significantly lower likelihoods than valid SMILES, suggesting that filtering invalid SMILES provides an intrinsic mechanism to identify and remove low-quality samples from the model output. I then exploit the design of the SELFIES language by removing the valency constraints that ensure valid molecule generation and obtain causal evidence that generating invalid outputs improves the performance of chemical language models. I elucidate the mechanism by which imposing valency constraints impairs distribution learning, and show that these constraints bias chemical space exploration towards molecules with specific structural properties and impair generalization to unseen chemical space. Finally, I show that language models can correctly elucidate complex chemical structures from minimal analytical data, and that models capable of generating invalid outputs outperform models that cannot on this task.

Results

Models that generate invalid outputs outperform models that do not

Previous benchmarks suggested that chemical language models trained on SMILES could outperform those trained on SELFIES, a format in which every string corresponds to a valid molecule by design. Specifically, these benchmarks showed that language models trained on SMILES strings generated unseen molecules whose physicochemical properties better matched those of the molecules in the training set50,51. I initially set out to reproduce this observation. I trained chemical language models on random samples of molecules from the ChEMBL database53, providing either SMILES or SELFIES representations of the same molecules as input. The trained models were then used to sample new molecules from the same chemical space as the training set, and model performance was evaluated by calculating metrics that captured the similarity between generated molecules and the training set (Fig. 1c).

As expected, models trained on SELFIES strings produced valid molecules at a rate of 100%, compared to an average of 90.2% for models trained on SMILES (Fig. 1d). Nonetheless, models trained on SMILES generated novel molecules that matched the training set significantly better than models trained on SELFIES, as quantified by the Fréchet ChemNet distance (Fig. 1e). This conclusion was unchanged when using other metrics to quantify performance, such as the Murcko scaffold similarity between the training and generated molecules54, and was recapitulated when integrating multiple metrics into a single measure of model performance using principal component analysis (PCA), as previously described50 (Extended Data Fig. 1a–d).

The superior performance of models trained on SMILES was robust both to the data used to train the chemical language models, and to the architecture of the models themselves. I reproduced this result when (1) training models on smaller or larger samples of molecules from ChEMBL; (2) training models on molecules from a different chemical database, GDB-13 (ref. 55); (3) training models on more or less chemically diverse training sets; (4) performing data augmentation by SMILES or SELFIES enumeration56,57; or (5) using a language model based on the transformer architecture58 instead of one based on long short-term memory (LSTM) networks (Extended Data Fig. 1e–s).

Language models trained on SELFIES typically generated novel molecules at a higher rate than models trained on SMILES, but models trained on either representation were able to achieve a very high rate of novelty (>99%) except when deliberately constructing training sets with a low degree of chemical diversity (Extended Data Fig. 2).

Together, these results demonstrate that language models trained on SMILES robustly outperformed those trained on SELFIES. Moreover, across all models tested, I found that the magnitude of this difference in performance was strongly and negatively correlated with the proportion of valid SMILES (Fig. 1f): in other words, models trained on SMILES performed proportionately better when generating more invalid outputs.

Invalid SMILES are low-likelihood samples

These findings expose an apparent contradiction. The presence of invalid outputs is widely perceived to be a central shortcoming of generative models based on SMILES strings. However, models that can generate invalid outputs robustly outperformed models that—by design—can only generate valid outputs.

I sought to identify the mechanisms underlying this contradiction. One potential explanation is that invalid SMILES represent low-likelihood samples from the language model. Removing invalid SMILES would, therefore, function as a mechanism to filter low-quality samples from the model output. Notably, this hypothesis is consistent with the observed anticorrelation between invalid SMILES generation and model performance (Fig. 1f): filtering out a larger number of low-quality samples would expectantly result in proportionately better performance.

If this hypothesis were correct, one would expect that invalid SMILES are sampled with larger losses than valid SMILES from the same model. This was, indeed, found to be the case (Fig. 2a,b). Moreover, this difference was not limited to a single subtype of invalid SMILES: all major categories of invalid SMILES38 were sampled with higher losses than their valid counterparts (Fig. 2c–e). These differences were mediated, in part, by the increased lengths of invalid SMILES, but persisted when comparing the average losses with which individual tokens were sampled within valid versus invalid SMILES (Extended Data Fig. 3a–d). Conversely, SMILES that were sampled with smaller losses were more likely to be valid (Fig. 2f). These findings were robust to varying the size and composition of the training dataset or the architecture of the language model (Extended Data Fig. 3e–m).

Fig. 2: Invalid SMILES are low-likelihood samples from chemical language models.
figure 2

a, Losses of valid versus invalid SMILES sampled from a representative chemical language model (n = 107 SMILES; P < 10–15, two-sided t-test). b, Effect sizes (Cohen’s d) comparing the losses of valid versus invalid SMILES sampled from n = 10 chemical language models, demonstrating consistent effects (P = 1.5 × 10–13, one-sample t-test). c, Losses of valid SMILES versus invalid SMILES sampled from a representative chemical language model, classified into six different categories based on RDKit error messages38 (n = 107 SMILES; all P < 10–15, two-sided t-test). d, Effect sizes (Cohen’s d) comparing the losses of valid SMILES versus six different categories of invalid SMILES across n = 10 chemical language models, demonstrating consistent effects (all P ≤ 1.4 × 10–10, one-sample t-test). e, Frequencies of each invalid SMILES error type, shown as the mean proportion of all generated SMILES across ten chemical language models. f, Proportion of valid SMILES within each decile of loss in samples of 500,000 strings from ten chemical models (P < 10–15, two-sided Jonckheere–Terpstra test).

Invalid outputs improve performance

These results establish that invalid SMILES are enriched among low-likelihood samples from chemical language models. This finding suggests that removing invalid SMILES has the effect of filtering low-quality samples from the model output, which in turn would be expected to improve performance on distribution-learning metrics such as the Fréchet ChemNet distance. However, these data provide correlative rather than causal evidence for the notion that the ability to generate (and then discard) invalid outputs improves model performance.

To obtain such evidence, I took advantage of the design of SELFIES themselves. Within the SELFIES library, the generation of chemically valid graphs is enforced by a set of constraints on the valence of each atom: for example, the specification that a carbon atom cannot participate in more than four covalent bonds34,59. These valency constraints provide a natural mechanism to test the relationship between output validity and model performance. I modified the default valency constraints within the SELFIES library to allow pentavalent carbons, a modification I refer to as ‘Texas SELFIES’60. Under these modified constraints, language models trained on SELFIES can generate chemically invalid outputs. Remarkably, however, these chemically invalid constraints significantly improved performance: decoding Texas SELFIES yielded samples of novel molecules that were more similar to the training set than those decoded with the default and chemically valid constraints (Fig. 3a and Extended Data Fig. 4a–d).

Fig. 3: Invalid outputs improve the performance of chemical language models.
figure 3

a, Fréchet ChemNet distance between training and generated molecules for language models trained on SMILES or SELFIES representations, with SELFIES valency constraints modified to allow pentavalent carbons (‘Texas SELFIES’) or removed entirely (‘unconstrained SELFIES’; n = 10 each; both P ≤ 3.0 × 10–5 compared with default valency constraints, paired t-test). b, Losses of valid versus invalid SELFIES sampled from a representative chemical language model with valency constraints disabled when parsing generated SELFIES (n = 107 SELFIES; P < 10–15, t-test). c, Effect sizes (Cohen’s d) comparing the losses of valid versus invalid SMILES sampled from n = 10 chemical language models, demonstrating consistent effects (P = 5.6 × 10–14, two-sided one-sample t-test). d, Relationship between the proportion of valid SELFIES generated with valency constraints disabled, and the difference in Fréchet ChemNet distance when parsing generated SELFIES with or without valency constraints. Inset text shows the Pearson correlation coefficient and P value. The line and shaded area show linear regression and 95% confidence interval, respectively.

Next, I tested the effect of removing valency constraints entirely (‘unconstrained SELFIES’), and found that this further improved performance (Fig. 3a and Extended Data Fig. 4e–h). Invalid SELFIES were sampled with larger losses than their valid counterparts (Fig. 3b,c), corroborating the trends observed for invalid SMILES, and supporting the notion that removing valency constraints provided a mechanism to filter low-quality samples from the model output.

The superior performance of unconstrained SELFIES was robust to variations in the training dataset or model architecture (Extended Data Fig. 4j,l,m,p,r). Moreover, I identified a significant correlation between the proportion of invalid SELFIES generated and the improvement in performance after removing valency constraints (Fig. 3d): in other words, models performed proportionately better when generating more invalid SELFIES.

Whereas removing valency constraints improved the performance of language models trained on SELFIES, these were generally still outperformed by models trained on SMILES, pointing to residual differences in performance as a function of molecular representation.

These results provide causal evidence that allowing chemical language models to produce invalid outputs improves their performance.

Enforcing valid outputs biases chemical space exploration

I sought to clarify the mechanisms by which the ability to produce invalid outputs improved the performance of chemical language models. I hypothesized that these differences in performance reflected differences in the chemical space explored by models trained on SMILES versus SELFIES. To address this possibility, I computed a series of properties for each generated molecule, and compared the resulting property distributions to those of the training set. By far the largest difference between models trained on SMILES versus SELFIES in this analysis involved their propensity to generate cyclic molecules. Molecules generated as SELFIES were markedly depleted for aromatic rings (Fig. 4a,b) and enriched for aliphatic rings (Fig. 4c,d), relative both to the training set and to molecules generated as SMILES. Smaller but statistically significant differences were observed for a range of other structural properties, reflecting pervasive differences in the chemical space explored by generative models trained on SMILES versus SELFIES (Fig. 4e and Extended Data Fig. 5a–t).

Fig. 4: Enforcing valid outputs biases chemical space exploration.
figure 4

a, Number of aromatic rings in generated molecules sampled from representative chemical language models trained on the same molecules in SMILES versus SELFIES format, and in the training set molecules themelves. b, Effect sizes (Cohen’s d) comparing the number of aromatic rings in generated molecules from chemical language models trained on SMILES versus SELFIES to the molecules in the training set (n = 10 each; P = 1.2 × 10–11, paired t-test). c, As in a, but showing aliphatic rings. d, As in b, but showing aliphatic rings (P = 1.2 × 10–7, paired t-test). e, Volcano plot showing differences in structural properties between molecules generated as SMILES versus SELFIES (statistical significance versus mean difference in effect size, paired t-test). Dotted line shows P = 0.05. f, Effect sizes (Cohen’s d) comparing the number of aromatic rings in generated molecules from chemical language models to the molecules in the training set, shown separately for valid versus invalid SELFIES when parsing generated SELFIES without valency constraints (n = 10 each; P = 7.6 × 10–11, paired t-test). g, As in f, but showing aliphatic rings (P = 2.6 × 10–12, paired t-test). h, Differences in structural properties (mean effect sizes) are correlated between molecules generated as SMILES versus SELFIES (x-axis) and valid versus invalid SELFIES when parsing generated SELFIES without valency constraints (y-axis). Inset text shows the Pearson correlation coefficient and P value. The line and shaded area show linear regression and 95% confidence interval, respectively.

Together, these experiments identified significant differences in the chemical space explored by language models trained on SMILES versus SELFIES. To establish whether a causal relationship existed, I compared the SELFIES that could be successfully parsed without chemical valency constraints to those that required the imposition of these constraints in order to produce a valid chemical graph. This comparison allowed me to directly assess how removing invalid SELFIES influenced the distributions of structural properties among the generated molecules. Remarkably, I observed that the most profound differences between valid and invalid SELFIES again involved their propensity to contain aromatic and aliphatic rings. Invalid SELFIES were significantly depleted for aromatic rings, and enriched for aliphatic rings, relative to both the training set and to valid SELFIES (Fig. 4f,g). Conversely, disabling the valency constraints, and allowing the model to generate invalid SELFIES, reversed the structural differences between molecules generated as SMILES versus SELFIES.

This reversal led me to ask whether other structural differences between molecules generated as SMILES versus SELFIES were also reversed when disabling valency constraints. Indeed, I observed that the differences in structural properties between SMILES and SELFIES were strongly and significantly correlated to those between valid and invalid SELFIES (Fig. 4h and Extended Data Fig. 6a–w). Thus, the structural differences between molecules generated as SMILES versus SELFIES can be attributed at least in part to the correction of invalid outputs.

Together, these experiments expose the mechanism underlying differences in performance between generative models trained on SMILES versus SELFIES. The imposition of valency constraints in SELFIES prevents the generation of invalid outputs, but results in an overrepresentation of aliphatic rings and an underrepresentation of aromatic rings in the resulting molecules. These systematic biases in the chemical composition of the generated molecules are reflected in poor performance on distribution-learning metrics, such as the Fréchet ChemNet distance. Removing valency constraints, and allowing the model to generate invalid outputs, corrects these biases and improves performance.

Structural biases limit generalization

An ideal generative model would sample evenly from the chemical space surrounding the molecules in the training set. The observation of structural biases in the outputs of language models trained on SELFIES is at odds with this goal. I therefore sought to test whether, in addition to introducing biases in the chemical space explored by generative models, the choice of representation would also constrain their capacity for generalization.

To test this hypothesis, I made use of an exhaustively explored chemical space: that of the GDB-13 database, which enumerates all ~975 million drug-like molecules containing up to 13 heavy atoms. Following an experimental design proposed previously, I trained chemical language models on small samples from GDB-13, using either SMILES or SELFIES to represent these molecules61. I then drew samples of 100 million strings from each language model, and calculated the total proportion of GDB-13 that was correctly reproduced within the language model output.

Language models trained on SELFIES generated significantly more valid molecules than those trained on SMILES, as expected (Fig. 5a). However, a substantial fraction of the molecules generated as SELFIES were outside the chemical space defined by the GDB-13 database (Fig. 5b). Consequently, despite generating fewer molecules overall, models trained on SMILES explored a significantly larger proportion of the GDB-13 chemical space than models trained on SELFIES (Fig. 5c,d). That models trained on SELFIES showed a greater propensity to explore outside the chemical space of GDB-13, but a lower coverage of GDB-13 itself, can be rationalized on the basis that models trained on SELFIES show a diminished capacity to generalize from the chemical space of the training set.

Fig. 5: Structural biases limit generalization to unseen chemical space.
figure 5

a, Numbers of valid, novel molecules within samples of 100 million strings from chemical language models trained on a subset of 1 million molecules from the GDB-13 database, represented as SMILES versus SELFIES (n = 10 each; P = 1.9 × 10–10, paired t-test). b, As in a, but showing the number of sampled molecules outside the chemical space of the GDB-13 database within samples of 100 million strings (n = 10 each; P = 8.5 × 10–7). c, As in a, but showing the number of molecules from the full GDB-13 database reproduced within samples of 100 million strings (n = 10 each; P = 1.1 × 10–7). d, Saturation curve showing the proportion of the full GDB-13 database reproduced after sampling a given number of valid molecules from chemical language models trained on SMILES versus SELFIES.

Invalid outputs improve structure elucidation

To explore the implications of these findings further, I applied chemical language models to a task in which efficient navigation of unknown chemical space is of central importance: namely, structure elucidation of complex natural products. Recent work has shown that chemical language models can generate novel molecules that match experimentally measured properties. One particularly exciting observation is that language models can not only generate plausible chemical structures, but even prioritize the most likely ones on the basis of as little experimental data as an accurate mass measurement62. However, thus far this possibility has only been demonstrated for a subset of drug-like molecules, and it remains unclear whether the same approach could be applied to structure elucidation of more complex molecules. In Supplementary Note 1 and Extended Data Figs. 79, I show that language models can contribute to the structure elucidation of a range of complex small molecules including natural products, environmental pollutants, and food-derived compounds, and that the ability to generate invalid outputs improves performance on these tasks.

Discarding invalid SMILES is fast and easy

One potential criticism of generating (and then discarding) invalid outputs is that the process of parsing every sample from the model to establish its validity necessarily requires additional computational resources63. However, filtering invalid SMILES is a lightweight post-processing step that does not substantially increase the computational requirements of a chemical language model. Parsing 1 million SMILES can be achieved with the RDKit in an average of 7.5 minutes on a single CPU, and determining the validity of a SMILES string requires just a single line of code (Extended Data Fig. 10).

Discussion

That chemical language models trained on SMILES strings can produce invalid outputs is widely (if not universally) perceived to be an important deficiency of these models. This perception has motivated a remarkably broad spectrum of work in the field of chemical artificial intelligence, including the development of alternative molecular representations, mechanisms that encourage generation of valid outputs, approaches to correct invalid outputs post hoc, and models that generate chemical graphs directly. Here I provide direct and causal evidence that the ability to produce invalid outputs is not harmful but is instead beneficial to chemical language models, and elucidate the mechanisms underlying this effect. I show that language models trained on SMILES, a representation that can lead to both syntactically and semantically invalid outputs, outperform models trained on SELFIES, a representation that enforces the generation of valid outputs by design (Fig. 1). Invalid SMILES are sampled with significantly lower likelihoods than valid SMILES, implying that filtering invalid SMILES preferentially removes low-quality samples from the model output (Fig. 2). I leverage the design of the SELFIES representation by removing the valency constraints that enforce valid molecule generation, allowing me to show causally that generating (and then removing) invalid outputs improves language model performance (Fig. 3). I further show that the imposition of valency constraints results in biased exploration of chemical space, reflected in an overrepresentation of aliphatic rings and an underrepresentation of aromatic rings in the generated molecules (Fig. 4), and that these biases in turn impair generalization to unseen chemical space (Fig. 5). Finally, I apply chemical language models to structure elucidation of natural products, and show that (1) language models can develop remarkably accurate hypotheses about unknown chemical structures from minimal analytical data, and (2) models capable of generating invalid outputs significantly outperform models that cannot on this task (Extended Data Fig. 7).

Collectively, these results challenge the often-voiced assumption that invalid SMILES are a problem that must be addressed by developing new computational approaches. They suggest that further efforts to enforce the generation of valid molecules are unlikely to improve model performance. Instead, these results advocate for a more widespread recognition that removing invalid outputs is a simple and computationally efficient post-processing step that does not necessarily reflect a fundamental flaw in the underlying model. More broadly, these results support a redirection of efforts towards improving the performance of generative models of molecules through directions other than maximizing output validity. Indeed, several recent studies have highlighted opportunities to improve molecule generation despite the generation of invalid SMILES64,65,66,67.

That language models trained on SMILES outperformed those trained on SELFIES on distribution-learning metrics does not imply the latter should never be preferred. A number of recent works have presented computational approaches in which the robustness of the SELFIES representation has been a central consideration, including applications to model interpretability68,69 and inverse design70,71. In other words, while I demonstrate that the ability to generate invalid outputs is beneficial to chemical language models in general, there are specific scenarios in which validity is a more important consideration.

I found that removing the valency constraints that enforce valid molecule generation in SELFIES greatly reduced the difference in performance between models trained on SMILES versus SELFIES, but did not abolish it entirely (Fig. 3a). This observation suggests that there are residual differences between the two representations that go beyond the presence of invalid outputs and reflect deeper aspects of how they represent chemical structures. Elucidating the mechanisms underlying these differences will be an important direction for future work.

The application of chemical language models to structure elucidation of natural products indicates that these models can develop remarkably accurate hypotheses about complex and unseen chemical structures from as little data as an accurate mass measurement. Structures proposed by the chemical language model were more similar to the true molecule than those obtained by searching in PubChem, or even by searching in the natural product-like chemical space of the training set itself. This latter observation emphasizes the degree to which the model has learned to extrapolate beyond the training set and into unseen chemical space. Of course, the evaluation scenario here is by design unrealistic, and I do not mean to suggest that it is possible to perform complete structure elucidation of complex natural products from an accurate mass alone (not least because it is theoretically impossible to discriminate between different structures with the same molecular formula using only MS1 information). Instead, my intention in this experiment is to highlight that given minimal analytical data, language models can develop remarkably good hypotheses—sometimes even better than those based on much richer analytical data. This is exciting because, to my knowledge, these structural hypotheses represent a new source of information that is not currently used by any methods for structure elucidation of unknown molecules72. Integrating the novel chemical structures prioritized by chemical language models with computational approaches that leverage MS/MS or retention time information could provide a powerful mechanism to accelerate structure elucidation of unknown molecules in biological systems.

Methods

Datasets

My experiments initially focused on training chemical language models on random samples of molecules from the ChEMBL database53. ChEBML (version 28) was obtained from ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_28_chemreps.txt.gz. Duplicate SMILES and SMILES that could not be parsed by the RDKit were removed. Salts and solvents were removed by splitting molecules into fragments and retaining only the heaviest fragment containing at least three heavy atoms, using code adapted from the Mol2vec package73. Charged molecules were neutralized using code adapted from the RDKit Cookbook. Molecules with atoms other than Br, C, Cl, F, H, I, N, O, P or S were removed, and molecules were converted to their canonical SMILES representations.

Random samples of between 30,000 and 300,000 molecules were then drawn from the preprocessed SMILES. In most experiments, molecules were sampled randomly to achieve uniform coverage of ChEMBL chemical space. Separately, the effect of the chemical diversity of the training set on model performance was assessed by sampling training sets of molecules with decreasing chemical diversity50. This was achieved by selecting a ‘seed’ molecule at random from ChEMBL and then computing the Tanimoto coefficient (Tc) between the seed molecule and the remainder of the database. The database was then filtered to retain only molecules with a Tc greater than some target minimum value. A minimum Tc of zero reflects random selection of molecules across the entire database, whereas increasing the minimum Tc selects molecules that are increasingly similar to the seed molecule (that is, decreasing diversity). The Tanimoto coefficient was calculated on Morgan fingerprints74 with a radius of 3, folded to 1,024 bits.

Past studies reported that data augmentation by non-canonical SMILES enumeration56 could significantly improve the performance of chemical language models62,75,76. I therefore also tested the effect of SMILES enumeration, which was performed using the SmilesEnumerator class available from http://github.com/EBjerrum/SMILES-enumeration, with augmentation factors of 10× or 30×. SELFIES enumeration was performed by first enumerating non-canonical SMILES and then converting these to SELFIES, which I verified produced multiple SELFIES representing the same molecules. Finally, conversion from SMILES to SELFIES was performed using the SELFIES package (version 2.1.1).

To verify that the results were not specific to the chemical space populated by molecules from ChEMBL, I also trained chemical language models on the GDB-13 database55. The GDB-13 database was obtained from Zenodo (https://doi.org/10.5281/zenodo.5172018) and underwent preprocessing identical to that described above for ChEMBL.

For each set of parameters tested (for example, training dataset size, degree of data augmentation, or input database), ten independent training datasets were created with random seeds in order to evaluate variability in model performance and enable statistical comparisons. In total, I trained 180 models (90 on SMILES and 90 on SELFIES) that differed according to the random seed, the size of the training dataset (30,000, 100,000 or 300,000 molecules), the chemical diversity (Tc ≥ 0.0, 0.05, 0.10, or 0.15), the degree of SMILES/SELFIES augmentation (canonical, 10×, 30×) and the input database (ChEMBL versus GDB-13). Models shown in the main text were trained on 100,000 molecules from ChEMBL, without data augmentation or chemical diversity filters.

Chemical language models

For most experiments, I trained chemical language models based on LSTMs, which have been widely adopted in the field and have been subject to extensive experimental validation. LSTMs were implemented in PyTorch, adapting code from the REINVENT package77. Briefly, each SMILES was converted into a sequence of tokens by splitting the SMILES string into its constituent characters, except for atomic symbols composed of two characters (Br, Cl) and environments within square brackets, such as [nH]. SELFIES were tokenized using the split_selfies function from the selfies package. The vocabulary of the RNN then consisted of all unique tokens detected in the training data, as well as start-of-string and end-of-string characters and a padding token. The architecture of the language models consisted of a three-layer LSTM with a hidden layer of 1,024 dimensions, an embedding layer of 128 dimensions, and a linear decoder layer. Models were trained to minimize the cross-entropy loss of next-token prediction using the Adam optimizer with β1 = 0.9 and β2 = 0.999, with a batch size of 64 and a learning rate of 0.001. Ten percent of the molecules in the training set were reserved as a validation set and used to perform early stopping with a patience of 50,000 minibatches. A total of 500,000 SMILES strings were sampled from each trained model after completion of model training.

To confirm that these results were robust to the specific architecture of the chemical language models, I also trained a series of models based on the generative pretrained transformer (GPT) architecture78. Models were implemented in PyTorch, adapting code and hyperparameters from MolGPT79. The architecture consisted of an embedding layer of 256 dimensions, which was concatenated with a learned positional encoding and passed through eight transformer blocks. Each block comprised eight masked self-attention heads and a feed-forward network with a hidden layer of 1,024 dimensions using GELU activation, both preceded by layer normalization. Finally, the outputs of the transformer blocks were passed through a single linear decoder layer with layer normalization. The transformer models were trained as described above for language models based on LSTMs, except that the learning rate was set to 0.0005.

Evaluation of model performance

I evaluated the performance of chemical language models trained on SMILES versus SELFIES by drawing samples of 500,000 SMILES or SELFIES strings from the trained models, and then quantifying the similarity between the generated molecules and the molecules in the training set. I used multiple complementary metrics to evaluate similarity. As the primary measure of model performance, I measured the chemical similarity between the generated molecules and the training set as quantified by the Fréchet ChemNet distance80. This metric was previously found to be among the most reliable for evaluating chemical language models50 and is included in multiple benchmark suites48,49. As secondary measures of model performance, I also calculated a series of other metrics that were found to be similarly reliable, including the Jensen–Shannon distances between the Murcko scaffold compositions54, natural product-likeness scores81, and fraction of atoms in each molecule that are stereocenters in both the generated and training molecules. Molecules from the training set were filtered before calculating any of the above metrics to ensure that models were not being rewarded for simply reproducing the training molecules. I additionally integrated these metrics into a single measure of model performance using PCA, as previously demonstrated in a study in which chemical language models were trained on datasets of between 1,000 and 500,000 molecules50. PCA was shown to recover a first principal component that correlated strongly with the size of the training dataset, and thus also the quality of the learned model, while accounting for the covariance between the underlying evaluation metrics. PCA was performed using the reference dataset collected in the prior study using the ‘CLMeval’ R package (https://github.com/skinnider/CLMeval), and evaluation metrics for models analysed in the present study were projected onto this PC space. Finally, I confirmed that models trained on SELFIES generated 100% valid molecules whereas models trained on SMILES did not. Valid molecules were defined as those that could be parsed by RDKit, using the Chem.MolFromSmiles function.

Analysis of invalid SMILES

To investigate the properties of invalid SMILES, I drew samples of 10 million SMILES strings from trained chemical language models, and identified invalid SMILES as those that could not be parsed by RDKit. I then compared the losses with which valid versus invalid SMILES were sampled from the chemical language model by calculating Cohen’s d, as implemented in the R package effsize. Separately, I divided SMILES into ten bins according to their losses and calculated the proportion of valid SMILES in each bin, testing for the presence of a trend with the Jonckheere–Terpstra test as implemented in the R package clinfun. Last, for each invalid SMILES, the parsing errors returned by RDKit were classified into six different error types using code developed as part of the UnCorrupt SMILES approach38 (available from GitHub at https://github.com/LindeSchoenmaker/SMILES-corrector), and the losses for invalid SMILES from each error type were compared to those of valid SMILES with Cohen’s d.

Removal of SELFIES valency constraints

To causally test the relationship between the generation of invalid output and model performance, I leveraged the design of the SELFIES library by modifying the valency constraints that ensure semantically correct output. I initially modified the default constraints (encoded in the _DEFAULT_CONSTRAINTS dictionary) by allowing carbon atoms to participate in five bonds (‘Texas SELFIES’). I then parsed generated SELFIES under this modified set of constraints, removed outputs that could not be parsed by the RDKit, and re-calculated the Fréchet ChemNet distance and other distribution-learning metrics. I also tested the effect of removing valency constraints entirely by setting all values in the _DEFAULT_CONSTRAINTS dictionary to 999 (‘unconstrained SELFIES’), following a previous suggestion59, after which I removed invalid SELFIES and re-calculated the distribution-learning metrics.

Properties of generated molecules

To understand why the generation of invalid outputs improved performance on distribution-learning metrics, I calculated a series of structural properties for molecules generated as SMILES versus SELFIES using the RDKit. I calculated the same properties for molecules generated as SELFIES that could be parsed without valency constraints, versus molecules parsed under the default constraints. The list of properties examined was as follows:

  1. 1.

    The molecular weight of each molecule.

  2. 2.

    The computed octanol–water partition coefficient82 of each molecule.

  3. 3.

    The topological complexity83 of each molecule.

  4. 4.

    The topological polar surface area84 of each molecule.

  5. 5.

    The proportion of carbon atoms in each molecule that were sp3 hybridized.

  6. 6.

    The proportion of rotatable bonds in each molecule.

  7. 7.

    The proportion of atoms in each molecule that were stereocentres.

  8. 8.

    The fraction of heteroatoms in each molecule.

  9. 9.

    The number of aliphatic rings in each molecule.

  10. 10.

    The number of aromatic rings in each molecule.

  11. 11.

    The total number of rings in each molecule.

  12. 12.

    The number of hydrogen donors in each molecule.

  13. 13.

    The number of hydrogen acceptors in each molecule.

For each of these properties, I compared the distributions of values for generated molecules versus the training set using Cohen’s d. I then tested for statistically significant differences using a paired t-test. Separately, I computed the Pearson correlation between the mean difference in effect sizes in comparisons of (1) molecules generated as SELFIES versus SMILES and (2) molecules generated as valid versus invalid SELFIES.

Generalization to unseen chemical space

To evaluate the ability of chemical language models to generalize to unseen chemical space surrounding the training set, I adapted an experimental design proposed previously61. I trained chemical language models on samples of 100,000 molecules from the GDB-13 database, represent either as SMILES or SELFIES, and then drew samples of 100 million strings from each trained model. I then intersected these samples with the remainder of the GDB-13 database by comparison of canonical SMILES. For each language model, I computed (1) the number of novel and unique molecules; (2) the number of generated molecules in GDB-13 (excluding the training set); and (3) the number of generated molecules not in either GDB-13 or the training set. The experiment was repeated ten times with different training sets to gauge variability and enable statistical comparison.

Structure elucidation with chemical language models

Previous work had reported that chemical language models could contribute to structure elucidation of unknown small molecules using mass spectrometry, given minimal analytical data as input62. Specifically, it was found that when drawing a very large sample of molecules from a trained language model, the frequency with which any given unique molecule appeared in this output provided a model-intrinsic measure that could be used to develop structural hypotheses from an accurate mass measurement. These hypotheses could then be further refined by integrating the sampling frequency with MS/MS data, using existing tools for MS/MS interpretation85. Here, I tested (1) whether this principle would apply to other classes of small molecules, including complex natural products and (2) whether the choice of representation would influence the accuracy of structure elucidation using this approach.

I evaluated the performance of chemical language models on this task using four databases representing three different categories of small molecules. These included natural products from the LOTUS86 and COCONUT87 databases, food-derived compounds from the FooDB database, and environmental compounds from the NORMAN suspect list88. All four databases were preprocessed as described above for ChEMBL, and then split at random (that is, without scaffold splitting) into ten folds. For each test fold, a language model was trained on the 90% of molecules comprising the training set, after which a total of 100 million strings were sampled from the trained model. The sampled molecules were then parsed with the RDKit, invalid outputs were discarded, and the frequency with which each canonical SMILES appeared in the model output was tabulated. Then, for each molecule in the held-out test set, the model output was searched with the exact mass of this query molecule (plus or minus a 10 part per million window) and the generated molecules were sorted in descending order by their sampling frequencies to provide a ranked list of structural hypotheses. A window of 10 ppm was used to allow for differences between the measured (experimental) and theoretical mass; the accuracy of the experimental measurements is usually expressed as a mass error in parts per million as

$$({{\rm{mass}}}_{{\rm{measured}}}-{{\rm{mass}}}_{{\rm{theoretical}}})/{{\rm{mass}}}_{{\rm{theoretical}}}\times {10}^{6}$$

The top-k accuracy was then calculated as the proportion of held-out molecules for which the correct chemical structure appeared within the k top-ranked outputs from the language model when ordered by sampling frequency (in the case of ties, molecules were ordered at random). A similar evaluation was carried out at the level of molecular formulas, whereby the top-k accuracy was calculated as the proportion of held-out molecules for which the correct formula appeared within the k top-ranked model outputs. In addition, I calculated the Tanimoto coefficient between each held-out molecule and the top-ranked chemical structure proposed by the language model, using Morgan fingerprints with a radius of 3 as above. The entire process was repeated in ten-fold cross-validation.

As a baseline, I compared this language model-directed approach to searching by exact mass in PubChem, mimicking one approach to assigning chemical structures or molecular formulas based on MS1 measurements. I also compared the language model approach to searching by exact mass in the training set itself, recognizing that this would by definition lead to a top-k accuracy of 0 for any value of k, but with the goal of comparing the Tanimoto coefficients between the true molecule and structures prioritized by the language model versus molecules from the training set as a means to assess generalization beyond the training set. In both baselines, the same 10 ppm error window was used, within which molecules were ordered at random.

Finally, I repeated the procedures above with language models trained on SELFIES representations of the same databases, using identical folds. In addition, I parsed molecules generated as SELFIES without valency constraints as described above, discarded invalid outputs, and then re-calculated the sampling frequency of each generated molecule after canonicalization.

CASMI 2022

To further place the performance of chemical language models in context, I benchmarked the language model trained on the LOTUS database against 19 submissions to the CASMI 2022 competition (two submissions, Nikolic_KUCA and Nikolic_POSO, were excluded because these represented manual rather than computational approaches, per the organizers of the competition). In this competition, 500 commercially available compounds were profiled by mass spectrometry; four compounds (identifiers 81, 282, 432, 476) were subsequently excluded from the competition. Entrants were provided with the accurate m/z and retention times for each compound, and an accompanying mzML file containing MS/MS data. I tested the performance of the chemical language model given only the accurate m/z value as input. To simulate de novo elucidation of unknown molecules, I averaged sampling frequencies across all 10 cross-validation folds, removing training set molecules from the generative model output for each fold. This procedure ensured that sampling frequencies could be generated for known natural products without data leakage. For each accurate m/z value, I considered multiple potential adducts ([M + H]+, [M + NH4]+ and [M + Na]+ in the positive mode and [M – H], [M + Cl] and [M + FA – H] in the negative mode) and retrieved sampled molecules with the corresponding exact masses, plus or minus a 10 ppm error window as above. Generated molecules were then sorted in descending order by sampling frequency across all potential adduct types to produce a ranked list of hypotheses for each target m/z value.

Visualization

Throughout the manuscript, the box plots show the median (horizontal line), interquartile range (hinges) and smallest and largest values no more than 1.5 times the interquartile range (whiskers), and the error bars show the standard deviation.