Inverse design of 3d molecular structures with conditional generative neural networks

The rational design of molecules with desired properties is a long-standing challenge in chemistry. Generative neural networks have emerged as a powerful approach to sample novel molecules from a learned distribution. Here, we propose a conditional generative neural network for 3d molecular structures with specified chemical and structural properties. This approach is agnostic to chemical bonding and enables targeted sampling of novel molecules from conditional distributions, even in domains where reference calculations are sparse. We demonstrate the utility of our method for inverse design by generating molecules with specified motifs or composition, discovering particularly stable molecules, and jointly targeting multiple electronic properties beyond the training regime.


I. INTRODUCTION
Identifying chemical compounds with particular properties is a critical task in many applications, ranging from drug design [1][2][3] over catalysis [4] to energy materials [5][6][7][8].As an exhaustive exploration of the vast chemical compound space is infeasible, progress in these areas can benefit substantially from inverse design methods.In recent years, machine learning (ML) has been used to accelerate the exploration of chemical compound space [9][10][11][12][13][14][15].A plethora of methods accurately predict chemical properties and potential energy surfaces of 3d structures at low computational cost [16][17][18][19][20][21][22][23][24][25][26][27].Here, the number of reference calculations required for training ML models depends on the size of the domain to be explored.Thus, naive exploration schemes may still require a prohibitive number of electronic structure calculations.Instead, chemical space has to be be navigated in a guided way with fast and accurate methods to distill promising molecules.
This gives rise to the idea of inverse molecular design [28], where the structure-property relationship is reversed.Here, the challenge is to directly construct molecular structures corresponding to a given a set of properties.Generative ML models have recently gained traction as a powerful, data-driven approach to inverse design as they enable sampling from a learned distribution of molecular configurations [29].By appropriately restricting the distributions, they allow to obtain sets of candidate structures with desirable characteristics for further evaluation.These methods typically represent molecules as graphs or SMILES strings [30,31], which lack information about the three-dimensional structure of a molecule.Therefore, the same molecular graph can rep-resent various spatial conformations that differ in their respective properties, e.g.due to intramolecular interactions (hydrogen bonds, long-range interactions) or different orientations of structural motifs (rotamers, stereoisomers).Beyond that, connectivity-based representations are problematic in chemical systems where bonding is ambiguous, e.g. in transition metal complexes, conjugated systems or metals.Relying on these abstract representations is ultimately a limiting factor when exploring chemical space.
Recently, generative models that enable sampling of 3d molecular configurations have been proposed.This includes specifically designed approaches to translate given molecular graphs to 3d conformations [32][33][34][35][36][37][38], map from coarse-grained to fine-grained structures [39], sample unbiased equilibrium configurations of a given system [40,41], or focus on protein folding [42][43][44][45][46].In contrast, other models aim at sampling directly from distributions of 3d molecules with arbitrary composition [47][48][49][50][51][52][53][54][55][56], making them suitable for general inverse design settings.These models need to be biased towards structures with properties of interest, e.g. using reinforcement learning [51,52,56], fine-tuning on a biased data set [48], or other heuristics [54].Some of us have previously proposed G-SchNet [48], an auto-regressive deep neural network that generates diverse, small organic molecules by placing atom after atom in Euclidean space.It has been applied in the 3D-Scaffold framework to build molecules around a functional group associated with properties of interest in order to discover novel drug candidates [54].Such an approach requires prior knowledge about the relationship between functional groups and target properties and might prevent the model from unfolding its potential by limiting sampling to very specific molecules.G-SchNet has been biased by fine-tuning on a fraction of the training data set containing all molecules with small HOMO-LUMO gap [48].For this, a sufficient amount of training examples in the target space is required.However, the most interesting regions for exploration are often those where reference calculations are sparse.
In this work, we propose conditional G-SchNet (cG-SchNet), a conditional generative neural network for the inverse design of molecules.Building on G-SchNet, the model learns conditional distributions depending on structural or chemical properties allowing us to sample corresponding 3d molecular structures.Our architecture is designed to generate molecules of arbitrary size and does not require the specification of a target composition.Consequently, it learns the relationship between the composition of molecules and their physical properties in order to sample candidates exhibiting given target properties, e.g.preferring smaller structures when targeting small polarizabilities.Previously proposed methods have been biased towards one particular set of target property values at a time by adjusting the training objective or data [48,51].In contrast, our conditional approach permits searching for molecules with any desirable set of target property values after training is completed.It is able to jointly target multiple properties without the need to retrain or otherwise indirectly constrain the sampling process.This provides the foundation for the model to leverage the full information of the training data resulting in increased generalization and data efficiency.We demonstrate that cG-SchNet enables the exploration of sparsely populated regions that are hardly accessible with unconditional models.To this end, we conduct extensive experiments with diverse conditioning targets including chemical properties, atomic compositions and molecular fingerprints.In this way, we generate novel molecules with predefined structural motifs, isomers of a given composition that exhibit specific chemical properties, and novel configurations that jointly optimize HOMO-LUMO gap and energy.This demonstrates that our model enables flexible, guided exploration of chemical compound space.

Targeted 3d molecule generation with cG-SchNet
We represent molecules as tuples of atom positions R ≤n = (r 1 , ..., r n ) with r i ∈ R 3 and corresponding atom types Z ≤n = (Z 1 , ..., Z n ) with Z i ∈ N. cG-SchNet assembles these structures from sequences of atoms that are placed step by step in order to build the molecule in an autoregressive manner, where the placement of the next atom depends on the preceding atoms (Fig. 1a and  c).In contrast to G-SchNet [48], which learns an unconditional distribution over molecules, cG-SchNet samples from target-dependent conditional probability distributions of 3d molecular structures (Fig. 1b).
Given a tuple of k conditions Λ = (λ 1 , ..., λ k ), cG-SchNet learns a factorization of the conditional distribution of molecules, i.e. the joint distribution of atom positions and atom types conditioned on the target properties: In fact, we can split up the joint probability of the next type and the next position into the probability of the next type and the probability of the next position given the associated next type: This allows to predict the next type before the next position.We approximate the distribution over the absolute position from distributions over distances to already placed atoms which guarantees that it is equivariant with respect to translation and rotation of the input.Here α is the normalization constant and r ij = ||r i − r j || is the distance between the new atom i and a previously placed atom j.This approximation has previously been shown to accurately reproduce a distribution of molecular structures [48].Fig. 2 shows a schematic depiction of the cG-SchNet architecture.The conditions λ 1 , ..., λ k are each embedded into a latent vector space and concatenated, followed by a fully connected layer.In principle, any combination of properties can be used as conditions with our architecture with a suitable embedding network.In this work, we use three scalar-valued electronic properties such as the isotropic polarizability, vector-valued molecular fingerprints, and the atomic composition of molecules.Vectorvalued properties are directly processed by the network while scalar-valued targets are first expanded in a Gaussian basis.To target an atomic composition, learnable atom type embeddings are weighted by occurrence.The embedding procedure is described in detail in the Methods section.
In order to localize the atom placement and stabilize the generation procedure, cG-SchNet makes use of the same two auxiliary tokens as in the unconditional setting, namely the origin and the focus token [48].Auxiliary tokens are treated like regular atoms by the model, i.e. they possess positions and token types, which are contained in the tuples of atom positions and atom types serving as input at each step.The origin token marks the center of mass of molecules and allows the architecture to steer the growth from inside to outside.The focus token localizes the prediction of the next position in order to assure scalability and allows to break symmetries of partial structures.This avoids artifacts in the reconstruction of the positional distribution (Eq.3) as reported by Gebauer et al. [48].At each step, the focus token is randomly assigned to a previously placed atom.The position of the next atom is required to be close to this focus.In this way, we can use a small grid localized on the focus that does not grow with the number of atoms when predicting the distribution of the next position.
We train cG-SchNet on a set of molecular structures, where the values of properties used as conditions are known for each molecule.Given the conditions and the partial molecular structure at each step, cG-SchNet predicts a discrete distribution for the type of the next atom.As part of this, a stop type may be predicted that allows the model to control the termination of the sampling procedure and therefore generate molecules with variable size and composition.After sampling a type, cG-SchNet predicts distributions for the distance between the atom to be placed and each preceding atom and auxiliary token.The schematic depiction of the atom placement loop in Fig. 1c includes the auxiliary tokens, the model predictions, and the reconstruction of the localized 3d grid distribution.During training, we minimize the crossentropy loss between the predicted distributions and the ground-truth distributions known from the reference calculations.For further details on the model architecture and training procedure, refer to the Methods section.

Generating molecules with specified motifs
In many applications, it is advantageous for molecules to possess specific functional groups or structural motifs.These can be correlated with desirable chemical properties, e.g.polar groups that increase solubil-ity, or with improved synthetic accessibility.In order to sample molecules with specific motifs, we condition cG-SchNet on a path-based, 1024 bits long fingerprint that checks molecular graphs for all linear segments of up to seven atoms [58] (Supplementary Methods X C).The model is trained on a randomly selected subset of 55k molecules from the QM9 dataset consisting of ∼134k organic molecules with up to nine heavy atoms from carbon, nitrogen, oxygen, and fluorine [59][60][61].We condition the sampling on fingerprints of unseen molecules, i.e. structures not used during training.Fig. 3a shows results for four examples.We observe that the generated molecules have higher similarity with the target fingerprints than the training data.Furthermore, structures with high target similarity are also sampled with higher probability, as can be seen from the increased similarity score of generated duplicates.In the last column of Fig. 3a, we show sampled molecules with high similarity to each target and see that in each case various structures with perfectly matching fingerprint were found.For reference, we also show the most similar molecule in the training set.Overall, we see that the conditional sampling with cG-SchNet is sensitive to the target fingerprint and allows for generation of molecules with desired structural motifs.Although there are no molecules with FIG. 4. Discovery of low-energy isomers for an unseen composition.We sample C7O2H10 isomers with cG-SchNet conditioned on atomic composition and relative atomic energy (see text for details), where the training data set was restricted to contain no C7O2H10 conformations.a: The distribution of the relative atomic energy for C7O2H10 isomers in the test set (orange) and for three sets of isomers generated with cG-SchNet (blue curves) when targeting the composition C7O2H10 and three different relative atomic energy values as marked with color-matching dots on the x-axis.The generated isomer closest to the respective target is depicted above each curve.b: The absolute number of C7O2H10 isomers in the test set (red dotted line) for increasing relative atomic energy thresholds.The black solid line shows how many of these were generated by cG-SchNet (target energy -0.1 eV).c: Bar plot of the absolute number of C7O2H10 isomers with relative atomic energy ≤ 0.05 eV in the test set (orange) and generated by cG-SchNet (target energy -0.1 eV, purple).The bar for generated molecules is divided into isomers that can be found in the test set (unseen isomers), isomers that have different stereochemistry but share the same bonding pattern as test set structures (novel stereoisomers), and novel constitutional isomers that are not in QM9 (novel isomers).d: Relaxed example low-energy isomers generated by cG-SchNet (target energy -0.1 eV, blue dots) and structures from the test set (orange dots) along with their relative atomic energy.
the same fingerprint in the training data for three of the four fingerprint targets, the ML model successfully generates perfectly matching molecules, demonstrating its ability to generalize and explore unseen regions of chemical compound space.

Generalization of condition-structure relationship across compositions
For inverse design tasks, integrating information gained from different structures and properties is vital to obtain previously unknown candidates with desired properties.In this experiment, we target C 7 N 1 O 1 H 11 isomers with HOMO-LUMO gap values outside the range observed during training.To this end, the model has to learn from other compositions how molecules with particularly high or low HOMO-LUMO gap are structured, and transfer this knowledge to the target composition.There are 5859 C 7 N 1 O 1 H 11 isomers in QM9, where 997 have a HOMO-LUMO gap smaller than 6 eV, 1612 have a HOMO-LUMO gap larger than 8 eV, and 3250 lie in between these two values.We restrict the training data consisting of 55k molecules from QM9 to contain no C 7 N 1 O 1 H 11 isomers with HOMO-LUMO gap values outside the intermediate range (Fig. 3b).Thus, the model can only learn to generate molecules with gaps outside this range from compositions other than Fig. 3b shows examples of generated C 7 N 1 O 1 H 11 isomers for two target values as well as the respective HOMO-LUMO gap distributions.In both cases, the majority of generated isomers exhibit gap values close to the respective target (± 1 eV), i.e. outside of the range observed for these isomers by the model during training.This demonstrates that cG-SchNet is able to transfer knowledge about the relationship between structural patterns and HOMO-LUMO gaps learned from molecules of other compositions to generate unseen C 7 N 1 O 1 H 11 isomers with outlying gap values upon request.

Discovery of low-energy conformations
The ability to sample molecules that exhibit property values which are missing in the training data is a prerequisite for the targeted exploration of chemical space.A generative model needs to fill the sparsely sampled regions of the space, effectively enhancing the available data with novel structures that show property values of interest.We study this by training cG-SchNet on a randomly sampled set of 55k QM9 molecules and query our model to sample low-energy C 7 O 2 H 10 isomers -the most common composition in QM9.We exclude these isomers from the training data, i.e. our model has to generalize beyond the seen compositions.The identification of low-energy conformations is desirable in many practical applications, since they tend to be more stable.However, the energy of molecules is largely determined by their size and composition.Since we are mainly interested in the energy contribution of the spatial arrangement sampled by the model, we require a normalized energy value.To this end, we define the relative atomic energy, which indicates whether the internal energy per atom is relatively high or low compared to other molecules of the same composition in the data set (see Supplementary Methods X B for details).Negative values indicate comparatively low energy, and thus higher stability than the average structure of this composition.Note that a similarly normalized energy has been defined by Zubatyuk et al. [62] for their neural network potential.Using the relative atomic energy allows cG-SchNet to learn the influence of the spatial arrangement of atoms on the energy and transfer this knowledge to the unseen target composition.Examples of generated C 7 O 2 H 10 isomers with low, intermediate, and high relative atomic energy are shown in Fig. 4a.We observe that conformations with highly strained, small rings exhibit increased relative atomic energy values.
Fig. 4a shows that the trained model generalizes from the training data to sample C 7 O 2 H 10 isomers capturing the whole range of relative atomic energies exhibited by the QM9 test structures.We focus on stable, low-energy isomers for our analysis in the following.We sample 100k molecules with the trained cG-SchNet conditioned on the composition C 7 O 2 H 10 and a relative atomic energy value of -0.1 eV, i.e. close to the lowest energies occurring for these isomers in QM9.The generated molecules are filtered for valid and unique C 7 O 2 H 10 isomers, relaxed using density functional theory (DFT), and then matched with the test data structures.169 of the 200 isomers with the lowest relative atomic energy in the test set have been recovered by the model as well as 67% of the 1k isomers with relative atomic energy lower than −0.05 eV (Fig. 4b).Beyond that, cG-SchNet has generated 416 novel isomers as well as 243 novel stereoisomers that share the same bonding pattern as a test structure but show different stereochemistry (Fig. 4c).We found 32% more unique C 7 O 2 H 10 isomers with relative atomic energy lower than −0.05 eV with our model than already contained in QM9.Example isomers are depicted in Fig. 4d.For reference, we show additional, randomly selected generated novel isomers along with their most similar counterparts from QM9 in Supplementary Fig. S1 and depict how atoms in these structures moved during relaxation in Supplementary Fig. S4.Furthermore, we examine different conformations found for the five most often generated isomers in Supplementary Fig. S3.
The generated molecules include structures and motifs that are sparse or not included in the QM9 benchmark data set, which has previously been reported to suffer from decreased chemical diversity compared to real world data sets [63].For instance, there are no C 7 O 2 H 10 isomers with carboxylic acid groups in QM9, while twelve of the generated novel low-energy isomers possess this functional group (e.g., Fig. 4d, top right and Supplementary Fig. S2).Carboxylic acid groups are a common motif of organic compounds and feature prominently in fats and amino acids.While they are only contained in a few hundred molecules in QM9, cG-SchNet has learned to transfer this group to molecules of the targeted composition.Moreover, the model has discovered several acyclic C 7 O 2 H 10 isomers exhibiting a significantly lower relative atomic energy than those in QM9 (examples in Fig. 4d, bottom row).As cG-SchNet generalizes beyond the chemical diversity of QM9, this demonstrates that it can be employed to systematically enhance a database of molecular structures.
Targeting multiple properties: Discovery of low-energy structures with small HOMO-LUMO gap For most applications, the search for suitable molecules is guided by multiple properties of interest.Therefore, a method for exploration needs to allow for the specification of several conditions at the same time.Here we demonstrate this ability by targeting HOMO-LUMO gap as well as relative atomic energy, i.e. two complex electronic properties at the same time.A particular challenging task is to find molecules with extreme property values, as those are often located at the sparsely populated borders of the training distribution.In previous work, we have biased an unconditioned G-SchNet in order to sample molecules with small HOMO-LUMO gap [48].The model was fine-tuned with all ∼3.8k available molecules from QM9 with HOMO-LUMO gap smaller than 4.5 eV, a small fraction of the whole QM9 data set with ∼130k molecules.In the following, we demonstrate that improved results can be achieved with the cG-SchNet architecture while using less training samples from the target region.We further condition the sampling to particularly stable, low-energy conformations.In a fine-tuning approach, this would limit the training data to only a few molecules that are both stable and exhibit small gaps.In contrast, the conditioned model is able to learn also from reference calculations where only one of the desired properties is present.
We condition cG-SchNet on the HOMO-LUMO gap as well as the relative atomic energy and train it on 55k randomly selected QM9 molecules, where only ∼1.6k of the ∼3.8k molecules with HOMO-LUMO gap smaller than 4.5 eV are contained.Then, we sample the same number of molecules as for the biased model [48] (20k) with the trained cG-SchNet using a HOMO-LUMO gap value of 4.0 eV and relative atomic energy of −0.2 eV as conditions.The generated conformations are filtered for valid and unique molecules, relaxed using DFT, and then matched with the training data structures.
Fig. 5 compares the sets of generated, unique, unseen molecules with HOMO-LUMO gap smaller than 4.5 eV obtained for the cG-SchNet and biased G-SchNet.For biased G-SchNet, we use the previously published [48] data set of generated molecules with low HOMO-LUMO gap and remove all structures with HOMO-LUMO gap larger than 4.5 eV.Since the energy range has not been restricted for the biased G-SchNet, it samples structures that capture the whole space spanned by the training data, i.e. also less stable molecules with higher relative atomic energy.The molecules generated with cG-SchNet, in contrast, are mostly structures with low relative atomic energy (Fig. 5a).Considering the total amount of unseen molecules with small gaps found by both models, we observe that cG-SchNet samples a significantly larger number of structures from the low-energy domain than the biased G-SchNet.It similarly surpasses the number of molecules from this domain in the training set, showcasing an excellent generalization performance (see Fig. 5b).
The statistics about the average atom, bond, and ring count of generated molecules depicted in Fig. 5c reveal further insights about the structural traits and differences of molecules with low HOMO-LUMO gap in the two sets.The molecules found with cG-SchNet contain more double bonds and a larger number of rings, mainly consisting of five or six atoms.This indicates a prevalence of aromatic rings and conjugated systems with alternating double and single bonds, which are important motifs in organic semiconductors.The same patterns can be found for molecules from biased G-SchNet, however, there is an increased number of nitrogen and oxygen atoms stemming from less stable motifs such as rings dominated by nitrogen.An example of this is the molecule with the highest energy depicted in Fig. 5a).Furthermore, the molecules of biased G-SchNet tend to contain highly strained small cycles of three or four atoms.cG-SchNet successfully averts these undesirable motifs when sampling molecules with a low relative atomic energy target.
We conclude that cG-SchNet has learned to build stable molecules with low HOMO-LUMO gap even though it has seen less than half of the structures that the biased model was fine-tuned on.More importantly, the training data contains only very few (∼200) structures close to the target conditions at the border of the QM9 distribution, i.e. with HOMO-LUMO gap smaller than 4.5 eV and relative atomic energy smaller than −0.1 eV.However, our model is able to leverage information even from structures where one of the properties is outside the targeted range.Consequently, it is able to sample a significantly higher number of unseen molecules from the target domain than there are structures in the training data that fulfill both targets.In this way, multiple properties can be targeted at once in order to efficiently explore chemical compound space.
The efficiency of cG-SchNet in finding molecular structures close to the target conditions is particularly evident compared to exhaustive enumeration of graphs with subsequent relaxation using DFT.In both cases, the relaxation required to obtain equilibrium coordinates and the physical properties is the computational bottleneck and takes more than 15 minutes per structure for the molecules generated in this experiment.Furthermore, the calculation of the internal energy at zero Kelvin (U0) requires additional 40 minutes per molecule.In contrast, the generation with cG-SchNet takes only 9 milliseconds per structure on a Nvidia A100 GPU when sampling in batches of 1250.The training time of about 40 hours is negligible, as it corresponds to the relaxation and calculation of U0 of only 44 structures.Thus, the efficiency is determined by the number of molecules that need to be relaxed for each method.The QM9 data set was assembled by relaxing structures from the GDB enumeration [61] of graphs for small organic compounds.Of the ∼78k molecules that we did not use for training, 354 molecules are close to the target region.Relaxing only the 5283 structures proposed by cG-SchNet, i.e. less than 10% of the computations performed by screening all graphs, we can already recover 46% of these structures.Additionally, the model has unveiled valid molecules close to the target that are not contained in the data set.More than 380 of these are larger than QM9 structures and thus not covered.However, 253 smaller structures were missed by the enumeration method.This is, again, in line with findings by Glavatskikh et al. [63] that even for these small compounds the graph-based sampling does not cover all structures of interest.Furthermore, the conditional model is not restricted to the space of low energy / low gap molecules, but can also sample low energy / high gap structures or any other combination of interest.Thus, the efficiency of the generative model becomes even more pronounced when there are multiple sets of desirable target values.Fig. 1b depicts an example where cG-SchNet has been trained on the isotropic polarizability as condition.Here, the same model is employed to sample molecules for five different target values.Again, cG-SchNet is able to generalize to isotropic polarizabilities beyond the values present in the training data.

III. DISCUSSION
cG-SchNet enables the targeted discovery of 3d molecular structures conditioned on arbitrary combinations of multiple structural and chemical properties.The neural network captures global and local symmetries of molec-ular structures by design, enabling it to learn complex relationships between chemical properties and 3d structures.This makes it possible to generalize to unseen conditions and structures, as we have thoroughly evaluated in a line of experiments where we target property values not included in the training data.In contrast to previous approaches, the model does not require targetspecific biasing procedures.Instead, the explicit conditioning enables cG-SchNet to learn efficiently from all available reference calculations.Desirable values of multiple properties can be targeted simultaneously to sample from specific conditional distributions.In this way, cG-SchNet generates novel 3d candidate molecules that exhibit the target properties with high probability and thus are perfectly suited for further filtering and evaluation using ML force fields.
Further work is required to apply the cG-SchNet architecture to the exploration of significantly larger systems and a more diverse set of atom types.Although an unconditional G-SchNet has been trained on druglike molecules with 50+ atoms in the 3D-Scaffold framework [54], adjustments will be necessary to ensure scalability to materials.In the current implementation, we employ all preceding atoms to predict the type and reconstruct the positional distribution of the next atom.Here, a cutoff or other heuristics to limit the number of considered atoms will need to be introduced, together with corrections for long-range interactions.While the small organic compounds considered in this work are well represented by QM9, the model might benefit from enhancing the training data using representative building blocks such as "amons" [64] or other fragmentation methods [65,66].This becomes increasingly important when tackling larger molecules where reference data is hard to obtain.Furthermore, additional adaptations are necessary to explore systems with periodic boundary conditions.In cases where not all targeted properties can be fulfilled simultaneously, finding suitable molecules becomes harder, if not impossible.Therefore, another important extension is to explicitly define a trade-off between multiple conditions or to sample along a Pareto front.
We have applied cG-SchNet to sample particularly stable, low-energy C 7 O 2 H 10 isomers.In this process, we have discovered molecules and motifs that are absent from the QM9 database, such as isomers with carboxylic acid groups.Furthermore, we have sampled more than 800 low-energy molecules with HOMO-LUMO gaps smaller than 4.5 eV from a domain that is only sparsely represented in the training data.Although the exploration of such small molecules with exhaustive sampling of molecular graphs and subsequent evaluation with DFT is computationally feasible, our model considerably accelerates the process by providing reasonable candidate structures.cG-SchNet thus also enables the dataefficient, systematic improvement of chemical databases, which is particularly valuable considering the computational cost and unfavourable scaling of electronic struc-ture calculations.This paves the way for ML-driven, targeted exploration of chemical compound space and opens avenues for further development towards generative models for larger and more general atomistic systems.

A. Training Data
For each training run, 55k reference structures are randomly sampled from the QM9 data set [59][60][61], a collection of 133,885 molecules with up to nine heavy atoms from carbon, nitrogen, oxygen, and fluorine.We removed 915 molecules from the training pool which are deemed invalid by our validation procedure that checks the valency and connectedness of generated structures (see Section IV F).For some runs, limited subsets of the training data pool are used, as described in the results (e.g.without C 7 O 2 H 10 isomers).We train the neural network using 50k randomly sampled molecules and employ the remaining 5k for validation (see Section IV D).All molecules shown in figures have been rendered with the 3d visualization package Mayavi [67].

B. Details on the neural network architecture
In the following, we describe the cG-SchNet architecture as depicted in Figure 2 While this example shows a succession of two linear layers, the notation covers any number of successive linear layers with intermediate shifted softplus activations in the following.The number of layers and neurons as well as all other hyper-parameter choices for our neural network architecture are given in Supplementary Table S1.
The inputs to cG-SchNet when placing atom i is a partial molecule consisting of i − 1 atoms including two auxiliary tokens (focus and origin) and k target properties Λ = (λ 1 , ..., λ k ).The atoms and tokens are given as tuples of positions R ≤i−1 = (r 1 , . . ., r i−1 ) with r j ∈ R 3 and types Z ≤i−1 = (Z 1 , ..., Z i−1 ) with Z j ∈ N. The first two entries correspond to the auxiliary tokens, which are treated like ordinary atoms by the neural network.Thus, whenever we refer to atoms in the following, this also encompasses the tokens.Note that tokens do not influence the sampling probability of a molecule in Eq. 1, since they are placed with probability p(R ≤2 , Z ≤2 |Λ) = 1.
We employ SchNet [21,57] to extract atom-wise features X ≤i−1 = (x 1 , . . ., x i−1 ) that are invariant to rotation and translation.We use the SchNet representation network as implemented in the SchNetPack software package [68] with F = 128 features per atom and 9 interaction blocks.
Additionally, we construct a vector y ∈ R D of conditional features from the list of target properties.To this end, each target property is first mapped into vector space using an individual embedding network that depends on the form of the specific property.In this work, we employ different embedding networks for scalar-valued properties, vector-valued properties, and the atomic composition.Scalar-valued properties are processed by an MLP after applying a Gaussian radial basis function expansion where the minimum λ min and maximum λ max property values and the grid spacing ∆ω are hyper-parameters chosen per target property.Vector-valued properties such as molecular fingerprints are directly processed by an MLP: For the atomic composition, we use two embedding blocks.While the number of atoms is embedded as a scalar property, we map atom types to learnable embeddings g comp Z ∈ R G .These vectors are weighted by the fraction of the corresponding atom type in the target atomic composition, concatenated, and processed by an MLP.For example, the atomic composition of hydrocarbons would be encoded as: where "⊕" is the concatenation of two vectors and n H and n C is the fraction of hydrogen and carbon atoms in the target atomic composition, respectively.Finally, the property feature vectors f λ1 , . . ., f λ k are aggregated by an MLP to obtain the combined conditional features y.
Given the conditional features y representing the target properties and the atom-wise features X ≤i−1 describing the partial molecule, the cG-SchNet architecture predicts distributions for the type of the next atom and its pairwise distances to all preceding atoms with two output networks.Let Z all ⊂ N be the set of all atom types in the training data including an additional stop marker type.The type prediction network first computes atom-wise, |Z all |-sized vectors containing a scalar score for each atom type.Let s [z] j be the score of type z ∈ Z all predicted for preceding atom j.Then, the probability for the next atom being of type z is obtained by taking the softmax over all types and averaging the atom-wise predictions: The distance distributions are discretized on a grid with L bins, each covering a span of ∆µ.The bin of a distance d ∈ R + is given by b : R + → {1, . . ., L} b(d) = Given the type Z i of the next atom, the distance prediction network computes scores for each preceding atom and distance bin where " " is the Hadamard product and g next Z ∈ R F is a learnable atom type embedding.The probability of any distance between the new atom and a preceding atom is obtained by applying a softmax over all bins where u is the score of bin b(d) predicted for preceding atom j.

C. Sampling atom placement sequences for training
The number of sequences in which a molecule can be built by placing n atoms grows factorially with n.During training, we randomly sample a new atom placement sequence for every training molecule in each epoch.However, we use the focus and origin tokens to constrain how molecules are build by cG-SchNet and thus significantly reduce the number of possible sequences.Our approach ensures that molecules tend to grow outwards starting from the center of mass and that each new atom is placed close to one of the already placed atoms.For the first atom placement step, we set the positions of the focus and origin tokens to the center of mass of the training molecule and choose the atom closest to it as the first atom to be placed.If multiple atoms are equally close, one of them is randomly chosen as the first atom.
Afterwards, each atom placement step follows the same procedure.One of the already placed atoms (excluding tokens) is chosen as focus, i.e. the position of the focus token is set to the position of the chosen atom.Then, from all unplaced atoms, we select the neighbor of the focus that is closest to the center of mass as next atom.If there are no neighbors of the focus among the unplaced atoms, we insert a step where the type prediction network shall predict the stop marker type.In this way, the focus atom is marked as finished before randomly choosing a new focus and proceeding with the next atom placement step.Marked atoms cannot be chosen as focus anymore and the atom placement sequence is complete when all placed atoms are marked as finished.Thus, the sequence ends up with 2n steps, as each atom needs to be placed and furthermore marked as finished.
For our experiments, we consider atoms sharing a bond as neighbors.However, note that bonding information is not necessarily required as neighborhood can also be defined by a radial cutoff of e.g. 3 Å centered on the focus atom.For each atom placement, we minimize the crossentropy between the distributions predicted by the model given Z ≤i−1 , R ≤i−1 , and Λ and the distributions obtained from the ground truth next type Z next and position r next .The ground truth distribution of the next type is a one-hot encoding of Z next , thus the cross-entropy loss for the type distributions is The average cross-entropy loss for the distance distributions is with model predictions and Gaussian expanded ground truth distance where L is the number of bins of the distance probability grid with spacing ∆µ.The width of the Gaussian expansion can be tuned with γ, which we set to 10 ∆µ in our experiments.
The loss for a mini-batch C is the average type and distance loss of all atom placement steps of all M molecules in the mini-batch: where |A m | is the number of steps in sequence A m and The indicator function δ is zero for steps where the type to predict is the stop marker, since no position is predicted in these steps.
The neural networks were trained with stochastic gradient descent using the ADAM optimizer [69].We start with a learning rate η = 10 −4 which is reduced using a decay factor of 0.5 after 10 epochs without improvement of the validation loss.The training is stopped at η ≤ 10 −6 .We use mini-batches of 5 molecules and the model with lowest validation error is selected for generation.

E. Conditional generation of molecules
For the generation of molecules, conditions need to be specified covering all target properties the model was trained on, e.g. the atomic composition and the relative atomic energy.The generation is an iterative process where the type and position of each atom are sampled sequentially using the distributions predicted by cG-SchNet.Generating a molecule with n atoms takes 2n steps, as each atom needs to be placed and furthermore marked as finished in order to terminate the generation process.
At each step, we want to sample the type Z next ∈ Z all ⊂ N and position r next ∈ G ⊂ R 3 of the next atom given the types and positions of already placed atoms (including the two tokens) and the conditions.Here, Z all is the set of all atom types in the training data including an additional stop marker type and G is a grid of candidate positions in 3d space (see Supplementary Methods X A).An unfinished atom is randomly chosen as focus at the start of each step, i.e. the position of the focus token is aligned with the position of the chosen atom.Then, we predict the distribution of the type of the next atom with the model (see Eq. 11) to sample the next type If the next type is the stop marker, we mark the currently focused atom as finished and proceed with the next step by choosing a new focus without sampling a position.Otherwise, we proceed to predict the distance distributions between placed atoms and the next atom with the model (see Eq. 14).Since cG-SchNet is trained to place atoms in close proximity to the focused atom, we align the local grid of candidate positions with the focus at each step regardless of the number of atoms in the unfinished molecule.Then, the distance probabilities are aggregated to compute the distribution over 3d candidate positions in the proximity of the focus.The position of the next atom is drawn accordingly with where α is the normalization constant and r focus is the position of the focus token.At the very first atom placement step, we center the focus and grid on the origin token, while for the remaining steps, only atoms will be focused.
The generation process terminates when all regular atoms have been marked as finished.In this work, we limit the model to a maximum number of 35 atoms.If the model attempts to place more atoms, the generation terminates and the molecule is marked as invalid.

F. Checking validity and uniqueness of generated molecules
We use Open Babel [58] to assess the validity of generated molecules.Open Babel assigns bonds and bond orders between atoms to translate the generated 3d representation of atom positions and types into a molecular graph.We check if the valence constraints hold for all atoms in the molecular graph and mark the molecule as invalid if not.Furthermore, the generated structure is considered invalid if it consists of multiple disconnected graphs.We found that Open Babel may struggle to assign correct bond orders even for training molecules if they contain aromatic sub-structures made of nitrogen and carbon.Thus, we use the same custom heuristic as in previous G-SchNet work [48] that catches these cases and checks whether a correct bond order can be found.The corresponding code is available in the Supplementary Material.
The uniqueness of generated molecules is checked using their canonical SMILES [30] string representation obtained from the molecular graph with Open Babel.If two molecules share the same string, they are considered to be equal, i.e. non-unique.Furthermore, we check the canonical SMILES string of mirror-images of generated structures, which means that mirror-image stereoisomers (enantiomers) are considered to be the same molecule in our statistics.In case of duplicates, we keep the molecule sampled first, with the exception of the search for C 7 O 2 H 10 isomers, where we keep the structure with the lowest predicted relative atomic energy.Molecules from the training and test data are matched with generated structures in the same way, using their canonical SMILES representations obtained with Open Babel and the custom heuristic for bond order assignment.In general, we use isomeric SMILES strings that encode information about the stereochemistry of 3d structures.Only in the search for C 7 O 2 H 10 isomers, we also compare nonisomeric canonical SMILES obtained with RDKit [70] in order to identify novel stereoisomers, i.e. structures that share the same non-isomeric SMILES representation but differ in the isomeric variant.

G. Prediction of property values of generated molecules
We use pretrained SchNet [21,57] models from SchNet-Pack [68] to predict the HOMO-LUMO gap, isotropic polarizability, and internal energy at zero Kelvin of generated molecules.The reported mean absolute error (MAE) of these models is 0.074 eV, 0.124 Bohr 3 , and 0.012 eV, respectively.The predicted values are used to plot the distributions of the respective property in Fig. 1b, Fig. 3b, and Fig. 4a.We relax generated molecules for every experiment in order to assess how close they are to equilibrium configurations and to calculate the MAE between predictions for generated, unrelaxed structures and the computed ground-truth property value of the relaxed structure.The relaxation procedure is described in Supplementary Methods X D), where furthermore a table with the results can be found (Supplementary Table S2).For the statistics depicted in Fig. 4b-d and Fig. 5, we use the property values computed during relaxation instead of predictions from SchNet models.
The set of molecules with small HOMO-LUMO gap generated by biased G-SchNet is available at http://quantum-machine. org/datasets.

VI. CODE AVAILABILITY
The code for cG-SchNet is available at www.github.com/atomistic-machine-learning/cG-SchNet.This includes the routines for training and deploying the model, for filtering generated structures, all hyperparameter settings used in our experiments, and the splits of the data employed to train the reported models.

VII. AUTHOR CONTRIBUTIONS
NWAG developed the method and carried out the experiments.MG carried out the reference computations and simulations.SSPH trained the neural networks for predictions of molecular properties.NWAG, MG, KRM and KTS designed the experiments and analyses.NWAG, MG, and KTS wrote the paper.All authors discussed results and contributed to the final version of the manuscript.

X. SUPPLEMENTARY INFORMATION A. 3d grid for molecule generation
We use a grid of candidate positions G ⊂ R 3 , with a spacing of 0.05 Å.The extent of the grid is limited by a minimum distance d min and a maximum distance d max : The limits should be chosen according to the minimum and maximum distances between atoms in the training set that are considered to be neighbors when building atom placement sequences.For our experiments with QM9, we choose d min = 0.9 Å and d max = 1.7 Å.Furthermore, as in previous work with G-SchNet [48], we utilize a temperature parameter T to control the randomness when sampling from candidate positions: with Increasing T will increase randomness by smoothing the grid distribution.For sampling, we stick with T =0.1 in this work, which was found to result in accurate yet diverse sets of generated molecules [48].
The very first atom is placed solely based on the predicted distance to the origin token, i.e. the center of mass of the structure about to be generated.Naturally, this distance is not restricted by the same limits as neighboring atoms and thus, for this particular step, we employ a special grid G 1 ⊂ R 3 that covers larger distances: The maximum distance covered by the grid has been chosen to match with the maximum distance covered in the discretized distance distributions predicted by the model.Due to symmetry, the grid only needs to extend into one direction.Furthermore, the distribution is not smoothed during generation, i.e. we always set T =1.0 when sampling the first atom.

B. Calculation of relative atomic energy
We define a relative atomic energy that describes whether the energy per atom of a 3d conformation is comparatively high or low with respect to other structures in the data set that share the same atomic composition: Here, E(R ≤n , Z ≤n ) is the internal energy per atom at zero Kelvin of a molecular structure and ÊZ (Z ≤n ) is the expected internal energy per atom of molecules with the same composition in the training data set.A similarly normalized energy has been defined by Zubatyuk et al. [62] for their neural network potential AIMNet.Analogous to their procedure, we predict ÊZ (Z ≤n ) from the atomic composition with a linear regression model.The model maps from atomic concentration, i.e. the atomic composition divided by the total number of atoms in the system, to the internal energy per atom at zero Kelvin.In this way, we can compute the relative atomic energy even for structures with compositions that are not included in the training data and treat molecules of different size and composition in a comparable and normalized manner.This allows our model to learn a relation between 3d conformations and their energy that can be transferred across compositions, as can be seen in our experiments where we sample low-energy C 7 O 2 H 10 isomers with a model that was trained solely on other compositions (see Figure 4 in the paper).
The internal energy of training structures is provided in QM9 as property "U0".For unrelaxed, generated structures we predict the internal energy with a SchNet model trained on QM9 as explained in the Methods (section IV G).For relaxed, generated molecules we use the internal energy calculated with the ORCA quantum chemistry package [71] (Supplementary Methods X D).Although we relax structures at the same level of theory as the training data , the internal energies obtained with ORCA have a systematic offset compared to the calculations used in QM9.Thus, we estimate this offset and add it to the calculated internal energy.For the relaxed low-energy C 7 O 2 H 10 isomers (results in Fig. 4b-d), we re-compute the internal energy of all C 7 O 2 H 10 isomers in QM9 with ORCA and take the average difference between the reported internal energies and the re-computed values to estimate the offset (∼ −0.0064 eV per atom).For the relaxed low-energy molecules with small HOMO-LUMO gap (results in Fig. 5) we re-compute the internal energy of 1000 randomly sampled structures from QM9 with ORCA and fit a linear regression model to predict the difference between reported internal energies and re-computed values from the atomic composition.This allows to estimate the offset between internal energies from ORCA and the training data for relaxed, generated molecules of arbitrary composition.

C. Calculation of fingerprints
We obtain 1024 bits long binary fingerprints that capture the presence of linear fragments with up to seven atoms with Open Babel [58].We use version 2.4.1 of Open Babel, where the employed fingerprint is called "FP2" and corresponds to the default choice.Fingerprints are calculated after the SMILES representation of 3d structures are obtained as described in the Methods (section IV F).

D. Relaxation of generated structures with density functional theory
All electronic structure computations were carried out with the ORCA quantum chemistry package [71].SCF convergence was set to tight and integration grid levels of 4 and 5 were employed during SCF iterations and the final computation of properties, respectively.
The zero point vibrational energies required for the computation of the internal energies were obtained by normal mode analysis performed on the fully relaxed structures using the B3LYP/6-31G(2df,2p) level of theory.to the same conformation.However, we see that cG-SchNet is capable of sampling multiple conformations whenever there are degrees of freedom, e.g. in isomer number one and isomer number four.Our analysis suggests the path for a possible future adaptation and application of cG-SchNet that is particularly tailored to generative models for 3d molecules, i.e. the targeted generation of conformations for a given (possibly isomeric) graph.To this end, a proper embedding of the molecular graph and several target properties would need to be provided as conditions.FIG.S4.Comparison of generated novel C7O2H10 isomers before and after relaxation.We show novel, low-energy C7O2H10 structures as generated by our model and the corresponding closest equilibrium conformation found by relaxation with DFT (orange structures).The root-mean-square deviation between atom positions before and after relaxation is noted below each molecule (in Å).We show the same structures as in Fig. S1.TABLE S2.Relaxation results.Results for relaxation of the 100 generated unique unseen molecules closest to the respective target electronic property values.We show the properties on which the respective model was conditioned, the targeted property values, the validity of the relaxed molecules, the median root-mean-square deviation (RMSD) between atom positions before and after relaxation for valid molecules, and the mean absolute error (MAE) between the property values before and after relaxation for valid molecules (i.e.how much the calculated property values of relaxed molecules deviate from the predicted property values of the generated molecules).For the C7O2H10 isomers sampled with relative atomic energy target -0.1 eV and molecules sampled while targeting HOMO-LUMO gap and relative atomic energy simultaneously, the statistics are calculated from all generated unique unseen molecules instead of the 100 closest (since we relaxed all of them for our analyses in Fig. 4 and Fig. 5).

FIG. 1 .
FIG. 1. Molecule generation with cG-SchNet.a: Factorization of the conditional joint probability of atom positions and types into a chain of probabilities for placing single atoms one after another.b: Results of sampling molecules from targetdependent conditional probability distributions.Distributions of the isotropic polarizability of training structures (orange) and five sets of molecules generated by the same cG-SchNet model (blue curves) conditioned on five different isotropic polarizability target values (color-matching dots above the x-axis).The generated molecule closest to the corresponding target value and not contained in the training data (unseen) is shown above each curve.c: Schematic depiction of the atom placement loop.For visualization purposes, we show a planar molecule and a 2d slice of the actual 3d grid distributions in steps 4, 5, and 6.

FIG. 2 .
FIG. 2. Schematic depiction of the cG-SchNet architecture with inputs and outputs."⊕" represents concatenation and " " represents the Hadamard product.Left: Atom-wise feature vectors representing an unfinished molecule are extracted with SchNet [57] and conditions are individually embedded and then concatenated to extract the conditional features vector.The exact embedding depends on the type of the condition (e.g.scalar or vector-valued).Middle: The distribution for the type of the next atom is predicted from the extracted feature vectors.Right: Based on the extracted feature vectors and the sampled type of the next atom, distributions for the pairwise distances between the next atom and every atom/token in the unfinished molecule are predicted.See Methods for details on the building blocks.

FIG. 3 .
FIG. 3. Targeted exploration of chemical space with cG-SchNet.a: Generation of molecules with desired motifs by conditioning cG-SchNet on simple path-based fingerprints.First column: Four different target fingerprints of structures from the test set.For each, we conditionally sample 20k molecules with cG-SchNet.Second column: Average Tanimoto similarity of the respective target to training structures (brown) and to generated molecules without duplicates (blue) and with duplicates (grey).The amount of generated structures is noted next to the dots.Third column: Most similar training molecule.Fourth column: Three generated unseen examples with high similarity to the target.The Tanimoto similarity to the target structure is noted to the bottom-right of depicted molecules.b: Generation of C7N1O1H11 isomers with HOMO-LUMO gap targets outside the training data range by conditioning cG-SchNet on atomic composition and HOMO-LUMO gap.The training data set of 55k QM9 molecules is restricted to not contain any C7N1O1H11 isomers with gap < 6 eV or gap > 8 eV.The graph shows the distribution of the gap for the C7N1O1H11 isomers in QM9 (brown), the isomers in the restricted training data set (orange), and the two sets of isomers generated with cG-SchNet (blue curves) when targeting the composition C7N1O1H11 and two gap values outside the training data range (color-matching dots on the x-axis).For each target value, the two generated isomers closest to it are depicted.

FIG. 5 .
FIG. 5. Discovery of low-energy structures with small HOMO-LUMO gap.We compare cG-SchNet to the previous, biased G-SchNet approach [48].a: The joint distributions of relative atomic energy and HOMO-LUMO gap for QM9 (left) and for unique, unseen molecules with gap ≤ 4.5 eV generated with cG-SchNet (middle) and with biased G-SchNet (right).Biased G-SchNet was fine-tuned on all molecules in QM9 below a gap threshold of 4.5 eV (red, dotted line).The conditions used for generation with cG-SchNet are marked with a blue cross.The depicted molecules are generated examples with a gap of 4 eV and different relative atomic energy values (black, dotted lines).More examples as well as the distributions close to the conditioning target for cG-SchNet and the training data can be found in Supplementary Fig. S5.b: The absolute number of unique, unseen molecules with gap ≤ 4.5 eV generated by cG-SchNet (black) and biased G-SchNet (red) for increasing relative atomic energy thresholds.For reference, we also show the amount of structures with low gap included in the training set of cG-SchNet (blue dotted line).c: The average number of atoms of different types (left), bonds of different orders (middle), and rings of different sizes (right) in unique, unseen molecules with gap ≤ 4.5 eV generated by each model.

D.
Neural network training We use mini-batches with M molecules for training.Each mini-batch contains one atom placement sequence per molecule, randomly sampled in each epoch as explained in Section IV C. Each step of the atom placement sequence a ∈ A m consists of types Z ≤i−1 and positions R ≤i−1 of already placed atoms and the two auxiliary tokens, of the values Λ of molecule m for the target properties of the model, and of the type Z next and position r next of the next atom.

FIG. S1 .
FIG. S1.Generated novel C7O2H10 isomers vs. most similar isomers in QM9.Pairs of generated, novel, low-energy C7O2H10 isomers (left) and the corresponding most similar C7O2H10 isomer in QM9 (right) according the Tanimoto similarity of path-based fingerprints (noted below each pair).In the first row, we show pairs corresponding to the novel structures depicted in the first row of Fig. 4d.The remaining structures are uniformly randomly selected from all novel isomers with relative atomic energy ≤ -0.05 eV generated by cG-SchNet (target energy -0.1 eV).