## Introduction

The three-dimensional (3-D) coordinates of atoms in a molecule are commonly referred to as the molecule’s geometry or conformation. The task, known as conformation generation, of predicting possible valid coordinates of a molecule, is important for determining a molecule’s chemical and physical properties1. Conformation generation is also a vital part of applications such as generating 3-D quantitative structure-activity relationships (QSAR), structure-based virtual screening and pharmacophore modeling2. Conformations can be determined in a physical setting using instrumental techniques such as X-ray crystallography as well as using experimental techniques. However, these methods are typically time-consuming and costly.

A number of computational methods have been developed for conformation generation over the past few decades2. Typically this problem is approached by using a force field energy function to calculate a molecule’s energy, and then minimizing this energy with respect to the molecule’s coordinates. This hand-designed energy function yields an approximation of the molecule’s true potential energy observed in nature based on the molecule’s atoms, bonds and coordinates. The minimum of this energy function corresponds to the molecule’s most stable configuration. Although this approach has been commonly used to generate a geometrically diverse set of conformations with certain conformations being similar to the lowest-energy conformations, it has been shown that molecule force field energy functions are often a crude approximation of actual molecular energy3.

In this paper, we propose a deep generative graph neural network that learns the energy function from data in an end-to-end fashion by generating molecular conformations that are energetically favorable and more likely to be observed experimentally4. This is done by maximizing the likelihood of the reference conformations of the molecules in the dataset. We evaluate and compare our method with conventional molecular force field methods on three databases of small molecules by calculating the root-mean-square deviation (RMSD) between generated and reference conformations. We show that conformations generated by our model are on average far more likely to be close to the reference conformation compared to those generated by conventional force field methods i.e. the variance of the RMSD between generated and reference conformations is lower for our method. Despite having lower variance, we show that our method does not generate geometrically similar conformations. We also show that our approach is computationally faster than force field methods.

A disadvantage of our model is that in general for a given molecule, the best conformation generated by our model lies further away from the reference conformation compared to the best conformation generated by force field methods. We show that for the QM9 small molecule dataset, the best of both methods can be combined by using the conformations generated by the deep generative graph neural network as an initialization to the force field method.

## Conformation Generation

We consider a molecule as an undirected, complete graph G = (V, E), where V is a set of vertices corresponding to atoms, and E is a set of edges representing the interactions between pairs of atoms from V. Each atom is represented as a vector vi$${{\mathbb{R}}}^{{d}_{v}}$$ of node features, and the edge between the i-th and j-th atoms is represented as a vector eij$${{\mathbb{R}}}^{{d}_{e}}$$ of edge features. There are M vertices and M(M − 1)/2 edges. We define a plausible conformation as one that may correspond to a stable configuration of a molecule. Given the graph of a molecule, the task of molecular geometry prediction is the generation of a set of plausible conformations Xa = $$({x}_{1}^{a},\ldots ,{x}_{M}^{a})$$, where $${x}_{i}^{a}\in {{\mathbb{R}}}^{3}$$ is a vector of the 3-D coordinates of the i-th atom in the a-th conformation.

Molecules can transition between conformations and end up in different local minima based on the stability of the respective conformations and environmental conditions. As a result, there is more than one plausible conformation associated with each molecule; it is hence natural to formulate conformation generation as finding (local) minima of an energy function $${\mathcal F} (X,G)$$ defined on a pair of molecule graph and conformation:

$$\{{X}_{1},\ldots ,{X}_{S}\}={\rm{\arg }}\mathop{{\rm{\min }}}\limits_{X}\, {\mathcal F} (X,G).$$
(1)

Alternatively, we could sample from a Gibbs distribution:

$$\{{X}_{1},\ldots ,{X}_{S}\} \sim {p}_{ {\mathcal F} }(X|G),$$
(2)

where

$${p}_{ {\mathcal F} }(X|G)=\frac{1}{\zeta (G)}\exp \{- {\mathcal F} (X,G)\},$$
(3)

where ζ is a normalizing constant. We use S to indicate the number of conformations we generate for each molecule.

Under this view, the problem of conformation generation is decomposed into two stages. In the first stage, a computationally-efficient energy function $${\mathcal F} (X,G)$$ is constructed. The second stage involves either performing optimization as in Eq. (1) or sampling as in Eq. (2) to generate a set of conformations from this energy function.

### Energy function construction

A conventional approach is to define an energy function semi-automatically. The functional form of an energy function is designed carefully to incorporate various chemical properties, whereas detailed parameters of the energy function are either computationally or experimentally estimated. Two examples of widely used energy functions are the Universal Force Field (UFF)5 and the Merck Molecular Force Field (MMFF)6. In contrast to these methods, here we will describe how to estimate the energy function or probability distribution directly from data using the latest techniques from deep learning.

### Energy minimization/sampling

Once the energy function is defined, a conventional approach is to run the minimization many times starting from different initial conformations. Due to the non-convexity of the energy function, each run is likely to end up in a unique local minimum, allowing us to collect a set of many conformations.

A typical approach is to use distance geometry (DG)7 or its variants, such as experimental-torsion basic knowledge distance geometry (ETKDG)8, to randomly generate an initial conformation that satisfies various geometric constraints such as lower and upper bounds on the distances between atoms. Starting from the initial conformation, an iterative optimization algorithm, such as L-BFGS9, gradually updates the conformation until it finds a minimum of the energy function. In this paper, we instead propose an approach based on deep generative models that allow us to sample directly from a distribution over all possible conformations given a molecule graph.

## Deep Generative Model for Molecular Geometry

We propose to “learn” an energy function $${\mathcal F} (G,X)$$ from a database containing many pairs of a molecule and its experimentally obtained conformation. Let $$|{\mathscr{D}}|=\{({G}_{1},{X}_{1}^{\ast }),\ldots ,({G}_{N},{X}_{N}^{\ast })\}$$ be a set of examples from such a database, where Xn* is “a” reference conformation, often obtained and verified empirically in a certain environment. These reference conformations may not necessarily correspond to the lowest energy configurations of the molecules, but are energetically favorable and more likely to be observed experimentally. Learning an energy function can then be expressed as the following optimization problem:

$$\hat{ {\mathcal F} }(G,X)={\rm{\arg }}\mathop{{\rm{\max }}}\limits_{ {\mathcal F} }\frac{1}{N}\mathop{\sum }\limits_{n=1}^{N}\mathop{\underbrace{\log \,{p}_{ {\mathcal F} }({X}_{n}^{\ast }|{G}_{n})}}\limits_{({\rm{a}})},$$
(4)

where $${p}_{ {\mathcal F} }$$ is a Gibbs distribution defined using $${\mathcal F}$$ as in Eq. (3). In other words, we can learn the energy function $${\mathcal F}$$ by maximizing the log-likelihood of the data D. In principle, the term “energy” has a very specific meaning in each context (e.g., potential energy, statistical free energy and etc). In our case, “energy” refers to an objective function that reflects the likelihood of a conformation given a molecular graph.

### Conditional variational graph autoencoders

We use a conditional version of a variational autoencoder10 to model the distribution $${p}_{ {\mathcal F} }$$ in Eq. (4) (a). This choice enables an underlying model to capture the complicated, multi-modal nature of this distribution, while allowing us to efficiently sample from this distribution. This is done by introducing a set of latent variables Z = {z1, …, zM}, where $${z}_{m}\in {{\mathbb{R}}}^{{d}_{z}}$$ and rewriting the conditional log-probability $$\log \,{p}_{ {\mathcal F} }(X|G)$$ as

$$\log \,p(X|G)=\,\log \,\int p(X|Z,G)p(Z|G){\rm{d}}Z,$$
(5)

where we omit the subscript $${\mathcal F}$$ for brevity.

The marginal log-probability in Eq. (5) is generally intractable to compute, and we instead maximize the stochastic approximation to its lower bound, as is standard practice in problems involving variational inference:

$$\log \,p(X|G)\ge {{\mathbb{E}}}_{Z \sim Q(Z|G,X)}[\log \,\mathop{\underbrace{p(X|Z,G)}}\limits_{({\rm{b}})\,{\rm{likelihood}}}]-{\rm{KL}}(\mathop{\underbrace{Q(Z|G,X)}}\limits_{({\rm{c}})\,{\rm{posterior}}}\parallel \mathop{\underbrace{P(Z|G)}}\limits_{({\rm{a}})\,{\rm{prior}}})$$
(6)
$$\approx \frac{1}{K}\mathop{\sum }\limits_{k=1}^{K}\,\log \,p(X|{Z}^{k},G)-{\rm{KL}}(Q(Z|G,X)\parallel P(Z|G)),$$
(7)

where Zk is the k-th sample from the (approximate) posterior distribution Q above. We assume that we can compute the KL divergence analytically, for instance by constructing Q and P to be normal distributions.

#### Modeling the graph using a message passing neural network

We use a message passing neural network (MPNN)11, a variant of a graph neural network12,13, which operates on a graph G directly and is invariant to graph isomorphism. The MPNN consists of L layers. At each layer l, we update the hidden vector $$h({v}_{i})\in {{\mathbb{R}}}^{{d}_{h}}$$ of each node and hidden matrix $$h({e}_{ij})\in {{\mathbb{R}}}^{{d}_{h}\times {d}_{h}}$$ of each edge using the equation

$${h}^{l}({v}_{i})={\rm{GRU}}({h}^{l-1}({v}_{i}),J({h}^{l-1}({v}_{i}),{h}^{l-1}({v}_{j\ne i}),h({e}_{i,j\ne i})),$$
(8)

where J is a linear one layer neural network that aggregates the information from neighboring nodes according to its hidden vectors of respective nodes and edges. GRU is a gated recurrent network that combines the new aggregate information and its corresponding hidden vector from previous layer14. The weights of the message passing function J and GRU are shared across the L layers of the MPNN.

#### Prior parameterization

We use the MPNN described above to model the prior distribution P(Z|G) in Eq. (6) (a). We initialize h0(vi) and h(eij) in Eq. (8) as linear transformations of the feature vectors vi and eij of the nodes and edges respectively:

$${h}^{0}({v}_{i})={U}_{{\rm{node}}}^{{\rm{prior}}}{v}_{i};\,h({e}_{ij})={U}_{{\rm{edge}}}^{{\rm{prior}}}{e}_{ij},$$
(9)

where Unodeprior and Uedgeprior are matrices representing the linear transformations for the nodes and edges respectively. The final hidden vector hL(vi) of each node is passed through a two layer neural network with hidden size df, whose output $${\tilde{h}}^{L}({v}_{i})$$ is transformed into the mean and variance vectors of a Normal distribution with a diagonal covariance matrix:

$${\mu }_{i}={W}_{\mu }^{{\rm{prior}}}{\tilde{h}}^{L}({v}_{i})+{b}_{\mu }^{{\rm{prior}}};$$
(10)
$${\sigma }_{i}^{2}=\exp \{{W}_{\sigma }^{{\rm{prior}}}{\tilde{h}}^{L}({v}_{i})+{b}_{\sigma }^{{\rm{prior}}}\},$$
(11)

where Wμprior and Wσprior are the weight matrices and bμprior and bσprior are the bias terms of the transformations. These are used to form the prior distribution:

$$\log \,P(Z|G)=\mathop{\sum }\limits_{i=1}^{N}\,\mathop{\sum }\limits_{j=1}^{3}\,-\frac{{({\mu }_{i,j}-{z}_{i,j})}^{2}}{2{\sigma }_{i,j}^{2}}-\,\log \,\sqrt{2\pi {\sigma }_{i,j}^{2}},$$
(12)

where μi,j and σi,j2 are the j-th components of the mean and variance vectors respectively. In other words, we parameterize the prior distribution as a factorized Normal distribution factored over the vertices and the dimensions in the 3-D coordinate.

#### Likelihood parameterization

We use a similar MPNN to model the likelihood distribution, P(X|Z, G) in Eq. (6) (b). The only difference is that this distribution is conditioned not only on the molecular graph G = (V, E) but also on the latent set Z = {z1, …, zM}. We incorporate the latent set Z by adding the linear transformation of the node feature vector vi to its corresponding latent variable zi. This result is used to initialize the hidden vector:

$${h}^{0}({v}_{i})={U}_{{\rm{node}}}^{{\rm{likelihood}}}{v}_{i}+{z}_{i};\,h({e}_{ij})={U}_{{\rm{edge}}}^{{\rm{likelihood}}}{e}_{ij},$$
(13)

where Unodelikelihood and Uedgelikelihood are matrices representing the linear transformations for the nodes and edges respectively. From there on, we run neural message passing as in Eqs. (811), with a new set of parameters, θlikelihood, Wμlikelihood, bμlikelihood, Wσlikelihood and bσlikelihood. The final mean and variance vectors are now three dimensional, representing the 3-D coordinates of each atom, and we can compute the log-probability of the coordinates using Eq. (12).

#### Posterior parameterization

As computing the exact posterior P(Z|G, X) is intractable, we resort to amortized inference using a parameterized, approximate posterior Q(Z|G, X) in Eq. (6) (c). We use a similar approach to our parameterization of the prior distribution above. However, we replace the input to the MPNN with the concatenation of an edge feature vector eij and the corresponding distance (proximity) matrix D(X*) of the reference 3-D conformation X*:

$$h({e}_{ij})={U}_{{\rm{edge}}}^{{\rm{posterior}}}[\begin{array}{l}{e}_{ij}\\ D({x}_{i}^{\ast })\end{array}].$$
(14)

With a new set of parameters, θposterior, Wμposterior, bμposterior, Wσposterior and bσposterior, the MPNN outputs a Normal distribution for each latent variable zi. Linear weight embeddings of nodes Unode are shared between prior, likelihood and posterior.

### Training the conditional variational graph autoencoder

With the choice of the Gaussian latent variables zi, we can use the reparameterization trick10 to compute the gradient of the stochastic approximation to the lower bound in Eq. (7) with respect to all the parameters of the three distributions10. This property allows us to train this model on a large dataset using stochastic gradient descent (SGD). However, there are two major considerations that must be made before training this model on a large molecule database.

#### Post-alignment likelihood

An important property of conformation generation over a usual problem of regression is that we must take into account rotation and translation. Let R be an alignment function that takes as input a target conformation and a predicted conformation. The function aligns the reference conformation to the predicted conformation and returns the aligned reference conformation. $$\hat{X}$$ = R(X, X*) is the conformation obtained by rotating and translating the reference conformation X* to have the smallest distance to the predicted conformation X according to a predefined metric such as RMSD:

$${\rm{RMSD}}(\hat{X},{X}^{\ast })=\sqrt{\frac{1}{M}\mathop{\sum }\limits_{i=1}^{M}\,\parallel {\hat{x}}_{i}-{x}_{i}^{\ast }{\parallel }^{2}}.$$
(15)

This alignment function R is selected according to the problem at hand, and we present below its use in a general form without exact specification.

We implement this invariance to rotation and translation by parameterizing the output of the likelihood distribution above to be aligned to the target molecule. That is,

$$\log \,p(X|G,Z)=\mathop{\sum }\limits_{i=1}^{M}\,\mathop{\sum }\limits_{j=1}^{3}\,-\frac{{({\mu }_{i,j}-{\hat{x}}_{i,j}^{\ast })}^{2}}{2{\sigma }_{i,j}^{2}}-\,\log \,\sqrt{2\pi {\sigma }_{i,j}^{2}},$$
(16)

where $${\hat{x}}_{i}^{\ast }$$ is the coordinate of the i-th atom aligned to the mean conformation {μ1, …, μN}. That is,

$$\{{\hat{x}}_{1}^{\ast },\ldots ,{\hat{x}}_{M}^{\ast }\}=R(\{{\mu }_{1},\ldots ,{\mu }_{M}\},{X}^{\ast }).$$
(17)

In other words, we rotate and translate the reference conformation X* to be best aligned to the predicted conformation (or its mean) before computing the log-probability. This encourages the model to assign high probability to a conformation that is easily aligned to the reference conformation X*, which is precisely the goal of maximum log-likelihood.

#### Unconditional prior regularization

The second term in the lower bound in Eq. (6), which is the KL divergence between the approximate posterior and prior, does not have a point minimum but an infinitely long valley consisting of minimum values. Consider the KL divergence between two univariate Normal distributions:

$${\rm{KL}}({\mathscr{N}}({\mu }_{1},{\sigma }_{1}^{2})\parallel {mathscr{N}}({\mu }_{2},{\sigma }_{2}^{2}))=\,\log \,\frac{{\sigma }_{2}}{{\sigma }_{1}}+\frac{{\sigma }_{1}^{2}+{({\mu }_{1}-{\mu }_{2})}^{2}}{2{\sigma }_{2}^{2}}-\frac{1}{2}.$$
(18)

When both distributions are shifted by the same amount, the KL divergence remains unchanged. This could lead to a difficulty in optimization, as the means of the posterior and prior distributions could both diverge.

In order to prevent this pathological behavior, we introduce an unconditional prior distribution P(Z) which is a factorized Normal distribution:

$$P(Z)=\mathop{\prod }\limits_{i=1}^{M}\,{mathscr{N}}({z}_{i}|0,I),$$
(19)

where $${mathscr{N}}$$ computes a Normal probability density, and I is a dz × dz identity matrix. We minimize the KL divergence between the original prior distribution P(Z|G) and this unconditional prior distribution P(Z) in addition to maximizing the lowerbound, leading to the following final objective function for each molecule:

$${\mathcal L} =\,\log \,p(X|{Z}^{1},G)-{\rm{KL}}(Q(Z|G,X)\parallel P(Z|G))-\alpha \cdot {\rm{KL}}(P(Z|G)\parallel P(Z)),$$
(20)

where we assume K = 1 and introduce a coefficient α ≥ 0.

### Inference: predicting molecular geometry

Learning a conditional variational autoencoder above corresponds to the first stage of conformation generation, that is, the stage of energy function construction. Once the energy function is constructed, we need to sample multiple conformations from the Gibbs distribution defined using the energy function, which is logP(X|G) in Eq. (5). Our parameterization of the Gibbs distribution using a directed graphical model15 allows us to efficiently sample from this distribution. We first sample from the prior distribution, $$\tilde{Z} \sim P(Z|G)$$, and then sample from the likelihood distribution, $$\tilde{X} \sim P(X|\tilde{Z},G)$$. In practice, we fix the output variance σi,j of the likelihood distribution to be 1 and take the mean set {μ1, …, μM} as a sample from the model.

## Experimental Setup

### Data

We experimentally verify the effectiveness of the proposed approach using three databases of molecules: QM916,17, COD18 and CSD19. These datasets are selected as they possess distinct properties from each other, which allows us to carefully study various aspects of the proposed approach. There is an overlap between COD and CSD databases, since both of these databases were based on published crystallography data. We only keep molecules from each database that can be processed by RDKitFootnote 1. We further remove disconnected compounds i.e. those whose Simplified Molecular-Input Line-Entry System20 (SMILES) representation contains ‘.’. See Fig. 1 for some other properties of these three datasets.

#### QM9

The filtered QM9 dataset contains 133, 015 molecules, each of which contains up to 9 heavy atoms of types C, N, O and F. Each molecule is paired with a reference conformation obtained by optimizing the molecular geometry with density functional theory (DFT) at the B3LYP/6-31G(2df,p) level of theory, which implies that the reference conformations are obtained from the same environment. We hold out separate 5,000 and 5,000 randomly selected molecules as validation and test sets, respectively.

#### COD

We use the organic part of the COD dataset. We further filter out any molecule that contains more than 50 heavy atoms of types B, C, N, O, F, Si, P, S, Cl, Ge, As, Se, Br, Te and I. This results in 66,663 molecules, out of which we hold out separate 3,000 and 3,000 randomly selected ones respectively for validation and test purposes. Reference conformations are voluntarily contributed to the dataset and are often determined either experimentally or by DFT calculations21. Thus, the reference conformations are obtained from different environments.

#### CSD

Similarly to COD, we remove any molecule that contains more than 50 heavy atoms, resulting in a total of 236, 985 molecules. We hold out separate 3,000 and 3,000 randomly selected molecules for validation and test purposes respectively. This dataset contains organic and metal-organic crystallographic structures which have been observed experimentally19. The atom types in this dataset are S, N, P, Be, Tc, Xe, Br, Rh, Os, Zr, In, As, Mo, Dy, Nb, La, Te, Th, Ga, Tl, Y, Cr, F, Fe, Sb, Yb, Tb, Pu, Am, Re, Eu, Hg, Mn, Lu, Nd, Ce, Ge, Sc, Gd, Ca, Ti, Sn, Ir, Al, K, Tm, Ni, Er, Co, Bi, Pr, Rb, Sm, O, Pt, Hf, Se, Np, Cd, Pd, Pb, Ho, Ag, Mg, Zn, Ta, V, B, Ru, W, Cl, Au, U, Si, Li, C and I. The reference conformations are obtained from crystal structures.

### Models

#### Baselines

As a point of reference, we minimize a force field starting from a conformation created using ETKDG8. We test both UFF and MMFF, and respectively call the resulting approaches ETKDG + UFF and ETKDG + MMFF. The environment from which each conformation is obtained affects the force field calculations. To keep comparisons fair and to abstract away the effects of the environment, we use the implementations in RDKit22 with the default hyperparameters. The default implementations have often been used in literature when comparing different conformation generation methods23,24,25.

#### Conditional variational graph autoencoder

We build one conditional variational graph autoencoder for each dataset. We use dh = 50 hidden units at each layer of neural message passing (Eq. 8) in each of the three MPNNs corresponding to the prior, likelihood and posterior distributions. We use df = 100 in the two layer neural network that comes after the MPNN. As described earlier, we fix the variance of the output in the likelihood distribution to 1. We use L = 3 layers per network for QM9 and L = 5 layers per network for COD and CSD. We chose these hyperparameter values by carrying out a grid-search and choosing the values that had the best performance on the validation set. The grid-search procedure and the performance of models with different hyperparameters are shown in the Supplementary Information.

#### Learning

For all models, we use dropout26 at each layer of the neural network that comes after the MPNN with a dropout rate of 0.2 to regularize learning. We set the coefficient α in Eq. (20) to 10−5. We train each model using Adam27 with a fixed learning rate of 3 × 10−4. All models were trained with a batch size of 20 molecules on 1 Nvidia GPU with 12 GB of RAM.

#### Inference

There are two modes of inference with the proposed approach. The first approach is to sample from a trained conditional variational graph autoencoder by first sampling from the prior distribution and taking the mean vectors from the likelihood distribution; we refer to this as CVGAE. We can then use these samples further as initializations of MMFF minimization; we refer to this as CVGAE + MMFF. The latter approach can be thought of as a trainable approach to initializing a conformation in place of DG or ETKDG.

### Evaluation

In principle, the quality of the sampled conformations should be evaluated based on their molecular energies, for instance by DFT, which is often more accurate than force field methods3. However, the computational complexity of the DFT calculation is superlinear with respect to the number of electrons in a molecule, and so is often impractical28. Instead, we follow prior work on conformation generation1 and evaluate the baselines and proposed method using the RMSD (Eq. 15) of the heavy atoms between a reference conformation and a predicted conformation which is fast and simple to calculate.

## Results

When evaluating each method, we first sample 100 conformations per molecule for each method in the test set. We can make several observations from Table 1. First, compared to other methods, our proposed CVGAE always succeeds at generating the specified number of conformations for any of the molecules in the test set. UFF and MMFF fail to generate conformations for some molecules, as they do not support handling every element but the pre-defined sets of elements. Since all other evaluated approaches were unsuccessful at generating at least one conformation for a very small number of test molecules, we report results for the molecules for which all evaluated methods generated at least one conformation. We report the median of the mean of the RMSD, the median of the standard deviation of the RMSD and the median of the best (lowest) RMSD among all generated conformations for each test molecule. Across all three datasets, every evaluated method achieves roughly the same median of the mean RMSD. More importantly, the standard deviation of the RMSD achieved by CVGAE is significantly lower than that achieved by ETKDG + Force Field. After the initial generation stage, conformations are usually further evaluated and optimized by running the computationally expensive DFT optimization. Reducing the standard deviation can lower the number of conformations on which DFT optimization has to be run in order to achieve a valid conformation. On the other hand, the best RMSD achieved by ETKDG + UFF/MMFF methods is lower than that achieved by CVGAE. Using MMFF initialized by CVGAE (CVGAE + MMFF) instead of ETKDG (ETKDG + MMFF) improves the mean results on the QM9 dataset for CVGAE, and yields a lower standard deviation and similar best RMSD compared to ETKDG + MMFF. Unfortunately, CVGAE + MMFF worsens the results achieved by CVGAE alone on the COD and CSD datasets. We additionally evaluate single point DFT energy for the subset of 1000 molecules in the QM9 test set for all 100 generated conformations. We find that all three methods ETKDG + MMFF, CVGAE and CVGAE + MMFF achieve similar median energy values of −411.52, −410.87 and −411.50 respectively. The energy was calculated using GAMESS software29 with default parameters.

We also report the diversity of conformations generated by all evaluated methods in Table 2. Diversity is measured by calculating the mean and standard deviation of the pairwise RMSD between each pair of generated conformations per molecule. Overall, we can see that despite having a smaller median of standard deviation of RMSD between generated conformations and reference conformations, CVGAE does not collapse to generating extremely similar conformations. Although, CVGAE generates relatively less diverse samples compared to ETKDG + MMFF baseline on all datasets. The conformations of molecules generated by CVGAE + MMFF are less diverse on the QM9 dataset and more diverse on COD/CSD datasets compared to ETKDG + MMFF baseline.

The computational efficiency of each of the evaluated approaches on the QM9 and COD datasets is shown in Fig. 2. For consistency, we generated one conformation for one molecule at a time using each of the evaluated methods on an Intel(R) Xeon(R) E5-2650 v4 CPU. On the QM9 dataset, CVGAE is 2× more efficient than ETKDG + UFF/MMFF, while CVGAE + MMFF is slightly slower than ETKDG + UFF/MMFF. On the COD dataset, which contains a larger number of atoms per molecule, CVGAE is almost 10× as fast as ETKDG + UFF/MMFF, while CVGAE + MMFF is about 2× as fast as ETKDG + UFF/MMFF. This shows that CVGAE scales much better than the baseline ETKDG + UFF/MMFF methods as the size of the molecule grows.

Figures 3 and 4 visualize the median, standard deviation and best RMSD results as a function of the number of heavy atoms in a molecule on the QM9 and COD/CSD datasets respectively. For all approaches, we can see that the best and median RMSD both increase with the number of heavy atoms. The standard deviation of the median RMSD for CVGAE and CVGAE + MMFF is lower than that for ETKDG + MMFF across molecules of almost all sizes. The standard deviation of the best RMSD is slightly higher for CVGAE and CVGAE + MMFF than for ETKDG + MMFF on molecules with at most 12 atoms, but is lower for larger molecules, particularly for CVGAE. Overall, CVGAE yields a lower or similar median RMSD compared to ETKDG + MMFF across molecules of all sizes and a lower standard deviation, whereas ETKDG + MMFF provides a lower best RMSD particularly for larger molecules observed in the COD/CSD datasets.

Figures 5 and 6 qualitatively compare the results of CVGAE against MMFF and CVGAE + MMFF against CVGAE respectively. For each dataset, each figure shows the three molecules for which the first method in each figure outperforms the second method by the greatest amount, and the three molecules for which the second method outperforms the first by the greatest amount. The reference molecules are shown alongside the conformations resulting from each of the methods for comparison.

We can see some general trends from both these figures. The conformations produced by the neural network are qualitatively much more similar to the reference in the case of the QM9 dataset than in the cases of the COD and CSD datasets. In the case of the COD and CSD datasets, the CVGAE predictions appear to be squashed or compressed in comparison to the reference molecules. For example, in almost every case we can see the absence of visible rings and the absence of bonds protruding from the lengthwise dimension of the molecule. At the same time we can see that on COD and CSD, CVGAE does better than ETKDG + MMFF in cases where ETKDG + MMFF creates loops and protrusions in the wrong places.

## Analysis and Future Work

Overall we observe that CVGAE performs better than ETKDG + MMFF on QM9 than on COD and CSD. One possible reason that could explain this phenomenon is that COD and CSD contain much larger number of heavy atoms per molecule than QM9. In the absence of adequate number of neural message passing steps and adequate number of hidden units, the network may converge to outputting a conformation that contains atoms largely along a single non-linear dimension in order to minimize outliers, which would be heavily penalized by the sum of squared distances term in the loss function. A neural network architecture with a larger number of neural message passing steps and larger number of hidden units may be needed to generate less conservative conformations and achieve comparable results to those for QM9. This is a recommended direction of future work that will require more computational resources, including distributed training on multiple GPUs with sufficient memory.

Another concern for COD and CSD is the inconsistency in the environments from which the reference conformations are obtained. The inconsistency would not be a serious concern for small molecules, but it can result in performance degradation with larger molecules. Further investigation should be performed with the dataset of larger molecules and their reference conformations whose corresponding environments are identical. Additionally, conditioning deep generative graph neural networks on the environment could be explored in the future.

We also observe that our CVGAE method has a lower variance than the baseline methods, so a relatively small number of samples needs to be taken before getting a conformation with a good RMSD. In addition, CVGAE is faster than force field methods and uses less computational resources once trained. Using conformations generated by CVGAE as an initialization to force field method showed promising results on the QM9 dataset that allowed to combine the best of two distinct methods. However, applying a force field method on the conformations generated by CVGAE leads to an increase in RMSD on the COD and CSD datasets - future work could explore why this is the case. Another avenue of future inquiry could be the joint training of CVGAE and a force field method, which would involve implementing force field minimization using a deep learning framework, connecting this to CVGAE and backpropagating through this aggregate model. This joint training could further yield better results than either method alone.