Background & Summary

Accurate and affordable prediction of molecular properties is a longstanding goal of computational chemistry. Predictions can be generated with rule-based1 or physics-based2 methods, which typically involve a trade-off between accuracy and speed. Machine learning offers an attractive alternative, as it is far quicker than physics-based methods and outperforms traditional rule-based baselines in many molecule-related tasks, including property prediction and virtual screening3,4,5, inverse design using generative models6,7,8,9,10, reinforcement learning11,12,13, differentiable simulators14,15, and synthesis planning and retrosynthesis16,17.

Advances in molecular machine learning have been enabled by algorithmic improvements18,19,20,21,22 and by reference datasets and tasks23. A number of reference datasets provide unlabeled molecules for generation tasks7,24,25,26,27 or experimentally labeled molecules for property prediction23,28,29,30,31. The molecules are typically represented as SMILES32 or InChi33 strings, which can be converted into 2D graphs, or as single 3D structures. These representations can be used as input to machine learning models that predict properties or generate new compounds. However, these representations fail to capture the flexibility of molecules, which consist of atoms in continual motion on a potential energy surface (PES). Molecular properties are a function of the conformers accessible at finite temperature34,35, which are not explicitly included in a 2D or single 3D representation (Fig. 1). Models that map conformer ensembles to experimental properties could be of interest, but they require a dataset with both conformers and experimental data.

Fig. 1
figure 1

Molecular representations of the latanoprost molecule. top SMILES string. left Stereochemical formula with edge features, including wedges for in- and out-of-plane bonds, and a double line for cis isomerism. right Overlay of conformers. Higher transparency corresponds to lower statistical weight.

Here we present the Geometric Ensemble Of Molecules (GEOM), a dataset of high-quality conformers for 317,928 mid-sized organic molecules with experimental data, and 133,258 molecules from the QM9 dataset36. 304,466 drug-like species and their biological assay results were accessed as part of AICures (https://www.aicures.mit.edu), an open machine learning challenge to predict which drugs can be repurposed to treat COVID-19 and related illnesses. 16,865 molecules are from the MoleculeNet benchmark31. They are labeled with experimental properties related to physical chemistry, biophysics, and physiology. Conformers were generated with the CREST program37, which uses extensive sampling based on the semi-empirical extended tight-binding method (GFN2-xTB38) to generate reliable and accurate structures. CREST ensembles from 1,511 species in the BACE dataset39 were also labeled with high-accuracy single-point DFT energies and semi-empirical quasi-harmonic free energies. Of these ensembles, 534 were further refined with DFT geometry optimizations.

GEOM addresses two key gaps in the dataset literature. First, the data can be used to benchmark new models that take conformers as input to predict experimental properties, such as biological assay results for antiviral activity, or physicochemical and physiological properties. Such models could not be trained on the above molecular datasets, which contain only 2D graphs or single 3D structures. Some datasets provide single 3D structures for hundreds of thousands of molecules36,40,41, but do not include a full ensemble for each species. Others contain a continuum of high-quality 3D structures for each species, but only contain hundreds of molecules42,43,44,45,46,47. Yet others contain conformers for tens of thousands of molecules with experimental data48, but the conformers are of force-field quality (see below). GEOM is unique in its size, number of conformers per species, conformer quality, and connection with experiment.

Second, GEOM can be used to train generative models to predict conformers given an input molecular graph. This is an active area of research that seeks to lower the computation cost compared to exhaustive torsional approaches and to increase the speed, reliability and accuracy compared to stochastic approaches49,50,51,52,53,54,55. The size and simulation accuracy of the GEOM dataset make it an ideal training set and for pre-training generalizable models. Moreover, machine learning models for conformer generation are orders of magnitude faster than the methods used to generate GEOM. Hence models trained on GEOM may be able to reproduce its accuracy on unseen molecules at a fraction of the cost. As discussed below, the CREST ensembles have high coverage of the true thermally accessible conformers. Hence GEOM is an excellent benchmark for the recall and diversity of conformer generation methods. However, the CREST statistical weights for each conformer are rather inaccurate. Therefore, benchmarks that include conformer probabilities should use the DFT weights provided in GEOM.

Table 1 provides summary statistics of the molecules that make up the dataset. The drug-like molecules from AICures are generally medium-sized organic compounds, containing an average of 44.4 atoms (24.9 heavy atoms), up to a maximum of 181 atoms (91 heavy atoms). They contain a large variance in flexibility, as demonstrated by the mean (6.5) and maximum (53) number of rotatable bonds. 15% (45,712) of the molecules have specified stereochemistry, while 27% (83,326) have stereocenters but may or may not have specified stereochemistry. The QM9 dataset is limited to 9 heavy atoms (29 total atoms), with a much smaller molecular mass and few rotatable bonds. 72% (95,734) of the species have specified stereochemistry.

Table 1 Molecular descriptor statistics for the QM9 and AICures molecules in the GEOM dataset.

Table 2 summarizes the experimental properties in the GEOM dataset from the AICures dataset. Of note is data for the inhibition of SARS-CoV-2, and for the specific inhibition of the SARS-CoV-2 3CL protease. The 3CL protease has high sequence similarity to its SARS-CoV 3CL counterpart, for which there is significantly more experimental data. The similarity of the two proteases means that CoV-2 models may benefit from pre-training with CoV data, so GEOM can also be used to benchmark transfer learning methods. Another target of interest is the SARS-CoV PL protease56,57. The dataset also contains molecules screened for growth inhibition of E. Coli and Pseudomonas aeruginosa, both of which can cause secondary infections in COVID-19 patients.

Table 2 Experimental data for GEOM species from AICures.

Table 3 shows the species from MoleculeNet31 that are included in GEOM. We used every compound from the physical chemistry and physiology categories. These molecules have experimental data for three physical chemistry tasks and 659 physiology tasks. The latter include blood-brain barrier penetration, qualitative toxicity, and whether a drug fails in clinical trials due to toxicity. GEOM also contains the BACE dataset39, which is part of the biophysics category of MoleculeNet. Each BACE molecule has an experimental binding affinity for human β-secretase 1 (BACE-1). The remaining biophysics datasets were excluded because of size, and because the AICures drug dataset is already sufficiently large. The “recovered” column in Table 3 shows that vacuum conformer-rotamer ensembles (CREs) were generated for over 98% of the molecules in each dataset other than SIDER. CREST CREs were also generated with an implicit solvent model of water for 99.9% of the BACE compounds. As mentioned above, these conformers were further annotated with single-point DFT energies and xTB quasi-harmonic free energies.

Table 3 Experimental data for GEOM species from MoleculeNet31.

GEOM contains vacuum CREs for 98% of the original molecules in all but one of the datasets within MoleculeNet. This means that future models using the CREs can be benchmarked against past predictions from 2D and single-conformer models31. Care should still be taken when making such comparisons, as the missing molecules have similar characteristics, and may therefore bias the resulting data. For example, many missing compounds are extremely flexible. For most of these compounds, the CREST calculations ran for several days with 40 cores and did not finish. Other missing compounds failed during initial xTB optimization, often because of unusual topologies; this was most common in the SIDER dataset.

Methods

CREST

Generation of conformers ranked by energy is computationally complex. Many exhaustive, stochastic, and Bayesian methods have been developed to generate conformers58,59,60,61,62,63,64,65. The exhaustive method is to enumerate all the possible rotations around every bond, but this approach has prohibitive exponential scaling with the number of rotatable bonds60,66. Stochastic algorithms available in cheminformatics packages such as RDKit64 suffer from two flaws. First, they explore conformational space very sparsely through a combination of pre-defined distances and stochastic samples67 and can miss many low-energy conformations. Second, in most standalone applications, conformer energies are determined with classical force fields, which are rather inaccurate47. Enhanced molecular dynamics simulations, such as metadynamics (MTD), can sample conformational space more exhaustively, but need to evaluate an energy function many times. Ab initio methods, such as DFT, can assign energies to conformers more accurately than force fields, but are also orders of magnitude more computationally demanding.

An efficient balance between speed and accuracy is offered by the newly developed CREST software37. This program uses semi-empirical tight-binding DFT to calculate the energy. The predicted energies are significantly more accurate than classical force fields, accounting for electronic effects, rare functional groups, and bond-breaking/formation of labile bonds, but are computationally less demanding than full DFT. Moreover, the search algorithm is based on MTD, a well-established thermodynamic sampling approach that can efficiently explore the low-energy search space. Finally, the CREST software identifies and groups rotamers, conformers that are identical except for atom re-indexing. It then assigns each conformer a probability through

$${p}_{i}^{{\rm{CREST}}}=\frac{{d}_{i}\,\exp (-{E}_{i}/{k}_{{\rm{B}}}T)}{{\sum }_{j}{d}_{j}\,\exp \left(-{E}_{j}/{k}_{{\rm{B}}}T\right)}.$$
(1)

Here pi is the statistical weight of the ith conformer, di is its degeneracy (i.e., how many chemically and permutationally equivalent rotamers correspond to the same conformer), Ei is its energy, kB is the Boltzmann constant, T is the temperature, and the sum is over all conformers. Equation (1) is an approximation to the true probability, \({p}_{i}\propto \exp (-{G}_{i}/{k}_{{\rm{B}}}T)\), where G is the free energy [Eqs. (34)]. The solvation free energy can be incorporated into E with a solvent model, but the translation, rotation, and vibrational free energies are missing. The addition of these terms is discussed below.

To generate conformers and rotamers, CREST takes a geometry as input and uses its flexibility to determine an MTD simulation time tmax (between 5 and 200 ps). The initial structure is deformed by propagating Newton’s equations of motion with an NVT thermostat68 from time t = 0 to tmax. The potential at each time step is given by the sum of the GFN2-xTB potential energy and a bias potential,

$${V}_{{\rm{bias}}}=\mathop{\sum }\limits_{i}^{n}{k}_{i}\exp \left(-{\alpha }_{i}{\Delta }_{i}^{2}\right),$$
(2)

which forces the molecule into new conformations. The collective variables Δi are the root-mean-square displacements (RMSDs) of the structure with respect to the ith reference structure, n is the number of reference structures, ki is the pushing strength and αi determines the potentials’ shapes. A new reference structure from the trajectory is added to Vbias every 1.0 ps, driving the molecule to explore new conformations. Different molecules require different (ki, αi) pairs to produce best results, so twelve different MTD runs are used with different settings for the Vbias parameters.

Conformers are defined by rotation about dihedral angles. In MTD simulations with RMSD collective variables, the biasing potential in Eq. (2) generates energy for overcoming torsional barriers. Since it takes less energy to cross a rotational barrier than to break a covalent bond, the biasing term leads to exploration of conformational space through rotation, rather than to trivial fragmentation37,68. Indeed, the bare energy without the biasing term keeps the molecule from exploring ultra-high energy regions, and thus reduces the size of the 3N-6-dimensional PES to be explored, where N is the number of atoms. This also makes it more efficient at finding accessible minima than an exhaustive enumeration of dihedral angles, since the latter would include high-energy, thermally inaccessible structures.

Geometries from the MTD runs are then optimized with GFN2-xTB. Conformers are identified as structures with ΔE > Ethr, RMSD > RMSDthr, and ΔBe > Bthr, where ΔE is the energy difference between structures, ΔBe is the difference in their rotational constants, and thr denotes a threshold value. Rotamers are identified through ΔE > Ethr, RMSD > RMSDthr, and ΔBe < Bthr. Duplicates are identified through ΔE < Ethr, RMSD < RMSDthr, and ΔBe < Bthr. The defaults, which are used in this work, are Ethr = 0.1 kcal/mol, RMSDthr = 0.125 Å, and Bthr = 15.0 MHz. Conformers and rotamers are added to the CRE and duplicates are discarded.

If a new conformer has a lower energy than the input structure, the procedure is restarted using the conformer as input, and the resulting structures are added to the CRE. The procedure is restarted between one and five times. The three conformers of lowest energy then undergo two normal molecular dynamics (MD) simulations at 400 K and 500 K. These are used to sample low-energy barrier crossings, such as simple torsional motions, which are needed to identify the remaining rotamers. Conformers and rotamers are once again identified and added to the CRE. All accumulated structures are then used as inputs to a genetic Z-matrix crossing algorithm68,69, the results of which are also added to the CRE. All geometries accumulated throughout the sampling process are optimized with a tight convergence threshold, identified as conformers, rotamers or duplicates, and sorted to yield the final set of structures. The process is restarted after the regular MD runs or the tight optimization if any conformers have lower energy than the input, with no limit to the number of restarts. The final CRE contains conformers and rotamers up to a maximum energy Ewin. The default Ewin = 6.0 kcal/mol provides a safety net around errors in the xTB energies, as only conformers with \(E\;\lesssim \;2.5\) kcal/mol have significant population at room temperature.

CREST generates ensembles with good coverage of the true CREs. For example, ref. 37 compared the experimental conformations of gas-phase citronellal, inferred through microwave spectroscopy70, with computational predictions. Each of the 15 lowest-energy experimental conformers was found in the CREST ensemble. The 1H-NMR spectrum was then computed in chloroform using CREST conformers, together with DFT for energy re-ranking and computation of the coupling and shielding constants. The spectrum with the ensemble matched experiment far better than with only one conformer37. Investigation of macrocycles, a protonated peptide, metal-organic systems, and the 1-Naphthol dimer yielded similarly good results.

DFT

CREST offers an excellent balance between cost and accuracy for generating an initial CRE. The GFN2-xTB method is fast enough to be used in long MTD runs, and its conformational energies are accurate to within 2 kcal/mol (see Technical Validation). The number of energy and force calculations can easily reach into the millions for a single CREST run, making full DFT prohibitive and xTB quite practical. Further, the CREST safety window of 6.0 kcal/mol ensures that the vast majority of accessible conformers should be present in the CRE. However, the typical xTB errors of 2 kcal/mol are too large for the accurate ranking of the conformers by statistical weight. This is because p is exponential in ΔE/kBT, and at room temperature kBT = 0.59 kcal/mol, which is 3.4 times smaller than the average error. Further, the weights do not take into account the zero-point energy or the roto-translational and vibrational entropy (see below). Each of these contributions to the free energy is conformation-dependent, and can lead to non-negligible changes in statistical weight.

DFT can be used to optimize conformers and compute their relaxed energies. However, each ensemble can contain hundreds of conformers, which makes DFT optimization extremely resource-intensive. Further, a Hessian calculation is required to compute the zero-point energy and entropic corrections to the free energy. Such calculations are among the most computationally demanding in quantum chemistry. Thus a full DFT optimization of each ensemble, together with an accurate free energy calculation, is a daunting task.

To address these issues, the developers of CREST recently introduced the CENSO program71. CENSO uses a series of optimizations at increasingly accurate levels of DFT theory. The free energy cutoff for discarding conformers is reduced at each stage, leading to fewer conformers in each successive round. Further, CENSO uses the recently developed r2scan-3c meta-GGA functional72 for the final optimization. r2scan-3c with the custom-made mTZVPP basis set is extremely accurate, yielding conformational energies that are within 0.3 kcal/mol of the CCSD(T) complete basis set limit72. It is also quite affordable given its accuracy, with a cost that is 100–1000 times lower than hybrid functionals with large basis sets72. The optimization is further accelerated by discarding duplicate conformers and high-energy geometries that are close to converged71. Lastly, CENSO computes entropic and zero-point corrections using the new biased Hessian method73. This technique uses xTB, which is quite computationally affordable, together with an extra biasing potential. The biasing potential accounts for energy differences between xTB and DFT, which allows xTB Hessians to be computed for DFT-optimized geometries.

The statistical weight computed by CENSO for the ith conformer is

$${p}_{i}^{{\rm{CENSO}}}=\frac{\exp (-{G}_{i}/{k}_{{\rm{B}}}T)}{{\sum }_{j}\exp \left(-{G}_{j}/{k}_{{\rm{B}}}T\right)},$$
(3)

where Gi is the conformation-dependent free energy. Note that unlike CREST, CENSO does not include rotamer degeneracy in the calculation of p. The reason is that accounting for all rotamers, though attempted by CREST, is still a difficult task, and the difference in rotamers among different conformers should be small. The free energy is given by71:

$${G}_{i}={E}_{{\rm{gas}}}^{(i)}+\delta {G}_{{\rm{solv}}}^{(i)}(T)+{G}_{{\rm{trv}}}^{(i)}(T).$$
(4)

Here Egas is the gas phase energy, \(\delta {G}_{{\rm{solv}}}(T)\) is the solvation free energy, and \({G}_{{\rm{trv}}}(T)\) is the free energy due to translation, rotation and vibration. \(\delta {G}_{{\rm{solv}}}\) can be calculated with implicit solvent methods such as COSMO-RS74,75 or C-PCM76. The solvation free energy predicted by r2scan-3c/COSMO-RS is typically accurate to within 0.5 kcal/mol71. Given the Hessian matrix and the associated normal modes, \({G}_{{\rm{trv}}}(T)\) can be computed within the standard modified rigid-rotor harmonic-oscillator approximation77. This term can be predicted quite accurately, with sub-chemical accuracy attainable even for semi-empirical methods71.

CENSO qualitatively reproduces the optical rotation of organic molecules measured in solution, which is a challenging task that depends sensitively on the CRE71. Further, it makes very accurate predictions of the octanol-water partition coefficients and pKa values of various organic molecules71. Conformers and statistical weights generated by CENSO are thus quite reliable.

In this work we apply CENSO to 534 species, yielding the highest-accuracy ensembles ever generated for drug-like molecules. Calculations are performed in implicit water solvent for 35% of the molecules in the BACE dataset31,39, which contains experimental binding affinities for inhibitors of BACE-1 (Table 3). Binding affinity models that incorporate CREs can be trained with this data. Models trained on a single conformer can also benefit from the CENSO ensembles. Since many of the drug-like molecules are quite flexible, the typical approach of optimizing a single force field conformer with DFT is likely to miss the true lowest-energy structure. Thus the lowest-energy CENSO structures are far more reliable inputs to single-conformer models. Lastly, the ensembles can be used for transfer learning (TL), so that generative models trained on the large CREST dataset can be fine-tuned with the CENSO data.

In addition to the fully optimized CREs, we provide single-point DFT energies for all 1.3 million CREST conformers in 1,511 out of 1,513 BACE species (99.9%). We also provide xTB vibrational frequencies to complete the calculation of G. Together, these calculations give statistical weights that are much more accurate than those of CREST, and somewhat less accurate than CENSO. Since nearly all BACE species have single-point calculations, future binding affinity models using the re-ranked CREs can be benchmarked against predictions from past 2D and 3D models31. All geometries with DFT energies are also annotated with DFT dipole moments, partial charges, and molecular orbital energies. This data can be used for multi-task learning to improve TL for conformer generation.

Conformer generation

SMILES pre-processing

SMILES strings from the QM9 dataset were used as given. SMILES strings and properties of the drug-like molecules were accessed from ref. 78 and https://github.com/yangkevin2/coronavirus_data/tree/master/data (original sources are3,56,57,79,80,81). Each SMILES string was converted to its canonical form using RDKit. This allowed us to assign multiple properties from multiple sources to a single species, even if different non-canonical SMILES strings were used in the original sources.

3.9% of the drug molecules accessed (11,886 total) were given as clusters, either with a counterbalancing ion (e.g. “.[Na+]”, “.[Cl-]”) or with an acid to represent the protonated salt (e.g. “.Cl”). For non acid-base clusters we identified the compound of interest as the heaviest component of the cluster. For the acid/base SMILES strings, used reaction SMARTS in RDKit to generate the protonated molecule and counterion. This product SMILES was used in place of the original SMILES. Original SMILES strings are available in the dataset with the key uncleaned_smiles (see https://github.com/learningmatter-mit/geom for details). Not only does de-salting identify the drug-like compound in each cluster and correct its ionization state, it also homogenizes the molecular representations in the drug datasets. For MoleculeNet we also selected the heaviest component from each cluster SMILES, but did not perform protonation.

Initial structure generation

To generate conformers with CREST one must provide an initial guess geometry, ideally optimized at the same level of theory as the simulation (GFN2-xTB). For the drug molecules we therefore used RDKit to generate initial conformers from SMILES strings, optimized each conformer with GFN2-xTB, and used the lowest energy conformer as input to CREST.

Conformers were generated in RDKit using the EmbedMultipleConfs command with 50 conformers (numConfs = 50), a pruning threshold of similar conformers of 0.01 Å (pruneRmsThresh = 0.01), a maximum of five embedding attempts per conformer (maxAttempts = 5), coordinate initialization from the eigenvalues of the distance matrix (useRandomCoords = False), and a random seed. If no conformers were successfully generated then numConfs was increased to 500. Each conformer was then optimized with the MMFF force field82 in RDKit using the default arguments. Duplicate conformers, identified as those with an RMSD below 0.1 Å, were removed after optimization. Optimization was skipped for any molecules with cis/trans stereochemistry (indicated by “\” or “/” in the SMILES string), as such stereochemistry is not always maintained during RDKit optimization.

The ten MMFF-optimized conformers with the lowest energy were further optimized with xTB using Orca 4.2.083,84. The conformer with the lowest xTB energy was selected as the seed geometry for CREST. The QM9 molecules are already optimized with DFT, and so in principle did not need to be optimized further for CREST. However, since it is recommended to seed CREST with a structure optimized at the GFN2-xTB level of theory, we re-optimized each QM9 geometry with xTB before using it in CREST.

CREST simulation

A single xTB-optimized structure was used as input to the CREST simulation of each species. Default values were used for all CREST arguments, except for the charge of each geometry. CREST runs on the AICures drug dataset took an average of 2.8 hours of wall time on 32 cores on Knights Landing (KNL) nodes (89.1 core hours), and 0.63 hours on 13 cores on Cascade Lake and Sky Lake nodes (8.2 core hours). QM9 jobs were only performed on the latter two nodes, and took an average of 0.04 wall hours on 13 cores (0.5 core hours). 13 million KNL core hours and 1.2 million Cascade Lake/Sky Lake core hours were used in total.

CREST calculations on MoleculeNet species were run across several compute clusters, each with various node types and different core counts per node. KNL nodes were not used. Excluding species already present in the AICures dataset, each MoleculeNet job took 6.3 hours of wall time using 18.1 cores on average. These values are skewed by extremely flexible molecules whose CREST jobs took several days to finish: the median wall time was 1.4 hours, and the median core count was 12.0. 1.5 million CPU hours were used in total.

Graph re-identification

It was necessary to re-identify the graph of each conformer generated by CREST, for the following reasons. First, stereochemistry may not have been specified in the original SMILES string, but necessarily existed in each of the generated 3D structures. Second, reactivity such as dissociation or tautomerization may have occurred in the CREST simulations (CREST has specific commands to generate tautomers, but they were not used here). This would also lead to conformers with different graphs.

To re-identify the graphs we used xyz2mol85 (code accessed from https://github.com/jensengroup/xyz2mol) to generate an RDKit mol object. These mol objects were used to assign graph features to each conformer (see Data Records). It should be noted that xyz2mol sometimes assigned resonance structure graphs instead of the original graphs. In some cases this caused different conformers of the same species to have different graphs. This happened, for example, when the conformers had different cis/trans isomerism about a double bond that was only present because of the resonance structure used (see the RDKit tutorial at https://github.com/learningmatter-mit/geom). This is conceptually different from species whose conformer graphs differ because of reactivity. One may want to distinguish these two cases when analyzing the conformer mol objects. We also note that CREST changed the atom ordering of the input geometry, and hence of the subsequent conformers. This means that, even if a conformer did not react, we could not simply create an RDKit mol object with its canonical SMILES and set its coordinates.

CENSO simulation

534 molecules from the BACE dataset (35%) were optimized with CENSO. Initial CREs were generated with CREST using the ALPB model for water86. The CREs were refined with CENSO 1.1.2, using Orca 5.0.187 to perform the DFT calculations. The C-PCM76 model of water was used for DFT and the ALPB model was used for xTB. Conformer and rotamer duplicates were removed throughout the optimization using CREST (crestcheck = “on”). Default values were used for all other parameters. We used the same clusters and nodes for CENSO as for CREST with MoleculeNet species. The average CENSO job took 1 day and 4 hours of wall time using 54 cores. 781,000 CPU hours were used in total.

Single point calculations

We performed single-point DFT calculations on all CREST conformers in the BACE dataset without further optimization. We used Orca version 5.0.2 and the same level of theory as in CENSO optimization (r2scan-3c functional, mTZVPP basis, C-PCM model of water, and default grid 2). The average run took 6.4 minutes of wall time using 8 cores. Calculations took a total of 1.1 million CPU hours for 1.3 million conformers.

Hessian calculations

We performed Hessian calculations on all CREST conformers in the BACE dataset, using xTB with the ALPB model for water. The average run took 41 seconds of wall time using 4 cores. Calculations took a total of 63,000 CPU hours for 1.3 million conformers.

Conformational property prediction

The GEOM dataset is significant because it allows for the training of conformer-based property predictors and generative models to predict new conformations. The first application will be explored in a future publication. The second application is necessary for using conformer-based ML models in practice, since generating CREST structures from scratch is too costly for the virtual screening of new species. Such work is already underway88, paving the way for graphconformer ensembleproperty models that can be trained end-to-end. Here we give an example of a simpler application in the same vein, benchmarking methods to predict summary statistics of each conformer ensemble, rather than the conformers themselves. Our proposed tasks are similar to the benchmark QM9 tasks, which measure a model’s ability to predict properties that are uniquely determined by geometry. Here, since we provide conformer ensembles for each species, we measure a model’s ability to predict properties defined by the ensemble. Because one chemical graph spawns a unique conformer ensemble, these tasks are also a metric of the performance of graph-based models to infer properties mediated through conformational flexibility.

We trained different models to predict three quantities related to conformational information. A summary of these quantities can be found in Table 4 and Fig. 2. The first quantity is the conformational free energy, G = −TS, where the ensemble entropy is \(S=-R{\sum }_{i}{p}_{i}{\rm{\log }}\,{p}_{i}\)37. Here the sum is over the statistical probabilities pi of the ith conformer, and R is the gas constant. The conformational entropy is a measure of the conformational degrees of freedom available to a molecule. A molecule with only one conformer has an entropy of exactly 0, while a molecule with equal statistical weight for an infinite number of conformers has infinite conformational entropy. The conformational Gibbs free energy is an important quantity for predicting the binding affinity of a drug to a target. The affinity is determined by the change in Gibbs free energy of the molecule and protein upon binding, which includes the loss of molecular conformational free energy89. The second quantity is the average conformational energy. The average energy is given by \(\langle E\rangle ={\sum }_{i}{p}_{i}{E}_{i}\), where Ei is the energy of the ith conformer. Each energy is defined with respect to the lowest-energy conformer. The third quantity is the number of unique conformers for a given molecule, as predicted by CREST within the default maximum energy window37.

Table 4 CREST-based statistics for the QM9 and AICures drug datasets.
Fig. 2
figure 2

Violin plots of CREST-based statistics for the QM9 and AICures drug datasets.

We trained a kernel ridge regression (KRR) model90, a random forest91, and three different neural networks to predict conformer properties. The random forest, KRR and feed-forward neural network (FFNN) models were trained on Morgan fingerprints92 generated through RDKit. Two different message-passing neural networks93 were trained. The first, called ChemProp, has achieved state-of-the-art performance on a number of benchmarks20. The second is based on the SchNet force field model94,95. We call it SchNetFeatures, as it learns from 3D geometries using the SchNet architecture, but also incorporates graph-based node and bond features. The SchNetFeatures models were trained on the highest-probability conformer of each species.

100,000 species were sampled randomly from the AICures drug subset of GEOM. We used the same 60-20-20 train-validation-test split for each model. The splits, trained models, and log files can be found at96, under the heading “synthetic”. Hyperparameters were optimized for each model type and for each task using the hyperopt package. Details of the hyperparameter searches, optimal parameters, and network architectures can be found in the same location as the models. Source code is available at https://github.com/learningmatter-mit/NeuralForceField.

Results are shown in Table 5. ChemProp and SchNetFeatures are the strongest models overall, followed in order by FFNN, KRR, and random forest. Of the three models that use fixed 2D fingerprints, we see that the FFNN is best able to map these non-learnable representations to properties. ChemProp has the added flexibility of learning an ideal molecular representation directly from the graph, and so performs even better than the FFNN. The SchNetFeatures model retains this flexibility while incorporating extra information from one 3D structure. Compared to ChemProp, its prediction error is 10% lower for G, nearly equal for \(\left\langle E\right\rangle \), and 5% lower for ln(unique conformers). The small improvement in performance is not surprising, as the ensemble properties are mainly determined by molecular flexibility, which is a function of the graph through the number of rotatable bonds. A single 3D geometry would not provide extra information about this flexibility.

Table 5 Prediction mean absolute error (MAE) for three conformer-related properties.

We see that various models can accurately predict conformer properties when trained on the GEOM dataset. With access to the dataset, researchers will therefore be able to predict results of expensive simulations without performing them directly. This has implications beyond ensemble-averaged properties, as generative models trained on the GEOM dataset will also be able to produce the conformers themselves88.

Data Records

The dataset is available online at97, and detailed tutorials for loading and analyzing the data can be found at https://github.com/learningmatter-mit/geom.

The data is available either through MessagePack, a language-agnostic binary serialization format, or through Python pickle files. There are two MessagePack files for the AICures drug dataset and two for QM9. Each of the two files contains a dictionary, where the keys are SMILES strings and the values are sub-dictionaries. In the file with suffix crude, the sub-dictionaries contain both species-level information (experimental binding data, average conformer energy, etc.) and a list of dictionaries for each conformer. Each conformer dictionary has its own conformer-level information (geometry, energy, degeneracy, etc.). In the file with suffix featurized, each conformer dictionary contains information about its molecular graph.

The Python pickle files are organized in a different fashion. The main folder is divided into sub-folders for QM9, AICures, and MoleculeNet data, plus separate folders for BACE calculations in water and with CENSO. Each sub-folder contains one pickle file for each species. Each pickle file contains both summary information and conformer information for its species. Each conformer is stored as an RDKit mol object, so that it contains both the geometry and graph features. One may only want to load the pickle files of species with specific properties (e.g., those with experimental data for SARS-CoV-2 inhibition); for this one can use the summary JSON file. This file contains all summary information along with the path to the pickle file, but without the list of conformers. It is therefore lightweight and quick to load, and can be used to choose species before loading their pickles.

Technical Validation

The quality of the data was validated in three different ways. First, we checked that the conformer data was accurately parsed from the CREST calculations. To do so we randomly sampled one conformer from 20 different species and manually confirmed that its data matched the data in the CREST output files.

Second, we re-identified the graphs of the conformers generated by CREST using xyz2mol. The graph re-attribution procedure succeeded for 88.4% of the QM9 molecules and 94.7% of the drug molecules, recovering the original molecular graph that was used to generate each conformer. Note that to compare graphs we removed stereochemical indicators from the original and the re-generated graph. This was done because of cases in which stereochemistry was not specified originally but was specified in the generated conformers. All of the failed QM9 graphs underwent some sort of reaction, which can be explained by the presence of highly strained and unstable molecules. However, manual inspection of 53 cases in the AICures drug dataset suggests that 70% of the drug graphs failed only because of poor handling of resonance forms by xyz2mol (see above). This means that the original graph was likely recovered for 98.4% of all drugs. 21% of cases failed because of tautomerization (1% of all cases), and 9.4% failed because of a different reaction (usually dissociation or ring formation; 0.5% of all cases). The high success rate of the graph re-identification indicates that, in the vast majority of cases, the geometries generated by CREST were actual conformers of the species.

Third, we compared the CREST energies and coordinates to those from higher levels of theory. Figure 3 compares the GFN2-xTB calculations of CREST with single-point r2scan-3c calculations, both performed in water for 1,511 species in the BACE dataset. Panel (a) shows the relative energies of the two methods. The mean absolute error (MAE) of xTB is 1.96 kcal/mol, which is similar to reported values in conformational energy benchmarks38. The ranking accuracy can be measured with the Spearman correlation coefficient ρ, which lies between 1 and −1 (perfect correlation and anti-correlation, respectively). The Spearman coefficient is 0.47 when using all geometries from all species. However, it is more meaningful to judge the energy rankings among different conformers in a single species. Computing ρ separately for each species yields the distribution in panel (b). The distribution of ρ is quite wide, with an average value of 0.39 and a standard deviation of 0.35. The mean value of ρ indicates moderate correlation between the methods. The correlation is significantly better than for classical force fields such as MMFF9482, UFF98, and GAFF99: For instance, the median ρ between MMFF94 and single-point DFT for drug-like molecules is between −0.1 and −0.45, meaning that the two methods are actually weakly anti-correlated (Supporting Information of ref. 47).

Fig. 3
figure 3

Comparison of GFN2-xTB (CREST) energies and single-point r2scan-3c (DFT) energies. (a) xTB vs. r2scan-3c energies for all geometries in the BACE-1 dataset. The ideal correlation is shown with a dashed white line. (b) Distribution of Spearman rank correlation coefficients ρ, measuring the accuracy of xTB energy ranking for each of the ensembles.

Figure 4 compares single-point DFT calculations on CREST geometries (“SP”) with DFT results on fully optimized geometries (“CENSO”). Panel (a) shows the distribution of ρ for conformer energies. The average Spearman correlation is 0.69 and the standard deviation is 0.27, indicating good agreement between the two methods. Indeed, the MAE between optimized and single-point relative energies is 0.54 kcal/mol, which is 3.6 times lower than the xTB error (the MAE of the absolute energy, equal to the average energy released after optimization, is 5.74 kcal/mol). Panel (b) shows that the geometries change very little during optimization, with a mean RMSD of only 0.36 Å. This shows that the CREST geometries are quite good, thus validating the quality of the GEOM ensembles. The median RMSD among heavy atoms is 0.25 Å; this is 2.4 times lower than the value of 0.6 Å between MMFF94 and PM7 geometries100 for drug-like molecules47.

Fig. 4
figure 4

Comparison of CENSO and single-point DFT calculations. (a) Distribution of Spearman coefficients, measuring the accuracy of single-point ranking for each of the ensembles. (b) RMSDs between CREST geometries and DFT-optimized geometries.

Similar comparisons can be made between CENSO geometries and their most similar CREST counterparts (i.e., the CREST geometry with the lowest RMSD relative to a CENSO geometry). These may not be the same as the CREST geometries used to seed the optimization. We have found that using the most similar geometries does not significantly affect the results; for example, the Spearman coefficient only climbs to 0.72 ± 0.27, while the RMSD only drops to 0.33 ± 0.19. Note also that the comparison of the methods only includes conformers with non-negligible weight after optimization (ΔG ≤ 2.5 kcal/mol), since CENSO discards high-energy conformers during optimization. Hence high-energy conformers were not fully optimized and thus not included in the comparison.

Figure 5(a) compares the ordering of geometries with CREST and with CENSO. The Spearman correlation is ρ = 0.43 ± 0.41, which is similar to the correlation between CREST and single-point energies. This result should be interpreted with caution, however, since only the lowest-energy CENSO geometries are included in the comparison, whereas the rank correlation in Fig. 3 includes all CREST conformers. Lastly, Fig. 5(b) compares the ordering of CENSO geometries by energy and by free energy. The correlation is quite high (ρ = 0.85 ± 0.18), and the MAE between energies and free energies is only 0.33 kcal/mol. Hence energies alone can be quite good for ordering conformers by statistical weight. This also means that the statistical weight errors in GEOM are dominated by xTB errors, and that the quasi-harmonic errors are comparably negligible.

Fig. 5
figure 5

(a) Comparison of CENSO and CREST calculations. The distribution of Spearman coefficients shows the accuracy of CREST ranking for each of the ensembles. (b) Comparison of energy and free-energy ranking with CENSO. The distribution of Spearman coefficients shows the accuracy of energy ranking for each of the ensembles.

Usage Notes

Researchers are encouraged to use the data-loading tutorials given in https://github.com/learningmatter-mit/geom. We suggest loading the data through the RDKit pickle files, as RDKit mol objects are easy to handle and their properties can be readily analyzed. The MessagePack files, while secure and accessible in all languages, represent graphs through their features rather than objects with built-in methods, and are thus more difficult to analyze. To train 3D-based models we suggest following the tutorial and README file at https://github.com/learningmatter-mit/NeuralForceField.