## Background & Summary

Accurate descriptions of atomic and intermolecular interactions are a cornerstone of reliable computer simulations in biophysics, chemistry, and materials science. For the past 50 years we have seen tremendous progress in the development of theoretical methods and software tools aiming to describe more complex systems and allow for longer time scales. Kohn-Sham density-functional theory (KS-DFT or DFT for short) has become by far the most popular electronic structure method in computational physics and chemistry1. DFT has found applications in many systems in organic chemistry2,3, biology4, catalysis3,5 and solid state chemistry6,7. It is also frequently combined with molecular dynamics (AIMD) and classical force fields (quantum mechanics-molecular mechanics (QM-MM)) to describe chemical reactions in extended systems.

Although DFT calculations have become affordable on modern supercomputers, we face a dilemma: standard computational algorithms representing the N electrons system require O(N2) storage and O(N3) arithmetic operations. This O(N3) complexity has become a critical bottleneck which limits capabilities to study larger realistic physical systems, as well as longer time scales relevant to actual experiment. Consequently, a lot of progress has been made in the development of atomistic potentials using machine learning (ML)8,9. The low numerical complexity and high accuracy of machine learning algorithms makes them very attractive as a pragmatic substitute for ab-initio and DFT methods. Thanks to their remarkable ability to find complex relationships among data, in many cases these ‘machine learned’ models out-perform more physically sound approximations (like force fields) and methods while also reducing the computational time required for a given application915. These models are heavily dependent on the quality and quantity of data used in their fitting, also called training. Neural networks are highly efficient and effective at modeling reference training data, due to their flexible functional form. However, this flexibility comes at a cost: a vast amount of reference data is required to properly train these models.

The Chemical Space Project16 computationally enumerated all possible organic molecules up to a certain size, resulting in the creation of the GDB databases. Their latest GDB-17 database17 contains 166.4 billion molecules of up to 17 atoms of C, N, O, S, and halogens. All molecules follow the valency rules and are filtered for unstable substructures, non-synthesizable and strained topologies. GDB molecules are stored as SMILES [www.opensmiles.org] strings representing the composition and connectivity of a molecule.

The GDB databases were fundamental in creating the QM7 dataset18, one of the first benchmark datasets for training atomistic ML potentials. The QM7 dataset consists of 7,165 energy minimized (equilibrium) molecules calculated with the PBE0 functional. All structures are a small subset of GDB-13 (older GDB database of nearly 1 billion organic molecules) composed of molecules with up to 7 heavy atoms C, N, O, and S. Later, QM7 was extended to include 13 additional properties, like frontier molecular orbital energies, dipole moments, polarizability, and excitation energies19. The first ML model trained on QM7 used kernel ridge regression with the Coulomb matrix representation, which predicted atomization energies with a mean absolute error (MAE) of 9.9 kcal×mol−1. This error was quickly reduced to 3.3 kcal×mol−1 (ref. 20) and eventually was under 1 kcal×mol−1 (ref. 21).

QM9, is perhaps the most well-known benchmark dataset17,22. It consists of 133,885 equilibrium organic molecules containing up to nine heavy atoms (CONF) from the GDB-17 database. In addition to energy minima it reports corresponding harmonic frequencies, dipole moments, polarizabilities, along with energies, enthalpies, and free energies of atomization. All properties were calculated at the B3LYP/6–31 G(2df,p) level of quantum chemistry. A subset of 6,095 constitutional isomers in QM9 corresponding to a brutto formula C7H10O2 was also calculated at the more accurate G4MP2 level of theory. Various molecular representations and ML methods were benchmarked against the QM9 dataset20,21,23,24. See also a recent survey of methods23. Later, a Message Passing Neural Network (MPNN)10 achieved chemical accuracy in 11 out of 13 target properties in the QM9 dataset. Finally, the hierarchical interacting particle neural network (HIP-NN)15 model of Lubbers et. al. achieved state-of-the-art accuracy of just 0.26 kcal×mol−1 MAE on total energy prediction.

A common feature of all QMx datasets is that they only explore chemical degrees of freedom by providing information about energy minimized (equilibrium) molecular configurations. In these molecules, the forces of all atoms are equal to zero. Therefore, considerable efforts were undertaken to produce off-equilibrium datasets using ab initio molecular dynamics (AIMD) simulations. The C7O2H10–17 dataset includes energies from AIMD trajectories of 113 isomers of C7O2H10 (5 k frames each). All simulations used the DFT/PBE level of theory and were carried out at 500 K. Very recently Schutt et al.21 and Chmiela et al.25 released MD17 dataset, a collection of eight AIMD/PBE+vdW-TS simulations for small organic molecules. Each of these consist of an MD trajectory for a single molecule extending from ~100 K to 900 K frames. In contrast to the QMx datasets, these MD datasets explore conformational space while keeping composition fixed.

We recently introduced a neural network potential (NNP) called ANI-1, the first NNP for organic molecules shown to transfer to molecular systems well outside of its training set. As presented, the ANI-1 potential was trained on a data set, which spans both conformational and configurational space, built from small organic molecules of up to 8-heavy atoms. We show its applicability to much larger systems, up to 50 atoms, including well known drug molecules and a random selection of molecules from the GDB-11 (refs 26,27) database with 10-heavy atoms. ANI-1 shows exceptional predictive power on the 10-heavy atom test set, with RMSE versus DFT relative energies as low as 0.57 kcal×mol−1 when only considering molecular conformations that are within 30 kcal×mol−1 of the energy minimum for each molecule. More recently, Gastegger et. al.28, showed similar results for large organic systems that were fragmented into smaller molecules and DFT data was generated on the fly for training. This was done in an active-learning fashion where the goal is to train the potential to a specific system during an MD simulation. Shortly after, Huang and Von Lilienfeld29 used a fragmentation scheme for training an ML model to predict energies of large rigid drug molecules. Both studies back up the argument that information about the physics of large systems can be learned from data sets of small molecules.

In this data descriptor, we report a large dataset of non-equilibrium DFT total energy calculations for organic molecules. In total, we provide access to the total energies of ~20 M molecular conformations for 57,462 molecules from the GDB database26,27, which samples both chemical and conformational degrees of freedom at the same time. As the accuracy of modern ML methods for molecules in equilibrium on the QM9 benchmark achieved 1 kcal×mol−1, ANI-1 provides 100x more data and a much more challenging task to learn. Therefore, we expect it will become a new standard benchmark of comparison for current and future methods in the machine learned potential community. More importantly, it is a sound foundation for the development of future general-purpose machine learned potentials, providing an exhaustive head start on data generation, which can be augmented with future data sets covering relevant regions of chemical space.

## Methods

### QM calculations

All electronic structure calculations are carried out with the ωB97x (ref. 30) density functional and the 6–31 G(d) basis set31 in the Gaussian 09 (ref. 32) electronic structure package. ωB97x is a hybrid-meta GGA functional30, which has been shown to be chemically accurate compared to high-level CCSD(T) calculations3337.

### Molecular geometry generation

The GDB-11 database26,27 provides an exhaustive search of stable and chemically viable molecules, supplied in the SMILES [www.opensmiles.org] string format, containing C, N, O, and F atoms with up to 11 of these ‘heavy’ atoms. Hydrogen atoms are added through the RDKit cheminformatics software package [www.rdkit.org] to make molecular structures that are charge neutral and have a singlet electronic ground state. The ANI-1 data set presented here is built from an exhaustive sampling of a subset of the GDB-11 database containing molecules with between 1 and 8 heavy atoms and limiting the atomic species to C, N, and O. This leaves a subset of 57,947 starting molecules. All molecules are neutral and with a singlet electronic ground state. The conformation generation process is carried out in five steps starting with these 57,947 molecules. The steps are listed below and qualitatively depicted in Fig. 1.

Smiles strings from the GDB-11 subset described above are used to generate 3D conformations using RDKit. Also with RDKit, all structures are saturated with hydrogens such that each has charge 0 and multiplicity 1. The 3D structures are then pre-optimized to a stationary point using the MMFF94 force field38 as implemented in RDKit.

At the chosen DFT or ab-initio level of theory, geometries are optimized until energy minima convergence. Optimization is carried out using Gaussian 09’s default method and convergence criteria. Obtained geometries correspond to the first stationary point reached on the potential surface and correspond to some local minima or in a rare case to a saddle point. If convergence fails, the structure is not included in the data set. At this step, 485 (0.84% of total) molecules failed to converge during the structural optimization. The final data set is built from these 57,462 equilibrium geometries. Finally, for each of the 57,462 structurally optimized molecules, a normal mode calculation is performed in the Gaussian 09 package to obtained normal mode coordinates and their associated force constants. This is accomplished using the UltraFine DFT grid option with the ωB97x density functional.

### Normal mode sampling (NMS)

To carry out normal mode sampling on an energy minimized molecule of Na atoms, first a set of Nf normal mode coordinates, $Q=\left\{{q}_{1},{q}_{2},{q}_{3},\dots {q}_{{N}_{f}}\right\}$, is computed at the desired ab-initio level of theory, where Nf=3Na−5 for linear molecules and Nf=3Na−6 for all others. The corresponding force constants, $K=\left\{{K}_{1},{K}_{2},{K}_{3},\cdots ,{K}_{{N}_{f}}\right\}$, are obtained alongside Q. Then a set of Nf uniformly distributed pseudo-random numbers, ci, are generated such that ${\sum }_{i}^{{N}_{f}}{c}_{i}$ is in the range [0,1]. Next, a displacement, Ri, for each normal mode coordinate is computed by setting a harmonic potential equal to the ci scaled average energy of the system of particles at some temperature, T. Solving for the displacement gives,

$\begin{array}{}\text{(1)}& {R}_{i}=±\sqrt{\frac{3{c}_{i}{N}_{a}{k}_{b}T}{{K}_{i}}}\end{array}$

where kb is Boltzmann’s constant. The sign of Ri is determined randomly from a Bernoulli distribution where P=0.5 to ensure that both sides of the harmonic potential are sampled equally. Each Ri is used to scale the normalized normal mode coordinates by ${q}_{i}^{R}={R}_{i}{q}_{i}$. Next, a new conformation of the molecule is generated by displacing the structurally optimized coordinates by QR, the superposition of all ${q}_{i}^{R}$. Finally, a single point energy at the desired level of theory is calculated using the newly displaced coordinates as input.

N data points (new conformations) are generated, representing a window of the potential surface. N is calculated by S×K where S is an empirically chosen value (See Table 1) based on the number of heavy atoms in each molecule and K is the number of degrees of freedom of the molecule. The total energy, atomic symbols, and cartesian coordinates of the structure are stored as described in the Data Format section.

## Data Records

The data set is provided in an HDF5 based file in a Figshare data repository (Data Citation 1). A GitHub repository containing a README file with technical usage details and examples of how to access the data set is supplied online (https://github.com/isayev/ANI1_dataset).

### File format

Data is stored per molecule as described in Fig. 2. Data for each X molecule is stored in a python dict type containing all conformer data. The keys shown in Fig. 2: coordinates, energies, and species give access to containers of the type shown, and containing data described by the key. Species is a python list of strings containing the atomic symbol of each atom and its order corresponds correctly to dimension 1 of the coordinates numpy array. Appending ‘HE’ to the end of the coordinates and energies keys will yield high energy structures as described in the technical validation section.

## Technical Validation

Since normal mode sampling is used to generate the non-equilibrium structures, high-energy conformers exist in the data set. These high energy conformations occur where the harmonic approximation of normal modes fail in anharmonic regions of a potential, and are caused by atomic clashes or other highly unfavorable molecular conformations. The distribution shown in Fig. 3b visualizes the energies in the dataset, which contains structures with energies as high as 15 Ha. For this reason, energies greater than 275 kcal×mol−1 higher than the lowest energy conformer were not included into the training set of the ANI-1 potential. This removed 2,630,435 (10.7% of the original total) structures yielding 22,057,374 structures. Regions this high in energy are generally not considered in bio-chemical research. However, this data might be useful for some purposes. Therefore, we include both the high-energy and the low energy datasets as described in the data description section. Figure 3c shows the new distribution of energies which are never larger than 0 Ha in total energy minus the sum of atomic contributions to the total energy.

During the structural optimization phase, we do not distinguish between optimized structures that might land at a saddle point in the potential surface and those that land at some structural minima. Given the goal of sampling conformational space, the fact that some structures might land at off equilibrium geometries (saddle points) could in fact help in using this data to fit potential surfaces, as it will help to cover regions of conformational space not covered by equilibrium molecule normal mode sampling. However, if the optimization fails to converge to a stationary point, as 485 molecules did, then these structures were not included in the training set, as the validity of their configuration could not immediately be confirmed. However, given the vast number of structures in the data set, it is likely any interaction found in these 485 molecules can be found elsewhere in the data set.

A similar process of not including information for unconverged calculations is used in the generation of the total energies. For certain highly elongated bonds the molecular orbital optimization process, the self-consistent field procedure used in obtaining the total energy of the conformation, can fail to converge to a solution when two orbitals are too close in energy. For this reason, if a structure’s single point energy calculation failed to converge, then this data is not included in the data set.

The primary concept of including non-equilibrium data is to sample regions of chemical space that would be sparsely covered in equilibrium only data sets. Figure 3a provides validation of energy sampling by showing the distribution of total energies divided by the total number of electrons for each molecule in the GDB subsets from 4 to 8-heavy atoms. Figure 3b,c show the distribution of total energies minus the sum of all individual atomic energies (tabulated in Supplementary Information Table 1) for the full and ‘low energy’ (less than 275 kcal×mol−1 from the minimum energy) data sets, respectively.

Further validation of the of non-equilibrium sampling is to show the data set covers a large domain of the chemical degrees of freedom in conformational space. Figure 4 contains five panels representing the distribution of atomic distances in the resulting non-equilibrium data set (blue line) compared with a data set of equilibrium only conformations (red) of the same molecule. As expected, the normal mode sampling method used to generate non-equilibrium conformations visits areas of conformational space not covered by equilibrium only data. A similar plot, Supplementary Information Fig. 1 shows distance distributions for the remaining atomic pairs. Figure 5 shows distributions involving the angles in the data sets, and tell a similar story in terms of coverage in conformational space for three body interactions. The blue background density plot shows that the ANI-1 data set covers far more angular space than the equilibrium data sets (red and orange). The remaining plots are included in the Supplementary Information Figs 2–4.

## Usage Notes

To ensure that all readers have easy access to the ANI-1 data set, we have developed a python library with an easy to use interface for extracting the data. Examples uses of this library are included in the ‘readers’ folder.