Approaching coupled cluster accuracy with a general-purpose neural network potential through transfer learning

Computational modeling of chemical and biological systems at atomic resolution is a crucial tool in the chemist’s toolset. The use of computer simulations requires a balance between cost and accuracy: quantum-mechanical methods provide high accuracy but are computationally expensive and scale poorly to large systems, while classical force fields are cheap and scalable, but lack transferability to new systems. Machine learning can be used to achieve the best of both approaches. Here we train a general-purpose neural network potential (ANI-1ccx) that approaches CCSD(T)/CBS accuracy on benchmarks for reaction thermochemistry, isomerization, and drug-like molecular torsions. This is achieved by training a network to DFT data then using transfer learning techniques to retrain on a dataset of gold standard QM calculations (CCSD(T)/CBS) that optimally spans chemical space. The resulting potential is broadly applicable to materials science, biology, and chemistry, and billions of times faster than CCSD(T)/CBS calculations.

Another key component of our efficient computational method is the complete basis set (CBS) extrapolation. We have applied the widely used two-point "EP1" 4 extrapolation scheme originally proposed by Hobza and coworkers 5 . The total energy is computed as the sum of the MP2/CBS extrapolated energy and the difference between CCSD(T) and MP2 energies calculated with a smaller basis set. For the sake of computational efficiency we have used cc-pVTZ and cc-pVQZ basis sets for extrapolation of HF and MP2 energies using the formulas of Halkier 6 and Helgaker 7 ( 34 = 5.46, 34 = 3.05 8 ). The difference between CCSD(T) and MP2 energies was estimated with the cc-pVTZ basis set as shown in Supplementary Equation 1. The most computationally demanding term in the Equation 1 is ( ) − , which we need to calculate with at least TightPNO level of accuracy. To reduce computational cost further, we approximated this term using an idea similar to the CBS extrapolation approach above: TightPNO and NormalPNO-CCSD(T) energies were calculated with a smaller basis set and the difference was added to the NormalPNO-CCSD(T)/cc-pVTZ energy. Summarizing, the CCSD(T)/cc-pVTZ part in Supplementary Eq. 1 was estimated with Supplementary Equation 2.
Such a composite scheme for approximating CCSD(T)/CBS energies is computationally efficient, as it requires computation of Tight-DPLNO-CCSD(T) with at most a double-zeta basis set. The robustness of this suggested CCSD(T)*/CBS scheme is evaluated against two distinct datasets with high-quality CCSD(T)-F12 data available: S66 benchmark 9 for weak intermolecular interactions and the extensive W4-11 benchmark 10 for thermochemistry. CCSD(T)*/CBS performs excellent on both benchmarks (See Supplementary Table 1).
In Supplementary Table 1, the extrapolation schemes are noted with the "small" basis set, which is used to calculate ΔCCSD(T) with respect to MP2, the MP2/CBS energy obtained using extrapolation with basis set with higher cardinal number. CCSD(T)*/CBS is equal to the TightPNO-CCSD(T)*/CBS(TZ) scheme, which was applied for obtaining our reference training dataset. TightPNO-CCSD(T)* denotes that the ΔTightPNO term is the difference between TightPNO and NormalPNO energy using a basis set with lower cardinal number. The data in Supplementary Table 1 shows that CCSD(T)*/CBS provides excellent balance between cost and performance with errors no worse than the canonical CCSD(T)/CBS(aDZ) method. It also should be noted that incorporating the ΔTightPNO approach helps to achieve much lower error, compared to NormalPNO, without sacrificing much computational efficiency.

S1.2.1 Neural network architecture
The models trained in this work all utilize variable size networks for different atomic species. This is done since the molecules in the ANI-1x datasets are proportioned 48% H, 30% C, 13% N, and 9% O. Supplementary Table 2 provides details of each model architecture used in this work.

S1.2.2 Atomic environment vector parameters
The ANI model atomic environment vector (AEV) is computed using the in-house NeuroChem software suite. These AEVs are computed identically to those published in the ANI-1 work. 11 In this work the atomic elements C, H, N, and O are described by the AEVs (using the parameters below) yielding a total of 384 AEV elements per atom. The AEV parameters used to train each model are supplied below.

S1.2.3 Training details
Prior to training the ANI models, a linear fitting to the energy per atomic species is performedessentially, an empirical self-energy term for each element. This linear fitting is over the entire dataset and the fit is performed with respect to the number of each atomic element in a given molecule as the input. The ANI models are then trained to the QM calculated energy minus the linear fitted prediction. The energy obtained from this process is roughly analogous to the process of computing an atomization energy, but without any per-atom bias. The linear fitting parameters (in Hartrees) used in this work are provided below.
ANI-1x DFT Linear fitting parameters: ANI-1x CCSD(T)*/CBS Linear fitting parameters: ANI-1x DFT to CCSD(T)*/CBS ∆ Linear fitting parameters: The following hyperparameters are used during training for all ANI models. These parameters have been determined through rigorous hyperparameter searches in prior ANI potential 11,12 development. •

S1.2.4 Ensemble held out test set results
Supplementary Table 3 provides the 1/8th held out test set mean absolute error (MAE) and root mean squared error (RMSE) for the DFT trained ANI-1x and CCSD(T)*/CBS trained ANI-1ccx and ANI-1ccx-R single models from the ensemble. The ANI-1ccx models were trained to a small dataset (500k) still fits to its reference as well as the ANI-1x models trained to a larger dataset (5M). However, the ANI-1ccx-R models, which were only trained to the coupled cluster data with no transfer learning, performs significantly worse on the held-out test set. From this we conclude that transfer learning is performing as designed by providing the performance of a model trained to 5M datapoints with only 500k datapoints.

S1.3 Active learning molecular torsions
Since molecular dynamics simulations and normal mode perturbation is used for sampling the ANI-1x dataset, molecular torsion barriers can be poorly described by the ANI-1x potential. In other words, ANI torsion barriers tend to have higher error than near equilibrium conformations due to the sampling methods used to generate training data. To reduce this error and improve barrier sampling, we develop an iterative and entirely ML driven active learning technique for automatically sampling molecular torsions. We begin with a dataset of SMILES [opensmiles.org] strings; in this work we start with a portion of the ChEMBL 14-16 database containing less than 20 total atoms. We also include the SMILES strings for molecules in the Genentech torsion benchmark 17 since these molecules represent a diverse set of dihedrals in the simplest chemical environment that they can be found. From the resulting set of SMILES strings we carry out active learning iterations as follows: 1. Randomly select N smiles from the database 2. Embed the N molecules in 3D space 3. Randomly select a rotatable bond, and direction of rotation 4. Conduct relaxed scan every 10 degrees (36 points) of each selected torsion using the current ANI model 5. Carry out an ensemble disagreement test (see our work on active learning for details; a selection criterion of ̂= 0.2 kcal*mol^-1 is used in this work) 12 on the 36 * N generated molecular conformations 6. For all M conformations that fail the ensemble disagreement test, compute ANI normal modes 7. Randomly perturb the M conformations along the normal modes to a maximum distance of 0.2Å along each mode. Generate 4 normal mode sampled (NMS) points for each M 8. Generate DFT energies and forces for all NMS points 9. Add resulting data to the training dataset and retrain ANI model 10. Go back to 1 and iterate We complete 20 iterations of the above scheme resulting in the generation of 202k extra DFT datapoints. We subsample 19k points from this dataset generation using our CCSD(T)/CBS extrapolation scheme. Supplementary Figure 1 compares ANI-1ccx and ANI-1x with and without the active learning generated dihedral corrections.

Supplementary Tables
Supplementary Table 1: Computational efficiency for example small molecules and performance of the CCSD(T)/CBS approximation (CCSD(T)*/CBS) evaluated against CCSD(T)-F12 reference data and compared to other composite schemes.

Method
CPU-core hours a MAE / RMSD, kcal*mol^-    Figure 3 in the main article.   Figure 3 in the main article. One reaction involving the atomic element fluorine (F) was left out since the ANI potential in this work is only fit to CHNO.