VIB5 database with accurate ab initio quantum chemical molecular potential energy surfaces

High-level ab initio quantum chemical (QC) molecular potential energy surfaces (PESs) are crucial for accurately simulating molecular rotation-vibration spectra. Machine learning (ML) can help alleviate the cost of constructing such PESs, but requires access to the original ab initio PES data, namely potential energies computed on high-density grids of nuclear geometries. In this work, we present a new structured PES database called VIB5, which contains high-quality ab initio data on 5 small polyatomic molecules of astrophysical significance (CH3Cl, CH4, SiH4, CH3F, and NaOH). The VIB5 database is based on previously used PESs, which, however, are either publicly unavailable or lacking key information to make them suitable for ML applications. The VIB5 database provides tens of thousands of grid points for each molecule with theoretical best estimates of potential energies along with their constituent energy correction terms and a data-extraction script. In addition, new complementary QC calculations of energies and energy gradients have been performed to provide a consistent database, which, e.g., can be used for gradient-based ML methods.


Background & Summary
Many physical and chemical processes of molecular systems are governed by potential energy surfaces (PESs) that are functions of potential energy with respect to the molecular geometry defined by the nuclei 1 . Accurate ab initio quantum chemical (QC) molecular PESs are essential to predict and understand a multitude of physicochemical properties of interest such as reaction thermodynamics, kinetics 2 , and simulation of rovibrational spectra [3][4][5] . As for the latter, PESs of a number of different molecules have been constructed and used in variational nuclear motion calculations to provide accurate rotation-vibration-electronic line lists to aid the characterization of exoplanet atmospheres, amongst other applications [6][7][8][9][10][11][12][13][14][15][16] .
It is necessary to have a global PES covering all relevant regions of nuclear configurations allowing to simulate rotation-vibration (rovibrational) spectra approaching the coveted spectroscopic accuracy of 1 cm −1 in a broad range of temperatures. This can be achieved by defining the PES on a high-density grid of nuclear geometries with no holes and having the theoretical best estimate (TBE) of energies computed at a very high QC level of theory. The construction of an optimal grid usually involves many steps and human intervention, and often requires a staggeringly large number of grid points, e.g., ca. 100 thousand points even for a five-atom molecule such as methane 10 . The choice of QC level for TBE calculations is determined by the trade-off between accuracy and computational cost, but typically requires going well beyond the gold-standard [17][18][19] CCSD(T) 17 /CBS (coupled cluster with single and double excitations and a perturbative treatment of triple excitations/complete basis set) limit and needs many QC corrections on top of it. Just to give a perspective, ca. 24 single processing unit (CPU)-hours are required for calculating TBE energy of each grid point of ~45 thousand methyl chloride (CH 3 Cl) geometries amounting to over 100 CPU-years when constructing its highly accurate ab initio PES 20 .
To reduce the high computational cost, machine learning (ML) has emerged as a powerful approach for constructing full-dimensional PESs 21-27 and the resulting ML PESs can be used 22,24,[28][29][30][31][32][33][34][35] for performing vibrational calculations. In particular, substantial cost reduction can be achieved by calculating TBE energies only for a small number of existing grid points and then interpolating between them with ML 36 ; such ML grids can be subsequently used for simulating rovibrational spectra with a relatively small loss of accuracy. Importantly,

Methods
Grid points generation. For each molecule, we take grid points directly from the previous studies by some of the authors. Here we only describe in short how these grid points were generated for the sake of completeness. We refer the reader to the original publications cited for each molecule for further details (see Table 1). CH 3 Cl. 44819 grid points for CH 3 Cl were taken from Refs. 7,9,20 . A Monte Carlo random energy-weighted sampling algorithm was applied to nine internal coordinates of CH 3 Cl: the C-Cl bond length r 0 ; three C-H bond lengths r 1 , r 2 , and r 3 ; three ∠(H i CCl) interbond angles β 1 , β 2 , and β 3 ; and two dihedral angles τ 12 and τ 13 between adjacent planes containing H i CCl and H j CCl (Fig. 1a). This procedure led to geometries in the range 1.3 ≤ r 0 ≤ 2.95 Å, 0.7 ≤ r i ≤ 2.45 Å, 65 ≤ β i ≤ 165° for i = 1, 2, 3 and 55 ≤ τ jk ≤ 185° with jk = 12, 13. The grid also includes 1000 carefully chosen low-energy points to ensure an adequate description of the equilibrium region.  4 . 97271 grid points for CH 4 were taken from ref. 10 . The global grid was built in the same fashion as the grid was constructed for CH 3 Cl. Nine internal coordinates of CH 4 are defined as follows: four C-H bond lengths r 1 , r 2 , r 3 and r 4 ; five∠(H j -C-H k ) interbond angles α 12 , α 13 , α 14 , α 23 , and α 24 , where j and k label the respective hydrogen atoms (Fig. 1b). Then grid points are in the range 0.71 ≤ r i ≤ 2.60 Å for i = 1, 2, 3, 4 and 40 ≤ α jk ≤ 140° with jk = 12, 13, 14, 23, 24.
NaOH. 15901 grid points for NaOH were taken from ref. 14 . Grid points were generated randomly with a dense distribution around the equilibrium region. Three internal coordinates of NaOH are defined as follows: the Na-O bond length r NaO , the O-H bond length r OH , and the interbond angle ∠(NaOH) (Fig. 1e). This procedure led to geometries in the range 1.435 ≤ r NaO ≤ 4.400 Å, 0.690 ≤ r OH ≤ 1.680 Å, and 40 ≤ ∠(NaOH) ≤ 180°.
Theoretical best estimates and constituent terms. For each molecule, we take the TBEs and energy corrections directly from the previous studies by some of us. Here we only briefly introduce how these calculations were performed. We refer the reader to the original publications cited for each molecule for details (see Table 1). TBE is obtained through the sum of many constituent terms: E CBS , ∆E CV , ∆E HO , ∆E SR , and, for most molecules, ∆E DBOC . E CBS means the energy at the complete basis set (CBS) limit. ∆E CV refers to the core-valence (CV) electron correlation energy correction. ∆E HO refers to the energy correction accounted for by the higher-order (HO) coupled cluster terms and ∆E SR shows scalar relativistic (SR) effects. ∆E DBOC means the diagonal Born-Oppenheimer correction and was calculated for CH 3 Cl, CH 4 , CH 3 F, and NaOH, but not for SiH 4 due to the little effect of ∆E DBOC on the vibrational energy levels of this molecule.
The constituent terms were not calculated at the same level of theory across all molecules in the data set. The computational details of five TBE constituent terms (E CBS , ∆E CV , ∆E HO , ∆E SR , and ∆E DBOC ) for 5 molecules are shown below and summarized in the Table 2.
To extrapolate the energy to the CBS limit, the parameterized, two-point formula 48 was used. In this process, the method CCSD(T)-F12b 49 and two basis sets cc-pVTZ-F12 and cc-pVQZ-F12 50 were chosen. When performing calculations, the frozen core approximation was adopted and the diagonal fixed amplitude ansatz 3C(FIX) 51 with a Slater geminal exponent value 48 of β = 1.0 a 0 −1 were employed. As for the auxiliary basis sets (ABS), the resolution of the identity OptRI 52 basis and cc-pV5Z/JKFIT 53 and aug-cc-pwCV5Z/MP2FIT 54 basis sets for density fitting were used for all 5 molecules. These calculations were carried out with either MOLPRO2012 55 (CH 3 Cl, CH 4 , SiH 4 , CH 3 F) or MOLPRO2015 55,56 (NaOH). As for the coefficients F n C 1 + in this two-point formula, F CCSD-F12b = 1.363388 and F (T) = 1.769474 48 were used for all molecules. The extrapolation was not applied to the Hartree-Fock (HF) energy and the HF + CABS (complementary auxiliary basis set) singles correction 49 calculated with the cc-pVQZ-F12 basis set was used. www.nature.com/scientificdata www.nature.com/scientificdata/ ∆E CV . ∆E CV was computed at CCSD(T)-F12b/cc-pCVQZ-F12 57 for CH 3 Cl and at CCSD(T)-F12b/ cc-pCVTZ-F12 57 for the other 4 molecules (CH 4 , SiH 4 , CH 3 F, NaOH). The same ansatz and ABS used for E CBS were employed for calculating ∆E CV but the Slater geminal exponent value was changed: β = 1.5 a 0 −1 for CH 3 Cl and β = 1.4 a 0 −1 for the other 4 molecules. For this term, all-electron calculations were adopted, but with the 1s orbital of Cl frozen for CH 3 Cl, the 1s orbital of Si frozen for SiH 4 , and the 1s orbital of Na frozen for NaOH. There is no frozen orbital in all-electron calculations for CH 4 and CH 3 F. As for the software used, see the above E CBS part.
∆E SR . ∆E SR was calculated by using either one-electron mass velocity and Darwin (MVD1) terms from the Breit-Pauli Hamiltonian in first-order perturbation theory 67 or the second-order Douglas-Kroll-Hess approach 68,69 . The former method was used for CH 3 Cl and the latter method was used for the other 4 molecules (CH 4 , SiH 4 , CH 3 F, and NaOH). All-electron calculations (except for the 1s orbital of Cl) was adopted for CH 3 Cl while the frozen core approximation was employed for the other 5 molecules. Calculations were performed at CCSD(T)/aug-cc-pCVTZ(+d for Cl) 70,71 using the MVD1 approach 72 implemented in CFOUR for CH 3 Cl and at CCSD(T)/cc-pVQZ-DK 73 using MOLPRO (software versions the same as mentioned in the above E CBS part) for other 4 molecules.
∆E DBOC . ∆E DBOC was computed using the CCSD method 74 as implemented in CFOUR. This correction was not included for SiH 4 . For this term, all-electron calculations were adopted, but with the 1s orbital of Cl frozen for ; all-electron calculations kept the 1s orbital of Cl frozen; Software: MOLPRO2012 Levels of theory: CCSD(T), CCSDT, and CCSDT(Q); Basis sets for the full triples and the perturbative quadruples calculations are aug-cc-pVTZ(+d for Cl) and aug-cc-pVDZ(+d for Cl), respectively.
Method: one-electron mass velocity and Darwin (MVD1) terms from the Breit-Pauli Hamiltonian in first-order perturbation theory; All electrons correlated (except for the 1s of Cl); CCSD(T)/augcc-pCVTZ(+d for Cl).   4 and CH 3 F, and the 1s orbital of Na frozen for NaOH. As for the basis set, calculations were performed at aug-cc-pCVTZ (+d for Cl) for CH 3 Cl, aug-cc-pCVDZ for CH 4 , aug-cc-pCVDZ for CH 3 F, and aug-cc-pCVDZ(+d for Na) for NaOH. we use CFOUR V2.1 to perform calculations for some grid points in CH 3 Cl and NaOH that converge to high energy solutions); see Fig. 2 for the CFOUR input options. In the MP2/cc-pVTZ calculations, we use the default option FROZEN_CORE = OFF so that all electrons and all orbitals are correlated. In the CCSD(T)/cc-pVQZ calculations, the option FROZEN_CORE = ON is used for all molecules to allow valence electrons correlation alone. For CH 3 Cl, CH 4 , CH 3 F and NaOH, SCF_CONV = 10, CC_CONV = 10 and LINEQ_CONV = 8 are set to specify the convergence criterion for the HF-SCF, CC amplitude and linear equations and CC_PROG = ECC is set to specify that the CC program we used is ECC. For SiH 4 , we adopted CFOUR default options SCF_CONV = 7, CC_CONV = 7, LINEQ_CONV = 7 and CC_PROG = VCC. We use GEO_MAXCYC = 1 option to set the maximum number of geometry optimization iterations to one to obtain the gradient information of the current nuclear configuration. From these calculations we also extracted HF energies calculated with the corresponding basis sets cc-pVTZ and cc-pVQZ. In addition, for CH 3 Cl we include MP2/aug-cc-pVQZ energies calculated using MOLPRO2012 55 as reported in ref. 20 .

Data records
All data of 5 molecules are stored as a database in JSON format in the file named VIB5.json available for download from https://doi.org/10.6084/m9.figshare.16903288 79 . The first level of the database contains an item corresponding to each molecule in the order of CH 3 Cl, CH 4 , SiH 4 , CH 3 F, and NaOH. For each molecule, at the next level of the database, chemical formula, chemical name, number of atoms, list of nuclear charges in the same order as they appear in the items with nuclear coordinates are given at first, then the description of properties available for grid points (property type, levels of theory, units) is provided. Finally, the items for each grid point  Table 3 with the brief description and units. The geometry configuration in Cartesian coordinates and in internal coordinates of each grid point for each molecule can be accessed by the "XYZ" key and the "INT" key, respectively. Definition of internal coordinates used in the database is shown in Fig. 3. The "HF-TZ", "HF-QC", "MP2", "CCSD-T", and "TBE" keys can be selected separately to obtain the energy of each grid point at HF/cc-pVTZ, HF/cc-pVQZ, MP2/cc-pVTZ,  www.nature.com/scientificdata www.nature.com/scientificdata/ CCSD(T)/cc-pVQZ, and TBE, respectively. This database also provides the energy gradients in Cartesian coordinates and internal coordinates at MP2/cc-pVTZ and CCSD(T)/cc-pVQZ theory levels, which can be accessed through "MP2_grad_xyz", "MP2_grad_int", "CCSD-T_grad_xyz", and "CCSD-T_grad_int" keys. See Table 3 for the summary and the keys of other properties.

Technical Validation
The TBE values and TBE constituent terms were validated by calculating rovibrational spectra and comparing them to experiment in the original peer-reviewed publications cited in the Methods section and Table 1. In brief, rovibrational energy levels were computed by fitting analytical expression for PES and performing with it variational calculations using the nuclear motion program TROVE 80 . Then the resulting line list of rovibrational energy levels was compared to experimental values (when available) to validate the accuracy of the underlying PES. The new complementary data we have calculated here was validated by making sure that all calculations fully converged. After the database was constructed, we performed additional checks for repeated geometries, which identified grid points with the same geometrical parameters in the CH 4 grid points. We removed such duplicates from the database, which leads to a slightly reduced number of points (97217) compared to the numbers reported in the original publications (97271). This pruned grid is used as our final database.

Usage Notes
We provide a Python script extraction_data.py that can be used to pull the data of interest from the VIB5.json (Box 1). It is provided together with the database file from https://doi.org/10.6084/m9.figshare.16903288 79 .