Background & Summary

Many physical and chemical processes of molecular systems are governed by potential energy surfaces (PESs) that are functions of potential energy with respect to the molecular geometry defined by the nuclei1. Accurate ab initio quantum chemical (QC) molecular PESs are essential to predict and understand a multitude of physicochemical properties of interest such as reaction thermodynamics, kinetics2, and simulation of rovibrational spectra3,4,5. As for the latter, PESs of a number of different molecules have been constructed and used in variational nuclear motion calculations to provide accurate rotation-vibration-electronic line lists to aid the characterization of exoplanet atmospheres, amongst other applications6,7,8,9,10,11,12,13,14,15,16.

It is necessary to have a global PES covering all relevant regions of nuclear configurations allowing to simulate rotation-vibration (rovibrational) spectra approaching the coveted spectroscopic accuracy of 1 cm−1 in a broad range of temperatures. This can be achieved by defining the PES on a high-density grid of nuclear geometries with no holes and having the theoretical best estimate (TBE) of energies computed at a very high QC level of theory. The construction of an optimal grid usually involves many steps and human intervention, and often requires a staggeringly large number of grid points, e.g., ca. 100 thousand points even for a five-atom molecule such as methane10. The choice of QC level for TBE calculations is determined by the trade-off between accuracy and computational cost, but typically requires going well beyond the gold-standard17,18,19 CCSD(T)17/CBS (coupled cluster with single and double excitations and a perturbative treatment of triple excitations/complete basis set) limit and needs many QC corrections on top of it. Just to give a perspective, ca. 24 single processing unit (CPU)-hours are required for calculating TBE energy of each grid point of ~45 thousand methyl chloride (CH3Cl) geometries amounting to over 100 CPU-years when constructing its highly accurate ab initio PES20.

To reduce the high computational cost, machine learning (ML) has emerged as a powerful approach for constructing full-dimensional PESs21,22,23,24,25,26,27 and the resulting ML PESs can be used22,24,28,29,30,31,32,33,34,35 for performing vibrational calculations. In particular, substantial cost reduction can be achieved by calculating TBE energies only for a small number of existing grid points and then interpolating between them with ML36; such ML grids can be subsequently used for simulating rovibrational spectra with a relatively small loss of accuracy. Importantly, much larger savings in computational cost can be achieved20, when ML is applied to learn various QC corrections using a hierarchical ML (hML) scheme based on Δ-learning37 rather than to learn the TBE energy directly.

Despite all the above efforts in constructing highly accurate PESs, there is still room for improvement, e.g., via creating denser grids, using higher QC levels, and further development of ML approaches, all of which requires access to data. Unfortunately, the raw data containing geometries, TBEs and TBE constituent terms for many published studies is either missing or scattered. Thus, our data descriptor aims to organize these scattered data generated in the previous studies by some of us into a consolidated, structured PES database that we call VIB5. The VIB5 database contains five molecules CH3Cl7,9,20, CH410, SiH48, CH3F12, and NaOH14. The number of grid points ranges from 15 thousand to 100 thousand; altogether more than 300 thousand points (Table 1). In addition, it is also known that inclusion of the energy gradient information can significantly reduce the number of training points for ML, which is efficiently exploited in the gradient-based ML models38,39. Thus, for this database, we additionally calculate energies and energy gradients at two levels of theory, MP2/cc-pVTZ (second order Møller-Plesset perturbation theory/correlation-consistent triple-zeta basis set) and CCSD(T)/cc-pVQZ (correlation consistent quadruple-zeta basis set), and provide the HF (Hartree–Fock) energies calculated with the corresponding basis sets cc-pVTZ and cc-pVQZ.

Table 1 The number of grid points (grid size) for each molecule with references to original studies generating these grid points, theoretical best estimates (TBE), and TBE constituent terms.

Our database is complementary to existing databases used for developing ML PES models. Some existing databases contain only energies for equilibrium geometries of various compounds calculated at different levels (from density functional theory [DFT] up to coupled-cluster approaches): QM740, QM7b41, QM942, revised QM943, and ANI-1ccx44. Another database (ANI-145) also contains energies at DFT for off-equilibrium geometries. Energies and energy gradients at DFT are available for equilibrium and off-equilibrium geometries of different molecules in the ANI-1x44 and QM7-X46 databases. The MD-17 dataset38,39 is a popular database with energies and energy gradients for geometries taken from MD trajectories of several small- to medium-sized molecules at DFT and for subset of points at CCSD(T) with different basis sets. PESs generated from MD are, however, likely to have limited coverage of high-energy geometries and many holes, making them inapplicable to some kinds of accurate simulations such as diffusion Monte Carlo calculations as was pointed out recently47. In contrast to these databases, our database provides reliable, global PESs with QC energies and energy gradients at different levels including very accurate TBEs of energies going beyond CCSD(T)/CBS, which can be used for ML models trained on data from several levels of theory, such as hML, Δ-learning, etc. Finally, our database comes with a convenient data-extraction script that can be used to pull the required information in a suitable format for, e.g., ML.

Methods

Grid points generation

For each molecule, we take grid points directly from the previous studies by some of the authors. Here we only describe in short how these grid points were generated for the sake of completeness. We refer the reader to the original publications cited for each molecule for further details (see Table 1).

CH3Cl

44819 grid points for CH3Cl were taken from Refs. 7,9,20. A Monte Carlo random energy-weighted sampling algorithm was applied to nine internal coordinates of CH3Cl: the C–Cl bond length r0; three C–H bond lengths r1, r2, and r3; three (HiCCl) interbond angles β1, β2, and β3; and two dihedral angles τ12 and τ13 between adjacent planes containing HiCCl and HjCCl (Fig. 1a). This procedure led to geometries in the range 1.3 ≤ r0 ≤ 2.95 Å, 0.7 ≤ ri ≤ 2.45 Å, 65 ≤ βi ≤ 165° for i = 1, 2, 3 and 55 ≤ τjk ≤ 185° with jk = 12, 13. The grid also includes 1000 carefully chosen low-energy points to ensure an adequate description of the equilibrium region.

Fig. 1
figure 1

Definition of internal coordinates in each molecule. Internal coordinates of (a) CH3Cl; r0 is C–Cl bond length, ri and βi are C–Hi bond lengths and (HiCCl) angles (I = 1, 2, 3), τjk are HjCClHk dihedral angles (jk = 12, 13); only r0, r3, β1 and τ12 are shown; (b) CH4; ri and αjk are C–Hi bond lengths and (HjCHk) angles (i = 1, 2, 3, 4; jk = 12, 13, 14, 23, 24); only r4 and α14 are shown; (c) SiH4; ri and αjk are Si–Hi bond lengths and (HjSiHk) angles (i = 1, 2, 3, 4; jk = 12, 13, 14, 23, 24); only r4 and α14 are shown; (d) CH3F; r0 is C–F bond length, ri and βi are C–Hi bond lengths and (HiCF) angles (i = 1, 2, 3), τjk are HjCFHk dihedral angles (jk = 12, 13); only r0, r3, β1 and τ12 are shown; (e) NaOH; rNaO and rOH are Na–O and O–H bond lengths, θNaOH is (NaOH) bond angle.

CH4

97271 grid points for CH4 were taken from ref. 10. The global grid was built in the same fashion as the grid was constructed for CH3Cl. Nine internal coordinates of CH4 are defined as follows: four C–H bond lengths r1, r2, r3 and r4; five(Hj-C-Hk) interbond angles α12, α13, α14, α23, and α24, where j and k label the respective hydrogen atoms (Fig. 1b). Then grid points are in the range 0.71 ≤ ri ≤ 2.60 Å for i = 1, 2, 3, 4 and 40 ≤ αjk ≤ 140° with jk = 12, 13, 14, 23, 24.

SiH4

84002 grid points for SiH4 were taken from ref. 8. Nine internal coordinates of SiH4 are defined in the same way as CH4: four Si–H bond lengths r1, r2, r3 and r4; five(Hj-Si-Hk) interbond angles α12, α13, α14, α23, and α24, where j and k label the respective hydrogen atoms (Fig. 1c). Then geometries are in the range 0.98 ≤ ri ≤ 2.95 Å for i = 1, 2, 3, 4 and 40 ≤ αjk ≤ 140° with jk = 12, 13, 14, 23, 24.

CH3F

82653 grid points for CH3F were taken from ref. 12. Nine internal coordinates of CH3F are defined in the same way as CH3Cl: the C–F bond length r0; three C–H bond lengths r1, r2, and r3; three (HiCF) interbond angles β1, β2, and β3; and two dihedral angles τ12 and τ13 between adjacent planes containing HiCF and HjCF (Fig. 1d). This procedure led to geometries in the range 1.005 ≤ r0 ≤ 2.555 Å, 0.705 ≤ ri ≤ 2.695 Å, 45.5 ≤ βi ≤ 169.5° for i = 1, 2, 3 and 40.5 ≤ τjk ≤ 189.5° with jk = 12, 13.

NaOH

15901 grid points for NaOH were taken from ref. 14. Grid points were generated randomly with a dense distribution around the equilibrium region. Three internal coordinates of NaOH are defined as follows: the Na–O bond length rNaO, the O–H bond length rOH, and the interbond angle (NaOH) (Fig. 1e). This procedure led to geometries in the range 1.435 ≤ rNaO ≤ 4.400 Å, 0.690 ≤ rOH ≤ 1.680 Å, and 40 ≤ (NaOH) ≤ 180°.

Theoretical best estimates and constituent terms

For each molecule, we take the TBEs and energy corrections directly from the previous studies by some of us. Here we only briefly introduce how these calculations were performed. We refer the reader to the original publications cited for each molecule for details (see Table 1). TBE is obtained through the sum of many constituent terms: ECBS, ∆ECV, ∆EHO, ∆ESR, and, for most molecules, ∆EDBOC. ECBS means the energy at the complete basis set (CBS) limit. ∆ECV refers to the core-valence (CV) electron correlation energy correction. ∆EHO refers to the energy correction accounted for by the higher-order (HO) coupled cluster terms and ∆ESR shows scalar relativistic (SR) effects. ∆EDBOC means the diagonal Born–Oppenheimer correction and was calculated for CH3Cl, CH4, CH3F, and NaOH, but not for SiH4 due to the little effect of ∆EDBOC on the vibrational energy levels of this molecule.

The constituent terms were not calculated at the same level of theory across all molecules in the data set. The computational details of five TBE constituent terms (ECBS, ∆ECV, ∆EHO, ∆ESR, and ∆EDBOC) for 5 molecules are shown below and summarized in the Table 2.

Table 2 The comparative table of the computational details behind the calculations of the constituent terms of theoretical best estimates for five molecules of the VIB5 database.

ECBS

To extrapolate the energy to the CBS limit, the parameterized, two-point formula48 \(\left({E}_{CBS}^{C}=\left({E}_{n+1}-{E}_{n}\right){F}_{n+1}^{C}+{E}_{n}\right)\) was used. In this process, the method CCSD(T)-F12b49 and two basis sets cc-pVTZ-F12 and cc-pVQZ-F1250 were chosen. When performing calculations, the frozen core approximation was adopted and the diagonal fixed amplitude ansatz 3C(FIX)51 with a Slater geminal exponent value48 of β = 1.0 a0−1 were employed. As for the auxiliary basis sets (ABS), the resolution of the identity OptRI52 basis and cc-pV5Z/JKFIT53 and aug-cc-pwCV5Z/MP2FIT54 basis sets for density fitting were used for all 5 molecules. These calculations were carried out with either MOLPRO201255 (CH3Cl, CH4, SiH4, CH3F) or MOLPRO201555,56 (NaOH). As for the coefficients \({F}_{n+1}^{C}\) in this two-point formula, FCCSD-F12b = 1.363388 and F(T) = 1.76947448 were used for all molecules. The extrapolation was not applied to the Hartree–Fock (HF) energy and the HF + CABS (complementary auxiliary basis set) singles correction49 calculated with the cc-pVQZ-F12 basis set was used.

∆ECV

ECV was computed at CCSD(T)-F12b/cc-pCVQZ-F1257 for CH3Cl and at CCSD(T)-F12b/cc-pCVTZ-F1257 for the other 4 molecules (CH4, SiH4, CH3F, NaOH). The same ansatz and ABS used for ECBS were employed for calculating ∆ECV but the Slater geminal exponent value was changed: β = 1.5 a0−1 for CH3Cl and β = 1.4 a0−1 for the other 4 molecules. For this term, all-electron calculations were adopted, but with the 1s orbital of Cl frozen for CH3Cl, the 1s orbital of Si frozen for SiH4, and the 1s orbital of Na frozen for NaOH. There is no frozen orbital in all-electron calculations for CH4 and CH3F. As for the software used, see the above ECBS part.

∆EHO

To obtain ∆EHO, the hierarchy of coupled cluster methods was used. ∆EHO = ECCSDTECCSD(T) for NaOH, while ∆EHO = ∆ET + ∆E(Q) for other 4 molecules (CH3Cl, CH4, SiH4, CH3F) with ∆ET = ECCSDTECCSD(T) for full triples contribution and ∆E(Q) = ECCSDT(Q)ECCSDT for perturbative quadruples contribution. The frozen core approximation was employed in the calculations. Thus, energy calculations at CCSD(T) and CCSDT were performed for NaOH, while energy calculations at CCSD(T), CCSDT, and CCSDT(Q) levels of theory were performed for other 4 molecules. All of these calculations were carried out through the general coupled cluster approach58,59 implemented in the MRCC code (www.mrcc.hu)60 interfaced to CFOUR (www.cfour.de)61. As for the basis set, aug-cc-pVTZ(+d for Cl)62,63,64,65 & aug-cc-pVDZ(+d for Cl), cc-pVTZ62 & cc-pVDZ, cc-pVTZ(+d for Si)62,63,64,65 & cc-pVDZ(+d for Si), and cc-pVTZ62 & cc-pVDZ for full triples and the perturbative quadruples of CH3Cl, CH4, SiH4, and CH3F. For NaOH, cc-pVTZ(+d for Na)62,66 were used for CCSD(T) and CCSDT calculations.

∆ESR

ESR was calculated by using either one-electron mass velocity and Darwin (MVD1) terms from the Breit–Pauli Hamiltonian in first-order perturbation theory67 or the second-order Douglas–Kroll–Hess approach68,69. The former method was used for CH3Cl and the latter method was used for the other 4 molecules (CH4, SiH4, CH3F, and NaOH). All-electron calculations (except for the 1s orbital of Cl) was adopted for CH3Cl while the frozen core approximation was employed for the other 5 molecules. Calculations were performed at CCSD(T)/aug-cc-pCVTZ(+d for Cl)70,71 using the MVD1 approach72 implemented in CFOUR for CH3Cl and at CCSD(T)/cc-pVQZ-DK73 using MOLPRO (software versions the same as mentioned in the above ECBS part) for other 4 molecules.

∆EDBOC

EDBOC was computed using the CCSD method74 as implemented in CFOUR. This correction was not included for SiH4. For this term, all-electron calculations were adopted, but with the 1s orbital of Cl frozen for CH3Cl, all electrons correlated for CH4 and CH3F, and the 1s orbital of Na frozen for NaOH. As for the basis set, calculations were performed at aug-cc-pCVTZ (+d for Cl) for CH3Cl, aug-cc-pCVDZ for CH4, aug-cc-pCVDZ for CH3F, and aug-cc-pCVDZ(+d for Na) for NaOH.

Complementary energy and gradient calculations

All complementary ab initio QC energy and gradient calculations for a total of 324592 grid points were performed with two levels of theory: MP275,76/cc-pVTZ62,64,66 and CCSD(T)17,77,78/cc-pVQZ62,64,66 using the CFOUR program package (Versions 1.0 and 2.161; we use CFOUR V2.1 to perform calculations for some grid points in CH3Cl and NaOH that converge to high energy solutions); see Fig. 2 for the CFOUR input options. In the MP2/cc-pVTZ calculations, we use the default option FROZEN_CORE = OFF so that all electrons and all orbitals are correlated. In the CCSD(T)/cc-pVQZ calculations, the option FROZEN_CORE = ON is used for all molecules to allow valence electrons correlation alone. For CH3Cl, CH4, CH3F and NaOH, SCF_CONV = 10, CC_CONV = 10 and LINEQ_CONV = 8 are set to specify the convergence criterion for the HF-SCF, CC amplitude and linear equations and CC_PROG = ECC is set to specify that the CC program we used is ECC. For SiH4, we adopted CFOUR default options SCF_CONV = 7, CC_CONV = 7, LINEQ_CONV = 7 and CC_PROG = VCC. We use GEO_MAXCYC = 1 option to set the maximum number of geometry optimization iterations to one to obtain the gradient information of the current nuclear configuration. From these calculations we also extracted HF energies calculated with the corresponding basis sets cc-pVTZ and cc-pVQZ. In addition, for CH3Cl we include MP2/aug-cc-pVQZ energies calculated using MOLPRO201255 as reported in ref. 20.

Fig. 2
figure 2

Typical CFOUR input options for (a) MP2/cc-pVTZ, (b) CCSD(T)/cc-pVQZ for CH3Cl, CH4, CH3F, NaOH and (c) CCSD(T)/cc-pVQZ for SiH4. The blue options were used for most cases and the light grey options are examples of options used to improve SCF convergence only for some geometries.

Data Records

All data of 5 molecules are stored as a database in JSON format in the file named VIB5.json available for download from https://doi.org/10.6084/m9.figshare.1690328879. The first level of the database contains an item corresponding to each molecule in the order of CH3Cl, CH4, SiH4, CH3F, and NaOH. For each molecule, at the next level of the database, chemical formula, chemical name, number of atoms, list of nuclear charges in the same order as they appear in the items with nuclear coordinates are given at first, then the description of properties available for grid points (property type, levels of theory, units) is provided. Finally, the items for each grid point are given containing nuclear positions in both Cartesian and internal coordinates, and the values of properties (energies and energy gradients at different levels of theory, i.e., TBE, TBE constituent terms, complementary data). The JSON keys of items available for each grid point are listed in Table 3 with the brief description and units. The geometry configuration in Cartesian coordinates and in internal coordinates of each grid point for each molecule can be accessed by the “XYZ” key and the “INT” key, respectively. Definition of internal coordinates used in the database is shown in Fig. 3. The “HF-TZ”, “HF-QC”, “MP2”, “CCSD-T”, and “TBE” keys can be selected separately to obtain the energy of each grid point at HF/cc-pVTZ, HF/cc-pVQZ, MP2/cc-pVTZ, CCSD(T)/cc-pVQZ, and TBE, respectively. This database also provides the energy gradients in Cartesian coordinates and internal coordinates at MP2/cc-pVTZ and CCSD(T)/cc-pVQZ theory levels, which can be accessed through “MP2_grad_xyz”, “MP2_grad_int”, “CCSD-T_grad_xyz”, and “CCSD-T_grad_int” keys. See Table 3 for the summary and the keys of other properties.

Table 3 Layout of the VIB5.json file containing the VIB5 database.
Fig. 3
figure 3

Definition of internal coordinates for each molecule used in the database file VIB5.json and in the complimentary calculations. Internal coordinates of (a) CH3Cl; R0 is C–Cl bond length, Ri and Ai are C–Hi+2 bond lengths and (Hi+2CCl) angles (i = 1, 2, 3), Djk are Hj+2CClHk+2 dihedral angles (jk = 12, 13); only R0, R1, A1, A2, A3, D12, and D13 are shown; (b) CH4; Ri and A1j are C–Hi+1 bond lengths and (H2CHj+1) angles (i = 1, 2, 3, 4; j = 2, 3, 4), Dk2 are Hk+1CH2H3 dihedral angles (k = 3, 4); only R1, A12, A13, A14, D32, and D42 are shown; (c) SiH4; Ri and A1j are Si–Hi+1 bond lengths and (H2SiHj+1) angles (i = 1, 2, 3, 4; j = 2, 3, 4), Dk2 are Hk+1SiH2H3 dihedral angles (k = 3, 4); only R1, A12, A13, A14, D32, and D42 are shown; (d) CH3F; R0 is C–F bond length, Ri and Ai are C–Hi+2 bond lengths and (Hi+2CF) angles (i = 1, 2, 3), Djk are Hj+2CFHk+2 dihedral angles (jk = 12, 13); only R0, R1, A1, A2, A3, D12, and D13 are shown; (e) NaOH; R1 and R2 are Na–O and H–O bond lengths, RX is O–X bond length, AX1 and AX2 are (XONa) and (XOH) angles, and D is NaXOH dihedral angle. X is a dummy atom.

Technical Validation

The TBE values and TBE constituent terms were validated by calculating rovibrational spectra and comparing them to experiment in the original peer-reviewed publications cited in the Methods section and Table 1. In brief, rovibrational energy levels were computed by fitting analytical expression for PES and performing with it variational calculations using the nuclear motion program TROVE80. Then the resulting line list of rovibrational energy levels was compared to experimental values (when available) to validate the accuracy of the underlying PES. The new complementary data we have calculated here was validated by making sure that all calculations fully converged. After the database was constructed, we performed additional checks for repeated geometries, which identified grid points with the same geometrical parameters in the CH4 grid points. We removed such duplicates from the database, which leads to a slightly reduced number of points (97217) compared to the numbers reported in the original publications (97271). This pruned grid is used as our final database.

Usage Notes

We provide a Python script extraction_data.py that can be used to pull the data of interest from the VIB5.json (Box 1). It is provided together with the database file from https://doi.org/10.6084/m9.figshare.1690328879.