Abstract
While density functional theory (DFT) is often an accurate and efficient methodology for evaluating molecular properties such as energies and multipole moments, this approach often yields larger errors for response properties such as the dipole polarizability (α), which describes the tendency of a molecule to form an induced dipole moment in the presence of an electric field. In this work, we provide static α tensors (and other molecular properties such as total energy components, dipole and quadrupole moments, etc.) computed using quantum chemical (QC) and DFT methodologies for all 7,211 molecules in the QM7b database. We also provide the same quantities for the 52 molecules in the AlphaML showcase database, which includes the DNA/RNA nucleobases, uncharged amino acids, several open-chain and cyclic carbohydrates, five popular pharmaceutical molecules, and 23 isomers of C_{8}H_{n}. All QC calculations were performed using linear-response coupled-cluster theory including single and double excitations (LR-CCSD), a sophisticated approach for electron correlation, and the d-aug-cc-pVDZ basis set to mitigate basis set incompleteness error. DFT calculations employed the B3LYP and SCAN0 hybrid functionals, in conjunction with d-aug-cc-pVDZ (B3LYP and SCAN0) and d-aug-cc-pVTZ (B3LYP).
Design Type(s) | chemical structure classification objective • chemical reaction data analysis objective • modeling and simulation objective |
Measurement Type(s) | chemical structure analysis |
Technology Type(s) | ab initio quantum chemistry computational method |
Factor Type(s) | atom |
Sample Characteristic(s) |
Machine-accessible metadata file describing the reported data (ISA-Tab format)
Background & Summary
The molecular dipole polarizability, α, describes the tendency of a molecule to form an induced dipole moment in the presence of an external electric field. Knowledge of this fundamental response property is central to describing non-bonded interactions (such as induction and dispersion) between molecules in clusters or the condensed phase^{1,2,3}, computing Raman and sum frequency generation (SFG) spectra^{4,5,6,7}, and developing polarizable force fields^{8,9,10,11,12}. When compared to other ground-state molecular properties (e.g., multipole moments), the theoretical prediction of the α tensor is considerably more difficult to obtain, as this quantity is often more sensitive to the description of the underlying molecular electronic structure. In this regard, benchmark ab initio calculations of α are quite challenging to perform, as they require a simultaneous treatment of sophisticated electron correlation effects as well as mitigation of basis set incompleteness error to ensure sufficiently accurate and converged results.
To obtain benchmark values for α in molecular systems with a sizeable HOMO-LUMO gap (i.e., systems that are well-described by a single-reference wavefunction), one can utilize quantum chemical methods such as linear-response coupled-cluster theory (LR-CC)^{13,14,15}, which provides an accurate and reliable treatment of electron correlation. The downside of such wavefunction-based approaches is the large (and often prohibitive) computational cost associated with the inclusion of higher order excitations in the CC expansion. For example, LR-CC at the lowest order includes single and double excitations (LR-CCSD), and scales as O(n^{6}), where n is a measure of the system size (i.e., the number of orbitals). This computational cost keeps increasing as higher order excitations are included, and scales as O(n^{8}) with the inclusion of triple excitations (LR-CCSDT) and O(n^{10}) with the further inclusion of quadruple excitations (LR-CCSDTQ). As a result of this steep rise in the cost, such calculations are computationally prohibitive, even when one is dealing with relatively small molecules containing only 10–15 heavy (non-hydrogen) atoms. In addition to the computational cost required for a wavefunction-based treatment of the electron correlation, the error introduced by the use of a finite one-electron basis set is another factor that needs to be considered when computing α. In this regard, basis set incompleteness error in the prediction of α can be more severe than the error due to the lack of higher order (e.g., beyond doubles) excitations^{16,17,18,19}.
In this work, we provide static (frequency-independent) α tensors computed using LR-CCSD and hybrid density functional theory (DFT) for all molecules in the QM7b^{20,21,22} and AlphaML showcase databases^{23}. The QM7b database^{20,21,22} has become one of the de facto standard databases for machine-learning (ML) applications in chemistry, and contains N = 7,211 small organic molecules with up to seven heavy atoms (i.e., C, N, O, S, and Cl) and varying levels of H saturation. Recently introduced by Wilkins et al.^{23} for testing the transferability of ML-based predictions of α, the AlphaML showcase database consists of N = 52 larger organic molecules (with up to 16 heavy atoms), and includes the DNA/RNA nucleobases, uncharged amino acids, several open-chain and cyclic carbohydrates, five popular pharmaceutical molecules, and 23 isomers of C_{8}H_{n} (see Fig. 1). The diversity of structures in this combination of databases includes alkanes, alkenes, alkynes, (hetero)cycles, carbonyl and carboxyl groups, cyanides, amides, alcohols, amines, thiols, ethers, and epoxides, thereby providing a meaningful survey of α across a wide swath of chemical compound space.
Reference values for α were obtained with LR-CCSD with the doubly-augmented d-aug-cc-pVDZ basis set of Woon and Dunning^{19}, as this method (when employed in conjunction with a sufficiently large and diffuse one-particle basis set) has been shown to yield accurate and reliable predictions for α^{16,17,18,24}. The use of d-aug-cc-pVDZ greatly mitigates the basis set incompleteness error at the double-ζ level, and the validity of this basis set choice will be critically examined and discussed in more detail below. For comparative purposes, we also provide finite-field DFT values for α obtained with the popular B3LYP^{25,26} and SCAN0^{27} hybrid functionals in conjunction with the d-aug-cc-pVDZ (B3LYP and SCAN0) and d-aug-cc-pVTZ (B3LYP only) basis sets. Throughout the remainder of this work, the d-aug-cc-pVXZ basis sets (with X = D and T) will be referred to as daXZ, and all LR-CCSD/daDZ calculations will simply be denoted by CCSD/daDZ unless otherwise specified.
Methods
In this section, we provide the conventions used in generating and processing the geometries of the molecules in the QM7b and AlphaML showcase databases, all relevant computational details to ensure reproducibility of the quantum mechanical data, as well as a summary of the codes employed in this work.
Molecular cartesian coordinates in the QM7b and AlphaML showcase databases
The molecular geometries for all 7,211 species in the QM7b database^{20,21,22} were obtained online via the quantum-machine.org website^{28}. All QM7b molecular geometries were first translated to their respective center of nuclear (ionic) charge, to remove the origin-dependence of the higher-order (i.e., quadrupole) multipole moments. Using farthest-point sampling (FPS)^{29}, all molecules were then reordered using a kernel-based similarity measure^{30}, and relabelled accordingly from molecule0001 to molecule7211 (again padded to four digits with leading zeros). For consistency with the QM7b database, all 52 molecules in the AlphaML showcase database (see Fig. 1) were optimized with DFT using the PBE functional^{31} and a converged numerical atom-centered basis (i.e., tight settings with the tier-2 basis set in FHI-AIMS)^{32}. All AlphaML showcase molecules were also translated to their respective center of nuclear (ionic) charge, and are labelled from showcase0001 to showcase0052, as depicted in Fig. 1. All 7,263 structures are available on Materials Cloud^{33}, according to the format described below in the Data Records section.
Details of the quantum mechanical calculations
All CCSD/daDZ, B3LYP/daDZ, and B3LYP/daTZ calculations were carried out using Psi4 v1.1^{34}, while all SCAN0/daDZ calculations were performed with Q-Chem v5.0^{35}. At the CCSD/daDZ level, all α tensors, unrelaxed dipole moments, μ, and unrelaxed quadrupole moments, Q, were calculated using LR-CCSD/daDZ, with the exception of the ten largest molecules in the AlphaML showcase database (e.g., (18) Phenylalanine, (19) Tyrosine, (20) Tryptophan, (21) Caffeine, (23) Aspirin, (25) Acyclovir, (26) D-Fructose, (27) β-D-Fructofuranose, (28) D-Glucose, and (29) α-D-glucopyranose, see Fig. 1). For these molecules, the memory requirements required to solve the Λ-CC equations at the LR-CCSD/daDZ level were computationally prohibitive, and only energy calculations with CCSD/daDZ could be performed with the available computational resources. For consistency, this required the use of the orbital-unrelaxed finite-field method, in which the molecular orbitals were obtained from a field-free (unperturbed) Hartree-Fock calculation. To obtain μ and α, we computed first and second derivatives of the CCSD/daDZ energy (U) with respect to an external electric field, E, i.e., μ = ∂U/∂E and α = ∂^{2}U/∂E^{2}. Q values were not computed for the ten largest molecules in the AlphaML showcase database. All DFT calculations used the orbital-relaxed finite-field method, in which a self-consistent field (SCF) was obtained in the presence of each applied field, and α was computed via α = ∂μ/∂E. All other molecular properties at the DFT level (vide infra) were obtained directly from the field-free (unperturbed) calculation. All derivatives were computed numerically using two-point (for first derivatives) and three-point (for second derivatives) central difference formulae and a step size of E = 1.8897261250 × 10^{−5} atomic units.
For all LR-CCSD/daDZ calculations, the convergence criteria were set to their default values in Psi4, i.e., E_convergence = 1.0E-10 and D_convergence = 1.0E-10 for the energy and density during the solution of the HF equations, and E_convergence = 1.0E-08 and R_convergence = 1.0E-07 for the energy and residuals during the solution of the CCSD equations. For the ten largest molecules in the AlphaML showcase database, the finite-field CCSD/daDZ calculations were performed using the following convergence criteria in Psi4: E_convergence = 5.0E-10 and D_convergence = 5.0E-10 for the energy and density during the solution of the HF equations. Significantly tighter convergence criteria of E_convergence = 5.0E-10 and R_convergence = 5.0E-09 were employed for the energy and residuals during the solution of the CCSD equations to minimize errors in the numerical evaluation of μ and α. The frozen core (FC) approximation and scf_type = direct were used for all LR-CCSD/daDZ and CCSD/daDZ calculations. For all B3LYP/daDZ and B3LYP/daTZ calculations, the convergence criteria in Psi4 were again set to tight values to minimize numerical error in the finite-difference evaluation of α: E_convergence = 1.0E-10 and D_convergence = 1.0E-10 for the energy and density during the solution of the Kohn-Sham equations. For all the SCAN0/daDZ calculations, the convergence criteria were set to scf_convergence = 1.0E-10 and thresh = 1.0E-13 for the DIIS error and integral thresholding in Q-Chem. The Dunning-style daDZ and daTZ basis sets^{19} were obtained from the EMSL Basis Set Library^{36,37}.
Data Records
In this section, we briefly describe the molecular properties that have been computed in this work, as well as the conventions used to store and retrieve the generated data. In addition to a select set of molecular properties (such as energetic components, dipole and quadrupole moments, orbital eigenvalues, etc.), the provided data will also include the full output files from all of the calculations performed herein. In what follows, we focus the discussion on α, as this molecular response property is arguably the most challenging quantity computed in this work. In particular, we provide a statistical summary of the CCSD/daDZ α data in the QM7b and AlphaML databases, as well as a comparative analysis of the different quantum mechanical methods employed in this work.
Included molecular properties and file format
To store and disseminate the data generated in this work, we have created the following four data packages: CCSD_daDZ, B3LYP_daDZ, SCAN0_daDZ, and B3LYP_daTZ. Each data package contains 7,263 standard xyz files, and has been named according to the level of theory used to generate the data contained therein. Each of the included xyz files contains the translated geometries and calculated properties for a single molecule in the QM7b and AlphaML showcase databases. As described above, the 7,211 molecules in the QM7b database are contained in xyz files labelled from molecule0001 to molecule7211, and the 52 molecules in the AlphaML showcase database are contained in xyz files labelled from showcase0001 to showcase0052 (see Fig. 1).
All computed properties (for a given molecule) are provided on the “comment line” (i.e., the second line) of the corresponding xyz file (as comma-separated values), following the order provided in Table 1 (for CCSD_daDZ) and Table 2 (for B3LYP_daDZ, SCAN0_daDZ, and B3LYP_daTZ). Common molecular properties included in all four data packages are: the isotropic polarizability (α_{iso}),
anisotropic polarizability (α_{aniso}),
all symmetry-unique components of the polarizability (α) tensor (i.e., α_{xx}, α_{yy}, α_{zz}, α_{xy}, α_{xz}, α_{yz}), all components of the dipole moment (μ) vector (i.e., μ_{x}, μ_{y}, μ_{z}), and all symmetry-unique components of the quadrupole moment (Q) vector (i.e., Q_{xx}, Q_{yy}, Q_{zz}, Q_{xy}, Q_{xz}, Q_{yz}). In the CCSD_daDZ data package only, the following molecular properties are also included: the Hartree-Fock total energy (\({E}_{{\rm{tot}}}^{{\rm{HF}}}\)), same-spin (\({E}_{{\rm{ss}}}^{{\rm{MP}}2}\)) and opposite-spin (\({E}_{{\rm{os}}}^{{\rm{MP}}2}\)) correlation energies at the level of second-order Møller-Plesset perturbation (MP2) theory, and same-spin (\({E}_{{\rm{ss}}}^{{\rm{CCSD}}}\)) and opposite-spin (\({E}_{{\rm{os}}}^{{\rm{CCSD}}}\)) correlation energies at the CCSD level. In the B3LYP_daDZ, SCAN0_daDZ, and B3LYP_daTZ data packages only, the following molecular properties are also included: the DFT total energy (\({E}_{{\rm{tot}}}^{{\rm{DFT}}}\)) and the eigenvalues corresponding to the HOMO (\({\epsilon }_{{\rm{HOMO}}}\)) and LUMO (\({\epsilon }_{{\rm{LUMO}}}\)). All data described above is provided in atomic units and available for download on Materials Cloud^{33}.
Statistical summary of the CCSD/daDZ α data
To provide an overview of the α data, Fig. 2 contains the normalized probability distributions of the CCSD/daDZ isotropic (α_{iso}, blue) and anisotropic (α_{aniso}, red) polarizabilities in the QM7b database. This is accompanied by Table 3, which provides a statistical analysis of all α data generated in this work. From Fig. 2 and Table 3, one can see that the CCSD/daDZ α_{iso} values in the QM7b database have a range of 16.80–106.50 Bohr^{3}, and are centered around a mean value of 〈α〉 = 74.07 Bohr^{3}. With a standard deviation (σ) that is nearly two times larger than α_{iso}, the CCSD/daDZ α_{aniso} values in this database are characterized by a broader distribution that is significantly skewed to the right. We note in passing that the range of α_{aniso} is larger than α_{iso} by ≈64%, and includes minimum values that are significantly smaller (cf. \({\alpha }_{{\rm{aniso}}}^{{\rm{\min }}}=2.16\times 1{0}^{-4}\) Bohr^{3} vs. \({\alpha }_{{\rm{iso}}}^{{\rm{\min }}}=16.80\) Bohr^{3}) and maximum values that are significantly larger (cf. \({\alpha }_{{\rm{aniso}}}^{{\rm{\max }}}=147.49\) Bohr^{3} vs. \({\alpha }_{{\rm{iso}}}^{{\rm{\max }}}=106.50\) Bohr^{3}). Also depicted in Fig. 2 are the subset of molecules in the QM7b database with the smallest and largest α_{iso} and α_{aniso} values (as well as those molecules with intermediate α_{iso} and α_{aniso} values of ≈20, 40, 60, and 80 Bohr^{3}). From these molecules, one clearly sees that α_{iso} is an extensive quantity that grows with molecular size, and α_{aniso} (which is a measure of the anisotropy in the α tensor, see Eq. (2)) is largest for molecules with elongated and non-spherical/asymmetric shapes.
The statistical summary of the α data corresponding to the 52 molecules in the AlphaML showcase database (see Table 3) also illustrates that the α_{iso} (α_{aniso}) distributions in this database are characterized by 〈α〉 values that are larger by ≈34% (≈23%) and σ values that are 2.5× (2.0×) larger than that found in the QM7b database. In addition, the range of α_{iso} (α_{aniso}) values is approximately 25% (12%) larger in the AlphaML showcase database, and does not include symmetric molecules with vanishingly small α_{aniso} values. Taken together, these statistical measures reflect the fact that the molecules in the AlphaML showcase database, which includes the DNA/RNA nucleobases, uncharged amino acids, several open-chain and cyclic carbohydrates, five popular pharmaceutical molecules, and 23 isomers of C_{8}H_{n}, are (in general) larger and more diverse than those contained in the QM7b database (see Fig. 1).
Comparative analysis of the quantum mechanical methodologies
To investigate the performance of different quantum mechanical methodologies in calculating the α tensor in the QM7b and AlphaML showcase databases, a detailed statistical error analysis was carried out for the following combinations of methods (Level//Reference Level): B3LYP/daDZ//CCSD/daDZ and SCAN0/daDZ//CCSD/daDZ (to compare the electron correlation level while keeping the basis set fixed), B3LYP/daDZ//SCAN0/daDZ (to compare the exchange-correlation functional while keeping the basis set fixed), and B3LYP/daDZ//B3LYP/daTZ (to quantify the basis set incompleteness error at the B3LYP level). A summary of statistical error measures, including the mean signed error, \({\rm{MSE}}\equiv \frac{1}{N}{\sum }_{i=1}^{N}({\alpha }_{i}-{\alpha }_{i}^{{\rm{ref}}})\), mean absolute error, \({\rm{MAE}}\equiv \frac{1}{N}{\sum }_{i=1}^{N}\left|{\alpha }_{i}-{\alpha }_{i}^{{\rm{ref}}}\right|\), and root-mean-square error, \({\rm{RMSE}}\equiv \sqrt{\frac{1}{N}{\sum }_{i=1}^{N}{({\alpha }_{i}-{\alpha }_{i}^{{\rm{ref}}})}^{2}}\), are provided in Table 4 as well as the corresponding percent errors, \({\rm{M}}{\rm{S}}{\rm{P}}{\rm{E}}\equiv \frac{1}{N}{\sum }_{i=1}^{N}(\frac{{\alpha }_{i}-{\alpha }_{i}^{{\rm{r}}{\rm{e}}{\rm{f}}}}{{\alpha }_{i}^{{\rm{r}}{\rm{e}}{\rm{f}}}})\times 100{\rm{ \% }}\), \({\rm{MAPE}}\equiv \frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\,\left|\frac{{\alpha }_{i}-{\alpha }_{i}^{{\rm{ref}}}}{{\alpha }_{i}^{{\rm{ref}}}}\right|\times 100{\rm{ \% }}\), and \({\rm{RMSPE}}\equiv \sqrt{\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\,{\left(\frac{{\alpha }_{i}-{\alpha }_{i}^{{\rm{ref}}}}{{\alpha }_{i}^{{\rm{ref}}}}\right)}^{2}}\times 100{\rm{ \% }}\). To visualize these differences in more detail, correlation plots (corresponding to the four method combinations above) for α_{iso} and α_{aniso} (as well as probability distributions of the signed percent error (SPE)) are provided in Fig. 3.
When comparing B3LYP/daDZ to the reference CCSD/daDZ level for the molecules in the QM7b database, one sees that B3LYP/daDZ yields essentially identical MSE and MAE values for α_{iso} (i.e., 1.91 Bohr^{3} and 1.92 Bohr^{3}, respectively), indicating that B3LYP/daDZ systematically overestimates α_{iso} values by ≈2.5% (see Table 4). With an RMSE value that is ≈21% greater than the MAE, the magnitudes of the B3LYP/daDZ errors show substantial variations from molecule to molecule; this is particularly evident for the molecules with large α_{iso} values in Fig. 3(a). When comparing SCAN0/daDZ to CCSD/daDZ, one sees that SCAN0/daDZ outperforms B3LYP/daDZ by a large margin in the prediction of α_{iso}, yielding reductions of ≈90%, ≈50%, and ≈40% in the MSE, MAE, and RMSE values, respectively. In this regard, our finding that the SCAN0 functional provides greatly improved estimates for α_{iso} is also consistent with the recent benchmark study by Lao et al.^{18} on the dipole polarizability surface of the gas-phase water molecule. With an MSE value that is nearly 5× smaller than the MAE, it is also worth noting that SCAN0/daDZ α_{iso} values only have a slightly positive systematic error. From a quick glance at Fig. 3(b), it is clear that SCAN0/daDZ (like B3LYP/daDZ) also has more difficulties when treating molecules with large α_{iso} values; this is indicative of the challenges that one faces when computing α, a response property which becomes substantially more non-additive as the size and complexity of the molecules increase. When comparing B3LYP/daDZ to SCAN0/daDZ, one obtains nearly identical MSE, MAE, and RMSE values, which indicates that: (i) B3LYP/daDZ systematically overestimates α_{iso} with respect to SCAN0/daDZ, and (ii) the magnitudes of the B3LYP/daDZ errors do not show substantial molecule-to-molecule variations. Both of these findings are confirmed in Fig. 3(c), where one sees that: (i) the SPE distribution is centered around 2.3% (and not zero), and (ii) there is clearly a very strong linear correlation between the B3LYP/daDZ and SCAN0/daDZ α_{iso} values that does not deteriorate with molecular size and complexity. When comparing B3LYP/daDZ to B3LYP/daTZ, one finds that the increase from double- to triple-ζ in the underlying basis set does not lead to significantly different α_{iso} values. This finding is consistent with the (in general) rapid convergence of DFT with respect to the occupied space and the relatively weak dependence of DFT on the virtual/unoccupied space.
From a quick glance at Table 4, one also sees that both B3LYP and SCAN0 (when compared to the reference CCSD/daDZ level) yield larger errors when predicting α_{aniso} than α_{iso}. When comparing B3LYP/daDZ to CCSD/daDZ, for example, the MSPE, MAPE, and RMSPE values increased from 2.52%, 2.54%, and 2.97% for α_{iso} to 9.19%, 9.34%, and 10.4% for α_{aniso}; a similar increase was observed when comparing SCAN0/daDZ to CCSD/daDZ. These findings demonstrate that the tensorial properties of α (which govern α_{aniso}) are more difficult to predict than the average of the diagonal elements (i.e., α_{iso} values). When compared to CCSD/daDZ, SCAN0/daDZ is no longer performing substantially better than B3LYP/daDZ and now exhibits a (nearly) systematic overestimation of α_{aniso}. By looking at the upper left insets in Fig. 3(a,b), one again sees that the errors made by B3LYP and SCAN0 increase for molecules with larger α_{aniso} values; this is indicative of the increasing importance of including electron correlation effects when predicting α_{aniso} for molecules that are larger in size and potentially more anisotropic. Among the DFT functionals at the daDZ level, one sees that B3LYP/daDZ overestimates α_{aniso} with respect to SCAN0/daDZ in most cases, and that B3LYP/daDZ and SCAN0/daDZ are now in better agreement with each other than with CCSD/daDZ. When comparing B3LYP/daDZ and B3LYP/daTZ, the B3LYP functional again shows rapid convergence with respect to the underlying basis set in the prediction of α_{aniso}.
When performing a similar analysis for the molecules in the AlphaML showcase database, most of the findings described above for the QM7b database still hold. One interesting distinction is the finding that SCAN0/daDZ no longer outperforms B3LYP/daDZ when predicting α_{iso} values for the larger and more complex molecules contained in the AlphaML showcase database; in the same breath, we note that SCAN0/daDZ still maintains a relatively small MSE value, which is indicative of an error profile that is more random (and less systematic) than B3LYP/daDZ (see Fig. 3(a,b)).
Technical Validation
In this section, we explore the validity and reliability of the CCSD/daDZ α data in the QM7b database.
Validation of the CCSD/daDZ α Data
Since α describes the response of a molecule to an applied electric field, an accurate and reliable treatment of this quantity is particularly sensitive to the description of the underlying electronic structure as well as the quality of the basis set. The highest level α values provided in this work were computed with LR-CCSD, a sophisticated wavefunction-based method that consistently yields highly accurate α values for equilibrium and non-equilibrium molecular geometries when used with sufficiently large (and sufficiently diffuse) basis sets^{16,17,18,24}. To account for the basis set incompleteness error, which is almost always larger than the contributions from higher-order (e.g., beyond doubles) excitations in coupled-cluster theory^{16,17,18,19}, we employed the daDZ basis set. Although daDZ is a double-ζ basis set containing a moderate number of polarization functions, the incorporation of two sets of augmented functions (i.e., double augmentation) significantly reduces the basis set incompleteness error in the prediction of α. To validate the accuracy of our CCSD/daDZ calculations, we performed a series of calculations using the larger daTZ basis set^{19}, which is arguably the largest Dunning-style basis that can be used to compute α for the molecules in the QM7b database without significant supercomputer resources. To proceed with this technical validation, we used the FPS algorithm^{29,30} to choose the 100 most diverse molecules in the QM7b database (which we denote as the FPS-100 database). Due to the prohibitively large computational cost associated with LR-CCSD calculations with the daTZ basis set, we were only able to compute α for the 24 smallest molecules (by number of basis functions) in the FPS-100 database. A statistical error analysis of the α_{iso} and α_{aniso} values for these 24 molecules is provided in Table 5, and a more extensive discussion regarding the basis set convergence of our CCSD/daDZ calculations can be found in the main text and Supplementary Information of Ref.^{23}. From Table 5, one can immediately see that the CCSD/daDZ α_{iso} values have similar MSE, MAE, and RMSE values of ≈0.20 Bohr^{3}, which corresponds to a MAPE of ≈0.4%. For α_{aniso}, a measure of the anisotropy in the α tensor, we report slightly larger errors corresponding to a MAPE of \(\lesssim 1{\rm{ \% }}\). When compared to the errors made by the hybrid DFT functionals employed in this work (with CCSD/daDZ as the reference), namely 2.5% (B3LYP/daDZ) and 1.3% (SCAN0/daDZ) for α_{iso} and 9.3% (B3LYP/daDZ) and 7.9% (SCAN0/daDZ) for α_{aniso}, we conclude that the basis set incompleteness errors in our reference α values are significantly smaller (see Table 4). As such, the CCSD/daDZ α tensors presented in this work should be accurate and reliable enough for use in the development (and assessment) of next-generation force fields, density functionals, and quantum chemical methodologies, as well as machine-learning based approaches for predicting this fundamental response property.
Code Availability
As mentioned above, three different software packages were utilized in this work. Psi4 v1.1^{34} is freely available from its official website^{38} Q-Chem v5.0^{35} and FHI-AIMS^{32} must be downloaded from their official sites^{39,40} with a signed license.
References
- 1.
Stone, A. The Theory of Intermolecular Forces 2nd edn (Oxford University Press, 2016).
- 2.
Hermann, J., DiStasio, R. A. Jr. & Tkatchenko, A. First-principles models for van der Waals interactions in molecules and materials: concepts, theory, and applications. Chem. Rev. 117, 4714–4758 (2017).
- 3.
Grimme, S. In The Chemical Bond: Chemical Bonding Across the Periodic Table. (eds Frenking, G. & Shaik, S.) Ch. 16 (Wiley-VCH, 2014).
- 4.
Shen, Y. R. Surface properties probed by second harmonic and sum-frequency generation. Nature 337, 519–525 (1989).
- 5.
Morita, A. & Hynes, J. T. A theoretical analysis of the sum frequency generation spectrum of the water surface. J. Chem. Phys. 258, 371–390 (2000).
- 6.
Luber, S., Iannuzzi, M. & Hutter, J. Raman spectra from ab initio molecular dynamics and its application to liquid S-methyloxirane. J. Chem. Phys. 141, 094503 (2014).
- 7.
Medders, G. R. & Paesani, F. Dissecting the molecular structure of the air/water interface from quantum simulations of the sum-frequency generation spectrum. J. Am. Chem. Soc. 138, 3912–3919 (2016).
- 8.
Sprik, M. & Klein, M. L. A polarizable model for water using distributed charge sites. J. Chem. Phys. 89, 7556–7560 (1988).
- 9.
Fanourgakis, G. S. & Xantheas, S. S. Development of transferable interaction potentials for water. V. Extension of the flexible, polarizable, Thole-type model potential (TTM3-F, v. 3.0) to describe the vibrational spectra of water clusters and liquid water. J. Chem. Phys. 128, 074506 (2008).
- 10.
Ponder, J. W. et al. Current status of the AMOEBA polarizable force field. J. Phys. Chem. B 114, 2549–2564 (2010).
- 11.
Medders, G. R., Babin, V. & Paesani, F. Development of a “first-principles” water potential with flexible monomers. III. Liquid phase properties. J. Chem. Theory Comput 10, 2906–2910 (2014).
- 12.
Bereau, T., DiStasio, R. A. Jr., Tkatchenko, A. & von Lilienfeld, O. A. Non-covalent interactions across organic and biological subsets of chemical space: physics-based potentials parametrized from machine learning. J. Chem. Phys. 148, 241706 (2018).
- 13.
Monkhorst, H. J. Calculation of properties with the coupled-cluster method. Int. J. Quantum Chem. 12, 421–432 (1977).
- 14.
Koch, H. & Jørgensen, P. Coupled cluster response functions. J. Chem. Phys. 93, 3333–3344 (1990).
- 15.
Christiansen, O., Jørgensen, P. & Hättig, C. Response functions from Fourier component variational perturbation theory applied to a time-averaged quasienergy. Int. J. Quantum Chem. 68, 1–52 (1998).
- 16.
Hammond, J. R., Govind, N., Kowalski, K., Autschbach, J. & Xantheas, S. S. Accurate dipole polarizabilities for water clusters n = 2–12 at the coupled-cluster level of theory and benchmarking of various density functionals. J. Chem. Phys. 131, 214103 (2009).
- 17.
Hammond, J. R., de Jong, W. A. & Kowalski, K. Coupled-cluster dynamic polarizabilities including triple excitations. J. Chem. Phys. 128, 224102 (2008).
- 18.
Lao, K. U., Jia, J., Maitra, R. & DiStasio, R. A. Jr. On the geometric dependence of the molecular dipole polarizability in water: a benchmark study of higher-order electron correlation, basis set incompleteness error, core electron effects, and zero-point vibrational contributions. J. Chem. Phys. 149, 204303 (2018).
- 19.
Woon, D. E. & Dunning, T. H. Jr. Gaussian basis sets for use in correlated molecular calculations. IV. Calculation of static electrical response properties. J. Chem. Phys. 100, 2975–2988 (1994).
- 20.
Blum, L. C. & Reymond, J.-L. 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. J. Am. Chem. Soc. 131, 8732–8733 (2009).
- 21.
Rupp, M., Tkatchenko, A., Müller, K.-R. & von Lilienfeld, O. A. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett. 108, 058301 (2012).
- 22.
Montavon, G. et al. Machine learning of molecular electronic properties in chemical compound space. New J. Phys. 15, 095003 (2013).
- 23.
Wilkins, D. M. et al. Accurate molecular polarizabilities with coupled cluster theory and machine learning. Proc. Natl. Acad. Sci. USA 116, 3401–3406 (2019).
- 24.
Christiansen, O., Gauss, J. & Stanton, J. F. Frequency-dependent polarizabilities and first hyperpolarizabilities of CO and H_{2}O from coupled cluster calculations. Chem. Phys. Lett. 305, 147–155 (1999).
- 25.
Becke, A. D. Density-functional thermochemistry. III. The role of exact exchange. J. Chem. Phys. 98, 5648–5652 (1993).
- 26.
Stephens, P. J., Devlin, F. J., Chabalowski, C. F. & Frisch, M. J. Ab Initio calculation of vibrational absorption and circular dichroism spectra using density functional force fields. J. Chem. Phys. 98, 11623–11627 (1994).
- 27.
Hui, K. & Chai, J.-D. SCAN-based hybrid and double-hybrid density functionals from models without fitted parameters. J. Chem. Phys. 144, 044114 (2016).
- 28.
The QM7b Dataset, http://quantum-machine.org/datasets (2013).
- 29.
Imbalzano, G. et al. Automatic selection of atomic fingerprints and reference configurations for machine-learning potentials. J. Chem. Phys. 148, 241730 (2018).
- 30.
Bartók, A. P., Kondor, R. & Csányi, G. On representing chemical environments. Phys. Rev. B 87, 184115 (2013).
- 31.
Perdew, J. P., Burke, K. & Ernzerhof, M. Generalized gradient approximation made simple. Phys. Rev. Lett. 77, 3865–3868 (1996).
- 32.
Blum, V. et al. Ab initio molecular simulations with numeric atom-centered orbitals. Comput. Phys. Commun. 180, 2175–2196 (2009).
- 33.
Yang, Y. et al. Quantum mechanical static dipole polarizabilities in the QM7b and AlphaML showcase databases. Materials Cloud, https://doi.org/10.24435/materialscloud:2019.0002/v2(2019).
- 34.
Parrish, R. M. et al. Psi4 1.1: an open-source electronic structure program emphasizing automation, advanced libraries, and interoperability. J. Chem. Theory Comput. 13, 3185–3197 (2017).
- 35.
Shao, Y. et al. Advances in molecular quantum chemistry contained in the Q-Chem 4 program package. Mol. Phys. 113, 184–215 (2015).
- 36.
Feller, D. The role of databases in support of computational chemistry calculations. J. Comput. Chem. 17, 1571–1586 (1996).
- 37.
Schuchardt, K. L. et al. Basis set exchange: a community database for computational sciences. J. Chem. Inf. Model. 47, 1045–1052 (2007).
- 38.
The PSI4 Project. Psi4: Open-Source Quantum Chemistry, http://www.psicode.org (2017).
- 39.
Q-Chem Inc. Quantum Computational Software; Molecular Modeling; Visualization, http://www.q-chem.com (2015).
- 40.
Theory Department of the Fritz-Haber-Institut der Max-Planck-Gesellschaft. FHI-aims, https://aimsclub.fhi-berlin.mpg.de (2009).
Acknowledgements
Y.Y., K.U.L. and R.A.D. acknowledge support from Cornell University through start-up funding. D.M.W. and M.C. acknowledge support from the European Research Council (Horizon 2020 Grant Agreement No. 677013-HBMAP). M.C. and A.G. acknowledge funding by the MPG-EPFL Center for Molecular Nanoscience and the NCCR MARVEL, funded by the Swiss National Science Foundation. This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-06CH11357. This research also used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility operated under Contract No. DE-AC02-05CH11231. This work was also supported by a grant from the Swiss National Supercomputing Centre (CSCS) under Project ID s843, and by computer time from the EPFL scientific computing centre.
Author information
Affiliations
Contributions
Y.Y., K.U.L. and R.A.D. designed and performed all polarizability calculations. D.M.W., A.G. and M.C. implemented and performed the farthest point sampling of the QM7b database. All authors contributed to the design of the AlphaML showcase database, analyzed the data, and contributed to the writing of the manuscript.
Corresponding authors
Correspondence to Michele Ceriotti or Robert A. DiStasio Jr..
Ethics declarations
Competing Interests
The authors declare no competing interests.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
ISA-Tab metadata file
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver http://creativecommons.org/publicdomain/zero/1.0/ applies to the metadata files associated with this article.
About this article
Received
Accepted
Published
DOI