Quantum mechanical static dipole polarizabilities in the QM7b and AlphaML showcase databases

While density functional theory (DFT) is often an accurate and efficient methodology for evaluating molecular properties such as energies and multipole moments, this approach often yields larger errors for response properties such as the dipole polarizability (α), which describes the tendency of a molecule to form an induced dipole moment in the presence of an electric field. In this work, we provide static α tensors (and other molecular properties such as total energy components, dipole and quadrupole moments, etc.) computed using quantum chemical (QC) and DFT methodologies for all 7,211 molecules in the QM7b database. We also provide the same quantities for the 52 molecules in the AlphaML showcase database, which includes the DNA/RNA nucleobases, uncharged amino acids, several open-chain and cyclic carbohydrates, five popular pharmaceutical molecules, and 23 isomers of C8Hn. All QC calculations were performed using linear-response coupled-cluster theory including single and double excitations (LR-CCSD), a sophisticated approach for electron correlation, and the d-aug-cc-pVDZ basis set to mitigate basis set incompleteness error. DFT calculations employed the B3LYP and SCAN0 hybrid functionals, in conjunction with d-aug-cc-pVDZ (B3LYP and SCAN0) and d-aug-cc-pVTZ (B3LYP).

www.nature.com/scientificdata www.nature.com/scientificdata/ factor that needs to be considered when computing α. In this regard, basis set incompleteness error in the prediction of α can be more severe than the error due to the lack of higher order (e.g., beyond doubles) excitations [16][17][18][19] .
In this work, we provide static (frequency-independent) α tensors computed using LR-CCSD and hybrid density functional theory (DFT) for all molecules in the QM7b [20][21][22] and AlphaML showcase databases 23 . The QM7b database [20][21][22] has become one of the de facto standard databases for machine-learning (ML) applications in chemistry, and contains N = 7,211 small organic molecules with up to seven heavy atoms (i.e., C, N, O, S, and Cl) and varying levels of H saturation. Recently introduced by Wilkins et al. 23 for testing the transferability of ML-based predictions of α, the AlphaML showcase database consists of N = 52 larger organic molecules (with up to 16 heavy atoms), and includes the DNA/RNA nucleobases, uncharged amino acids, several open-chain and cyclic carbohydrates, five popular pharmaceutical molecules, and 23 isomers of C 8 H n (see Fig. 1). The diversity of structures in this combination of databases includes alkanes, alkenes, alkynes, (hetero)cycles, carbonyl and carboxyl groups, cyanides, amides, alcohols, amines, thiols, ethers, and epoxides, thereby providing a meaningful survey of α across a wide swath of chemical compound space.
Reference values for α were obtained with LR-CCSD with the doubly-augmented d-aug-cc-pVDZ basis set of Woon and Dunning 19 , as this method (when employed in conjunction with a sufficiently large and diffuse one-particle basis set) has been shown to yield accurate and reliable predictions for α [16][17][18]24 . The use of d-aug-cc-pVDZ greatly mitigates the basis set incompleteness error at the double-ζ level, and the validity of this basis set choice will be critically examined and discussed in more detail below. For comparative purposes, we also provide finite-field DFT values for α obtained with the popular B3LYP 25,26 and SCAN0 27 hybrid functionals in conjunction with the d-aug-cc-pVDZ (B3LYP and SCAN0) and d-aug-cc-pVTZ (B3LYP only) basis sets. Throughout the remainder of this work, the d-aug-cc-pVXZ basis sets (with X = D and T) will be referred to as daXZ, and all LR-CCSD/daDZ calculations will simply be denoted by CCSD/daDZ unless otherwise specified.

Methods
In this section, we provide the conventions used in generating and processing the geometries of the molecules in the QM7b and AlphaML showcase databases, all relevant computational details to ensure reproducibility of the quantum mechanical data, as well as a summary of the codes employed in this work. Molecular cartesian coordinates in the QM7b and AlphaML showcase databases. The molecular geometries for all 7,211 species in the QM7b database [20][21][22] were obtained online via the quantum-machine. org website 28 . All QM7b molecular geometries were first translated to their respective center of nuclear (ionic) charge, to remove the origin-dependence of the higher-order (i.e., quadrupole) multipole moments. Using farthest-point sampling (FPS) 29 , all molecules were then reordered using a kernel-based similarity measure 30 , and relabelled accordingly from molecule0001 to molecule7211 (again padded to four digits with leading zeros). For consistency with the QM7b database, all 52 molecules in the AlphaML showcase database (see Fig. 1) were optimized with DFT using the PBE functional 31 and a converged numerical atom-centered basis (i.e., tight settings with the tier-2 basis set in FHI-AIMS) 32 . All AlphaML showcase molecules were also translated to  (26)(27)(28)(29) open-chain and cyclic carbohydrates; and (30-52) 23 isomers of C 8 H n . Throughout this work, these molecules will be specified by "showcase" followed by the corresponding index (padded to four digits with leading zeros, e.g., showcase0001 to showcase0052).
www.nature.com/scientificdata www.nature.com/scientificdata/ their respective center of nuclear (ionic) charge, and are labelled from showcase0001 to showcase0052, as depicted in Fig. 1 Details of the quantum mechanical calculations. All CCSD/daDZ, B3LYP/daDZ, and B3LYP/daTZ calculations were carried out using Psi4 v1.1 34 , while all SCAN0/daDZ calculations were performed with Q-Chem v5.0 35 . At the CCSD/daDZ level, all α tensors, unrelaxed dipole moments, μ, and unrelaxed quadrupole moments, Q, were calculated using LR-CCSD/daDZ, with the exception of the ten largest molecules in the AlphaML showcase database (e.g., (18) Phenylalanine, (19) Tyrosine, (20) Tryptophan, (21) Caffeine, (23) Aspirin, (25) Acyclovir, (26) D-Fructose, (27) β-D-Fructofuranose, (28) D-Glucose, and (29) α-D-glucopyranose, see Fig. 1). For these molecules, the memory requirements required to solve the Λ-CC equations at the LR-CCSD/ daDZ level were computationally prohibitive, and only energy calculations with CCSD/daDZ could be performed with the available computational resources. For consistency, this required the use of the orbital-unrelaxed finite-field method, in which the molecular orbitals were obtained from a field-free (unperturbed) Hartree-Fock calculation. To obtain μ and α, we computed first and second derivatives of the CCSD/daDZ energy (U) with respect to an external electric field, E, i.e., μ = ∂U/∂E and α = ∂ 2 U/∂E 2 . Q values were not computed for the ten largest molecules in the AlphaML showcase database. All DFT calculations used the orbital-relaxed finite-field method, in which a self-consistent field (SCF) was obtained in the presence of each applied field, and α was computed via α = ∂μ/∂E. All other molecular properties at the DFT level (vide infra) were obtained directly from the field-free (unperturbed) calculation. All derivatives were computed numerically using two-point (for first derivatives) and three-point (for second derivatives) central difference formulae and a step size of E = 1.8897261250 × 10 −5 atomic units.
For all LR-CCSD/daDZ calculations, the convergence criteria were set to their default values in Psi4, i.e., E_convergence = 1.0E-10 and D_convergence = 1.0E-10 for the energy and density during the solution of the HF equations, and E_convergence = 1.0E-08 and R_convergence = 1.0E-07 for the energy and residuals during the solution of the CCSD equations. For the ten largest molecules in the AlphaML showcase database, the finite-field CCSD/daDZ calculations were performed using the following convergence criteria in Psi4: E_convergence = 5.0E-10 and D_convergence = 5.0E-10 for the energy and  Table 1. Calculated properties at the CCSD/daDZ level. All properties are in atomic units and are provided on the "comment line" (i.e., the second line) of a standard xyz file. a 04-09: α xx , α yy , α zz , α xy , α xz , α yz . b 10-12: μ x , μ y , μ z . c 13-18: Q xx , Q yy , Q zz , Q xy , Q xz , Q yz . d Q values are not provided for the ten largest molecules in the AlphaML showcase database (see text for details).

Data Records
In this section, we briefly describe the molecular properties that have been computed in this work, as well as the conventions used to store and retrieve the generated data. In addition to a select set of molecular properties (such as energetic components, dipole and quadrupole moments, orbital eigenvalues, etc.), the provided data will also include the full output files from all of the calculations performed herein. In what follows, we focus the discussion on α, as this molecular response property is arguably the most challenging quantity computed in this work. In particular, we provide a statistical summary of the CCSD/daDZ α data in the QM7b and AlphaML databases, as well as a comparative analysis of the different quantum mechanical methods employed in this work.

Included molecular properties and file format.
To store and disseminate the data generated in this work, we have created the following four data packages: CCSD_daDZ, B3LYP_daDZ, SCAN0_daDZ, and B3LYP_daTZ. Each data package contains 7,263 standard xyz files, and has been named according to the level of theory used to generate the data contained therein. Each of the included xyz files contains the translated geometries and calculated properties for a single molecule in the QM7b and AlphaML showcase databases. As described above, the 7,211 molecules in the QM7b database are contained in xyz files labelled from mole-cule0001 to molecule7211, and the 52 molecules in the AlphaML showcase database are contained in xyz files labelled from showcase0001 to showcase0052 (see Fig. 1).
All computed properties (for a given molecule) are provided on the "comment line" (i.e., the second line) of the corresponding xyz file (as comma-separated values), following the order provided in Table 1 (for CCSD_ daDZ) and Table 2 (for B3LYP_daDZ, SCAN0_daDZ, and B3LYP_daTZ). Common molecular properties included in all four data packages are: the isotropic polarizability (α iso ), iso x x y y z z www.nature.com/scientificdata www.nature.com/scientificdata/ anisotropic polarizability (α aniso ), all symmetry-unique components of the polarizability (α) tensor (i.e., α xx , α yy , α zz , α xy , α xz , α yz ), all components of the dipole moment (μ) vector (i.e., μ x , μ y , μ z ), and all symmetry-unique components of the quadrupole moment (Q) vector (i.e., Q xx , Q yy , Q zz , Q xy , Q xz , Q yz ). In the CCSD_daDZ data package only, the following molecular properties are also included: the Hartree-Fock total energy (E tot HF ), same-spin (E ss MP2 ) and opposite-spin (E os MP2 ) correlation energies at the level of second-order Møller-Plesset perturbation (MP2) theory, and same-spin (E ss CCSD ) and opposite-spin (E os CCSD ) correlation energies at the CCSD level. In the B3LYP_daDZ, SCAN0_daDZ, and B3LYP_daTZ data packages only, the following molecular properties are also included: the DFT total energy (E tot DFT ) and the eigenvalues corresponding to the HOMO ( HOMO  ) and LUMO ( LUMO ). All data described above is provided in atomic units and available for download on Materials Cloud 33 .
Statistical summary of the CCSD/daDZ α data. To provide an overview of the α data, Fig. 2 contains the normalized probability distributions of the CCSD/daDZ isotropic (α iso , blue) and anisotropic (α aniso , red) polarizabilities in the QM7b database. This is accompanied by Table 3, which provides a statistical analysis of all α data generated in this work. From Fig. 2 and Table 3, one can see that the CCSD/daDZ α iso values in the QM7b database have a range of 16.80-106.50 Bohr 3 , and are centered around a mean value of 〈α〉 = 74.07 Bohr 3 . With a standard deviation (σ) that is nearly two times larger than α iso , the CCSD/daDZ α aniso values in this database are characterized by a broader distribution that is significantly skewed to the right. We note in passing that the range of α aniso is larger than α iso by ≈64%, and includes minimum values that are significantly smaller (cf. Bohr 3 ). Also depicted in Fig. 2 are the subset of molecules in the QM7b database with the smallest and largest α iso and α aniso values (as well as those molecules with intermediate α iso and α aniso values of ≈20, 40, 60, and 80 Bohr 3 ). From these molecules, one clearly sees that α iso is an extensive quantity that grows with molecular size, and α aniso (which is a measure of the anisotropy in the α tensor, see Eq. (2)) is largest for molecules with elongated and non-spherical/asymmetric shapes.
The statistical summary of the α data corresponding to the 52 molecules in the AlphaML showcase database (see Table 3) also illustrates that the α iso (α aniso ) distributions in this database are characterized by 〈α〉 values that are larger by ≈34% (≈23%) and σ values that are 2.5× (2.0×) larger than that found in the QM7b database. In addition, the range of α iso (α aniso ) values is approximately 25% (12%) larger in the AlphaML showcase database, and does not include symmetric molecules with vanishingly small α aniso values. Taken together, these statistical measures reflect the fact that the molecules in the AlphaML showcase database, which includes the DNA/RNA nucleobases, uncharged amino acids, several open-chain and cyclic carbohydrates, five popular pharmaceutical  Table 3. Statistical analysis of the isotropic (α iso ) and anisotropic (α aniso ) polarizabilities in the QM7b and AlphaML showcase databases computed at the CCSD/daDZ, B3LYP/daDZ, SCAN0/daDZ, and B3LYP/daTZ levels. Statistical quantities (in Bohr 3 ) include: 〈α〉 (mean), σ (standard deviation), α min (minimum value), and α max (maximum value).
www.nature.com/scientificdata www.nature.com/scientificdata/ molecules, and 23 isomers of C 8 H n , are (in general) larger and more diverse than those contained in the QM7b database (see Fig. 1).

Comparative analysis of the quantum mechanical methodologies.
To investigate the performance of different quantum mechanical methodologies in calculating the α tensor in the QM7b and AlphaML showcase databases, a detailed statistical error analysis was carried out for the following combinations of methods (Level//Reference Level): B3LYP/daDZ//CCSD/daDZ and SCAN0/daDZ//CCSD/daDZ (to compare the electron correlation level while keeping the basis set fixed), B3LYP/daDZ//SCAN0/daDZ (to compare the exchange-correlation functional while keeping the basis set fixed), and B3LYP/daDZ//B3LYP/daTZ (to quantify the basis set incompleteness error at the B3LYP level). A summary of statistical error measures, including the mean signed error,  Table 4 as well as the corresponding percent errors, To visualize these differences in more detail, correlation plots (corresponding to the four method combinations above) for α iso and α aniso (as well as probability distributions of the signed percent error (SPE)) are provided in Fig. 3. When comparing B3LYP/daDZ to the reference CCSD/daDZ level for the molecules in the QM7b database, one sees that B3LYP/daDZ yields essentially identical MSE and MAE values for α iso (i.e., 1.91 Bohr 3 and 1.92 Bohr 3 , respectively), indicating that B3LYP/daDZ systematically overestimates α iso values by ≈2.5% (see Table 4). With an RMSE value that is ≈21% greater than the MAE, the magnitudes of the B3LYP/daDZ errors show substantial variations from molecule to molecule; this is particularly evident for the molecules with large α iso values in Fig. 3(a). When comparing SCAN0/daDZ to CCSD/daDZ, one sees that SCAN0/daDZ outperforms B3LYP/ daDZ by a large margin in the prediction of α iso , yielding reductions of ≈90%, ≈50%, and ≈40% in the MSE, MAE, and RMSE values, respectively. In this regard, our finding that the SCAN0 functional provides greatly improved estimates for α iso is also consistent with the recent benchmark study by Lao et al. 18 on the dipole polarizability surface of the gas-phase water molecule. With an MSE value that is nearly 5× smaller than the MAE, it is also worth noting that SCAN0/daDZ α iso values only have a slightly positive systematic error. From a quick glance at Fig. 3(b), it is clear that SCAN0/daDZ (like B3LYP/daDZ) also has more difficulties when treating molecules with large α iso values; this is indicative of the challenges that one faces when computing α, a response property which becomes substantially more non-additive as the size and complexity of the molecules increase. When comparing B3LYP/daDZ to SCAN0/daDZ, one obtains nearly identical MSE, MAE, and RMSE values,   www.nature.com/scientificdata www.nature.com/scientificdata/ which indicates that: (i) B3LYP/daDZ systematically overestimates α iso with respect to SCAN0/daDZ, and (ii) the magnitudes of the B3LYP/daDZ errors do not show substantial molecule-to-molecule variations. Both of these findings are confirmed in Fig. 3(c), where one sees that: (i) the SPE distribution is centered around 2.3% (and not zero), and (ii) there is clearly a very strong linear correlation between the B3LYP/daDZ and SCAN0/daDZ α iso values that does not deteriorate with molecular size and complexity. When comparing B3LYP/daDZ to B3LYP/ daTZ, one finds that the increase from double-to triple-ζ in the underlying basis set does not lead to significantly different α iso values. This finding is consistent with the (in general) rapid convergence of DFT with respect to the occupied space and the relatively weak dependence of DFT on the virtual/unoccupied space. www.nature.com/scientificdata www.nature.com/scientificdata/ From a quick glance at Table 4, one also sees that both B3LYP and SCAN0 (when compared to the reference CCSD/daDZ level) yield larger errors when predicting α aniso than α iso . When comparing B3LYP/daDZ to CCSD/ daDZ, for example, the MSPE, MAPE, and RMSPE values increased from 2.52%, 2.54%, and 2.97% for α iso to 9.19%, 9.34%, and 10.4% for α aniso ; a similar increase was observed when comparing SCAN0/daDZ to CCSD/ daDZ. These findings demonstrate that the tensorial properties of α (which govern α aniso ) are more difficult to predict than the average of the diagonal elements (i.e., α iso values). When compared to CCSD/daDZ, SCAN0/ daDZ is no longer performing substantially better than B3LYP/daDZ and now exhibits a (nearly) systematic overestimation of α aniso . By looking at the upper left insets in Fig. 3(a,b), one again sees that the errors made by B3LYP and SCAN0 increase for molecules with larger α aniso values; this is indicative of the increasing importance of including electron correlation effects when predicting α aniso for molecules that are larger in size and potentially more anisotropic. Among the DFT functionals at the daDZ level, one sees that B3LYP/daDZ overestimates α aniso with respect to SCAN0/daDZ in most cases, and that B3LYP/daDZ and SCAN0/daDZ are now in better agreement with each other than with CCSD/daDZ. When comparing B3LYP/daDZ and B3LYP/daTZ, the B3LYP functional again shows rapid convergence with respect to the underlying basis set in the prediction of α aniso .
When performing a similar analysis for the molecules in the AlphaML showcase database, most of the findings described above for the QM7b database still hold. One interesting distinction is the finding that SCAN0/ daDZ no longer outperforms B3LYP/daDZ when predicting α iso values for the larger and more complex molecules contained in the AlphaML showcase database; in the same breath, we note that SCAN0/daDZ still maintains a relatively small MSE value, which is indicative of an error profile that is more random (and less systematic) than B3LYP/daDZ (see Fig. 3(a,b)).

technical Validation
In this section, we explore the validity and reliability of the CCSD/daDZ α data in the QM7b database.
Validation of the CCSD/daDZ α Data. Since α describes the response of a molecule to an applied electric field, an accurate and reliable treatment of this quantity is particularly sensitive to the description of the underlying electronic structure as well as the quality of the basis set. The highest level α values provided in this work were computed with LR-CCSD, a sophisticated wavefunction-based method that consistently yields highly accurate α values for equilibrium and non-equilibrium molecular geometries when used with sufficiently large (and sufficiently diffuse) basis sets [16][17][18]24 . To account for the basis set incompleteness error, which is almost always larger than the contributions from higher-order (e.g., beyond doubles) excitations in coupled-cluster theory [16][17][18][19] , we employed the daDZ basis set. Although daDZ is a double-ζ basis set containing a moderate number of polarization functions, the incorporation of two sets of augmented functions (i.e., double augmentation) significantly reduces the basis set incompleteness error in the prediction of α. To validate the accuracy of our CCSD/daDZ calculations, we performed a series of calculations using the larger daTZ basis set 19 , which is arguably the largest Dunning-style basis that can be used to compute α for the molecules in the QM7b database without significant supercomputer resources. To proceed with this technical validation, we used the FPS algorithm 29,30 to choose the 100 most diverse molecules in the QM7b database (which we denote as the FPS-100 database). Due to the prohibitively large computational cost associated with LR-CCSD calculations with the daTZ basis set, we were only able to compute α for the 24 smallest molecules (by number of basis functions) in the FPS-100 database. A statistical error analysis of the α iso and α aniso values for these 24 molecules is provided in Table 5, and a more extensive discussion regarding the basis set convergence of our CCSD/daDZ calculations can be found in the main text and Supplementary Information of Ref. 23 . From Table 5, one can immediately see that the CCSD/daDZ α iso values have similar MSE, MAE, and RMSE values of ≈0.20 Bohr 3 , which corresponds to a MAPE of ≈0.4%. For α aniso , a measure of the anisotropy in the α tensor, we report slightly larger errors corresponding to a MAPE of 1%. When compared to the errors made by the hybrid DFT functionals employed in this work (with CCSD/daDZ as the reference), namely 2.5% (B3LYP/daDZ) and 1.3% (SCAN0/daDZ) for α iso and 9.3% (B3LYP/daDZ) and 7.9% (SCAN0/daDZ) for α aniso , we conclude that the basis set incompleteness errors in our reference α values are significantly smaller (see Table 4). As such, the CCSD/daDZ α tensors presented in this work should be accurate and reliable enough for use in the development (and assessment) of next-generation force fields, density functionals, and quantum chemical methodologies, as well as machine-learning based approaches for predicting this fundamental response property.  Table 5. Statistical error analysis of the CCSD/daDZ isotropic (α iso ) and anisotropic (α aniso ) polarizabilities in the FPS-100 database (i.e., the first 100 molecules in the QM7b database chosen by the FPS algorithm) computed with respect to the CCSD/daTZ level. Due to the increased computational cost associated with computing α at the CCSD/daTZ level, a subset of the FPS-100 database (which includes the 24 molecules with the smallest number of basis functions) was considered during this analysis.