BSE49, a diverse, high-quality benchmark dataset of separation energies of chemical bonds

We present an extensive and diverse dataset of bond separation energies associated with the homolytic cleavage of covalently bonded molecules (A-B) into their corresponding radical fragments (A.﻿ and B.﻿). Our dataset contains two different classifications of model structures referred to as “Existing” (molecules with associated experimental data) and “Hypothetical” (molecules with no associated experimental data). In total, the dataset consists of 4502 datapoints (1969 datapoints from the Existing and 2533 datapoints from the Hypothetical classes). The dataset covers 49 unique X-Y type single bonds (except H-H, H-F, and H-Cl), where X and Y are H, B, C, N, O, F, Si, P, S, and Cl atoms. All the reference data was calculated at the (RO)CBS-QB3 level of theory. The reference bond separation energies are non-relativistic ground-state energy differences and contain no zero-point energy corrections. This new dataset of bond separation energies (BSE49) is presented as a high-quality reference dataset for assessing and developing computational chemistry methods. Measurement(s) bond separation energies Technology Type(s) ab initio quantum chemistry computational method Factor Type(s) existing or hypothetical model structure Measurement(s) bond separation energies Technology Type(s) ab initio quantum chemistry computational method Factor Type(s) existing or hypothetical model structure Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.16680787

lack bond-type diversity 26,27 or are calculated using less accurate computational chemistry methods compared to the one used in this work [28][29][30] . To the best of our knowledge, an accurate and extensive dataset of computationally predicted BSEs is not available in the literature. The main reason for this absence is that BSE calculations with high accuracy require computationally expensive methods that tend to scale poorly with system size.
This work addresses the aforementioned gap in the literature by constructing a large dataset (4502 datapoints) of computationally predicted BSEs of 49 unique bond types, all of which are determined with a high-level composite theoretical procedure denoted as (RO)CBS-QB3 [31][32][33] . This approach ensures uniform, high-quality reference data and eliminates the need to collect and verify data gathered from various sources, which may differ substantially in their accuracy. The (RO)CBS-QB3 method is known to produce BDEs of high accuracy 8,[33][34][35][36][37] . Therefore, it is suitable for developing a database of BSEs that can be used to test and parametrize low-cost computational methods. One particular target application of our dataset is for the training of cost-effective computational approaches like atom-centered potentials [38][39][40] (ACPs) or machine learning potentials [28][29][30] .

Methods
Dataset composition. We present the BSE49 dataset, which comprises a broad range of bond separation energies for 49 unique bond types. The model systems present in the dataset are neutral molecules with X-H, X-F, X-Cl, X-X, and X-Y single bonds, where X and Y are B, C, N, O, Si, P, and S. The number of datapoints and the ranges of bond separation energies associated with each bond type are provided in Table 1. The structures of model systems on which the calculations were performed are divided into "Existing" and "Hypothetical" classes. The Existing type structures were built by selecting molecules with experimental data reported in the Comprehensive Handbook of Chemical Bond Dissociation Energies 41 . In contrast, the Hypothetical type structures were constructed by functional group substitutions of X-Y single bonds in order to include bond types that were not present in the handbook and to increase the diversity and number of datapoints for each bond type in the dataset. The candidate molecules for both Existing and Hypothetical subsets were generated using a partially automated computational workflow as described below.
Dataset generation. The calculated bond separation energies are defined as the negative of the difference in the ground-state electronic energies for the reaction where A . and B . represent the two radical fragments formed by homolytically breaking the A-B covalent bond in vacuum. Based on this reaction, the equilibrium geometries of the parent molecules and their respective radical fragments are required to calculate the bond separation energies. The geometries of the parent molecule and the associated radicals were constructed manually for both Existing and Hypothetical subsets using the Avogadro 42 program. The constructed geometries were then used as starting points for a conformer search. The CSD conformer generator 43 and FullMonte 44 codes were used to generate multiple conformers. The geometry of each conformer was relaxed to the corresponding local minimum using the Gaussian 45 software package. This relaxation was carried out first by using a low-level method, combining the B3LYP 46-51 density functional and 6-31G* 52,53 basis set along with the D3 54-56 dispersion correction scheme using the Becke-Johnson 57 damping (B3LYP-D3(BJ)/6-31G*). The optimized conformers were ranked using the B3LYP-D3(BJ)/6-31G* relative energies at the local minima. The ten lowest-energy conformers were then re-optimized at the higher-level CAM-B3LYP-D3(BJ)/def2-TZVP level of theory [54][55][56][57][58][59] . Range-separated functionals like CAM-B3LYP minimize the delocalization error, which could be important in the description of radical species 60 . The lowest-energy conformer obtained in this procedure was used for calculating the bond separation energies using the composite method described below. All calculations employed a default self-consistent field (SCF) convergence criterion of 10 −8 Hartrees, ultrafine integration grid, and a tight optimization convergence criteria (maximum force = 1.5 × 10 −5 Hartrees/Bohr, RMS force = 1 × 10 −5 Hartrees/Bohr, maximum displacement = 6 × 10 −5 Bohr, RMS displacement = 4 × 10 −5 Bohr). This partially automated workflow produced structures that are not necessarily the global minima. A visual inspection of the structures revealed that about 20% of the conformers generated do not correspond to the global minima, which reflects the difficulty of solving a global optimization problem (finding the most stable conformer) for such a large number of systems reliably. In addition, due to computational constraints, no attempt was made at evaluating the conformational energy landscape and statistically weighting the low-energy conformers associated with each molecule. Therefore, the dataset is not appropriate for direct comparison to bond separation energies obtained by back-correcting experimental BDEs, but it is suitable for testing and training computationally less expensive methods regarding their ability to accurately calculate the energy difference between the chosen conformers of products (A . and B . ) and reactant (A-B).

Data Records
The reference bond separation energies (in kcal/mol) and coordinates (in Å) of the structures presented in the BSE49 dataset are publicly available free-of-charge from the Figshare 68 and GitHub (https://github.com/ aoterodelaroza/bse49) repositories in the plain-text database file format (DB format) described in Table 2. The atomic coordinates of the model structures are stored in a plain-text XYZ format in the Geometries directory. www.nature.com/scientificdata www.nature.com/scientificdata/ The BSE49 dataset contains one DB format file and three XYZ format files for each bond separation energy. In total, deposited files include 4502 DB format files stored in the db-BSE49 directory and 13506 XYZ format files stored in their respective Existing or Hypothetical classification directories. Additional files labelled as BSE49_ Existing.org and BSE49_Hypothetical.org are also provided. These files contain the necessary information about the reference data for all the model systems. The DB format file contains a header line specifying the reference energy value (in kcal/mol) followed by three 'molc' (short for molecule) blocks containing a unique integer identifier, charge, multiplicity, and the atomic coordinates (in Å) of the parent molecule and its corresponding radical fragments. The XYZ format file contains a header line defining the number of atoms N, a comment line containing the charge and multiplicity, and N lines with each containing element type and X, Y, Z coordinates (in Å). The BSE49_Existing.org and BSE49_Hypothetical.org files are special-character separated plain-text files (where the special character is '|') containing multiple lines and eight columns. The columns are: (i) dataset name of the model system, (ii) unique integer identifier 1 indicating the A . fragment, (iii) geometry filename of the A . fragment, (iv) unique integer identifier 1 indicating the B . fragment, (v) geometry filename of the B . fragment, (vi) unique integer identifier -1 indicating the A-B model system, (vii) geometry filename of the A-B model system, and (viii) computational reference bond separation energy (in kcal/mol).

Technical Validation
For the generation of reference data, the reliable (RO)CBS-QB3 method was chosen for all the model systems considered in the BSE49 dataset. The (RO)CBS-QB3 method has been widely used in literature in recent years  . The developers of the (RO)CBS-QB3 method reported that it predicts heats of formation at 298 K with a mean absolute deviation (MAD) from the experiment of 0.91 kcal/mol 33 . For bond dissociation enthalpies of eleven molecules with chemical structures typically found in amino acid sidechains, peptide termini, and peptide the multiplicity of the A-B parent molecule n 1 + n 2 + 7, …, n 1 + n 2 + N + 6 1 element type n 1 + n 2 + 7…, n 1 + n 2 + N + 6 2 X coordinates (in Å) n 1 + n 2 + 7, …, n 1 + n 2 + N + 6 3 Y coordinates (in Å) n 1 + n 2 + 7, …, n 1 + n 2 + N + 6 4 Z coordinates (in Å) n 1 + n 2 + N + 7 1 'end' string specifying end of the third molecular block  8 . For small lignin model molecules, the CBS-QB3 approach was shown to yield bond dissociation enthalpies within 2.99 kcal/ mol from experimental values 34 . (RO)CBS-QB3 has been used as a reference method for benchmarking various density functional theory methods to estimate bond dissociation enthalpies in a different study on small lignin model systems 23 . Hudzik and co-workers utilized the CBS-QB3 composite method to study the C-H bond separation energies of a few alkane molecules and reported a good agreement with literature values 35 . The (RO) CBS-QB3 has also been used for the prediction of bond dissociation enthalpies in a previous work by Menon et al. 36 The MAD of (RO)CBS-QB3 was reported to be only 0.60 kcal/mol from the experiment and was suggested as being a reliable and efficient procedure for calculating bond separation energies in comparison to the other composite methods tested. In another work, bond dissociation enthalpies of 200 molecules were calculated using an earlier version of this work's composite method, CBS-Q 37 . It was shown that the results of the CBS-Q composite procedure predicted bond dissociation enthalpies to within 2.39 kcal/mol of the reported experimental values. Collectively, these results support the selection of (RO)CBS-QB3 as a practical and accurate method for the generation of reference data in this work. Note that the reference bond separation energies reported in this work are non-relativistic (RO)CBS-QB3 energies without zero-point energy corrections. This makes the reference data suitable to support the development of low-cost computational chemistry methods like those described in references [28][29][30][38][39][40] .

Code availability
Throughout this work, the Gaussian software package was used for geometry optimizations, frequency calculations, and composite (RO)CBS-QB3 calculations. The Gaussian software package can be purchased from Gaussian Inc. (http://gaussian.com/) under a commercial license. CSD conformer generator was used for conformer generation. The CSD conformer generator can be purchased under a commercial license from https:// www.ccdc.cam.ac.uk/solutions/csd-enterprise/applications/conformer-generator/. Fullmonte software package was also used along with MOPAC16 (PM6-DH2 method). Fullmonte software package can be downloaded freeof-cost from https://github.com/bobbypaton/FullMonte. Whereas MOPAC16 software package can be installed after acquiring a free license from http://openmopac.net/. The Avogadro molecular editor and visualizer is an open-source program available at https://avogadro.cc/.