QMugs, quantum mechanical properties of drug-like molecules

Machine learning approaches in drug discovery, as well as in other areas of the chemical sciences, benefit from curated datasets of physical molecular properties. However, there currently is a lack of data collections featuring large bioactive molecules alongside first-principle quantum chemical information. The open-access QMugs (Quantum-Mechanical Properties of Drug-like Molecules) dataset fills this void. The QMugs collection comprises quantum mechanical properties of more than 665 k biologically and pharmacologically relevant molecules extracted from the ChEMBL database, totaling ~2 M conformers. QMugs contains optimized molecular geometries and thermodynamic data obtained via the semi-empirical method GFN2-xTB. Atomic and molecular properties are provided on both the GFN2-xTB and on the density-functional levels of theory (DFT, ωB97X-D/def2-SVP). QMugs features molecules of significantly larger size than previously-reported collections and comprises their respective quantum mechanical wave functions, including DFT density and orbital matrices. This dataset is intended to facilitate the development of models that learn from molecular data on different levels of theory while also providing insight into the corresponding relationships between molecular structure and biological activity.


Background & Summary
Machine learning methodologies are increasingly becoming well-established tools in many chemistry-related disciplines, such as drug discovery 1 , material science 2 , and physical chemistry 3 .In recent years, significant progress has been made in quantum-based machine learning (QML) methods 4 , which aim to accurately and computationally inexpensively predict the governing properties of atomistic systems, such as energies and forces [5][6][7][8][9][10][11][12] , dipole moments 13 , wave functions 14,15 and electron densities 16,17 .Despite the success and promise surrounding the applicability of such approaches, several challenges remain for QML.Arguably, one of the most important challenges is the increasing need for curated, comprehensive datasets. 13While several options, such as the QM9 18 or ANI-1 19 sets have paved the way for the development of current-generation QML methods [5][6][7][8][20][21][22] , the computational cost entailed in their generation limits both the scope of the explored chemical space (e.g., molecule size, atom-type diversity), and prospective modeling applicability 13,23 .
There has been a recent surge in interest in the delta-learning (∆-learning) of chemical properties, which aims to use a machine learning model to predict a physically relevant quantity, such as those generated by density-functional theory (DFT) by utilizing information extracted with a computationally cheaper method 21,24 (e.g., semi-empirical approaches such as GFN2-xTB [25][26][27][28][29] and PM6 30 ).Datasets that enable this type of learning are scarce and could promote the development of accurate models at potentially a fraction of the computational cost of more precise alternatives 31 .Furthermore, datasets that provide three-dimensional conformational data, for a wide variety of chemical space, at levels of theory higher than classical force fields 32,33 , could boost the performance of machine learning methods in predicting properties from ensembles as well as generative models of conformations.Relevant examples include the PubChemQC-PM6 22 and GEOM 33 datasets, which include molecules with properties computed using different semi-empirical levels of theory.Finally, there is a clear potential to open up new lines of research by combining biological annotations (e.g., from molecular databases such as ChEMBL 34 ), and additional QM-derived physical information.
This work introduces QMugs (Quantum-Mechanical Properties of Drug-like Molecules), a data collection of over 665k curated molecular structures extracted from the ChEMBL database, with accompanying computed quantum mechanical properties.Different levels of theory were combined in these calculations.Per compound, three conformers were generated, and their geometries were optimized using the semi-empirical GFN2-xTB method [25][26][27][28][29] , whereas a comprehensive array of quantum properties was computed at the DFT level of theory using the ωB97X-D functional 35 and the def2-SVP Karlsruhe basis set 36 .The data collection presented herein is put in the context of other sets that also feature DFT-level properties.A descriptive evaluation against QM9 18 , PubchemQC 37 , and the ANI-1 19 datasets is provided in Figure 1B, as well as in Table 1.As previously reported for the ChEMBL database 38 , most of the considered drug-like molecules in this study fall within the rod-disk axis in the principal moments of inertia plot 39 (Figure 1A).Furthermore, the vast majority of the included compounds (∼ 641k, 96.3%) were previously unreported in other DFT data collections, while also providing equivalent information at additional levels of theory, namely GFN2-xTB.With an average of 30.6 and a maximum of 100 heavy atoms per compound (Table 1 & Figure 2), QMugs also features molecular samples that are considerably larger than those provided by other datasets.To the best of our knowledge, this work is the first to provide a large and diverse dataset of quantum mechanical wave functions represented as local bases of atomic orbitals (i.e., DFT density and orbital matrices).Single-point properties as well as wave functions were computed with the Psi4 software suite 40 for all the conformers (∼ 2.0M) present in the database, totaling over 7 terabytes of supplied quantum mechanical data.
Overall, the utility of the presented dataset is fourfold: (i) it will provide researchers with the largest-to-date dataset to either directly predict the quantum chemical properties, or learn a property mapping between two popular quantum mechanical levels of theory (i.e., GFN2-xTB and ωB97X-D/def2-SVP); (ii) it will facilitate the development of novel machine-learning methodologies for the generation of molecular conformations and molecular property predictions via their ensembles; (iii) it will enable the development of novel deep learning frameworks for the prediction of the quantum mechanical wave function in a local basis of atomic orbitals from which all ground-state properties as well as electron densities can be derived; and (iv) it will enable research towards the exploration of QML methods quantum featurization in the context of pharmacologically relevant, annotated biological data.

Methods
Molecules were extracted from the ChEMBL database 34 (version 27).Conformers were generated using RDKit 41 and GFN2-xTB 25,26,28,29 .DFT (ωB97X-D/def2-SVP) calculations were carried out via Psi4 40 .A similar approach was adopted in a previous study on transition-metal complexes. 13An overview of the data processing pipeline is given in Figure 3, while individual steps are described in more detail in the following subsections.
In chemical terminology, the term "conformation" refers to any arrangement of atoms in space, whereas "conformer" refers to a conformation that is a local minimum on the potential energy surface of the molecule. 42In the analyses that follow, the term "conformation" is loosely used to refer to both, unless explicitly mentioned otherwise.

Data extraction and SMILES processing
Single-protein targets with assay information for at least 10 compounds with unique internal identifiers were extracted from the ChEMBL database.Several activity and annotation filters were subsequently applied to these compounds (see ESI for a detailed query description).This procedure resulted in 685, 917 molecules with unique external identifiers (ChEMBL-IDs), represented by their Simplified Molecular Input Line Entry Specification (SMILES) 43 .Molecules were neutralized, and salts and solvents were removed using the ChEMBL Structure Pipeline package 44,45 .For compounds consisting of multiple separate fragments after this "washing" procedure, all except the one with the highest number of heavy atoms were discarded.Additionally, molecules containing fewer than 3 or more than 100 heavy atoms, as well as radical species and molecules with a net charge different from zero after the attempted neutralization, were removed.Atom types included in the QMugs dataset are hydrogen, carbon, nitrogen, oxygen, fluorine, phosphorus, sulfur, chloride, bromide, and iodine.

Conformer generation and optimization
With the procedure described herein, a compromise between efficient molecular conformational search and practical computational expense considerations was sought.
The RDKit 41 implementation of the Experimental-Torsion Knowledge Distance Geometry (ETKDG) method 46 was used to generate up to 100 conformers for each molecule, with a maximum of 1000 embedding attempts and an initial coordinate assignment using distance-matrix eigenvalues and default settings (boxSizeMult=2.0,force-field tolerance=1e-3) .Upon no successful conformer generation, it was re-attempted via random assignment of the starting coordinates.The resulting conformers were further minimized using the Merck molecular force field 47 (MMFF94s) for a maximum of 1000 iterations, with default settings (nonBondedThresh=100.0).The lowest-energy conformer (according to the selected force field) for each structure was then used as a starting point for meta-dynamics (MTD) simulations.Stereocenters that were previously undefined in the SMILES extraction procedure were assigned in this conformer generation process.
For each generated conformer, an MTD simulation was performed with the xTB software package 28 for a duration of 5 ps with time steps of 1 fs, at a temperature of 300 K.The biasing root-mean-square deviation (RMSD) potential used for all MTD 2/14 simulations is given by where N is the number of reference structures, k i the pushing strength, ∆ i the collective variable (i.e., the RMSD between structure i and a reference structure), and α the width of the Gaussian potential used in the RMSD criterion.Simulations were carried out with α =1.2 a 0 −1 and k i =0.2 mE h with snapshots taken every 50 fs, resulting in 100 conformations stored with their corresponding energies.To obtain conformationally diverse samples, these structures were subsequently clustered into three groups via the k-means 48 algorithm, as implemented in the scikit-learn 49 (version 0.23.1)Python package using the pairwise RMSD of the aligned structures as molecular features.The conformation with the lowest-energy value from each cluster was then selected for further processing.The three resulting conformations for each molecule were then optimized using the GFN2-xTB 25,26,28,29 method using energy and gradient convergence criteria of 5 × 10 −6 E h and 1 × 10 −3 E h α −1 , respectively, and the approximate normal coordinate rational function optimizer (ANCopt).Harmonic frequencies, entropies, enthalpies and heat capacities at 298.15 K were extracted at the end of the geometry optimization process.Structures for which vibrational frequencies with imaginary wave numbers were obtainedindicative of failure to reach energy minima -were subjected to additional optimizations until no significant ones remained, up to a maximum of 100 attempts.

Quantum mechanical calculations
Single-point electronic calculations were performed for the optimized geometries using the ωB97X-D quantum functional and the def2-SVP basis set as implemented in the open-source quantum-chemistry software suite Psi4 40 .Single-point properties such as formation and orbital energies, dipole moments, rotational constants, partial charges, bond orders, valence numbers, as well as wave functions including α and β DFT-density matrices, orbital matrices, and the atomic-orbital-to-symmetry-orbital transformer matrix were obtained.For practical reasons, 52 structures whose DFT calculations required computational resources that exceeded empirically determined limits, or for which calculations were unsuccessful, were discarded (see ESI for details).

Data Records
All computed molecular structures, as well as their corresponding properties and wave functions are accessible through the ETH Library Collection service (Data Citation 1) 50 .
Format specification A summary.csv comma-separated file contains computed molecular-level properties and additional annotations.A compressed tarball file (structures.tar.gz) of ∼ 7 gigabytes (GB) contains plain MDL structure-data files 51 (SDFs) with embedded atomic and molecular properties, grouped in sub-directories according to their respective ChEMBL identifiers.These SDFs include single-point electronic properties calculated on the GFN2-xTB and ωB97X-D/def-SVP levels of theory, as described in Table 2.A second compressed tarball file (vibspectra.tar.gz,∼ 3 GB) contains vibrational spectra.
Wave function files, including additional properties such as DFT-density and orbital matrices, as described in Table 3, are split into 100 compressed tarballs (wfns_xx.tar.gz) of ∼ 50 GB each for easier management and downloading.These are supplied as NumPy 52 (.npy) binary files, which can be read using the Psi4 software package.Molecules (with all conformers grouped together) were assigned at random to the tarballs to enable easy use of subsets of the QMugs dataset without having to download all the files.The assignment of ChEMBL identifiers to tarballs is described in a tarball_assignment.csvfile.

Technical Validation
Optimized geometry sanity checks Four consecutive geometry checks were performed to filter out structures for which the geometry optimization procedure converged to unrealistic conformations.To determine suitable thresholds for removing a structure from our dataset, the generated geometries were compared to experimental reference values and to DFT-optimized geometries extracted from the PubChemQC dataset 37 .Specifically, we investigated (i) the deviation of bond lengths from experimental reference values, (ii) isomorphism between the initial molecular graphs and those obtained after geometry optimization, (iii) linearity of triple bonds, and (iv) planarity of aromatic rings.Structures were removed from the dataset if they failed any test of these tests.In total, 10, 986 (0.55%) conformations were discarded from the dataset.Each test is briefly described in the following subsections, with further technical details reported in the ESI.

Deviation of bond lengths from experimental reference values
Bond lengths in the optimized structures were compared to average experimental reference values for bonds of the same bond type (single, double, triple, or aromatic) and between the same atoms.Reference values were obtained from the Computational Chemistry Comparison and Benchmark DataBase (CCCBDB) 53 , and the largest absolute bond-length deviation from reference values was recorded per molecule.Bonds for which no reference value was available (0.75%) were omitted.The same analysis was carried out for molecules from the PubChemQC dataset containing the same atom types as QMugs, in order to obtain a comparable set with respect to the present atom types.The PubChemQC set (3, 834, 382 conformations with reference bond lengths) showed a deviation of 0.06 ± 0.04 Å (median ± 1 standard deviation), whereas the QMugs dataset (2, 004, 003 conformations with reference bond lengths) showed a deviation of 0.07 ± 0.03 Å.Based on the observed distribution of bond-length deviations from experimental reference values (Figure S1 in ESI) and manual investigation of example structures, 0.2 Å was determined to be a suitable threshold for a conformation to be removed from the dataset, which included 6, 131 (0.31%) examples.

Molecular graph isomorphism
It was investigated whether atom connectivity could be reconstructed after removing bond information from the generated SDFs.To this end, molecular graphs constructed exclusively from atom positions and types were compared to those obtained using the original atom connectivity (see ESI for details).1, 568 (0.08%) conformations for which the resulting molecular graphs were non-isomorphic failed this test.

Deviation of triple bonds from linear geometry
The deviation of triple bonds from their ideal linear geometry was examined.In this investigation, ring triple bonds were not considered owing to routinely-occurring deviations from linear geometry in systems with high ring strain 54 .The largest deviation from a 180°(linear) bond angle was recorded for each molecule containing at least one non-ring triple bond.The same analysis was performed on the PubChemQC dataset 37 .Triple-bond-containing molecules from PubchemQC and QMugs (273, 320 and 165, 101 samples, respectively) show deviations of 1.38 ± 1.46°(median ± 1 standard deviation), and 1.46 ± 2.13°, respectively.Based on the observed distribution of triple bond angles from a linear geometry (Figure S2 in ESI) and manual inspection of structures, a 10°deviation was identified as a suitable threshold.1, 147 (0.06%) conformations failed this test.

Deviation of aromatic rings from planar geometry
The planarity of carbon-containing aromatic rings was also investigated.For each molecule containing aromatic carbon atoms, the largest dihedral angle between the two planes spanned by each aromatic carbon atom and its three neighbors was recorded (see ESI for details).The same analysis was performed on the PubChemQC dataset 37 .Molecules from PubchemQC and QMugs (2, 391, 589 and 1, 950, 929 conformations with aromatic carbons, respectively) showed median dihedral angles ( ± 1 standard deviation) of 1.70 ± 1.85°and 2.99 ± 2.20°, respectively.Based on the observed distribution of dihedral angles from planar geometries (Figure S3 in ESI) and manual inspection of structures, 2, 769 (0.14%) conformations with aromatic carbon dihedral angles above 15°were discarded.

Further geometrical assessment
The changes in the molecular geometries along the applied pipeline were examined in order to evaluate the effects of the applied steps.Figure 4A shows the mean pairwise RMSD of atom positions between the conformations of each molecule at different steps along the pipeline.Conformations sampled during MTD simulations show a mean pairwise RMSD of 2.40 ± 0.52 Å (median ± 1 standard deviation).The k-means clustering procedure accomplishes the envisaged task of sampling conformations with higher geometric diversity (2.67 ± 0.74 Å).During the geometry optimization process, conformational diversity decreases (2.48 ± 0.86 Å).Unsurprisingly, for some molecules featuring rigid structures, conformations tend to converge toward the same energy minimum (0.09 % of molecules show a mean pairwise RMSD < 0.01 Å between their optimized conformers).
The degree to which the molecular geometries changed during the final optimization step was further analyzed.Molecules with initially more diverse conformations (higher mean pairwise RMSD of pre-optimized conformations) were shown to undergo a greater chance in atom positions (mean RMSD of pre-vs.post-optimized conformations) during optimization with the GFN2-xTB method (Figure 4B).The observed heteroscedastic behavior of these two properties indicates that while the mean RMSD of pre-vs.post-optimized conformations tends to increase with higher mean pairwise RMSD of pre-optimized conformations, its variance also increases.
Finally, the suitability of GFN2-xTB as a lower-cost surrogate for DFT-level geometry optimization (Figure 4C) was confirmed.500 randomly-chosen structures prior to semi-empirical geometry optimization from the QMugs dataset were further subjected to DFT-level geometry optimization (ωB97X-D/def2-SVP), discarding structures that could not be converged in 100 iterations or with the computational resources described in the ESI.The RMSDs between the structures independently optimized at both levels of theory were then measured.The pairs of structures show RMSDs of 0.47 ± 0.63 Å (median ± 1 standard deviation), indicating that the chosen semi-empirical method obtains similar geometries to those obtained with more expensive first-principle calculations.Large RMSDs in some example pairs (Figure 4C) could be interpreted as indicative of convergence to distinct local minima.
For 2, 067 molecules, their individual conformations have different SMILES describing two different (E)/(Z) isomers.Those structures are either α-β -unsaturated ketones, α-β -unsaturated nitriles, imine, or azo compounds, for which isomerization might be plausible 55,56 .In part due to the applied washing procedure, 17, 176 molecules can be represented with a SMILES string that is shared with at least one other ChEMBL-ID.

Validation of single-point properties
To validate the general agreement between the two methods employed in this work, the correlation between a series of singlepoint properties computed on both levels of theory was analyzed.Both global, molecular (Figure 5) and local, atomic/bond properties (Figures 6 & 7) were considered.All single-point molecular properties showed a high degree of correlation.Formation energies E Form (Figure 5A), which were obtained by subtracting atomic energies E Atom from total internal energies U RT , show a Pearson correlation coefficient (PCC) of 0.998.Dipole moments µ and rotational constants A (excl.22 small structures with very high rotational constants; Figure 5B,C) display PCCs of 0.969 and 0.999, respectively.Orbital energies, namely the energies for highest occupied (HOMO) E HOMO and lowest unoccupied molecular orbitals (LUMO) E LUMO and HOMO-LUMO gaps E Gap show PCCs of 0.769, 0.924 and 0.830, respectively (Figures 5 D, E & F).The observed PCCs for all six single-point molecular properties indicate good agreement between the two methods.Atom-type-specific partial charges for the 10 atom-types in QMugs (Figure 6, ESI Table 1) as well as the 15 most abundant covalent bonds orders (Figure 7, ESI Table 2) also show high correlations between the two methods used herein.Regarding partial charges, 7 out of the 10 atom types considered in QMugs were observed to have PCCs > 0.8, with the remaining carbon, nitrogen, and oxygen atom-types resulting in lower PCCs of 0.574, 0.124, and 0.274, respectively.Regarding bond orders, 10 out of the 15 show PCCs > 0.9 and 14 out of 15 display PCCs > 0.75 (see ESI Table 2 for additional metrics).Notably only carbon-fluorine bonds display a larger discrepancy between both levels of theory, with an observed PCC of 0.153.The observed correlations in both molecular and atomic single-point properties between GFN2-xTB and ωB97X-D/def2-SVP confirm the suitability of the former method as a computationally affordable starting point for ∆-learning of DFT-level properties.

Usage Notes
All data files can be accessed via any modern web browser, and can be programmatically downloaded using the provided instructions in the archive's readme.The provided SDFs can be processed using standard cheminformatics software (e.g., RDKit 41 , KNIME 57 ), and wave function files using the Psi4 40 software package or directly using Numpy 52 .    2 for additional metrics.For bond types which occurred > 1M times in the dataset, a randomly chosen sample of 1M bonds is plotted.* corresponding authors: Gisbert Schneider (gisbert@ethz.ch),José Jiménez-Luna (jose.jimenez@rethink.ethz.ch)† these authors contributed equally to this work

Overlap with other datasets
The overlap between the compounds included in the QMugs dataset with those included in other datasets featuring DFT-properties, namely QM9 1 , PubchemQC 2 and ANI-1 3 , was investigated.Overlap was computed based on the respective compounds' InChI 4 strings.For each dataset, the molecules were converted from their original formats (QMugs: .sdf,QM9: .xyz,PubChemQC: .mol,ANI-1: SMILES 5 , as obtained with the pyanitools module included in the ANI-1 repository) to InChIs using Openbabel 6,7 (version 3.1.1).Molecules that could not be successfully converted were skipped.The Venn diagram (Fig. 4, main article) was constructed using the pyvenn 8 package.Note that due to different washing procedures, the same compound identifier may refer to different InChI strings in different datasets (e.g., protonated and unprotonated).

ChEMBL SQL query
A locally-downloaded MySQL instance of the ChEMBL27 database 9 (version 8.0.19) was queried 10 .Our data extraction procedure is performed in two steps: First, a list of biological targets was extracted.Second, compounds for which a specific activity towards any of these targets was annotated were extracted.Both steps are described in the following.

Extraction of targets
"Single-protein" targets were selected, for which activity.standard_types were reported either as IC 50 , EC 50 , K i , K, K b , K a , K d , K e , or K m and in units of nM, or as a set of other activity.standard_types(e.g., − log K) and without units.The following types of annotations were excluded: Annotations with an assay.relationship_typeother than "homologous" or "direct", annotations with an assay.confidence_score below 7, annotations with assay descriptions for mutant species, annotations for potential activity duplicates, annotations with assay data validity comments other than "Manually validated" or "null", and annotations with activity.activity_commentsindicating missing data such as "Not Determined" or "Not tested".Only targets which had annotations for 10 or more unique compounds (as identified by their activity.molregnoidentifier) after this data extraction process were kept.

Extraction of compounds
For each of the targets extracted in the previous steps, we queried the SQL database for activity annotations.Annotations for which the activity.standard_valueor the activity.standard_unitwas missing were discarded.All activity.standard_valuesfor annotations were converted to negative decadic logarithm scales (pX) and values below -3 on this scale were corrected to their absolute value.The activity.standard_valuesfor annotations with activity comments denoting inactivity were set to 3 on this log-scale.Annotations where the converted activity.standard_valuelay outside the range of 3-12 were excluded.For each of the compounds that remained, we extracted the ChEMBL-ID and the canonical SMILES.

SMILES filtering
The molecules extracted from the ChEMBL database were filtered to exclude the following SMARTS patterns using RDKit 11 (version 2020.03.3.0): Compounds containing these substructures were excluded as they were either incompatible with the used MMFF94s force field 12 (e.g., B), contained halogens bound to two neighbors (e.g., [*]∼[F,Cl,Br,I]∼[*]), or elements with reduced relevance for drug-like molecules.

Computational resources
For practical reasons, molecules whose DFT-calculations demanded computational resources that exceeded empirically determined limits, were discarded.These limits include a maximum processing power of 4 CPU cores for 24 h wall-time, up to 40 GB of system memory, and 100 GB of scratch space.For the single-point property calculations, this was the case for 9 structures.Additionally, 19 structures (GFN2-xTB) and 24 structures (DFT) were discarded as the respective calculations could not be completed successfully.

Optimized geometry sanity checks
We performed four consecutive geometry checks (Sections 5.1 -5.4) to filter out structures for which geometry optimization had converged to unrealistic conformations.Structures failing any of these tests were removed from the dataset.Note that stated numbers of investigated molecules for the QMugs dataset are larger than the size of the final dataset, as only molecules that passed all geometry checks were included in the final dataset.Note that the reported numbers of conformations failing the tests described in the following subsections do not add up to 10, 986 (the number of conformations removed from the dataset) since some conformations failed in multiple tests.

Deviation of bond lengths from experimental reference values
We searched for bond-length outliers in the optimized structures.As reference values, average bondlength values between atom types from "Standard Reference Database, Computational Chemistry Comparison and Benchmark DataBase" 13 were extracted.For each bond in the optimized structures, the absolute deviation from its respective reference value was calculated.For each conformation, the largest absolute deviation was recorded.If no reference value for a specific bond type between two atom types was found, the bond was not considered for further analysis.This was the case for 876, 113 (0.75%) of all bonds in the investigated molecules from the QMugs dataset.The same procedure was carried out on molecules from the PubChemQC dataset 2 containing the same atom types as the QMugs dataset (Figure S1).The restriction of atom types was made in order to rule out effects of atom types on bond length deviations.This restriction removed 65, 316 (1.64%) molecules from the assessment.Further 82, 708 (2.08%) molecules which could not be successfully read with RDKit 11 were ignored.Based on the distribution of bond-length deviations from experimental reference values, we discarded molecules above the 0.2 Å threshold.Based on manual inspection of structures at different bond length deviation values, this threshold appeared as a suitable, conservative threshold.6, 131 (0.31%) conformations did not pass this test.

Molecular graph isomorphism
We investigated whether heavy-atom connectivity can be reconstructed when removing all bond information from the generated structure-data files (SDF) in the database.SDFs were converted to the .xyzfile format (which does not contain bond information) using OpenBabel 6,7 (version 3.1.1).We then attempted to perform a conversion from .xyz to InChI 4 , or to SMILES 5 upon failure of the former.If both were unsuccessful, we considered the graph isomorphism check as failed.If either succeeded, however, we compared the molecular graph of generated molecular strings (as read by RDKit) to the one originally obtained from the SDF (which includes bond information).The isomorphism of the molecular graphs was then checked using the NetworkX 14 Python package (version 2.5), considering nodes (representing atoms) in each graph labelled with their respective atom types.We did not use bond types in the previous comparison due to observed high false negative rates related to mislabelled bonds in nitro and other functional groups with multiple resonance structures, among further reasons.1, 568 (0.08%) conformations failed this test.

Deviation of triple bonds from linear geometry
For each molecule containing a triple bond which is not part of a ring, the deviation of bond angles γ from the ideal 180°(linear) geometry, denoted as ∆γ, was assessed.For each non-terminal atom in a nonring triple bond, the angle between the bonds to its two neighbors was computed.The largest deviation ∆γ from a perfectly linear triple bond was recorded per molecule.We limit this investigation to triple bonds outside a ring as triple bonds in rings can show substantial deviations from a linear geometry due to high ring strain (e.g., cyclooctyne, the smallest stable, cyclic hydrocarbon accommodating a triple bond, deviates by ∆γ = 17°from a linear geometry 15 ).For reference, we performed the same study on all molecules from the PubChemQC dataset 2 which include at least one non-ring triple bond (Figure S2), skipping molecules with could not be read with RDKit.From visual inspection of the distribution of triple bond angle deviations and manual inspection of example structures at different deviation levels, we decided to discard structures with a triple bond angle deviation of ∆γ > 10°. 1, 147 (0.06%) conformations failed this test.Structures for which GFN2-xTB [16][17][18][19] still indicated significant negative wavenumbers after 100 iterations (17, 502 conformations, 0.88%) are denoted in the summary.csvfile.

Deviation of aromatic rings from planar geometry
We furthermore investigated the planarity of carbon-containing aromatic rings.To do so, we assessed the dihedral angle between the two planes spanned by each aromatic carbon atom and its three neighbors.Angles greater than 90°were corrected to 180°-<angle> to remove directional dependency.Calculations were performed for all order permutations of the aromatic carbon atom and its three neighbors.The largest dihedral angle (and hence the largest deviation from a perfectly planar aromatic ring) was recorded per conformation.We performed the same study on the molecules from the PubChemQC dataset 2 described in Section 5.1 (Figure S3).From visual inspection of the distribution of aromatic carbon dihedral angle deviations and manual inspection of example structures at different deviation levels, we decided to discard structures with an aromatic carbon dihedral angle deviation of 15°or more.2, 769 (0.14%) conformations failed this test.

Independent terms of the Schrödinger equation
The Born Oppenheimer approximation 20 defines the molecular energy of the electronic Schrödinger equation as a sum of four independent terms, namely (i) nuclear repulsion energy VNN , (ii) exchange correlation energy VeN , (iii) kinetic electron energy also known as one electron energy Te , and (iv) electron repulsion energy also known as two electron energy Vee .The distribution of the four terms in QMugs are visualized in Figue S4.

Figure 1 .Figure 2 .
Figure 1.A) Principal-moments-of-inertia plot39 for molecules in the QMugs dataset.NPR x = x-th normalized principal moment, I x = x-th smallest principal moment of inertia.B) Venn diagram showing overlap between QMugs and other well-known datasets with DFT-level computed properties: QM9 18 , PubChemQC 37 , and ANI-119  .Overlap was computed based on the uniqueness of the InChI representations of the contained molecules.Numbers do not add up to those reported in Table1because of InChI strings that occur multiple times.

Figure 3 .Figure 4 .Figure 5 .Figure 6 .Figure 7 .
Figure 3. Overview of the data generation process.Molecules were extracted from the ChEMBL database, standardized, and filtered, and starting conformers were generated using the RDKit software package.Metadynamics (MTD) simulations were performed using the GFN2-xTB semi-empirical method to generate three diverse conformations before final geometry optimization.Molecules that did not pass a series of geometric sanity checks were removed.DFT-level properties (ωB97X-D/def2-SVP) were computed using the Psi4 software.

Figure S1 :
Figure S1: (Top) Distribution of largest absolute bond-length deviation from experimental reference values per conformation (histogram bin size 5 × 10 −4 Å).PubChemQC (3, 834, 382 conformations with reference bond lengths) shows a deviation of 0.0580 ± 0.0419 Å (median ± 1 standard deviation), whereas QMugs (2, 004, 003 conformations with reference bond lengths) exhibits a deviation of 0.0687 ± 0.0317 Å. 8, 492 (0.22%) and 926 (0.05%) conformations in the PubChemQC and QMugs sets, respectively, have higher deviations than 0.30 Å and are not shown.(Bottom) Distribution of the total number of atoms per conformation in both datasets (histogram bin size 1), showing that molecules in the QMugs sample are significantly larger on average.765 (0.04%) conformations in the QMugs dataset have more than 200 atoms and are not shown.Potentially-arising greater steric clashes in larger molecules may contribute to the slightly higher bond length deviations in the QMugs dataset, compared to the PubChemQC dataset.

Figure S4 :FFigure S5 :
Figure S4: Terms of the molecular hamiltonian Ĥ calculated on the ωB97X-D/def2-SVP level-of-theory fo moelcules in QMugs.(A) Nuclear repulsion energy VNN in E H . (B) Exchange correlation energy VeN in E H . (C) One electron energy Te in E H . (D) Two electron energy Vee in E H .

Table 1 .
59scriptive statistics of the dataset reported herein in the context of other DFT-level molecular datasets and the information provided by each.The number of molecules for PubChemQC corresponds to that available on the website of the project.59Heavyatom averages are weighted by the number of conformations.