Introduction

Datasets have become an essential part of computational chemistry1,2,3,4,5,6,7,8,9,10,11. Improvements in data availability, interoperability, and storage have been key to the development of cost-efficient Machine Learning (ML) methods12, and promoted the use of Quantum Chemistry (QC) computations for the high-throughput screening of molecules and materials. For datasets to be adequate for both ML and QC, they must cover a well-defined albeit diverse portion of the chemical space, and include the essential information needed to run an electronic structure computation, namely the structure (R), and the molecular charge (Q). Unfortunately, satisfying both requisites can be very challenging. Existing crystallographic databases (e.g., CSD13,14, COD15) offer the most comprehensive collection of crystals resulting from decades of creativity in synthetic chemistry work. They represent an ideal basis for the data-driven discovery of materials, but in these top-down databases, there is no information about Q for individual molecules. For this reason, a popular alternative has been the construction of combinatorial bottom-up databases, in which molecules are assembled from a pool of building blocks for which R and Q are known beforehand16,17,18,19,20,21,22. While these databases are easy to construct and can grow considerably large, they suffer from a lack of diversity in the pool of building blocks and/or in the rules to combine them. Therefore, neither approach is currently able to tackle the diversity nor information requirements to enable the use of ML and QC methods to explore somewhere close to the full chemical space.

To improve on this situation, herein we report the cell2mol software (available in github), a fully automatic pipeline to characterize molecular crystals that enables the construction of QC-ready datasets with large chemical diversity. The algorithm encodes chemical concepts and rules applied by chemists when interpreting crystallographic data. While cell2mol excels at characterizing purely-organic crystals, it is particularly useful to characterize crystals with TM complexes, which pose a bigger challenge due to their structural complexity and the multiple oxidation states (OS) of the metal ions. The elucidation of metal OS from structural data is an active topic of research. One of the most widespread methods is the Bond-Valence Sum (BVS), which uses a set of metal- and OS-dependent parameters to capture the correlation between OS and structure23. The BVS method can be very accurate for some metals with available parameters24, but has severe limitations in its applicability (as discussed below). Recently, ML alternatives to determine the metal OS in metal-organic frameworks25 and Oxygen-Coordinated Metal Atoms26 have been developed. An advantage of both the reported ML models and BVS is that they are solely based on the local environment of the metal center. As such, they scale very well with the number of metal centers in a molecule, and they do not suffer when structures have experimental uncertainties (e.g., disorder, missing H atoms) as long as these are far away from a metal center. However, while knowing the metal OS (QM) is valuable for analysis and filtering purposes, the charge of the ligand(s) (QL) is still unknown and hence these methods do not provide the total charge of the complex (QTMC) and, thus, are insufficient to set up a QC computation. With this aim in mind, Parsons et al. combined the BVS method for QM with three other approaches that retrieve QL24. While the overall approach is reported to be very reliable, it still suffers from the poor applicability of BVS and requires QC computations, which defeats the purpose of characterizing QTMC.

Other approaches exist to extract more comprehensive information about crystallographic data27,28,29,30. On one hand, CSD editors interpret and curate the database entries using algorithms that assess not only QM but also the molecular connectivity (C, i.e., the bond order network) and the formal atomic charges (qi)27. The goal of such algorithms is to interpret the whole unit cell, rather than only the metal center, which makes them much more valuable (e.g., for substructural or similarity searches31), but also more vulnerable to experimental uncertainties. Consequently, CSD algorithms have a lower—but still exceptional—ratio of success (ca. 75%) at interpreting unit cells27. However, they are closed, the final C and QM data are only accessible through the limited set of export options in the CCDC software, and have only been applied to few CSD entries, so their degree of automatization is unclear. On the other hand, a fully automatized topological analysis of crystals is available with the ToposPro package30 at the TopCryst web interface29, which includes a simplified form of C based on geometry considerations. It is worth mentioning that both the CSD algorithms and ML models exploit statistics to make their predictions. In other words, the accuracy over a particular structure (or metal center) depends on how often it has been seen before, which implies that less frequent structures, metal environments, and oxidation states are less accurately described25.

As opposed to these methods, cell2mol is able to provide a reliable and comprehensive interpretation of a unit cell, including the connectivity (C) and charge (Q) of all molecular species contained in crystallography files in a deterministic manner, meaning that it can be applied on individual.cif files with no preliminary training, as opposed to probabilistic models (i.e., ML-based25). Additionally, it does not require any QC computation32,33,34,35,36, and offers complete data interoperability37. cell2mol provides the necessary information to set up any subsequent QC computation (including solid-state ones) or ML model, or to classify compounds/ligands based on charge, denticity (κ) -or hapticity (η)- or through any substructural search, thus retaining full control over the species included in the final datasets. To validate its performance, we construct, analyze and distribute eight different databases of TM complexes, as well as a database of 13k unique ligands with all the necessary information to exploit them in molecular assemblers to achieve even greater chemical diversity.

Results and discussion

The algorithm

The characterization of a unit cell with cell2mol only requires its crystallography file (i.e., the cif). After an initial formatting with the cif2cell code38, the characterization proceeds in two main steps (see detailed workflow in Supplementary Note 1). The goal of step is to obtain the information about the stoichiometry of the different molecules in the unit cell. This information is not immediately available from the.cif file since the molecules are not yet recognized as such. Moreover, they tend to be severely fragmented in smaller groups of connected atoms (box A in Fig. 1). Those are put together through the construction and block-diagonalization of the adjacency matrix (A), with Aii = 1, and Aij = 1 if the distance between atoms i and j is below a threshold, otherwise zero. Thus, A is evaluated based on interatomic distances due to its simplicity and efficiency39, while more accurate and expensive alternatives exist based on Voronoi partitioning schemes40,41. After block diagonalization, the resulting blocks correspond to either molecules or fragments (i.e., a portion of a molecule). Molecules are preserved, while fragments undergo translations in the three crystal directions forming bigger fragments until all molecules are fully reconstructed (box B). Finally, ligands in TM complexes are identified using a similar block-diagonalization process, with AMj=AjM= 0 for all metal (M) and ligand atoms (j) (box C).

Fig. 1: Simplified workflow of cell2mol.
figure 1

Simplified workflow of cell2mol.

In step , cell2mol proceeds to assign the connectivity (C) and formal charge (Q) for all species in the unit cell exploiting the charge neutrality rule. All ligands and any non-metallic species (i.e., counterion or lattice-solvent) are interpreted and their C and the associated formal total charge (Q) and atomic charges (qi) are retrieved. Within cell2mol, the concept of connectivity is mathematically handled as the Bond-order matrix C representing a Lewis structure. Basically, C adds the bond order (e.g., single, double bond) information to A. The creation of C from A is done through a modified and expanded version of the xyz2mol code developed by Jensen and coworkers based on previous work by Kim42. The use of C to define the molecular graph offers improved capabilities with respect to other codes based on A29,30. Lewis structures give access to qi and Q, and enable advanced sub-structural searches including specific bond patterns or functional groups. While the adopted machinery offers a clear advantage, the number of possible C (i.e., Lewis structures) grows very fast with increasing the number of atoms, especially in conjugated backbones. For this reason, together with the difficulty to handle periodic connectivities, cell2mol cannot be applied to periodic structures, for which other approaches based on A are available29. However, the generation of C requires a known Q, which is precisely our unknown variable. Therefore, our approach is to generate several C starting from a list of candidate initial charges, and to select those that are plausible (box D). The criteria to generate and select the best C is the same for non-complex molecules (e.g., solvent, counterions) and ligands. However, for ligands, there is a key preliminary step. To compensate for the missing M-L bonds when generating C, some connected ligand atoms (i.e., those that are coordinated to the metal) must be saturated, typically with H atoms. This is a complicated part of the algorithm due to the ligand’s large chemical diversity and different coordination modes. A comprehensive list of rules is specified in cell2mol to this task, dealing with a large variety of denticity -and hapticity- modes for both terminal and bridge ligands. For some ligands, especially polydentate ones with large, conjugated moieties, the decision to saturate connected atoms is particularly difficult, because a chemically meaningful C can be achieved with and without any additional protonation. Thus, multiple protonation states are created and, for each of them, several C are generated (see Supplementary Note 2).

From the pool of C, those that minimize the total number of atomic charges, and the absolute total charge before and after correction (for the removal of the added H+), are considered plausible and are pre-selected. Once plausible C are collected for all non-metal species in the unit cell, these are combined with a list of common OS for all metal species (see Supplementary Note 3), generating charge distributions. When only one charge distribution fulfils charge neutrality, it is selected, and the unit cell is successfully interpreted (box E). When multiple do, the unit cell is considered “unresolved” (see Supplementary Note 4 for a detailed analysis of errors in step ). This is increasingly common in unit cells with multiple redox-active species, such as in bi- or poly-metallic complexes (A0/B+1 vs. A+1/B0). Options are currently being explored to improve the interpretation of those systems (see Supplementary Note 5).

Example: YOXKUS

To illustrate the interpretation capabilities of cell2mol, we take YOXKUS43 as an example (see Fig. 2). According to cell2mol, YOXKUS has four identical mono-metallic Re complexes and no counterion or solvent molecules in the unit cell. Each complex has three types of ligands. The first ligand is interpreted as being connected to the Re ion through two groups of atoms. One group consists of a substituted Cp ring with η5 hapticity and the other is the P atom of a diphenylphosphine, with κ1 denticity. cell2mol assigns this ligand a total –1 charge, after creating one protonation state, generating its connectivity under five possible charges (0, ±1, ±2), and selecting the most plausible one. The second ligand is an iodine atom with −1 charge and appears twice, and the third is a neutral CO ligand, with a − 1 and a + 1 formal charge in the C and O atoms, respectively, and a triple bond between them. All this information is stored in variables and saved in a python object containing the interpretation of the whole unit cell. From this file, a user can easily export the Cartesian coordinates of the full reconstructed unit cell, as well as the cell parameters, to prepare a solid-state QC computation. Alternatively, the user can extract the Q, R, and C of any of the individual molecules, or that of the isolated ligands/metals/atoms of those molecules. Indeed, non-metal species can also be accessed through their respective Rdkit mol objects, which provides an unprecedented level of control in the final dataset, with all the potentially relevant information (i) for a substructure/similarity search (using C), (ii) to set any QC computation (using Q and R), or (iii) for the generation of chemoinformatics (e.g., SMILES44) or QML-based45 representations for ML models (using Q, R, and C).

Fig. 2: Diagram of the CSD entry YOXKUS.
figure 2

Diagram of the CSD entry YOXKUS.

Performance of cell2mol

The capabilities of cell2mol are demonstrated by interpreting crystallographic information extracted from the CSD repository. For simplicity, datasets are constructed separately for eight TM ions, including the most electronically challenging ones from the 3d block (Cr, Mn, Fe, Co, Ni, Cu) and representatives from the 4d (Ru) and 5d (Re) blocks. The data is initially extracted from the CSD software ConQuest. The only filters applied at this stage are the presence of the respective TM ion, and the absence of any so-called polymeric bond. Thus, periodic systems are discarded, for which other approaches offer excellent topological analysis tools, or the prediction of metal OS25,26,29. Overall, our databases cover molecular crystals of organometallic and coordination complexes. No other limitation on the element types (except f-block), molecular size, or complexity is set. The resulting entries are exported from ConQuest in .cif format, and duplicate CSD-refcodes are discarded. Aiming at a complete interpretation of the unit cell, cell2mol is vulnerable to experimental uncertainties. Entries with disorder or missing H atoms cannot be interpreted correctly and are thus filtered out (see Fig. 3 and Supplementary Note 3). This pre-filtering step is crucial to obtain more reliable statistics of cell2mol. Less evident errors can only be identified after retrieving the connectivity, which is still unknown at this stage. For instance, assessing whether an O atom is missing a proton (OH vs. O vs. O) depends on the connectivity (−OH vs. =O vs. −O) of all molecules, and on fulfilling the charge neutrality criterion.

Fig. 3: General workflow of cell2mol performance analysis.
figure 3

General workflow to set up the success rate and reliability of cell2mol, including values for the Fe-based database of mono-metallic complexes.

To evaluate the performance of cell2mol we use two metrics: the success rate and the reliability. The former quantifies the percentage of CSD entries for which a plausible interpretation is given, and is related to the amount of chemical diversity that the code can handle without errors. The latter, which is the most important parameter to generate curated databases, measures how often is the proposed interpretation correct. While assessing the reliability based on the entire list of properties that are extracted for each CSD entry is not possible, most have a direct impact on the assignment of the metal OS. The metal OS is thus chosen to estimate the success rate and reliability of the cell2mol interpretation (see Fig. 3), and is compared to the metal OS given in the.cif file, which is taken as a reference. As discussed hereafter, the reference values are sometimes erroneous, which means that the reported reliability estimates are slightly underestimated (~1%) for all methods. Also, cases of error compensation are possible, in which the cell2mol interpretation is incorrect in any of its variables, while not affecting the metal OS prediction. In any case, all CSD entries for which the OS reference is available are collected in the test set, while the other entries are collected in what is called the prediction set, given that cell2mol predicts its properties based on the available crystallographic data. Finally, both subsets are further split in mono-, bi-, or poly-metallic complexes46. To simplify the discussion, we focus on mono-metallic complexes although complementary analysis on the other subsets are available in Supplementary Note 5.

More than 75% of crystal structures containing mono-metallic complexes are univocally interpreted by cell2mol (see Table 1). This percentage raises to 94% for a pool of randomly selected purely organic crystals, and decreases to 71% for Re-based complexes, owing to their greater diversity in OSs and metal-ligand coordination modes. Such a success rate is comparable to what has been reported for the CSD interpreters, and largely outperforms other popular methods to assign the metal OS such as BVS, especially for Cr, Ru, and Re (see Supplementary Note 6). Even more important, the reliability is extraordinarily high for all metals, especially for those with one dominant OS such as Cu (98%) or Ni (97%), and diminishes to 90% in Re complexes, owing to its larger number of common OS. Entries with a disagreement are discarded from the final published datasets. However, manual inspection reveals that only about one-third of those cases are due to an error in cell2mol (see Supplementary Note 7). In most cases, the disagreement is due to incomplete or erroneous information in the.cif file, which suggests the potential use of cell2mol as a diagnostic tool. Also, the reliability of cell2mol is much larger than BVS (ca. 74%), which greatly underperforms here in comparison with what is typically reported in the literature, due to the much greater diversity of our datasets. Finally, to assess the performance of ML models for the same dataset, we trained a Random Forest (RF) ML model to predict the metal OS based on its local environment (see Methods for details). The accuracy of this model reaches ca. 94%, similar to what Smit and coworkers report for the application of their ML model to metal complexes (ca. 90%)25, and similar to cell2mol itself (see Table 1). We thus conclude that cell2mol offers comparable reliability, while providing not only the metal OS (such as the ML and BVS methods) but a comprehensive interpretation of the unit cell. The advantages of cell2mol become even clearer when interpreting the CSD entries in the prediction set (vide infra).

Table 1 Results on the cell2mol characterization of unit cells with mono-metallic TM complexes included in the test set (See Supplementary Note 5 for unit cells with bi- and poly-metallic complexes).

Chemical diversity

For 31019 CSD entries included in the test set, cell2mol provided a unit cell interpretation that coincided with the metal OS provided in the.cif file. For those entries, two-dimensional maps of their chemical space have been constructed (see Methods), highlighting the charge and connectivity distribution (see Fig. 4 for Fe and Supplementary Notes 1013 for other metals). These maps help identify structure–property correlations without any a priori assumption. For instance, (i) most Fe-based haptic compounds have Fe(II) or Fe(0) metal ions, (ii) Mn shows a clear correlation between structure and OS, or (iii) Cu complexes with coordination number 3 are almost exclusively associated with Cu(I). Overall, the eight metal centers can be found in 2407 different coordination sphere types (e.g., FeN4O2), and are coordinated to a pool of 13,819 unique (i.e., non-repeated) ligands with total charges that range from –6 to +2, and including 8 different hapticity modes (see Fig. 5). Those ligands are collected in a separate database that includes their coordinates, list of connected atoms, charge (Q), and bond network (C) representing their Lewis structure. On one hand, C enables us to determine, through a structural search, that in this pool of ligands there are, for instance, 6909 secondary amines, or 988 rings containing an O atom. On the other hand, the remaining data could be used to re-assemble46 those ligands into new complexes to create a bottom-up database encompassing an even broader region of chemical space. For instance, the 4942 unique bi-dentate ligands in the database can be combined to generate about 20 billion octahedral complexes48. Similarly, the re-assembled molecules could be combined, in a modular fashion47, with the identified 1246 unique non-complex molecules (i.e., solvent, counterions) to generate new candidate unit cells fulfilling charge neutrality. While the stability and shape of these new unit cells would have to be assessed48,49, we expect that cell2mol could also be exploited to generate chemical diversity at the supramolecular level.

Fig. 4: Analysis of the chemical space covered by the Fe database and ML model performance.
figure 4

Representation of the chemical space in the Fe mono-metallic dataset using the t-SNE projection. Each point is one TM-complex in the database. Complexes are clustered by similarity in the local SLATM representation of their metal center describing the structure and chemical composition of the first coordination sphere (see Methods). In the top panel, for the test set we show a the distribution of metal OS (0 = green, 2 = red, 3 = cyan), Oxidation State (Test Set), b the presence of at least one haptic ligand (green = no, blue = yes), Hapticity (Test Set), c the 385 coordination sphere types for Fe, Coord. Sphere (Test Set), and d the coordination number of the metal, with haptic ligands counting 0 towards this number (green = 0, cyan = 3, purple = 4, navy = 5, pink = 6), Metal Coord. Number (Test Set). In the below panel, we show the e the maximum probability associated with the ML prediction of the metal OS, as a measure of its confidence (yellow = 1, degrading to green = 0.5 and blue = 0) Max. ML probability (Test Set), and the f overlap between the prediction (red) and test (blue) sets, which are also shown separately in g and h. The black circle indicates a region with poor overlap. See Supplementary Notes 1013 for other metals.

Fig. 5: Chemical diversity in the ligands database.
figure 5

Distribution of (left) total charges and (right) hapticity modes in the database of 13,819 unique ligands. The inset shows the single case of a ligand with −5 charge, in QORFAG.

Mining the CSD

We have proved that cell2mol is able to interpret molecular crystals and construct databases with great chemical diversity. So far, this has been done exclusively for what we defined as the test set, which amounts for ca. 50% of the total CSD entries. Ideally, databases would be constructed from the whole of CSD, and not be restricted to a fraction of its chemical space. Not being a statistical method, the success rate and reliability established above for the test set should, in principle, hold for the prediction set, provided that the set of chemical rules in cell2mol is transferable enough. To prove it, we used 1000 mono-metallic randomly-selected CSD entries for each metal. As expected, similar success rates are obtained for all metals (ca. 73%, see Supplementary Note 8). To evaluate the reliability in the absence of metal OS information in the.cif file, here we compared the metal OS predicted by cell2mol with the one provided by the ML model. Both methods coincide with about 90% of CSD entries with Mn, Fe, Ni, Cu, and Ru, which is close to the reliability reported for both methods in the test set (see Table 1). However, the agreement surprisingly drops to 70% for Cr, Co, and Re (see Fig. 6). Manual inspection of up to 100 cases with disagreement reveals that cell2mol is typically correct, which hints at deficiencies of the ML model (see Supplementary Note 9) that can be explained by the following two reasons. First, some metals exhibit a very poor correlation between structure (including chemical composition) and their OS (see Fig. 4a and Supplementary Note 10), which decreases the confidence of the ML model when assigning the OS (see Fig. 4e and Supplementary Note 12) and hence its accuracy (89% vs. 96 of agreement in Fe vs. Mn). Second, the chemical landscape can be very different in the test vs. prediction sets (see Fig. 4f–h and Supplementary Note 13), which means that the ML model often has to extrapolate. When both problems cooperate, such as in Cr, Co, and Re, the ML models lose accuracy. Considering that we used the whole available data in CSD to train this model, this behavior is likely unavoidable, and points to a fundamental problem that statistical methods have when mining the rich chemical diversity in the entire CSD. This stresses the relevance of non-statistical alternatives such as cell2mol. Indeed, the most promising route for future work is the combination of a deterministic method for the comprehensive interpretation of the unit cell (e.g., cell2mol) with a local statistical method for the evaluation of specific properties of species when more than one possible interpretation is possible (e.g., the metal OS in unresolved CSD entries). Future work will focus on the implementation of this scheme, as well as on the improvement/extension of the chemical rules to understand M-L connectivity, and the incorporation of f-block metals.

Fig. 6: Performance of the ML model for the test and prediction sets.
figure 6

Comparison of the performance of the trained ML model (see Methods) at predicting the metal OS. In blue, the reliability of the model, established by comparison with the.cif files in the test set. In red, the agreement between cell2mol and the ML model in the assignation of metal OS for the prediction set.

Summary

We presented cell2mol, a tool that encodes chemical concepts and rules to interpret crystallographic data, and extract comprehensive information about the individual molecules contained in unit cells. cell2mol can successfully interpret about 75% of the CSD entries containing mono-metallic complexes with a reliability of over 95%. We demonstrated that these metrics surpass other popular methods dedicated to the assignment of metal OS (BVS and ML), with cell2mol being much more versatile. Also, we showed that our software can generate top-down and bottom-up QC-ready databases with incomparable chemical diversity. To demonstrate its capabilities, we have used cell2mol to generate a publicly available database of 31,019 complexes containing eight different metal centers (Cr, Mn, Fe, Co, Ni, Cu, Ru, Re). Additionally, we generated a separate database of 13,819 constituent ligands that can be rearranged to generate billions of realistic new chemical structures. All content is fully searchable and interoperable using chemoinformatics software (e.g., Rdkit, SMILES-based tools). We expect that cell2mol, with possible subsequent improvements, will pave the way towards making all crystallographic repositories entirely usable for molecular and materials design purposes.

Methods

CSD entries have been exported with the software ConQuest (version 5.42) included in the CCDC software, with the database updated to May 2021. The pre-filtering has been done with local bash scripts. The Random Forest (RF) model for metal-specific oxidation state prediction was constructed using the RandomForestClassifier implementation of scikit-learn50. The local SLATM (aSLATM) representation51 of the metal center under scrutiny, aimed at capturing the structure and composition of the first metal coordination sphere, was used as the input feature vector. All SLATM vectors were computed using the QML package52 with a modified discretization grid (0.4 a.u.) and cutoff (4.0 a.u.), to accommodate the requirements associated with the large number of element types. The training sets for the RF model were composed of all mono-metallic metal complexes for which a reference oxidation state was available in the CSD entry. Due to the large number of cases for Fe, Co, Ni, and Cu, these were truncated to 4000 random samples. Among the respective training sets, 10 stratified K-folds were performed for cross-validation, from which the overall out-of-sample accuracy was computed, as well as the maximum probabilities for each metal center (i.e., the probability of the assigned OS). The same local SLATM (aSLATM) representation was used to generate the t-SNE projections in Fig. 4, for which we used a perplexity value of 50.