Global analysis of the energy landscapes of molecular crystal structures by applying the threshold algorithm

Polymorphism in molecular crystals has important consequences for the control of materials properties and our understanding of crystallization. Computational methods, including crystal structure prediction, have provided important insight into polymorphism, but have usually been limited to assessing the relative energies of structures. We describe the implementation of the Monte Carlo threshold algorithm as a method to provide an estimate of the energy barriers separating crystal structures. By sampling the local energy minima accessible from multiple starting structures, the simulations yield a global picture of the crystal energy landscapes and provide valuable information on the depth of the energy minima associated with crystal structures. We present results from applying the threshold algorithm to four polymorphic organic molecular crystals, examine the influence of applying space group symmetry constraints during the simulations, and discuss the relationship between the structure of the energy landscape and the intermolecular interactions present in the crystals.


Threshold Results
CSP energy landscapes and disconnectivity graphs over the whole sampled energy range. Supplementary Figure 9: Connectivity graph from threshold simulations for TTBI in P 1 unit cell over the whole sampled energy range.

PCA and Clustering Results
(a) (b) Supplementary Figure 10: Projection of the structures sampled during the threshold simulations for the co-crystal GAZCES in space group P 2 1 /c, projected onto the first two principle components of (a) symmetry functions and (b) SOAP descriptors. Structures are colored by the energy basin that they occupy.
The HDBSCAN algorithm was applied for clustering with the parameter min_cluster_size, defining the minimum size of clusters, as 5, 10 or 20. HDBSCAN clusters are represented by the colour of edges on the disconnectivity graph, while non-clustered structures are represented with black edges.
For ACSFs, we tested three ways of calculating pairwise distances d between vector r 1 and r 2 : the Euclidean distance the infinite order Minkowski distance: where r i is the i t h component of r, and by calculating the Euclidean distance between each atom and taking the maximal atomic distance as d(r 1 , r 2 ) = max Among these three distance metrics, infinite order Minkowski and maximal atomic approaches gave similar clustering results where almost all the structures were classified into one cluster (Supplementary Figure 13 and Supplementary Figure 14). Using Euclidean distances, neither pbserved structure was assigned to a cluster and all the structures in basin I were identified as noise (Supplementary Figure 12 (a)). By varying the minimum cluster size, we found that either all the structures were identified as one huge cluster at min_cluster_size = 5 or the clusters vanished at larger size. Neither of two known crystal structures, I or II, are clustered into any group when min_cluster_size is 10 or larger. The reason why the clusters vanishes with larger min_cluster_size = 5 is because the HDBSCAN requires the core structure to connect to sufficient neighbours to form clusters. In this case a lot of structures are only reachable to less than 10 other structures and thus identified as noise when min_cluster_size exceeds 10. The failure to cluster the structures in individual basins indicates the dissimilarity between those structures.
The distribution of pairwise distances also indicated that there was no clear difference in distances between structures in the same basin and structures in different basins ( Supplementary Figures 11 (d), 12(b), 13(b), 14(b)).

Basin Identification
In the current strategy, the basin identification of a perturbed structure is by local energy minimization and duplicate comparison with database. In our first implementation of the threshold algorithm, in order not to include any possible uphill process to identify the correct minimum from a perturbed structure, the local energy minimization was a one-step approach: a direct minimization with FIT + DMA. However, we found that this strategy leads to some structures in which gaps have been introduced between layers of molecules. Taking the tetrolic acid threshold simulations in P 1 as an example, we find sets of structures with the same energy (Supplementary Figure 19(a)), but with a wide density range. Consistently, the energy landscape over all density and energy indicates groups of structures at the same energy with density distributing from 0.3 to 0.75 g/cm 3 , implying a loosely packed crystal structures. These structures are found to have layers of molecules divided by vacuum gaps (Supplementary Figure 19(b)). We find that these do not correspond to true energy minima, but occur because of the lack of significant intermolecular forces across these gaps. This is, in part, due to a cutoff that is applied to force field terms, which sets all interactions beyond the cutoff to zero to save computational expense. In our final implementation, and all results presented in this paper, these structures are removed by including an intermediate energy minimization step with a pressure applied, to remove such gaps. However, an alternative approach would be to leave these structures in and treat them as separate energy basins.
The key point here is what type of transitions should be allowed between crystal structures. In terms of energy landscape described by te energy potential with a distance cutoff, any structure should be included and thus those intermediates with vacuum gaps describe valid paths, although structures along the path do not correspond to equilibrium crystal structures. The transition between two crystal structures might go through states which are not crystals at all and these structures might suggest the need for volume expansion along the transition path. However it also makes sense if one is only interested in transition paths where the basins correspond to equilibrium crystal structures. In this case, a solution is to eliminate those structures with vacuum gaps without affecting those well packed structures. To ensure this, we apply the three-stage optimization. Most structures optimize to satisfactory crystal structures in the first step, and stay at the same energy minimum after application and removal of pressure in the second and third steps. However, those structures with vacuum gaps after the first step have these gaps removed when pressure is applied (to minimize the PV energy term) and are returned to the initial energy surface through re-optimization with the pressure term removed. One should bear in mind that here we are not focusing on local minimization, but the identification of which minimum the perturbed structure belongs to. Therefore both strategies are feasible and it is hard to say which one is better. One reason why we finally adopt the compressing minimization is because the duplicate identification of structures with vacuum gaps is quite difficult. The length of vacuum gap will not arouse energy variation with layer structure fixed, which forms a contour on the energy landscape. In principle, any structures on the contour should be identified as one unique structure to connect two structures with the contour in their transition path. However, due to lack of experience of dealing these vacuum gaps structures, we have not got an efficient way to identify these vacuum gap structures and cluster them. Therefore the compressed unit cells are finally utilized. If one wants to investigate all the transitions on the energy landscape, these contours formed by vacuum gaps should be taken into consideration and methods to deal with those structures need to be developed. In this project, we prefer all the intermediates still in crystal form and rule out all the minima with vacuum gaps.

Duplicate identification
Crystal structures were first compared using similarity of their calculated powder XRD patterns. Powder XRD patterns are calculated by Platon [1], normalized (such that their area under the curve was unity) and then compared with each other by the constrained Dynamic Time Warping (CDTW) algorithm, which is the summation of Euclidean distance with Sakoe-Chiba global constraint and window length being 10. The tolerances for comparing pairs of structures were 1.0 kJ·mol −1 and 0.05 g·cm −3 .
Matched structures were checked using the COMPACK algorithm, [2] which compares the intermolecular environment between two crystal structures. As to the intermolecular environment comparison, a 30-molecule cluster is constructed surrounding the molecules in the asymmetric unit for the reference and comparison structures. The origins of two clusters are positioned at the same point and the best overlapping for every molecule is searched by the action of rotation only [3]. The two molecules are considered matched if interatomic distances are within 20% tolerance and angles are within 20 degrees tolerance. The current tolerance for an identical pair is a match of 30 out 30 molecules with root mean squared distance less than 0.5 Å.

Convergence and adaptive sampling
The threshold algorithm trajectories were run for fixed numbers of steps. We also implemented methods with flexible number of steps automatically adjusted under each lid energy, but have not found a satisfactory convergence criterion, so suggest the fixed step method for now.
Several criteria were applied and tested, using the benzamide crystal structure landscape in space group F 2dd as a test. The idea was that more steps are needed as the lid energy increases, due to a larger accessible configurational space. The first property tested for monitoring convergence was the acceptance ratio. Three simulations were run under three lid energies: 5 kJ/mol, 25 kJ/mol and 50kJ/mol from the same initial structure. The variation of acceptance ratio with iteration is shown in Supplementary Figure 20 (a) for each lid energy. The acceptance ratio increased with the rising lid energy as our expectation. The variance of the acceptance ratio converged to 0 as the trajectories proceeded, but the rate of convergence was not what we expected (Supplementary Figure 20 (b)). The rate of convergence were similar under 25 kJ/mol and 50 kJ/mol and thus could not distinguish the length of simulations reqired for these two lid energies. The second property investigated was the ratio of structures with energy within 1 kJ/mol window from lid energy. This could be viewed as a projection of energy distribution ( Supplementary Figure 21 (a)). However, the convergence rate of this ratio was not found to vary systematically with lid energy (Supplementary Figure  21

Atom-centered symmetry functions
Atom-centered symmetry functions (ACSFs) are a descriptor of local atomic environments that are invariant to translation, rotation and permutation [4]. The symmetry function describes an N -atom system by an atomic environment vector G = {G 1 , G 2 , ..., G N } where G i represents the i th atom's radial and angular chemical environment individually implying two-body and three-body interactions. In the actual numerical calculation, G i approximates both radial and angular features in a local environment and the local environment is defined by a piece-wise cutoff functions (Supplementary Equation (1)) where R c is the cutoff radius. The cutoff function f c is designed to be continuous and differentiable, and ensures that the symmetry functions tend to 0 with distance exceeding the cutoff radius.
The radial functions are constructed as sums of two-body terms as (2)) where η and R s individually represents the width and center of Gaussian, and index m indicates a specific set of R s parameters. In this work multiple R s are used to probe specific regions of radial environments. With the same width, the center of Gaussian shifting with R s and height of the radial function is controlled by the cutoff function.
The angular functions are constructed as summations of three-body cosine functions of the angles θ ijk = arccos(

(Supplementary Equation (3))
The parameter θ s shifts the local angular environment, while R s here controls the radial centre of the function. The two parameters controlling width of distributions η and ζ are kept constant throughout calculations. With the above approaches, for each atom a radial and angular vector is obtained by caluclation of the symmetry functions at a series of parameters. However,in its original form, the local environment is only different on atomic positions and not distinguishable with elements, i.e., atomic numbers, which would find difficulty identifying different atoms at similar distance. To solve this issue, rather than summing over all elements, the local environments are represented by a radial part for each atomic number and a angular function for each atomic pair, which gives L radial vectors and L(L − 1)/2 angular vectors for L atomic numbers [5]. The dimension of a atomic environment vectors are larger than the degrees of freedom of the system, ensuring enough dimensionality and thus features of the system captured. The symmetry functions are able to describe symmetrical features with identification of elements and perform better in diverse chemical environments.

SOAP REMatch Kernel
In the SOAP kernel, the local region of each system is described individually, expanded by a set of orthonormal basis functions and a similarity matrix between systems is evaluated by dot products to simplify the calculation. The matrix containing the complete information on the pairwise similarity is further transformed by various approaches, in our case, REMatch kernel.
The basic idea of SOAP descriptor is comparing structures based on distributions of atoms in the local environment of each atom. The local density of atoms is represented by a sum of Gaussians centering on all the atoms within the local environment χ as (4)) where σ is the variance of the Gaussian. In a similar way to symmetry functions, a cutoff function is multiplied by the atomic density to select the atoms from central atom smoothly. The SOAP kernel defines the overlapk between two environments χ and χ by the integral over the space as well as rotationsR in three dimensions k(χ, χ ) = dR drρ χ (r)ρ χ (Rr) 2 .
(Supplementary Equation (8)) which will always hold since p is a unit vector. Given a database of structures as input, the power spectrum vector p is calculated individually for each structure, and the local similarity matrix is estimated for each pair of structures. Then the question is how to evaluate the similarity of two systems with the local similarity matrix. Here we note matrix C(A, B) = k(χ A , χ B ) and component C ij (A, B) = k(χ A i , χ B j ) for structures A and B. The approach applied here is to calculate the similarity K(A, B) is the regularized-entropy match (REMatch) kernel, defined aŝ (Supplementary Equation (9)) Parameter γ controls the contribution of entropy penalty and the calculated similarity is normalized to make self-similarity unity.