Introduction

Polymorphism and the prediction of the energetic stability of a crystal polymorph are a fundamental problem of condensed matter physics, especially for the research and applications of molecular crystals. Polymorphism is the capability of solid materials to form more than one distinct crystal structure1,2. It is particularly pronounced when multiple atomic or molecular packing arrangements are characterized by a similar free energy. The physicochemical properties of these systems, such as mechanical and optical characteristics, melting point, chemical reactivity, solubility, or stability are tied strongly to the crystal morphology, therefore increasing the relevance of a comprehensive structure screening and the prediction of the relative stability of polymorphs for a broad range of industries3.

High-throughput computational screening of crystal structures based on free energies is rarely performed due to its high complexity as well as large computational effort, in particular if a first-principles potential energy surface is required4. It is more common to evaluate the relative stability of crystal polymorphs by calculating the lattice energy taking into account only potential energy contributions5,6,7,8,9, effectively disregarding enthalpic and entropic contributions at finite temperature2,10. Finite pressure contributions when comparing different phases at different pressures is typically of a lower magnitude, reaching only about 1 kJ mol−1 per molecule for pressure difference of several gigapascals. It was shown11 that even if the vibrational free energy difference between two given polymporphs lies typically around 2 kJ mol−1 per molecule, it is sufficient to cause a rearrangement of the polymporph relative stability ranking. Furthermore, even when the vibrational contribution to the relative stability is taken into account in a number of cases, the effect of the thermal expansion of the crystal unit-cell on the free energy is most frequently omitted. This is due to the typically low impact of the thermal expansion on the free energy (around 1–2 kJ mol−1 per molecule12), which is, nevertheless, also sufficient to affect the polymporph stability ranking.

The vibrational part of the free energy can be accessed by, among others, two straightforward types of calculation: within the harmonic approximation given by lattice dynamics calculations13,14 and with statistical sampling methods that accounts for all anharmonic contributions, for example via thermodynamic integration (TI)15,16,17,18,19. Even though methods like TI are more accurate, they are also extremely computationally demanding, requiring a large amount of statistical sampling in order to achieve the necessary accuracy. This renders this technique often impossible to carry out within a high-throughput setting. Approximations to the contribution of anharmonic terms to the free energy can be accessed by a number of other methods that are less computationally demanding. However, such approximations have been shown not to present a significant improvement over the much less computationally demanding harmonic approximation for the investigation of polymporph relative stability20. Still, harmonic lattice dynamics are not a viable solution for high-throughput screening if force evaluations are a bottleneck, since the calculation typically involves hundreds of force evaluations for a single structure (or costly perturbation theory techniques), considering the full unit cell.

Within the last decades, the rapid increase of computer power, allied to the rise of machine learning (ML) and big-data algorithms in the realm of material science, allowed for large-scale screening of materials properties, including those related to polymorphism10,21,22,23,24,25,26,27,28. There are only a handful of examples where vibrational free energies29, or other quantities related to the vibrational density of states30,31,32,33, were successfully predicted with the assistance of ML methods. Those methods, however, do not focus on high transferability, or, if they do, rarely achieve the necessary accuracy to differentiate between polymorphs. Clearly, if one could train a very accurate ML interatomic potential for a large class of systems, it would represent the best solution for the evaluation of lattice energies and free energies at the same time. However, despite the exceptional performance of many such potentials, typical root-mean-square errors on the forces lie around 20 meV Å−1 per atom34,35,36,37,38,39. With such errors, the expected prediction accuracy of phonon modes is ±0.15 THz for the best performing potentials34. If the resulting phonon accuracy, as in ref. 34, is assumed to be constant along the entire frequency range, the harmonic free energy calculation error amounts to 0.38 kJ mol−1 per atom.

In this study, we target high accuracy and low computational cost for harmonic free energy predictions. We build a model for the prediction of Helmholtz harmonic free energies of molecular crystals based on Gaussian process regression (GPR) and smooth overlap of atomic positions (SOAP)40 descriptors for representing the local atomic environments. We optimize the training and validation set selection with a computationally cheap empirical potential, confirm its transferability to a first-principles potential, and proceed to achieve a model with first-principles accuracy with a very low cost of training. For a set of hydrocarbon crystals, we are able to achieve a mean absolute error on the free energies of 0.04 kJ mol−1 per atom. We analyzed the stability ranking for a few families of hydrocarbon crystal polymorphs up to 300 K, highlighting the power and accuracy of the model. Furthermore, this method can predict the anisotropic lattice expansion of these crystals, allowing a cheap evaluation of volume expansion and free energies in the quasi-harmonic approximation.

Results and discussion

Because it was shown20 that the harmonic approximation to the free energy can be a suitable estimate for the computation of the relative stability between different structures of molecular crystals, this project focuses on predicting the harmonic Helmholtz free energies F. Contributions from pressure that would be described instead by the Gibbs free energy are not considered, because the structures regarded in this study are typically observed much below 1 GPa of pressure, making this contribution to the free energy negligible. Throughout this paper, for the sake of simplicity, F is evaluated at the Γ point of the Brillouin zone of a given unit cell. We consider unit cells larger than the primitive cell where needed (see “Methods” section). The harmonic free energies are thus calculated as

$$F(V,T)=\mathop{\sum }\limits_{i=1}^{3N-3}\left(\frac{\hslash {\omega }_{{{{\rm{i}}}}}}{2}+{k}_{{{{\rm{B}}}}}T{{{{\mathrm{ln}}}}}\,\left(1-{e}^{-\frac{\hslash {\omega }_{{{{\rm{i}}}}}}{{k}_{{{{\rm{B}}}}}T}}\right)\right),$$
(1)

where ωi is the frequency of a given phonon mode at the Γ point. When taking lattice expansion into account, the vibrational frequencies depend indirectly on the temperature such that ωi = ωi(V(T)).

Definition of the GPR model

The key assumption of the free energy prediction approach explored in this project is that even if free energies are defined only for the entire collection of atoms of the crystal structure, they can be decomposed into local contributions of atomic environments. The approach of casting a global property on local environments was explored previously41,42 for the generation of an interatomic potential from quantum mechanical data. The problem of the harmonic Helmholtz free energy prediction is approached by connecting the atomic-wise free energy to the full free energy by

$${{{{{\bf{F}}}}}}={{{{{{\bf{M}}}}}}}^{{{{{{\rm{T}}}}}}}{{{{{\bf{f}}}}}},$$
(2)

where F is the vector with all measured free energies for a given crystal set of dimension Ns (number of crystal structures in the training set), M is an incidence matrix of dimension Ns × Nae (number of atom environments in the given set) and f is the vector of all, unobserved, atom-wise free energies in the chosen ensemble. Then, the prediction of f in the training set is modeled as

$${{{{{\bf{f}}}}}}^{\prime} ={{{{{\bf{C}}}}}}{{{{{\boldsymbol{\alpha }}}}}},$$
(3)

where C is the matrix containing the similarities between pairs of atomic environments (dimension Nae × Nae), defined as

$${C}_{{{{\rm{ij}}}}}=\sigma {{{\mbox{e}}}}^{-\frac{\mathop{\sum }\nolimits_{{{{\rm{d}}}} = 1}^{D}{\left({q}_{{{{\rm{d,i}}}}}-{q}_{{{{\rm{d,j}}}}}\right)}^{2}}{2{l}^{2}}},$$
(4)

where σ is a scaling prefactor, and qi is a vector of length D describing local atomic environments. Cij corresponds to the Gaussian kernel. In Eq. (3), α is a vector of Nae weights for each atomic environment, such that

$${{{{{\bf{F}}}}}}^{\prime} ={{{{{{\bf{M}}}}}}}^{{{{{{\rm{T}}}}}}}{{{{{\bf{C}}}}}}{{{{{\boldsymbol{\alpha }}}}}}.$$
(5)

Opening up this equation element-wise, the full free energy of one sample i in the training set is given by

$${F}_{{{{\rm{i}}}}}^{\prime}=\mathop{\sum }\limits_{{{{\rm{j}}}}=1}^{{{{{\rm{N}}}}}_{{{{{{\rm{ae}}}}}}}}\mathop{\sum }\limits_{k=1}^{{{{{\rm{N}}}}}_{{{{{{\rm{ae}}}}}}}}{\left({{{{{{\bf{M}}}}}}}^{{{{{{\rm{T}}}}}}}\right)}_{{{{\rm{ij}}}}}{C}_{{{{\rm{jk}}}}}{\alpha }_{{{{\rm{k}}}}}.$$
(6)

Optimizing the weights αk is equivalent to minimizing the loss function

$$L=\mathop{\sum }\limits_{{{{\rm{i}}}}}^{{{{{\rm{N}}}}}_{{{{\rm{s}}}}}}{\left[{F}_{{{{\rm{i}}}}}^{\prime}-{F}_{{{{\rm{i}}}}}\right]}^{2}+{\sigma }_{\epsilon }^{2}{{{{{{\boldsymbol{\alpha }}}}}}}^{{{{{{\rm{T}}}}}}}{{{{{\bf{C}}}}}}{{{{{\boldsymbol{\alpha }}}}}},$$
(7)

where \({\sigma }_{\epsilon }^{2}\) is a regularization parameter related to the variance of the noise of the data.

Finally, substituting Eqs. (6) into (7), the minimization is straightforward and leads to

$${{{{{\boldsymbol{\alpha }}}}}}={{{{{\bf{M}}}}}}{\left({{{{{{\bf{M}}}}}}}^{{{{{{\rm{T}}}}}}}{{{{{\bf{C}}}}}}{{{{{\bf{M}}}}}}+{\sigma }_{\epsilon }^{2}{{{{{\bf{I}}}}}}\right)}^{-1}{{{{{\bf{F}}}}}},$$
(8)

where I is the identity matrix of dimensions Ns × Ns. In this way, one can obtain the optimized weights with no need to define or observe atom-wise free energies.

Finally the prediction of the free energy of a new structure that is not contained in the training set is achieved by calculating

$$F({{{{{{\bf{q}}}}}}}^{* })=\mathop{\sum }\limits_{i=1}^{N}\mathop{\sum }\limits_{j=1}^{{N}_{{{{{{\rm{ae}}}}}}}}{\left({{{{{{\bf{C}}}}}}}^{* {{{{{\rm{T}}}}}}}\right)}_{ij}{\alpha }_{j}$$
(9)

where C* is the similarity matrix between the atomic environments q* of the new structure to the ones in the training set, with elements

$${C}_{{{{\rm{ij}}}}}^{* }=\sigma {{{\mbox{e}}}}^{-\frac{\mathop{\sum }\nolimits_{{{{\rm{d}}}} = 1}^{{{{\rm{D}}}}}{({q}_{{{{\rm{d,i}}}}}^{* }-{q}_{{{{\rm{d,j}}}}})}^{2}}{2{l}^{2}}}.$$
(10)

All hyper-parameters for the GPR model and the representations were selected by minimizing, using the steepest descent method, the negative log marginal likelihood function43

$$\begin{array}{ll}-{{{{\mathrm{ln}}}}}\,P({{{{{\bf{F}}}}}}| (l,{\sigma }_{\epsilon },{{{{{\boldsymbol{\theta }}}}}}))\,=\,\frac{1}{2}{{{{\mathrm{ln}}}}}\,| {{{{{{\bf{M}}}}}}}^{{{{{{\rm{T}}}}}}}{{{{{\bf{C}}}}}}(l,{{{{{\boldsymbol{\theta }}}}}}){{{{{\bf{M}}}}}}+{\sigma }_{\epsilon }^{2}{{{{{\bf{I}}}}}}|\\ \qquad\qquad\qquad\qquad\quad+\,\frac{1}{2}{\left({{{{{{\bf{M}}}}}}}^{{{{{{\rm{T}}}}}}}{{{{{\bf{C}}}}}}(l,{{{{{\boldsymbol{\theta }}}}}}){{{{{\bf{M}}}}}}+{\sigma }_{\epsilon }^{2}{{{{{\bf{I}}}}}}\right)}^{-1}+\,\frac{1}{2}{{{{\mathrm{ln}}}}}\,2\pi , \end{array}$$
(11)

where θ is a vector containing the hyperparameters of the representations entering q. The application of the steepest descent method is only guaranteed to find a local minimum. A wide space of hyper-parameters was considered in order to increase the probability of finding a global minimum.

In all supervised machine learning based models, the quality of the model strongly depends on the quality of the training set. Typically, selecting the training set can be done by either a random selection of samples, given that the considered ensemble is fairly homogeneous, or by implementing methods that aim at covering the sampled domain by maximizing the resulting prediction accuracy, such as the "correlation” clustering method44, genetic optimization45, or k-fold cross-validation46. Unfortunately, most of the methods from the latter group require a large pool of data for which the target property, like free energy in this case, is available. In this study, because one of the objectives is to minimize the computational cost of obtaining a good training set, the applied procedure focuses on selecting an optimal training subset based exclusively on the geometrical parameters of the crystal structure.

For this purpose, the farthest point sampling (FPS)47 method is applied, that searches for a subset of the entire investigated crystal structure ensemble that covers evenly all structural motifs of the sampled domain with minimal information overlap. First, a similarity measures between molecular crystal structure Ra→b is defined according to the best-match structural kernel48 method, as it is needed for the application of the FPS

$${R}_{a\to b}=\frac{1}{{n}_{{{{\rm{b}}}}}}\mathop{\sum }\limits_{{{{\rm{i}}}}=1}^{{n}_{{{{\rm{b}}}}}}\mathop{\max }\limits_{{{{\rm{j}}}}\in (1,{n}_{{{{\rm{a}}}}})}C({{{{{{\bf{q}}}}}}}_{{{{\rm{j}}}}}^{{{{\rm{a}}}}},{{{{{{\bf{q}}}}}}}_{{{{\rm{i}}}}}^{{{{\rm{b}}}}}),$$
(12)

where \(C({{{{{{\bf{q}}}}}}}_{{{{\rm{j}}}}}^{{{{\rm{a}}}}},{{{{{{\bf{q}}}}}}}_{{{{\rm{i}}}}}^{{{{\rm{b}}}}})\) is the kernel matrix element defined in Eq. (4), \({{{{{{\bf{q}}}}}}}_{{{{\rm{j}}}}}^{{{{\rm{a}}}}}\) and \({{{{{{\bf{q}}}}}}}_{{{{\rm{i}}}}}^{{{{\rm{b}}}}}\) are jth and ith atomic environment representations of structure a and b respectively, and similarly na and nb are the number of atoms in those structures. Ra→b defines how well atoms of structure a can represent geometrical motifs of structure b and Ra→b ≠ Rb→a. In other words, it is possible that atomic environments of structure a represent well those of structure b, while structure b contains geometric features not present in a. This method of defining the relationship between crystal structures is very similar to others typically chosen for such tasks40,48,49,50,51, with the difference that Ra→b is not invariant with respect to the crystal structure index (Ra→b ≠ Rb→a) so it is not a similarity metric in a strict sense. Next, according to the FPS algorithm, the training set is created by iteratively picking structures that are least represented by those already present in the training set. Since any crystal structure can be used as the starting point for the FPS algorithm, the applied method selects NΩ potential training sets, where NΩ is the number of crystal structures in the considered ensemble. In order to choose out of NΩ potential training sets, we have investigated the scaled cumulative sum I(Nm) of the Ra→b,

$$I({N}_{{{{\rm{m}}}}})=\frac{\mathop{\sum }\nolimits_{{{{\rm{a}}}}}^{{N}_{{{{\rm{m}}}}}}\mathop{\sum }\nolimits_{{{{\rm{b}}}}}^{{N}_{{{\Omega }}}}{R}_{a\to b}}{\mathop{\sum }\nolimits_{{{{\rm{a}}}}}^{{{{{\rm{N}}}}}_{{{\Omega }}}}\mathop{\sum }\nolimits_{{{{\rm{b}}}}}^{{{{{\rm{N}}}}}_{{{\Omega }}}}{R}_{a\to b}},$$
(13)

where Nm is the total number of the molecular crystal configurations in the training set. This quantity reveals how fast a given training set candidate converges to unity, which we consider to represent a full coverage of the sampled feature space. In another sense, the I(Nm) quantity can be seen as the description of the information acquisition during consecutive steps of the FPS algorithm. Finally, training set with the highest recorded value of I(Nm) after all Nm = 60 steps of the FPS algorithm is chosen. The training set size of Nm = 60 was chosen because above this number, the improvement of the prediction accuracy was too small to justify a larger training set and the associated increase in computational effort.

In the same spirit of maximizing accuracy and minimizing cost, with the objective of performing free energy predictions with ab initio data, the aim was to select an efficient and reliable validation set, without using the entire ensemble. Here the goal is to create such a subset that would represent well the entire set, so as to include, for example, a proportional number of outlier structures as found in the entire set. A random selection of validation set would not fulfill this criterion due to the limited size of validation set used in this project. Additionally, this task largely differs from selecting the training set, because it typically contains a greater relative number of outliers compared to the entire set. In order to optimally select the validation set, while preserving the density of outliers, a stratified approach was used. Here each crystal structure a is assigned a similarity index Sa, that compares a given crystal to entire set

$${S}_{{{{\rm{a}}}}}=\mathop{\sum }\limits_{{{{\rm{b}}}}}^{{{{{\rm{N}}}}}_{{{\Omega }}}}{R}_{a\to b}.$$
(14)

The relatively high values of Sa indicate a “typical” crystal and low values indicate “outliers”. Next, the entire set is sorted with respect to Sa and the validation set is chosen by selecting every nth element of the sorted set, with n = round(NΓ/NΘ) where NΓ and NΘ are the target numbers of structures in the validation and training sets. All sets sorted with respect to Sa are presented in Supplementary Fig. 2.

Within the discussed framework, and common to many ML models, the choice of method encoding the atomic environments to numerical representations has an impact on the resulting performance of the model. In this project, three well-established general-use atomic environment representations52 were selected and tested, namely: SOAP40 that uses spherical harmonics to locally expand atomic densities, many body tensor representation (MBTR)53 that uses distributions of different structural motifs (like radial or angular distribution functions) and atom-centered symmetry functions (ACSFs)54 that use two-body and three-body functions detecting specific features. The Python implementations of the mentioned representations found in the DScribe package55 were used.

Model implementation and validation

For the purpose of this work we have chosen crystals composed of seventeen different hydrocarbons: pyrene, methylcyclopentane, styrene, naphthalene, benzene, tetracene, mesitylene, pentane, pentacene, hexane, ethylbenzene, propane, heptane, phenanthrene, butane, hexacene, and anthracene. We have included most available polymporphs that could be obtained from the Cambridge Crystallographic Data Center56 (CCDC), leading to an ensemble of 74 structures. We noted that polymporphs of very similar lattice constant in CCDC tend to be almost identical, with close to negligible differences in atomic positions, for example, the case of ANTCEN20 and ANTCEN22. Finally, the sample domain was further expanded by introducing structures with perturbations of roughly 5% in the lattice parameters, as this can lead to up to 16% increase in unit cell volume—a typical volume expansion percentage for molecular crystals57. The addition of crystal structures with strongly perturbed lattice parameters was found to be crucial for the later prediction of lattice expansion coefficients. Finally, NΩ = 444 crystal structures were considered in this project.

The building and testing of the framework was initially performed using a classical force-field potential (AIREBO, as detailed in “Methods” section). In the first steps of the model verification, the training and validation set selection criterion, based on the FPS method and maximization of I(Nm), was evaluated. For this purpose, based on the classical force field data with prediction performed at 300 K and with SOAP atomic-environment representations, free energy mean absolute error FMAE was calculated

$${F}_{{{\mbox{MAE}}}}=\frac{1}{N}\mathop{\sum }\limits_{{{{\rm{i}}}}}^{{{{\rm{N}}}}}| {F}_{{{{\rm{i}}}}}-{F}_{{{{\rm{i}}}}}^{\prime}| ,$$
(15)

where N is the number of structures for which the prediction is performed. The results are presented in Fig. 1a in the form of learning curves, with increasing size of the training set N and with the validation set. It is visible that the learning curve obtained for the chosen training set, so with the highest I(Nm), shows one of the lowest FMAE at the target training set size among all potential sets obtained using FPS method.

Fig. 1: Performance of the developed prediction model.
figure 1

a Learning curves obtained for 100 randomly chosen learning sets out of total 444 in gray, and the chosen learning set with the highest recorded value of I(Nm) in red. Data was obtained using SOAP representation and the classical force field model with the free energy calculated at 300K. b Correlation between predicted and calculated free energies at 300 K (classical force field, SOAP representations). Different crystal families are represented by different colors.

Next, the linear and monotonic correlations between benchmark F and predicted \(F^{\prime}\) values was assessed by calculating the Pearson and Spearman correlation coefficients. For predictions performed at 300 K with the SOAP representation they were found to be 0.9996 and 0.9894, respectively. A value so close to 1 for these coefficients indicate a good performance of the developed framework. Furthermore, due to the low cost of the lattice dynamics calculations performed using classical force field, the FMAE was inspected for the entire set (300 K with the SOAP representation) and it was found to be 0.042 kJ mol−1 per atom. Additionally, the FMAE = 0.218 kJ mol−1 per atom was obtained for 10% of the crystals with the poorest prediction and 0.023 kJ mol−1 per atom for the remaining 90% of samples.

Figure 1b shows the predicted free energy values \(F^{\prime}\) compared with the benchmark data F for the different crystal families. The analysis gives an indication of the system-sensitive performance of the framework, revealing that crystals of pentane, pentacene, tetracene, and hexane are characterized by the poorest averaged prediction accuracy, with the FMAE around 2 kJ mol−1 per molecule, reaching a possible free energy difference between different polymporphs11,12. Additionally, the predictions performed for crystal structures with strongly perturbed lattice parameters were noticeably poorer, even if the training set contained parental crystal structures. Nevertheless, the prediction accuracy overall is very high, especially considering the diversity of hydrocarbons represented.

Figure 2a shows the learning curves by monitoring FMAE at 300 K with increasing training set size and a constant validation set. Additionally, the impact of the atomic environment representation on the efficiency of the method was investigated. The learning is well-behaved for all representations, as expected for properly parameterized machine learning models. The results obtained with the SOAP representation, with a 6 Å cutoff and 1 Å for the standard deviation of the Gaussian functions used to expand the atomic density, are characterized by the lowest FMAE, showing that it is the best representation within the investigated set. Finally, the accuracy of the predictions are noticeably affected by the temperature at which the free energies are required, going from 0.019 kJ mol−1 per atom at 300 K and 0.015 kJ mol−1 per atom at 200 K to as low as 0.002 kJ/mol at 0 K.

Fig. 2: Learning curves of FMAE (300 K) calculated for the training (dashed line) and the validation set (solid line) obtained with SOAP, MBTR and ACSF representations.
figure 2

Results are presented for GPR models obtained based on: a the empirical AIREBO force field and b density-functional theory (PBE functional with pairwise van der Waals corrections) data, and are presented as a function of the number of crystal structures in the training set. Error bars are equal to the standard deviation of FMAE of training sets with different structures.

Finally, it is worth noting that the atom-wise contributions to F are not guaranteed to have any physical meaning, but sometimes their analysis can give interesting insights. In this case, as further discussed in Supplementary Note 1, this analysis shows that there are no trends in the contribution to the free energy from different atomic species or classes of bonded atoms, which could be related to the strongly non-local character of the free energy.

Transferability of the prediction model

Once the framework was built and proven to deliver a satisfactory prediction of harmonic free energies based on data coming from an empirical potential, the transferability of the model when using DFT data was investigated. For that, the PBE exchange correlation functional with pairwise van der Waals interactions was employed, as detailed in “Methods” section. As a test, the similarity between relaxed structures obtained with the empirical potential and DFT was assessed by analysing the root mean square deviation (RMSD) of the atomic positions averaged over entire set. RMSD for carbon and hydrogen were 0.16 Å and 0.20 Å, respectively. Importantly, differences in the SOAP representation were also investigated by calculating the root mean square error normalized by the standard deviation ϵX, defined as

$${\epsilon }_{{{{\rm{X}}}}}=100\times \sqrt{\frac{\mathop{\sum }\nolimits_{{{{\rm{d}}}}}^{{{{\rm{D}}}}}\frac{1}{N}\mathop{\sum }\nolimits_{i}^{{{{\rm{N}}}}}{\left({q}_{{{{\rm{d,i}}}}}^{{{{{{\rm{FF}}}}}}}-{q}_{{{{\rm{d,i}}}}}^{{{{{{\rm{DFT}}}}}}}\right)}^{2}}{\mathop{\sum }\nolimits_{{{{\rm{d}}}}}^{{{{\rm{D}}}}}\frac{1}{N-1}\mathop{\sum }\nolimits_{{{{\rm{j}}}}}^{{{{\rm{N}}}}}{\left({\bar{q}}_{{{{\rm{d}}}}}^{{{{{{\rm{DFT}}}}}}}-{q}_{{{{\rm{d,i}}}}}^{{{{{{\rm{DFT}}}}}}}\right)}^{2}}},$$
(16)

where D is the number of features of the representation and N is the number of atoms for which the ϵX was calculated. Obtained values for both carbon and hydrogen are ϵC = 1.09 and ϵH = 1.15, respectively. Those results show that the overall structural features are in good agreement in these two potential energy surfaces. As a consequence of this structural similarity between two data sets, the training and validation sets obtained with the empirical potential, as explained in the previous section, can be automatically used in DFT. As a cross-check, the same training and validation set optimization procedure were independently applied on the optimized DFT structures, indeed obtaining the same results. This proved that the experience gathered from the first phase of the project, where only classical data was used, is fully transferable to the current stage, where we employ more accurate ab initio data. As a result, the more expensive ab initio lattice dynamics calculations were only performed for crystal structures included in the training and validation sets, greatly reducing the computational cost of the model generation.

Finally, the hyperparameters of the GPR model were re-optimized and were used to calculate learning curves for training and validation sets shown in Fig. 2b, with the SOAP, MBTR, and ACSF representations. All representations presented a good performance, with MBTR and SOAP yielding very similar learning curves. The obtained FMAE for the SOAP representation at full training set was found to be 0.038 kJ mol−1 per atom. Interestingly, a fairly good prediction performance can be obtained with as little as 20 crystal structures, resulting in FMAE = 0.07 kJ mol−1 per atom. Such small training sets typically do not contain all different molecular components of the crystals that are present in the entire set, but can still describe it well. The remainder of this manuscript will focus on results obtained based on the DFT data with the SOAP representation, exclusively.

The proposed framework is summarized in the flowchart in Fig. 3. In addition, as it is shown in the SI, the possibility of this model trained only on hydrocarbons to extrapolate to systems containing carbon, hydrogen, and nitrogen atoms was investigated. Although the prediction accuracy decreases as the concentration of nitrogen atoms in the samples increases, the model is not completely invalid. It shows that with a small addition of structures to the training set or building representations for new atoms that combine characteristics of the atoms that were previously trained58,59, this framework could be easily extended to other systems.

Fig. 3: Flowchart of the developed framework.
figure 3

An optimal and small training set is obtained by training the model on force field data (green boxes) and then used to train the model on DFT data (blue boxes).

Relative free energies of molecular crystals: stability ranking

The GPR model was employed to create a stability ranking of several families of hydrocarbon molecular crystals. Sixteen crystal families were considered, encompassing 38 polymorphs and 36 variants corresponding to different thermodynamical conditions with lattice parameters as they are given by the CCDC56. Additionally, 370 crystal structures with randomly distorted lattice parameters derived from the initial 74 were included. Figure 4 shows the lattice energy and the free energy obtained at various temperatures, presented as relative values to the crystal structure characterized by the lowest free energy at 300 K (full data is found in Table S2, in the SI). The identifiers of all crystal structures follow the convention used in CCDC56. For many crystal families, the structure with the lowest lattice energy is not the same as the one with the lowest free energy especially at the room temperature. A clear example is the pyrene crystal and its three polymorphs: Form I is represented by PYRENE02 and PYRENE03 (structures measured at 423 K and 113 K, respectively, and ambient pressure); Form II is represented by PYRENE07 and PYRENE10 (at 93 and 90 K, ambient pressure); and Form III is represented by PYRENE08 and PYRENE09 (measured at at 0.3 GPa and 298 K, and at 0.5 Gpa and 298 K, respectively). Form I is measured to be more stable than form III at all temperatures up to and beyond 430 K, at ambient pressure. Here, it is shown that the energy ranking formed based on lattice energy exclusively would place the high-pressure form III PYRENE09 (form III) structure very close to PYRENE02 (form I). An inclusion of zero-point-energy and vibrational contributions already at low temperatures irrevocably destabilizes form III.

Fig. 4: Lattice energy Elat. and predicted free energy differences between variants of four molecular crystal families: anthracene, naphthalene, benzene, and pyrene.
figure 4

Full data, also for a larger number of crystals and polymporphs, is available in the Table S2 in the SI. Relative energies are calculated separately at each temperature, always with respect to the lowest free energy structure at 300 K. The numbers on the labels represent the identifiers of the crystals following the convention used in CCDC56.

A similar example is the benzene crystal. Here, structures of the ambient-pressure form I, represented by, for example, BENZEN15, BENZEN19, or BENZEN26, are characterized by overall lower free energy comparing to the high-pressure form II structures, like BENZEN16 and BENZEN17. Interestingly, for this crystal family, the lattice energy can provides a satisfactory relative stability ranking. However, the need for including the vibrational contributions becomes visible once a high and ambient pressure variants of one polymorph are compared, e.g., BENZEN13 and BENZEN26. It is visible in Fig. 4 that if considering only lattice energies, BENZEN13 shows the lowest energy compared to other crystal variants, with lattice energy lower than that of BENZEN26 by 2.58 kJ mol−1 per molecule. However, the free energy prediction shows that at 300 K, the BENZEN26 structure becomes the most stable out of all those investigated, and its relative free energy with respect to BENZEN13 is now lower by 2.92 kJ mol−1 per molecule, effectively swapping places in the relative ranking stability with BENZEN26. For this case, and to test the predictions of the model in practice, the free energies for both BENZEN13 and BENZEN26 structures were additionally calculated with DFT. These calculations showed that BENZEN13 is characterized by a free energy that is 3.43 (kJ mol−1 per molecule) higher than that of BENZEN26 at 300 K, confirming the results obtained with the GPR model.

The rearrangement of the relative stability ranking when room temperature free energy is taken into account is a very common trend among the investigated samples, and there are a number of cases, where even at 0 K the zero point energy contribution is high enough to affect the relative stability ranking. These observations are in good agreement with previous studies, where more direct methods were used11. In some cases, the prediction accuracy of this model is not sufficient to determine the relative stability of some structures. Nevertheless, the model is accurate enough to point towards those few that are characterized by the lowest free energies. Here, even only narrowing the pool of considered strucrures can effectively decrease the computational effort of phonon calculations required, if more accuracy is needed.

Predicting lattice expansion

Because one of the challenges in high throughput computational screening of crystal structures is accounting for thermal lattice expansion, the application of the trained free energy model was explored in this context. To illustrate the procedure, a simple case where only one of the lattice parameters is being perturbed was considered. For this purpose the BENZEN11 crystal was chosen, with the lattice parameter a being sampled within 6.52 and 7.32 Å. Next, within the quasi-harmonic approximation, the free energy was calculated and predicted as a function of a. Figure 5a shows the comparison between the GPR model and DFT calculations for the free energy at 200 K. The optimal lattice parameter a is determined by a Birch-Murnaghan60 fit. While there are small differences between the DFT and the GPR curves, mostly consisting of a shift in energy, the resulting optimal lattice parameter a is very similar in both cases, and equal to 6.92 Å and 6.95 Å, respectively. This simple and fairly artificial example illustrates that the prediction accuracy of this framework is sufficient to be employed in the context of the lattice expansion/contraction prediction.

Fig. 5: Application of the developed framework to predict the thermal lattice expansion.
figure 5

a Predicted (red) and calculated (blue, DFT) free energies as a function of one lattice parameter of the BENZEN11 crystal at 200 K. Solid lines correspond to a Birch-Murnaghan fit. b 3D visualization, in lattice parameter space (a, b, c), of the free energy prediction including lattice expansion of the anthracene crystal at 200 K, with 300 different combinations of perturbed lattice parameters. The red line indicates the observed change in the lattice parameters when increasing the temperature from 0 to 300 K. c Relative predicted free energies of benzene crystal structures when considering fixed lattice constants (taken from CCDC), and when considering lattice expansion at 0 and at 300 K.

A more challenging task is the prediction of anisotropic lattice changes. Direct calculations of anisotropic lattice expansion requires lattice dynamic evaluations for, typically, hundreds of structures of the same crystal polymorph, making it a very costly calculation for a high-throughput setting. Although harmonic and quasi-harmonic models61 as well as an approach based on the assumption of the linear relation between free energy and volume have been proposed to overcome this cost62, with this framework these lattice changes can be estimated without relying on any ansatz for the dependence of the free energy on the lattice parameters. It is worth noticing that even if the free energy predictions at various temperatures requires training the ML model multiple times, it happens with minimal computational overhead once appropriate lattice vibrations have been computed. Four molecular crystals were picked, namely P21/a anthracene (ANTCEN), Pbca benzene (BENZEN), P1 pentacene (PENCEN01), and Pbcn styrene (ZZZTKA01) and hundreds of ionic relaxations with the a, b, and c lattice parameters perturbed by around 5% were performed. Next, for each of those perturbed structures free energy prediction at a number of temperatures from 0 to 300 K range was performed.

Figure 5b shows a 3D visualization of free energy predictions for over 300 different combinations of lattice parameters a, b, and c of the ANTCEN crystal. Even with such a high number of sampled lattice parameter combinations, the position of the free energy local minima might not overlap with the gathered data. In this case, in order to find the minimum in this high-dimensional space, an active learning based on the GPR algorithm is employed. Here, the GPR is used as a multi-dimensional, non-linear regressor, as implemented in the scikit-learn63 package. In detail, the following bootstrap procedure is used:

  • Identifying the position of the data point with the lowest free energy value according to the GPR 3D interpolation.

  • For the chosen set of (a, b, c) lattice parameters perform an ionic relaxation and predict the free energy with the trained model.

  • If the predicted free energy of the (a, b, c) sample varies from the free energy obtained by the 3D GPR regression, a new 3D GPR regression is performed, now explicitly including sample (a, b, c), then go back to step 1.

  • If the predicted free energy of the (a, b, c) sample is sufficiently close to the one of the 3D GPR regression (within ±0.1%), then the scheme is stopped and the optimal lattice parameters are considered to be found.

We found that typically only around three additional relaxations and free energy predictions (per temperature) are necessary to achieve sufficient convergence of the lattice parameters. By employing this procedure to predict the anisotropic lattice changes the lattice-parameter change is calculated, as well as the full volume change of the selected crystals, as shown in Fig. S5.

The results obtained can be compared to experimental values where data is available. For anthracene the experimentally measured volume change is \({V}_{{{{{{\rm{290K}}}}}}}^{{{\exp .}}}/{V}_{{{{{{\rm{90K}}}}}}}^{{{\exp .}}}=1.024\)64 and we obtained \({V}_{{{{{{\rm{290K}}}}}}}^{{{{{{\rm{ML}}}}}}}/{V}_{{{{{{\rm{90K}}}}}}}^{{{{{{\rm{ML}}}}}}}=1.034\). For pentacene, the comparison is \({V}_{{{{{{\rm{295K}}}}}}}^{{{\exp .}}}/{V}_{{{{{{\rm{90K}}}}}}}^{{{\exp .}}}\,=\,1.037\)65 and \({V}_{{{{{{\rm{295K}}}}}}}^{{{{{{\rm{ML}}}}}}}/{V}_{{{{{{\rm{90K}}}}}}}^{{{{{{\rm{ML}}}}}}}=1.031\); for benzene \({V}_{{{{{{\rm{270K}}}}}}}^{{{\exp .}}}/{V}_{{{{{{\rm{78K}}}}}}}^{{{\exp .}}}\,=\,1.089\)66 and \({V}_{{{{{{\rm{270K}}}}}}}^{{{{{{\rm{ML}}}}}}}/{V}_{{{{{{\rm{78K}}}}}}}^{{{{{{\rm{ML}}}}}}}=1.068\); for styrene \({V}_{{{{{{\rm{120K}}}}}}}^{{{\exp .}}}/{V}_{{{{{{\rm{83K}}}}}}}^{{{\exp .}}}\,=\,1.017\)67,68 and \({V}_{{{{{{\rm{120K}}}}}}}^{{{{{{\rm{ML}}}}}}}/{V}_{{{{{{\rm{83K}}}}}}}^{{{{{{\rm{ML}}}}}}}=1.009\). The predictions are quite close to experimental data and overall a high degree of anisotropy is observed. Moreover, a deviation from a linear behavior of the free energy change with respect to volume is observed, as shown in Fig. S6.

This framework can thus be used to create the relative stability ranking including the thermal expansion effect on the free energy. Here, one example of how this can impact the relative stability and crystal form of these systems is presented. For this purpose, BENZEN13 and BENZENB26 (high and low pressure variants of the P21/b21/c21/a benzene I polymorph69) are selected, as well as BENZEN16 (a high pressure P21/c benzene II polymorph70). The initial lattice constants were taken from the CCDC. As shown in Fig. 5c, by simply searching for the free energy minimum at 0 K using the procedure described above, BENZEN13 and BENZEN26 were found to end up being characterized by almost identical (predicted) free energies and lattice constants. Further inspection indicated that indeed the BENZEN13 and BENZEN26 structures converged to the same structure, and the same behavior was found at all investigated temperatures. Even if somewhat expected, given that they are high and low pressure phases within the same crystal group and in the absence of any applied pressure it is natural that they both adopt the low-pressure structure, the fact that this result came from the model alone, and that the free energy predictions were able to capture this transition, shows that the method is robust. The BENZEN16 structure is stabilized by 1 kJ mol−1 per molecule upon increasing the temperature from 0 to 300 K, as shown in Fig. 5c. This stabilization is accompanied by an appreciable lattice expansion with a volume increase of around 6% from 0 to 300 K.

In summary, the framework proposed here provides a machine learning model with first principles accuracy for the harmonic Helmholtz free energies of molecular crystals, that is suitable for high-throughput studies. In addition, it was shown that the training and validation set of the model can be optimized using a cheaper empirical potential, and then transferred to first-principles calculations, thus substantially decreasing the cost of training, without sacrificing accuracy.

The model was tested to predict the relative energetic stability ranking of several diverse hydrocarbon polymorphs and distorted crystal structures derived from them, and the changes on this ranking with increasing temperature was studied. We observed that, in most cases, omitting thermal effects and instead using only the lattice energy produces misleading results. Furthermore, it was shown that the model can be efficiently employed to calculate the anisotropic lattice expansion—a task rarely approached due to its complexity and high computational demand when performed at the ab initio level. Unsurprisingly, taking the anisotropic lattice expansion into account leads to further changes in the stability ranking. Naturally, the same framework could be used to predict other quantities derived from vibrational properties, like the vibrational heat capacity.

The strengths of this framework lie in its low computational cost, reliability and accuracy. However, because the model is trained to directly predict free energies, one still has to deal with the computational cost of obtaining optimized structures, which we here obtained from first-principles geometry optimizations. Fitting a machine-learned interatomic potential is becoming more streamlined71, even though these potentials rarely target the accurate description of vibrational properties due to the added complexity of including them in the learning procedure. The presented framework, on the other hand, can be easily combined with any potential that can predict structures in a reasonable manner and has the potential to be more accurate.

Extending this framework beyond hydrocarbon-based crystals could be straightforward, albeit perhaps requiring different training data. We have already observed that the framework is capable of predicting DFT free energies from FF-relaxed structures with promising accuracy (see Supplementary Note 2). Finally, targeting fully anharmonic free energies with ab initio accuracy is still a daunting task that can, nevertheless, profit from the knowledge gained in this study.

Methods

Force field model

Geometry optimization calculations with empirical potentials were performed using LAMMPS72 together with AIREBO73 interatomic potentials. The conjugate gradient minimization algorithm was used with dummy parameters to ensure full convergence, namely 10−25 (1) and 4 × 10−25 kJ mol−1 Å−1 for energy and forces respectively and with 5 × 104 maximum iterations of the minimizer. Phonon calculations with the empirical potentials were performed using the i-PI74 code, considering 2 × 2 × 2 repetitions of the primitive cell. The phonons were calculated by finite differences with a 0.005 Å displacement in all Cartesian directions.

Density functional theory model

All ab initio simulations were performed using the FHI-aims package75. For this purpose, we employed light settings for all atomic species, together with the Perdew-Burke-Ernzerhof exchange-correlation functional76 and many-body dispersion corrections77. We have used 5 × 5 × 5 k-point sampling of the Brillouin zone. A self-consistency convergence criterion of 10−5 eV Å−1 was imposed on the forces, which ensured that energies were converged to 10−7 eV or below. The relaxation was performed using the trust radius version of the Broyden–Fletcher–Goldfarb–Shanno78,79 optimization algorithm with the maximum residual force component threshold equal to 10−4 eV Å −1. Lattice dynamics calculations were performed through finite differences using Phonopy80. The atomic displacements were of 0.002 Å in all Cartesian directions. The size of the supercell was individually chosen for the different molecular crystals, with the requirement that at least twice the distance between molecular centers of mass of adjacent molecules was comprised by the vector lengths in each direction.

Framework development

The framework for the GPR model was developed using Python3 and Fortran95 languages. The SOAP, MBTR, and ACSF representations were calculated using the DScribe55 package.