Introduction

First-principles density functional theory (DFT) calculations have established themselves as a routine and reliable tool in computational materials science research1,2,3,4 and have enabled important advancements in materials discovery.1, 2, 5 Although implementations with increasing numerical efficiency and growing computational power have made it possible to simulate ever larger structures with DFT, the method’s intrinsic scaling with the number of electrons prevents applications that require large structures (thousands of atoms) and intensive sampling (millions of configurations). Approximate energy models fitted to DFT reference data, such as cluster expansion (CE) lattice models6,7,8,9 or machine learning regression,10, 11 can overcome these limitations by constructing computationally more efficient models with accuracies that are close to DFT for a chosen structural and chemical space. One prototypical application for approximate energy models is the prediction of ordered ground states based on an underlying lattice topology.12, 13

The concept of CEs goes back to the Ising model,14 which describes the magnetic phases of an atomic lattice in terms of pair interactions. A CE model, or generalized Ising model, is the discrete sum representation of materials properties in terms of lattice site topologies and site interactions, such as site pairs, triplets, quadruplets, and so on. CEs have broad applications in different fields of science,6, 15 such as magnetism15 and alloy thermodynamics.6

The key challenge in constructing CE models is to determine the expansion coefficients, the effective cluster interactions (ECIs), in a robust fashion through a fit to reference configurations.1, 16, 17 Conventional ECI fitting procedures1, 16, 17 focus on minimizing the overall difference between the CE fit and the input configurations with respect to the expanded quantity, such as the energy. In many cases, that input quantity may be determined by an accurate ab-intio method such as Density Functional Theory. One essential requirement that each CE fit must meet for practical applications is ground-state preservation, i.e., a physically accurate CE model must reproduce the ground states of the input if only the input configurations are considered. This requirement is important as the ground states usually govern the material properties at relevant temperature,18 such as finite temperature voltage profiles18 and phase diagrams.18 In this article, we revisit the ECI fitting problem with a focus on constraints that guarantee ground-state preservation. We propose a robust and efficient scheme to construct ground-state preserving CE models based on compressive sensing17, 19 and quadratic programming.20

The manuscript is organized as follows: First, we briefly review the CE formalism and define the ECI fitting problem in a rigorous mathematical manner. We then derive ground-state preserving constraints that can be used in conjunction with a quadratic programming solver. Afterwards, we consider the phase diagrams of two prototypical oxides of practical relevance as benchmark cases. Finally, we compare the strengths and weaknesses of our approach with established methods.

Results and Discussion

Compressed sensing and cluster expansion

For a rigorous mathematical introduction to cluster expansions and their formal relationship to the partition function of crystalline solids, we refer the readers to references.6, 9, 21 Here, we only illustrate the key features of cluster expansions that are of relevance to the present work.17

The general expression of a cluster expansion Hamiltonian is

$${{\bf{E}}_{{\rm{CE}}}}\left( \sigma \right) = \ {J_0} + \mathop {\sum}\limits_i^{{\rm{sites}}} {{J_i}{\sigma _i}} + \mathop {\sum}\limits_{i,\, j}^{{\rm{pairs}}} {{J_{i,\, j}}{\sigma _i}{\sigma _j}} + \mathop {\sum}\limits_{i,\, j,\, k}^{{\rm{triplets}}} {{J_{i,\, j,\, k}}{\sigma _i}{\sigma _j}{\sigma _k}} \\ + ... = \mathop {\sum}\limits_{c \in {\bf{C}}} {{{\it{J}}_c}{\sigma _c}} $$
(1)

where σ is the spin representation of an atomic configuration in which each component σ i (a spin variable) denotes the occupancy of site i. Following the Ising model convention, σ i takes on values of ±1 in a binary system, encoding the atomic species on site i. Each product of spin variables, σ i σ j ... , (spin product) corresponds to a cluster of lattice sites, and the cluster expansion energy E CE is a polynomial of the spin variables weighted by the expansion coefficients J, the ECIs. For brevity, we denote the set of interacting clusters as C. For any cluster cC, J c is the corresponding ECI and σ c is the corresponding spin product. Note that typically multiple clusters of the same type exist (e.g., the point term for each equivalent site or the cluster corresponding to the nearest-neighbor pair interaction), and symmetry requires the coefficients of equivalent clusters have to be identical.22 The summation in Eq. (1) is therefore actually over cluster types, and the individual spin products can be replaced by their average over all equivalent clusters, the cluster correlations.

From Eq. (1) it is obvious that the CE energy is linearly dependent on the ECIs, J, when the configuration σ is fixed. We can thus write

$${{\bf{E}}_{CE}}\left( \sigma \right) = {\bf{\Pi }} \left( \sigma \right){\bf{J}}$$
(2)

where Π(σ) is the row vector of cluster correlations (with multiplicity incorporated) corresponding to configuration σ. Given a set of input atomic configurations S and their DFT energies E DFT,S, the problem of determining the ECIs can then be naïvely expressed as minimization of the L 2 norm

$$\mathop {{\min }}\limits_{\bf{J}} {\left\| {{{\bf{E}}_{{\rm{DFT,S}}}} - {{\bf{\Pi }}_S}{\bf{J}}} \right\|_2}$$
(3)

where the rows of the feature matrix Π S are the cluster correlations of the configurations in S. Note that the L 2 norm is the conventional Euclidean norm, and the general L p norm \({\left\| {\bf{u}} \right\|_p}\) is defined as:

$${\left\| {\bf{u}} \right\|_p} = {\left( {\mathop {\sum}\limits_i {{{\left| {{u_i}} \right|}^p}} } \right)^{\kern -3pt 1/p}}$$
(4)

Simply minimizing the L2 norm in Eq. (3) essentially means that the ECIs are fitted such that the average squared difference between the DFT energies and the CE-predicted energies of all structures is minimized. However, such a direct minimization of the error function leads to overfitting when the number of ECIs (the model parameters) exceeds or becomes close to the number of reference configurations (the fitting parameters), i.e., when the system of linear equations Eq. (3) is underdetermined. Overfitting means that the ECIs accurately reproduce the energies of the reference structures (in-sample data) but deliver poor generalization, i.e., the CE model does not reliably predict the energy of other unseen structures (out-of-sample data). A standard method to avoid overfitting is regularization,23 i.e., the simultaneous minimization of the sum of the error function and the magnitude of the model parameters. Compressive sensing17, 19 implements L 1 norm regularization, which has been shown to be a nearly optimal and robust way to reconstruct signals from a small number of data points.24 The compressive sensing formulation of the cluster expansion problem is:

$$\mathop {{\min }}\limits_{\bf{J}} \left\| {{{\bf{E}}_{{\rm{DFT,S}}}} - {{\bf{\Pi }}_S}{\bf{J}}} \right\|_2^2 + \mu {\left\| {\bf{J}} \right\|_1}$$
(5)

where μ is a parameter controlling the sparseness of the fit. A higher value of μ shifts the weight towards minimizing the L 1 norm, when μ is small the minimization of the L 2 error dominates. The L 1 norm of a vector is a measure of the vector’s sparseness,24 thus larger μ values result in fewer ECIs not equal to zero and thereby reduce overfitting. An optimal μ value can be determined through minimizing the error of the CE model on unseen data.17

Constrained cluster expansion models

For practical applications it is often desirable that a CE model preserves some invariants on the input data. For example, predicting the qualitative features of a phase diagram may require that the energetic order of all structures is exactly preserved while quantitative errors in the structural energies might be tolerable.25 This is because the set of ground states and the ranking of excited states close in energy determines the topology of a phase diagram more than the actual energies themselves.26 As the energy difference between competing structures is typically small, minimization of the average error in reproducing the DFT energy does not by itself enforce the structural energy order one wants to preserve. As a result even very small energy errors in the CE can qualitatively change a phase diagram when it leads to new ground states.26 We have found practically that trying to preserve the structural ordering and ground states by increasing the relative weights of these input data rapidly leads to overfitting in the CE. In the following, we will develop a methodology that allows including constraints in the ECI optimization problem in a systematic, unbiased fashion and without overfitting.

In recent years, mathematical programming has been a rapidly growing field that enables the highly efficient, systematic and rigorous solution of problems in different standard forms.27 One rapidly growing area is quadratic programming (QP),28 for which robust solvers exist,20 and a variety of different approaches have been researched and implemented, such as the interior point method, the active set method and the augmented Lagrangian method.28 In essence, quadratic programming is a mathematical optimization technique for problems of the following specific form:

$$\begin{array}{l}\\ \mathop {{\min }}\limits_{\bf{x}} \;\,\frac{1}{2}{{\bf{x}}^T}{\bf{Qx}} + {{\bf{c}}^T}{\bf{x}}\\ \\ s.t.\quad {\bf{Ax}} \le {\bf{b}}\\ \\ \quad \;\;\;\,{\bf{Cx}} = {\bf{d}}\\ \end{array}$$
(6)

where Q is a positive semidefinite matrix, A and C are real matrices, and b, c and d are real vectors. Note that a matrix is positive semidefinite if and only if for all real vectors X, x T Qx ≥ 0. The semidefinite property is essential so that the optimization problem is convex. Also note that when Q = 0 Eq. (6) reduces to a standard linear programming problem which was introduced to CE optimization in reference.25

Our key strategy for CE fitting is to cast the compressive sensing problem Eq. (5) into a quadratic programming problem Eq. (6)29 and to add constraints that guarantee ground-state preservation. Explicitly, Eq. (5) can be rewritten as:

$$\begin{array}{l}\\ \quad \quad \mathop {{\min }}\limits_{\bf{J}} \left\| {{{\bf{E}}_{{\rm{DFT,S}}}} - {\Pi _{\rm{S}}}{\bf{J}}} \right\|_2^2 + \mu {\left\| {\bf{J}} \right\|_1}\\ \\ \Leftrightarrow \quad \mathop {{\min }}\limits_{\bf{J}} \left\| {{{\bf{E}}_{{\rm{DFT,S}}}} - {\Pi _{\rm{S}}}{\bf{J}}} \right\|_2^2 + \mu \mathop {\sum}\limits_{c \in {\bf{C}}} {\left| {{J_c}} \right|} \\ \\ \Leftrightarrow \quad \mathop {{\min }}\limits_{{\bf{J}},{\bf{z}}} \left\| {{{\bf{E}}_{{\rm{DFT,S}}}} - {\Pi _{\rm{S}}}{\bf{J}}} \right\|_2^2 + \mu \mathop {\sum}\limits_{c \in {\bf{C}}} {{z_c}} \\ \\ \quad \quad s.t.\quad {z_c} \ge {J_c}\quad \forall c \in {\bf{C}}\\ \\ \quad \quad \quad \quad {z_c} \ge - {J_c}\quad \forall c \in {\bf{C}}\\ \end{array}$$
(7)
$$\begin{array}{l}\\ \Leftrightarrow \quad \mathop {{\min }}\limits_{{\bf{J}}{\rm{,}}{\bf{z}}} {{\bf{J}}^T}{\Pi _{\rm{S}}}^T{\Pi _{\rm{S}}}{\bf{J}} - 2{{\bf{E}}_{{\rm{DFT,S}}}}^T{\Pi _{\rm{S}}}{\bf{J}} + \mu \mathop {\sum}\limits_{c \in {\bf{C}}} {{z_c}} + {{\bf{E}}_{{\rm{DFT,S}}}}^T{{\bf{E}}_{{\rm{DFT,S}}} } \\ \\ s.t.\quad {z_c} \ge {J_c}\quad \forall c \in {\bf{C}} \\ \\ \; \quad \quad {z_c} \ge - {J_c}\quad \forall c \in {\bf{C}} \\ \end{array}$$
(8)

In the conversion step in Eq. (7) auxiliary variables z c , corresponding to constraints on J, have been introduced to remove the L 1 norm of Eq. (5). The equivalence in Eq. (7) holds because every z c can be independently minimized while it is constrained to be larger than ± J c . Note that the QP formulation in Eq. (6) does not allow absolute value operations, so that two separate linear constraints are required in Eq. (7), z c  ≥ J c and z c  ≥ −J c , even though they are in combination essentially expressing the absolute value constraint z c  ≥ |J c |. The conversion step in Eq. (8) is a direct expansion of the L 2 norm into vector multiplication. Note that \({\Pi _{\rm{S}}}^T{\Pi _{\rm{S}}}\) is always positive semidefinite for every vector x, since \({{\bf{x}}^T}{\Pi _{\rm{S}}}^T{\Pi _{\rm{S}}}{\bf{x}} = {\left( {{\Pi _{\rm{S}}}{\bf{x}}} \right)^T}{\Pi _{\rm{S}}}{\bf{x}} \ge 0\). Hence, we have arrived at a formulation of the compressive sensing ECI problem Eq. (5) in terms of a QP problem.

The second key step of our methodology is to include suitable constraints for ground-state preservation in the QP formulation. Ground states, i.e., those configurations that are thermodynamically stable at zero temperature (0 K), can be identified by constructing the lower convex hull of the formation energies.30 When the energy of a configuration is above the ground-state hull it is thermodynamically unstable with respect to decomposition into neighboring ground states.

Note that there are 2 different scenarios that lead to inconsistent ground states from an ECI fit: The first type of ground-state inconsistency occurs when the energy of some non-ground-state configuration is underestimated so much that it erroneously becomes a ground-state of the CE model. This problem is illustrated in Fig. 1 (labeled with P1), where the energy of configuration s1 is predicted to be below its decomposition line in the input data, i.e., below the convex combination of configurations h2 and h3 (shown as the line connecting the points). To constrain the QP system such that no inconsistency of type 1 occurs, we add the first constraint:

Fig. 1
figure 1

Schematic of the two types of ground-state inconsistencies that may arise during the fit of a cluster expansion (CE) model to DFT reference data. (P1) illustrates the situation in which one particular configuration, s1, that is unstable based on the DFT input data becomes a stable ground-state of the CE model, as its CE energy is below the convex combination of its decomposition line defined by the ground-state configurations h2 and h3. (P2) illustrates the converse situation in which the CE energy of one ground-state configuration, h2, is greater than the convex combination of the neighboring ground states, h1 and h3, which causes h2 to be unstable in the CE model

(C1) for each configuration that is not on the ground-state hull (i.e., a configuration that would thermodynamically decompose into ground states), we require that its CE configuration energy is greater than its CE decomposition line. To express this condition formally, we denote the i-th ground-state configuration (i.e., the i-th configuration on the lower convex hull) by σ h,i with iH. With this notation, the decomposition of an unstable configuration σ s into the stable ground states can be expressed as Dec H (σ s ) = {(x i (σ s ),σ h,i ) iH } where x i (σ s ) is the fraction of σ h,i in the decomposition products. The constraint to remove ground-state inconsistencies of type 1 becomes

$$\Pi \left( {{\sigma _s}} \right){\bf{J}} \ge \mathop {\sum}\limits_{i \in H} {{x_i}\left( {{\sigma _s}} \right)} \Pi \left( {{\sigma _{h,i}}} \right){\bf{J}} + \varepsilon \,{\rm{for}}\,{\rm{all}}\,{\rm{unstable}}\,{\rm{configurations}}\,{\sigma _s}$$
(9)

where ε is some small number used as numerical tolerance.

Introducing constraint (C1) in Eq. (9) to the QP problem in Eq. (8) guarantees that all ground states of the CE model are also ground states of the DFT input data. However, the converse is not necessarily true, i.e., a DFT ground-state configuration might not be a ground-state of the CE model. This scenario is shown in Fig. 1 (P2), where configuration h2 has a greater CE energy than its convex hull decomposition line defined by h1 and h3. To remove this second type of ground-state inconsistency, we introduce a second constraint:

(C2) for each ground-state configuration σ h (i.e., for each configuration σ h on the lower convex hull), we require that its energy is smaller than the energy of a modified hull that results from removing σ h from the set of input ground states. Formally, given a ground-state configuration σ h on the input hull, we consider its decomposition into a modified ground-state hull as Dec H\h (σ h ) = {(x i,H\h (σ h ),σ h,i ) iH\h } where H\h is the index set of all input hull configurations not including σ h , and x i,H\h (σ h ) is the fraction of decomposition product σ h,i . The constraint to remove ground-state inconsistencies of type 2 thus becomes

$$\begin{array}{ccccc}\\ & \Pi \left( {{\sigma _h}} \right){\bf{J}} \le \mathop {\sum}\limits_{i \in H\backslash h} {{x_{i,H{\rm{\backslash }}h}}\left( {{\sigma _h}} \right)} \Pi \left( {{\sigma _{h,i}}} \right){\bf{J}} - \varepsilon \quad \\ & {\rm{for}}\,{\rm{all}}\,{\rm{ground}}\,{\rm{state}}\,{\rm{configurations}}\,{\sigma _h} \\ \end{array}$$
(10)

Constraint (C2) in Eq. (10) guarantees that all ground-state configurations in the (DFT) input data are also ground states of the CE model. Consequently, by combining (C1) and (C2), a configuration is a ground-state of the resulting CE model if and only if it is a ground-state of the input data. The full quadratic programming formulation for ground-state preserving CE fitting is

$$\begin{array}{l}\\ \mathop {{\min }}\limits_{{\bf{J}},{\bf{z}}} {{\bf{J}}^T}{\Pi _{\rm{S}}}^T{\Pi _{\rm{S}}}{\bf{J}} - 2{{\bf{E}}_{{\rm{DFT,S}}}}^T{\Pi _{\rm{S}}}{\bf{J}} + \mu \mathop {\sum}\limits_{c \in {\bf{C}}} {{z_c}} + {{\bf{E}}_{{\rm{DFT,S}}}}^T{{\bf{E}}_{{\rm{DFT,S}}}} \\ \\ s.t.\quad {z_c} \ge {J_c}\quad \forall c \in {\bf{C}} \\ \\ \quad \quad {z_c} \ge - {J_c}\quad \forall c \in {\bf{C}} \\ \\ \quad \quad \Pi \left( {{\sigma _s}} \right){\bf{J}} \ge \mathop {\sum}\limits_{i \in H} {{x_i}\left( {{\sigma _{h,i}}} \right)} \Pi \left( {{\sigma _{h,i}}} \right){\bf{J}} + \varepsilon \quad \\ \\ \quad \quad {\rm{for}}\,{\rm{all}}\,{\rm{unstable}}\,{\rm{configurations}}\,{\sigma _s} \\ \\ \quad \quad \Pi \left( {{\sigma _h}} \right){\bf{J}} \le \mathop {\sum}\limits_{i \in H\backslash h} {{x_{i,H\backslash h}}\left( {{\sigma _{h,i}}} \right)} \Pi \left( {{\sigma _{h,i}}} \right){\bf{J}} - \varepsilon \quad \\ \\ \quad \quad {\rm{for}}\,{\rm{all}}\,{\rm{ground}}\,{\rm{state}}\,{\rm{configurations}}\,{\sigma _h}\\ \end{array}$$
(11)

Cation ordering in the rocksalt-type lithium transition metal oxide systems Li x Fe(1−x)O and Li x Ti(1−x)O

We demonstrate the effectiveness of the QP approach to cation ordering in two oxide systems. Rocksalt-type lithium transition metal oxides, LiMO2 (M = one or more transition metal species), are the most important class of cathode materials for lithium-ion batteries in consumer electronics.31 During the last decade, materials with lithium excess compositions, Li(1+x)M(1-x)O2, have attracted much interest owing to their high lithium storage capacities.32, 33 One criterion for the suitability of Li(1+x)M(1-x)O2 as cathode material is whether the material is a sufficiently good conductor for Li ions, which critically depends on the cation (Li, M) ordering in the structure.34, 35 While conventional oxide-based cathode materials form in ordered crystal structures (such as layered LiCoO2 36), several cation-disordered lithium-excess materials with high practical capacities have recently been discovered.35, 37 Some of these new compositions contain Ti38 and Fe37 which makes them attractive for technological applications because of the metals’ high abundance and nontoxicity. However, LiTiO2 39, 40 and LiFeO2 41, 42 are the only LiMO2 with single transition metal species that form in cation-disordered structures in solid-state synthesis, and consequently their configurational phase diagrams are challenging to investigate experimentally.

In the following we employ the ground-state preserving QP methodology developed above to investigate the phase diagrams of Li x Fe(1−x)O and Li x Ti(1−x)O39,40,41,42 to obtain a better understanding of the relevant atomic configurations. The input consists of 863 and 602 reference configurations for Li x Fe(1−x)O and Li x Ti(1−x)O respectively. DFT calculations for Li x Fe(1−x)O configurations were performed within the Hubbard-U corrected Generalized Gradient Approximation (GGA+U), using the PBE exchange-correlation functional.43, 44 The U values are taken from the work of Jain et al.45 DFT calculations for Li x Ti(1−x)O configurations did not employ a Hubbard-U correction. For both systems, an initial set of configurations at x = 0.5 with supercell sizes up to 8 sites was generated using the enumerating algorithms by Hart et al.,46 and the reference sets were subsequently refined by including ground-state configurations of preliminary cluster expansions determined using a recently published ground-state search algorithm for lattice models.47 The corresponding ground-state input hulls are shown in Fig. 2 as black dots and lines. We note that both systems, Li x Fe(1−x)O and Li x Ti(1−x)O, cannot easily be fitted using the conventional (unconstrained) compressive sensing technique, as the approach gives rise to a number of spurious ground states as shown in Fig. 2 (with optimal μ parameter as will be discussed below). Specifically, some Li x Fe(1−x)O configurations with x = 1/3, 5/8 and 9/16 and Li x Ti(1−x)O configurations with x = 1/8, 1/6, 1/4, 5/9, 3/5, 5/8 are erroneously predicted to be ground states (i.e., inconsistencies of type 1 as defined above). These over-stabilized configurations are marked with arrows in Fig. 2. In addition, the actual Li x Ti(1−x)O ground-state configurations with x = 1/10, 1/5, 8/15 become unstable in the compressive sensing CE model (inconsistencies of type 2). These examples demonstrate that ground-state preservation is not an automatic feature inherent to the compressive sensing approach, and the problem needs to be addressed before predictive simulations of materials systems are possible.

Fig. 2
figure 2

Input hull as obtained from density functional theory (black lines and circles), cluster expansion hull fitted using the conventional (unconstrained) compressive sensing method (red lines and squares), and ground-state preserving cluster expansion hull fitted with the quadratic programming method of the present work (green lines and diamonds) for (a) Li x Fe(1−x)O and (b) Li x Ti(1−x)O. The input ground-state configurations for (a) Li x Fe(1−x)O correspond to FeO, LiFeO2 and Li2FeO3 and for (b) Li x Ti(1−x)O correspond to TiO, Li1Ti9O10, Li2Ti8O10, Li3Ti6O9, Li4Ti5O9, Li2TiO3, LiTiO2, Li8Ti7O15 and Li2TiO3. Arrows are used to mark the over-stabilized unstable configurations. The sparseness parameter, μ, was chosen to be 0.144 for both systems

As seen in Fig. 2, the QP fitting scheme achieves ground-state preservation for both materials and yields CE hulls that are spanned by the same configurations as the input DFT hulls. However, it is worth noting that the sparseness parameter, μ, of Eq. (11) needs to be carefully selected to arrive at this result. In the following section we will show that μ should be chosen such that the cross-validation error is minimized. The discussion of the cross-validation error is essential in that it provides a standard measure of predictive power of our fitting scheme which sets the method apart from other approaches for ground-state preservation, such as the adjustment of configuration weights, which will also be shown below.

Cross-validation of the choice of the sparseness parameter

Cross-validation is the standard way to decide the optimal sparseness of a numerical model, which is generally referred to as bias variance trade-off in statistical inference.48 To determine the sparseness parameter μ by means of cross-validation we randomly split the DFT data, D, into N = 10 equal parts. For each part D i , we define its complement \({\overline D _i}\) as all the DFT data points except those belonging to D i (formally, \({\overline D _i} \equiv D - {D_i}\)). Next, the QP scheme of Eq. (11) is applied to the complement set \({\overline D _i}\) to obtain a CE fit without the information in part D i , so that an out-of-sample validation can be performed by calculating the root mean square error (RMSE) of the unseen data D i . We denote the resulting out-of-sample RMSE as e i,μ . The cross-validation (cv) score cv μ given a sparseness parameter μ is then defined as the root mean square of the out-of-sample RMSE over all N data parts, i.e., formally \(c{v_\mu } = \sqrt {\mathop {\sum}\limits_{i = 1...N} {e_{i,\mu }^2{\rm{/}}N} } \). Using this definition, the optimal μ resulting in the model with greatest predictive power can be determined by plotting cv μ against μ and selecting the value of μ that minimizes the cv score.

The cross-validation score cv μ of the Li x Fe(1−x)O system as function of μ is shown in Fig. 3a for various different numbers of input clusters. Note that the number of input clusters to draw from is determined by a maximum interaction order (e.g., triplets) and a radial cutoff. Across all five curves, the cv score initially decreases and then increases with increasing μ. We consider the concept of bias variance tradeoff48 to understand this behavior: The input DFT energies may conceptually be understood as the sum of an ideal cluster expansion and a certain degree of noise ε, i.e., E DFT = Π(σ)J + ε. Here, the noise could originate from numerical errors in the DFT energies. For small values of μ, the CE fit uses all available degrees of freedom (i.e., all ECIs) to incorporate the noise ε into the CE model, resulting in severe overfitting. As the value of μ increases, the number of non-zero ECIs decreases and the effect of noise, i.e., the variance in fitting, becomes less severe. However, when μ becomes too large, the bias that ECIs should tend to 0 becomes dominating over the data itself, resulting in severe underfitting and thus increasing cv scores. As a consequence, the cv score has a pronounced minimum allowing to determine the optimal μ corresponding to the best tradeoff between the variance and bias during fitting.

Fig. 3
figure 3

a Cross-validation (cv) score cv μ plotted as function of the sparseness parameter μ for the example of the Li x Fe(1−x)O system for various numbers of input clusters. The optimal μ minimizes the cv score. b, c Optimal cv score for different numbers of input clusters as determined using the QP methodology and the conventional weight adjustment scheme for (b) Li x Fe(1−x)O and (c) Li x Ti(1−x)O

As seen in Fig. 3a, for small values of μ where \(\log \left( \mu \right) < - 2\) the cv score increases dramatically with the number of input clusters indicating overfitting as the result of insufficient regularization. As the sparseness parameter increases above \(\log \left( \mu \right) >- 2\), the cv score becomes less sensitive with respect to the number of input clusters, indicating that the regularization is effective and that most non-essential ECIs are fitted to zero regardless of the number of input clusters. The optimal cv scores are found for \( - 1 \le \log \left( \mu \right) \le 0\) and are plotted in Fig. 3b for different numbers of input clusters (labeled “QP methodology”). As seen in the figure, the optimal cv score decreases from 0.0345 eV/f.u. (formula unit) for 54 input clusters to 0.0261 eV/f.u. for 625 input clusters. The cv score stabilizes at 625 input clusters and barely changes for 1184 input clusters (0.0260 eV/f.u.). Hence, we conclude that 625 input clusters and a sparseness parameter μ = 0.144 result in a CE model with optimal predictive power for the Li x Fe(1−x)O system. The corresponding analysis for the Li x Ti(1−x)O system is shown in Fig. 3c, and the optimal parameters, μ = 0.144 and 411 input clusters, yield a cv score of 0.0331 eV/f.u.

In-sample ground-state preservation and comparison with conventional weight adjustment

Per construction, the QP form Eq. (11) guarantees that the CE fit preserves the ground states of the reference data set. Conventionally, such in-sample gound-state preservation is often achieved by assigning weights to the reference configurations to manually bias the fit. In the following, we compare the performance of the QP methodology with the conventional weight adjustment technique to further assess the utility of our approach. Before we detail the weight adjustment method, we briefly consider how configuration weights can be included in the QP approach in practice. For this purpose, we define a diagonal weight matrix W whose diagonal entries w i,i correspond to the weight of the ith input configuration. With this definition, W can be incorporated into Eq. (5) to achieve multiplying weights to the in-sample fitting error:

$$\mathop {{\min }}\limits_{\bf{J}} \left\| {{\bf{W}}\left( {{{\bf{E}}_{{\rm{DFT,S}}}} - {{\bf{\Pi }}_S}{\bf{J}}} \right)} \right\|_2^2 + \mu {\left\| {\bf{J}} \right\|_1}.$$
(12)

Note that large w i,i result in a strong bias of the fitting error for the ith input configuration to be 0. The concrete weight adjustment procedure that we employed in this work is as follows:

  1. (1)

    Initialize all weights to be 1.

  2. (2)

    Perform QP to construct a CE model.

  3. (3)

    Check if the CE model preserves in-sample ground states. If it does, the ground-state preserving fit is completed. If it does not, we define the set T of all DFT and CE ground-state configurations

$${\bf{T}} \equiv \left\{ {i \in {\rm{S:}}\,{{\rm{S}}_i}\,{\rm{is}}\,{\rm{DFT}}\,{\rm{Ground}}\,{\rm{state}}\,{\rm{or}}\,{\rm{CE}}\,{\rm{Ground}}\,{\rm{State}}} \right\}$$
(13)

Further, we define the maximum CE hull as \(er{r_{{\rm{hull}}}} \equiv \mathop {{\max }}\limits_{j \in {\bf{T}}} \left| {{{\bf{E}}_{{\rm{CE}},\, j}} - {{\bf{E}}_{{\rm{DFT}},\, j}}} \right|\) and introduce a weight-increment set T ≡ {iT:|ECE, i −EDFT, i |>0.5err hull}. For each configuration iT , w i,i is increased \(\root 4 \of {2}\) ≈ 1.19 times. The procedure is continued with step (2).

This weight adjustment scheme guarantees that in-sample ground states are preserved, since it iteratively converges the CE hull to the DFT hull and corrects spurious ground-state configurations.

A comparison of the optimal cv scores obtained for different numbers of input clusters using both methods is shown in Figs. 3b, c. The cv score is, once again, used as a standard measure for the predictive power of the fits. In case of the Li x Fe(1−x)O system, the predictive power of the QP fit is consistently better than the fit obtained using the weight adjustment technique, and the improvement of the cv score is generally found to be around 2 meV/f.u. or 10%. For the Li x Ti(1−x)O system, the cv score of the QP CE fit also improves about 3 meV/f.u. or 10% of the CE fit from weight adjustment, except for 625 input clusters for which both methods give equivalent results. However, considering all numbers of input clusters, the overall best cv score for the QP method is 1.5 meV/f.u. or 5% better. While absolute energy errors on the order of a few meV/f.u. are close to the inherent error of density functional theory, similar errors in the relative energies of different configurations may add up and thereby give rise to qualitatively different phase diagrams.

Generally, we observed that the weight adjustment method biases some configurations by more than a factor of one thousand (w i,i  > 103), resulting in overfitting of those particular configurations, whereas the QP scheme shows no evidence of such a partial overfitting.

In summary, we conclude that the QP methodology of this work has significant advantages over conventional weight adjustment for the preservation of in-sample ground states. However, as will be demonstrated in the following section, the superiority of the QP approach becomes truly evident when out-of-sample configurations are considered.

Out-of-sample ground-state preservation

Suppose that we are certain that the set of reference configurations available for the construction of the CE model comprises all physical ground states of the system. This situation could occur after an extensive exploration of the configurational space or when the DFT data agrees exceptionally well with experiment. With such confidence in the reference data set, we would like to guarantee that the fitted CE model not only reproduces the ground states of the reference data, but also does not possess any additional ground states that are not already present in the reference data. We call this property out-of-sample ground-state preservation. In the following, we describe an iterative procedure for constructing CE models that guarantee out-of-sample ground-state preservation up to a given number of periodic sites. We will further show that this procedure is generally a useful strategy to construct CE models even when, initially, it is not known whether ground states outside of the reference set exist.

The QP formulation established in Eq. (11) provides ground-state preservation within the set of input data. However, out of sample ground-state preservation is not guaranteed. In principle, if the true configuration polytope,49 P, is known for a set of possible ECIs, i.e., σP can be added and solved within a QP, one could add the following constraint:

$$\Pi \left( \sigma \right){\bf{J}} \ge \mathop {\sum}\limits_{i \in H} {{x_i}\left( \sigma \right)} \Pi \left( \sigma \right){\bf{J}}\quad \forall \sigma \in P$$
(14)

to Eq. (11) and the corresponding optimization problem will result in a globally ground-state preserved CE fit. In practice, however, solving the configurational polytope for an arbitrary CE is an undecidable problem.50 Although this does not necessarily mean that finding a ground-state preserving fit is globally undecidable as well, this fact hints at the intrinsic difficulty of the out-of-sample ground-state preservation problem.

Instead of determining a priori constraints that guarantee out-of-sample preservation, we first examine a CE fit with in-sample ground-state preservation obtained from the QP methodology with optimal parameters (sparseness, number of input clusters) and determine all ground states of the CE model up to a defined system size using the methodology of reference.47 The ground-state hull defined by the input configurations is denoted as the in-sample hull whereas we refer to the hull that is based on all identified ground states as the out-of-sample hull. A comparison of the in-sample and out-of-sample hulls for Li x Fe(1−x)O and Li x Ti(1−x)O for supercell sizes with up to 16 sites is shown in Fig. 4. For Li x Fe(1−x)O, one extra ground-state at x = 5/8 is identified that is predicted to be 6 meV below the in-sample hull. Even though the distance between the in-sample and out-of-sample hulls is small (6 meV), this CE would produce a qualitatively wrong phase diagram due to the spurious ground-state at x = 5/8. For Li x Ti(1−x)O, the discrepancy between the in-sample CE hull and the out-of-sample CE hull is even more severe, as shown in Fig. 4b.

Fig. 4
figure 4

Comparison of the convex hulls based only on the reference input data (in-sample CE hull) and based on all ground-state configurations of the cluster expansion model (out-of-sample CE hull) for (a) Li x Fe(1−x)O and (b) Li x Ti(1−x)O. The Li x Ti(1−x)O out-of-sample hull is below the in-sample hull with a maximal distance of 6 meV. An arrow is added to emphasize the out of sample ground-state at x = 0.625 for Li x Fe(1−x)O

In the following we would like to arrive at a scheme to construct a CE that does not lead to additional ground states, i.e., out-of-sample ground-state preservation. Such scheme is useful in efficient determination of new ground-state configurations and self-consistent CE. Instead of determining the true configurational polytope of Eq. (14), we arrive at a CE model with out-of-sample ground-state preservation iteratively by determining the global ground states of preliminary CE models (as above) to identify those configuration σP for which Eq. (14) is not satisfied. Afterwards, without additional DFT calculations, the constraint corresponding to these configurations σ are added to the QP form as in Eq. (14). By iteratively calculating the ground-state hulls and adding further constraints, global ground-state preservation up to a large supercell size can be achieved. The procedure is illustrated in Fig. 5a.

Fig. 5
figure 5

a Flow chart of the iterative procedure for constructing out-of-sample ground-state preserving cluster expansion models. b Flow chart of a combined DFT-CE configurational sampling resulting in the construction of ground-state consistent cluster expansion models

To demonstrate the convergence of this iterative refinement, we applied the procedure to the two model systems for supercell size of up to 16 sites. The weight adjustment procedure described above is used for comparison. Small initial weights, 10−4, and energies of about 1 meV above the hull are assigned to the predicted new ground states. The results are shown in Fig. 6a for Li x Fe(1−x)O and Fig. 6b for Li x Ti(1−x)O. For reference, the figure also shows the results of an iterative refinement using the weight adjustment method. The maximum distance between the in-sample hull and the out-of-sample hull is plotted in the upper panel as a measure of the difference between the two hulls as the iteration progresses. The corresponding cv score is plotted in the lower panel as a measure of the predictive power of the CE fit.

Fig. 6
figure 6

Results of the iterative construction of cluster expansion models with out-of-sample ground-state preservation for (a) Li x Fe(1−x)O, (b) Li x Ti(1−x)O based on the exact ground states up to a supercell size of 16 sites and (c) Li x Ti(1−x)O based on the simulated annealing with a supercell size of 8 × 8 × 8 (512 sites). The upper panels show the maximal distance between the in-sample and out-of-sample hulls, and the lower panels show the corresponding cross-validation scores. For comparison, the corresponding results using iterative adjustment of configuration weights is shown as red lines and squares

As seen in Fig. 6, for both systems, Li x Fe(1−x)O and Li x Ti(1−x)O, the maximum distance (defined as the difference of energy under the same x) between the out-of-sample and in-sample hulls decreases monotonously to 0 with the QP methodology. The iterative weight adjustment also converges for Li x Fe(1−x)O, though the distance between the hulls fluctuates and does not decay monotonously. For Li x Ti(1−x)O the weight adjustment method does not converge. More importantly, the cv scores of the QP fits are nearly constant throughout the iterations, whereas the cv score continuously increases for the weight adjustment algorithms. This means that, using the QP methodology, out-of-sample ground-state preservation can be achieved without sacrificing the predictive power of the CE fit. On the other hand, the weight adjustment technique that is often used for CE construction is not guaranteed to converge and rends to achieve gound-state preservation at the cost of predictive power (increasing cv score). We therefore conclude that the QP methodology developed in the present work allows for the systematic construction of CE models with in-sample and out-of-sample ground-state preservation.

The results above are based on an exact ground-state search for system sizes of up to 16 sites, however, for the purpose of phase diagram calculations via Monte Carlo simulations much larger supercell sizes may be required. To construct CE models that are in practice ground-state preserving even for sufficiently large system sizes, the exact ground-state search may be replaced by simulated annealing simulations, which allow to determine plausible ground states for larger supercell size (but cannot provide proof that all ground states have been identified, see ref. 47 for a more detailed discussion). We repeated the iterative procedure of Fig. 5a for Li x Ti(1−x)O for supercell sizes with up to 512 sites using simulated annealing, and the results are depicted in Fig. 6c.

As shown in Fig. 6a for smaller cell sizes, using the QP methodology, the distance between the in-sample and out-of-sample CE hulls decreases monotonously to 0 within 7 iterations, and the cv score remains nearly constant. As before, for the iterative weight adjustment algorithm does not achieve complete convergence even after 18 iterations and gives rise to a dramatic increase of the cv score. This final example demonstrates again that the QP methodology is a robust scheme to obtain ground-state preserving CE fits even for large system sizes that are suitable for realistic Monte Carlo simulations.

Finally, we point out that the iterative procedure for out-of-sample ground-state preservation is not only useful, when the ground states of the system are known a priori. Instead, the procedure may also serve as a means for the sampling of the configurational space to generate additional reference data. For this purpose, the configurations that were identified as “spurious” ground states may be evaluated with the reference method (i.e., DFT) to confirm whether any unknown ground-state has been discovered. By construction, this approach also provides a good stopping criterion for the cluster expansion fit when no additional ground states have been identified. This procedure is illustrated in Fig. 5b. If DFT calculations for all prospective new ground states are carried out and none of them turns out to be an actual ground-state, the out-of-sample ground-state preserving fit has the correct assumption and the resulting CE fit is a valid fit with consistently low cv errors. No further iteration is necessary, and the CE fit is finalized. On the other hand, if additional DFT ground states are found within the proposed set, then the out of sample ground-state preserving fit would have to be re-started.

To summarize, in this article, we presented a robust and efficient procedure to obtain ground-state preserving cluster expansion models. The method is formulated in terms of quadratic programming and compressive sensing and is mathematically rigorous. We demonstrated the robustness of the approach by application to the phase diagrams of Li x Fe(1−x)O and Li x Ti(1−x)O that are challenging to describe with conventional cluster expansion techniques. We further showed that out-of-sample ground-state preservation can be achieved up to large supercell sizes. These properties make the presented quadratic programming approach an attractive tool for the fit of general constraint lattice models and point the way towards the fully automated construction of cluster expansion models for materials simulations.

Data availability statement

The data that support the findings of this study have been deposited in Open Science Framework with the identifiers (DOI: 10.17605/OSF.IO/6DEHY).51