Reliable methods for the assessment of thermodynamic stability can accelerate materials design in at least two ways, one considering only energy and the other considering free energy. Identifying low-energy structures that are stable with respect to phase decomposition is needed to ensure that computer-designed materials are synthesizable and stable in operation conditions. In addition, including the role of temperature and entropy is required to understand phase transitions and to predict phase diagrams de novo.

The difficulty in quantifying the free energy difference between phases arises because, in principle, the evaluation of potentials that govern phase stability requires a summation over all possible states of the system that satisfy the corresponding thermodynamic constraints. In practice, Monte Carlo (MC) methods approximate equilibrium properties by identifying a relatively small number of representative system configurations from which ensemble averages can be estimated, and thus implement a strategy to compute relative free energies and determine stable phases. The broad applicability of MC approaches has led to the development of numerous software packages specifically geared toward the materials domain1,2,3,4,5,6,7,8,9. Generally, the most common algorithm to quantify phase stability is to: (1) consider a coarse-grained representation of a phase consisting of a supercell of fixed size and space group where states can be defined by a set of occupation variables S describing the atom at each site, (2) use a set of DFT (structure, energy) pairs to fit an empirical model that predicts the internal energy U(S) of a state with occupancy S, and (3) draw samples from the equilibrium distribution defined by U(S) using a Markov Chain. Each step of the chain, and thus the resulting representative configurations, is obtained through the stochastic proposal of a new state followed by an acceptance/rejection criterion determined by the relative probabilities of the new and previous states according to the equilibrium distribution.

While this method has demonstrated widespread utility, Markov Chain Monte Carlo (MCMC)10 requires serial computation, can suffer from critical slowing down near phase transitions, and results from simulations run at one set of fixed constraints are not generally transferable to other conditions. These issues can be partially mitigated by the design of specialized proposal/acceptance moves11,12, exchange between parallel simulations13, and random-walks through the density of states14, but studies characterizing the mixing thermodynamics of complex, multi-component alloys often demand significant computational cost for large system sizes15.

These limitations have prompted the development of a number of MC methods specifically designed for multi-phase equilibria. Multi-cell Monte Carlo (MC2) implements carefully designed proposal/acceptance steps such that atoms can be exchanged between separate supercells. The impact of phase interfaces on these finite-size simulations is significantly reduced as multiple phases can coexist across different cells16,17,18. Variance-constrained semi-grand canonical simulations rely on a modified thermodynamic ensemble that can be leveraged to compute the free energies of systems within two-phase regions and improve the accuracy of recovered phase boundaries19. Furthermore, Wang–Landau methods14 have been adapted to the materials domain and applied to characterize benchmark systems20.

Alternatively, machine learning approaches can be used to produce realistic high-likelihood samples from complex distributions without explicit parametrization in so-called generative models21,22. The application of generative models to scientific calculations is a promising avenue to overcome the challenges of naive MC methods. Intuitively, these models are trained to draw samples by learning the typical values of the system’s physical variables at equilibrium. A perfectly tuned model could then simulate the system by simply averaging over a batch of ML-proposed samples. Generative models such as variational autoencoders (VAEs)22 and generative adversarial networks (GANs)23,24 have become standard tools for materials design by transforming samples drawn from Gaussian noise to resemble a target distribution. In recent years, VAEs and GANs have often been outperformed by autoregressive generative models25,26 that instead of using a latent space, sequentially sample the output variables in series of steps. When the probability distribution of each step is tractable, the probability of sampling any fully generated state can be computed exactly. Critically, when restricted to this class of exact-density models, the generative framework benefits from both a loss function that relies on a variational estimate of the thermodynamic potential, as well as reweighting27 and importance sampling techniques28 that can correct for sample distributions that deviate slightly from those at equilibrium.

The rigorous basis of these models and the explicit connection between exact likelihood and free energy have inspired a large number of physics-based applications. For continuous systems, exact-density flow models have been applied to reducing autocorrelations in lattice field theory29,30,31, sampling free energy barriers of biomolecules27, and studying relaxations of Ising models32,33. In discrete cases, autoregressive generative models have been used to extract thermodynamics quantities28,34 and determine ground states35,36 of spin models.

In this work, we introduce Semi-grand Ensemble Generation by Autoregressive Lattices (SEGAL), a generative approach to lattice simulations of phase stability in materials science. In particular, we demonstrate the applicability of exact-density generative models to the semi-grand canonical thermodynamic ensemble; assess model performance on well-known benchmark systems such as spin models, copper-gold and silver-palladium alloys; and extract estimates of phase stability of multi-component systems.


Autoregressive sampling for materials simulation

We seek to build a generative model that can successfully identify the representative states of the semi-grand canonical ensemble and their dependence on thermodynamic constraints, providing an alternative to traditional MC approaches. We refer to this model as SEGAL.

SEGAL associates each microstate the system can occupy with a predicted probability PAR. Due to the discrete structure of the coarse-grained crystal representation, we decompose the probability of a particular decoration of the crystal prototype as a product of site probabilities that can represent any possible distribution over microstates. This mathematical decomposition requires defining an ordering over sites whereby the atomic identity of a particular site is dependent on its predecessors28,34. Inspired by previous generative models that change the sampled distribution with temperature27,37,38, the dependencies between sites are also functions of the thermodynamic constraints, allowing the conditions to control the microstate probabilities:

$${P}_{{{{\rm{AR}}}}}({{{\bf{S}}}}| {{\Delta }}\mu ,T)=\mathop{\prod}\limits_{i}P({{{{\bf{S}}}}}_{i}| {{{{\bf{S}}}}}_{j\,{ < }\,i},{{\Delta }}\mu ,T).$$

We parameterize these conditional probabilities using a neural network, whose general architecture is shown in Fig. 1 and whose specific details per application are given in Supplementary Notes 15. Therefore, the parameters of the network are trained to capture the underlying correlations of the atomic orderings. In order to generalize easily to arbitrary numbers of components and increase the capacity of the model, we represent each site as a vector with length equal to the number of components Si. New decorations of the lattice prototype can be drawn from the model by sequentially sampling each site i from the categorical distribution P(SiSj<i, Δμ, T) such that, after sampling, Si is a one-hot encoded vector corresponding to the identity of the probabilistically chosen atom. The full state describing atomic labels over all sites is simply the concatenation of the Si vectors. Note that the first chosen site still has a dependence on the set Δμ and T.

Fig. 1: Architecture.
figure 1

Two-layer SEGAL model restricting dependence of all sites to previous sites and thermodynamic constraints. Each node shown in the figure represents a group of n neurons for an n-component system.

Ising model in a magnetic field

To demonstrate the use of SEGAL for a binary alloy, we first studied 10 × 10 periodic Ising spins in a magnetic field B. Through analysis of this model system, equivalences are drawn from spin variables to atomic site labels and from the magnetic field to the chemical potential difference. In particular, the long-range ordering of spins below the critical temperature is analogous to the opening of a two-phase miscibility gap in an alloy with unfavorable mixing. The internal energy function U(S) is the well-known nearest neighbor model with J = −1, working in units where kb = 1:

$$U({{{\bf{S}}}})=-\mathop{\sum}\limits_{(i,j)\in {{{\rm{NN}}}}}{s}_{i}\cdot {s}_{j}$$

where NN is the set of nearest neighbor pairs and si is the spin at site i. In the presence of a field B, an additional magnetic potential Bisi plays the role of chemical work ΔμN for our model system. SEGAL is trained with T [1.5, 3.5] and B set to values [−0.4, −0.2, 0.0, 0.2, 0.4], a range over which both first-order and second-order phase transitions are known to occur. Qualitatively, samples from the trained network exhibit behavior consistent with expectations (Fig. 2a). At low temperature, ferromagnetic states are observed and demonstrate a first-order discontinuity at the critical magnetic field B = 0. In addition, with increasing temperature, the samples demonstrate an order-disorder transition. Below the critical temperature, some magnetization values are not sampled, which is indicative of thermodynamically unstable alloy compositions that decompose into a linear combination of two more pure phases.

Fig. 2: SEGAL model for Ising alloy.
figure 2

a Samples from SEGAL under varying constraints T [1.5, 2.5, 3.5], B [−0.4, 0.0, 0.4]. b Numerical comparison between self-normalized importance sampling (SNIS) from SEGAL, Wang–Landau method39,40,41, and exact solutions43,44 with B = 0.0. Colors with lower opacity are sampled at intermediate values of B [0.05, 0.10, 0.15, 0.25, 0.30, 0.35]. SEGAL can interpolate over the whole training region. c Normalized effective sample size (NESS) over the training region. Estimates of NESS are taken with 10,000 samples each. d (left) Symmetry invariance of SEGAL model under an operation G. (right) An example of a transformed configuration with probabilities corresponding to the red dot on the left.

To quantitatively assess the validity of the model, we compared free energies estimated using self-normalized importance sampling (SNIS) on the output of SEGAL to those obtained from a Wang–Landau method that can interpolate between different temperatures but only at a fixed magnetic field39,40,41. When available, we also compared with exact results on finite-size Ising models42,43,44. The SEGAL-estimated free energies are obtained using 20,000 samples at each set of constraints (9 values of B [0.0, 0.4] and 101 values of T [1.5, 3.5]). Over the analyzed conditions, the differences in the free energy per site between the two methods are O(10−4) and comparable in magnitude with the standard deviations of 50 separate SEGAL estimates of F(T, B) (Fig. 2b and Supplementary Fig. 2). The total cost to train and perform one sampling iteration using this SEGAL model is 4.7 × 107 energy evaluations. When comparing to the exact values at B = 0, the magnitude of the errors of the SEGAL estimates are similar to the errors of the benchmark Wang–Landau algorithm39,40 when run for 109 evaluations and restricted to zero magnetic field strength (Supplementary Fig. 3). While this suggests that this SEGAL model is sample efficient in learning the typical ensemble configurations, we note that this reduction in energy evaluations does not translate exactly to acceleration in wall clock time, because of the overhead of the neural network operations, the ability of SEGAL to leverage batches to evaluate energies in parallel, and the Wang–Landau algorithm’s exploitation of the local structure of U to efficiently compute changes in energy between simulation steps. Though state-of-the-art exact-density approaches have achieved accuracies of ≈10−5 on 16 × 16 lattices at a single temperature28, sacrificing optimal performance for generalizability over the space of constraints may have more practical utility in regimes when many sets of conditions are of interest, as is the case in predicting materials phase diagrams.

In order to provide another estimate on the quality of the SNIS, we measured the normalized effective sample size (NESS) over the conditions the model saw during training (Fig. 2c). While the NESS cannot be used to guarantee accurate model performance, it indicates where the model performs poorly. Over a wide range of conditions, SEGAL performs adequately, with a minimum NESS of 0.40. Areas with lower NESS give some intuition on the limitations of conditional generation. For instance, there are regions of lower NESS near the boundary of the training region, which is likely an artifact of the strategy used to sample different conditions during training. NESS is also lower near the first-order phase transition where the typical configurations sampled by SEGAL change rapidly. Interestingly, above the critical temperature, performance no longer degrades significantly near B = 0, which can be interpreted through the disappearance of the first-order phase transition.

The NESS is not a foolproof metric for performance, because a model suffering from mode collapse—that is, repeatedly producing only a very small set of unique outputs—can still have high NESS. To address this concern, we further investigated potential mode collapse of the generative model. In particular, symmetry-related microstates must have the same unnormalized probability in the semi-grand canonical ensemble and that invariance should be preserved by SEGAL:

$${P}_{{{{\rm{SG}}}}}({{{\bf{S}}}})={P}_{{{{\rm{SG}}}}}(G\times {{{\bf{S}}}}),$$

where U is invariant upon the operation G. A poorly regularized model could prefer samples with a particular translational or rotational orientation that would break the physical symmetry. In order to test our model, we generated samples over the full range of conditions and recorded their probabilities PAR(S). We then applied a random symmetry operation and recorded the model probability of symmetry-adapted sample PAR(G × S). If the generating field was non-zero, G was a random C4 rotation composed with random translations in horizontal and vertical directions. If the B-field was 0.0 (10% of the tests), an additional spin-flip operation was applied half the time. The \(\log ({P}_{{{{\rm{AR}}}}}({{{\bf{S}}}}))\) and \(\log ({P}_{{{{\rm{AR}}}}}(G\times {{{\bf{S}}}}))\) showed significant agreement (R2 > 0.999), suggesting that the model captures the underlying physical symmetries without the use of data augmentation or invariances being explicitly encoded in the network (Fig. 2d). One possible explanation for this performance is that accurately capturing the ensemble under a range of (B, T) constraints forces the neural network toward varying regions of the systems order parameters including composition or site correlations. In this way, the training procedure may act as a natural regularizer of the generative model that incentivizes exploration and avoids mode collapse. Lastly, in Supplementary Fig. 4, we explore how automatic differentiation45 can be used to extract thermodynamic quantities by taking derivatives from the neural network predicted probabilities PAR(S) instead of relying on fluctuations.

Ground states of CuAu

In order to test the ability of SEGAL to detect low internal energy phases on realistic materials, we analyzed its performance detecting the stable ordered structures in a copper-gold alloy, a widely studied system for MC algorithms and software46,47,48,49. As is standard in materials science workflows, we trained a cluster expansion U(S) model to predict the energy of new decorations of fcc lattices with the aid of the CLEASE4 package.

Density functional theory (DFT) energies were computed for copper-gold fcc structures with varying cell sizes that were generated using CLEASE4. We observed that including all the data in the training process resulted in cluster expansions with relatively low accuracy. Prediction errors could be reduced by building models using a set of 41 training examples with at most 18 atoms and with formation energies below 0.02 eV/atom. This set included the pure phases as well as the CuAu, Cu3Au, and Au3Cu ground states. As predictions of thermodynamic stability rely predominantly on the properties of low-energy structures, it is reasonable to improve model accuracy for the most relevant system configurations by filtering out high-energy structures from the training data. Previous work also found that depending on the application context, cluster expansion performance can be sensitive to the choice of training data50. A final cluster expansion was trained using L2 regularization and obtained a leave-one-out cross-validation score of 8.15 meV/atom (Supplementary Fig. 5). The effective cluster interactions parameters and convex hull for a 16-site supercell are shown in Fig. 3a, b, predicting Cu3Au, CuAu, and Au3Cu as stable intermetallics. We note that Au3Cu was not stable according to our DFT calculations, which is consistent with the results of the Materials Project51. However, the presence of a phase that is stable over only a narrow range of chemical potentials increases the challenge of the sampling problem and can more clearly illustrate SEGAL’s performance.

Fig. 3: CuAu ground states.
figure 3

a Formation energies of all 216 = 65,536 possible decorations of 16-site fcc prototype lattice according to the model cluster expansion. b Effective cluster interactions (ECI) of two-body (blue), three-body (purple), and four-body (orange) clusters obtained using CLEASE4. c Ground states sampled as a function of chemical potential. Cells are expanded by nine times to aid visualization. Purple lines denote expected transitions based on the cluster expansion energy function. d Comparison of sampling of grand potential minima with SEGAL (orange) and random search (blue).

To sample ground states of varying composition, SEGAL was trained on a 16-site fcc lattice prototype over a range of chemical potential differences bounded by values where the pure phases are stable, Δμ [−0.24 eV, 0.24 eV]. The temperature was steadily decreased over each epoch in a simulated annealing-based approach to increase the likelihood of converging to the correct structures. A similar method was employed by Wu et al. to minimize the energy of spin systems34. Note that in contrast to SEGAL, the minimization of energy alone would only result in the detection of the structure with minimum formation energy (CuAu). The total number of energy evaluations required to train SEGAL on the CuAu system was 1,000,000, which exceeds the number of possible states on the 16-site lattice (65,536), but the resulting model can still be used to examine SEGAL’s behavior in the context of real materials systems.

Once trained, modifying the chemical potential difference allows SEGAL to sample stable alloy structures of varying composition, successfully identifying the pure phases as well as the Cu3Au, CuAu, and CuAu3 intermetallics. Futhermore, when stability is determined by the minimum value of the grand potential at 0 K over a batch of 1000 samples, the critical chemical potentials between stable structures closely match those predicted by the convex hull of the cluster expansion, suggesting that SEGAL has learned to approximate the location of phase transitions (Fig. 3c).

We observe a greater degree of mode collapse than with the Ising model case, as the model finds Cu3Au, CuAu, and CuAu3 ground states with degeneracy 1, where the exact values determined through a brute force enumeration are 4, 6, and 4, respectively. The increased difficulty of this task could be due to the more complicated symmetry relationships between ground states or the convergence of training temperature to 0 K, which reduces the regularization effects of temperature variability, from which the Ising SEGAL model may have benefited more significantly.

We compared the effectiveness of the trained SEGAL model to a benchmark random algorithm that samples all configurations with equal frequency by recording the percent of samples that correctly identify the grand potential minima at 0 K (Fig. 3d). During the test, 1000 samples were drawn from each method at 36 separate values of Δμ. In total, only 1 of the 36,000 random samples identified the correct structure, whereas 73.5% of the SEGAL samples correspond to the grand potential minima. Therefore, we conclude that SEGAL is capable of extracting stability-relevant thermodynamic information from a model of a real material’s internal energy after being trained. Similar to the observation of the Ising model’s NESS, the lowest probability of sampling the correct structure occurs as Δμ approaches phase transitions, where two competing ground states have very similar grand potentials and SEGAL-generated structures must rapidly switch between phases. This effect is largest in the case of Au3Cu, which is only stable for a narrow range of chemical potentials (Supplementary Fig. 6).

AgPd alloy

We further explored the ability of SEGAL to capture the physics of a real metal alloy at finite temperature. As an example, we considered a 27-site fcc prototype (3 × 3 × 3 supercell) of silver and palladium, whose phase diagram features a miscibility gap extending to temperatures of up to 600 K. Below the top of the miscibility gap, unfavorable mixing interactions cause ranges of alloy compositions to be thermodynamically unstable. The gap exhibits a characteristic asymmetry, as palladium is highly soluble in silver, but silver has virtually no solubility in palladium at low temperatures52,53.

A cluster expansion approximation of the formation energy UCE was built using a dataset of 625 AgPd structures from the ICET3 tutorial database and obtained a ten-fold cross-validation error of 2.3 meV/atom (Supplementary Fig. 7a). SEGAL was trained using UCE over a temperature range of [200 K, 900 K] extending within and above the expected miscibility gap. Benchmark semi-grand canonical MCMC simulations using the same cluster expansion were run using CLEASE. In order to show the flexibility of SEGAL with regard to the energy model U, we also trained a crystal graph convolutional model for the formation energy UCGC over the same dataset, which achieved a test error of 1.49 meV/atom (Supplementary Fig. 7b)54. For the crystal graph convolutional model, we wrote our own CGC MCMC implementation to obtain reference values. While SEGAL is compatible with any parametrization of U, and benchmarks of graph-based neural networks55 seem to demonstrate similar performance across architectures when trained on large datasets, we found that model selection can be an important factor. In particular, a SchNet56 model trained on the AgPd dataset obtained a test error of 8.1 meV/atom, much higher than UCE or UCGC. Because our dataset is small, the initial atom featurizations developed for CGCNN models may lead to improved accuracy when compared with the simpler atomic-number encoding in SchNet. Furthermore, SchNet is typically used as an interatomic potential and is highly sensitive to parameterizing interatomic distances, which are irrelevant in our case, since the input representation to the U models is an idealized lattice with arbitrary fixed lattice constant at all compositions. As a result, though SEGAL can be trained with any U, the ability of later sampling tasks to predict physical properties will be dependent on the choice of internal energy model.

Results from SNIS and the Markov Chain estimates show strong numerical agreement across multiple temperatures for both UCE and UCGC energy models, with most deviations in composition being on the order of 10−3 (Supplementary Fig. 8). These errors are sufficiently small to recover the physical properties and phase stability of the alloy over the training region. At 250 K, the discontinuity in compositions indicates thermodynamically unstable compositions and confirms the presence of the two-phase region, separating a nearly pure Pd phase and a 60/40 mixture of Pd and Ag (Fig. 4a, b, d). At 750 K, both methods show continuous variation in composition with chemical potential, suggesting that the top of the miscibility gap has been exceeded (Fig. 4a, b). Importantly, SEGAL is applicable as a sampling method for both UCGC and UCE potentials, and can be readily generalized to any developed models for alloy energy that achieve sufficient accuracy.

Fig. 4: SEGAL applied to 27-site AgPd alloy.
figure 4

Composition vs. Δμ computed with MCMC and self-normalized importance sampling (SNIS) at temperatures of 750 K (top) and 250 K (bottom) using (a) cluster expansion and (b) crystal graph convolution models for U. c Normalized effective sample size of SEGAL over the training region. Estimates of NESS are taken with 10,000 samples each. d Samples from SEGAL at T = 250 K, Δμ = −0.16 eV (left) and T = 250 K, Δμ = −0.04 eV (right).

The NESS of SEGAL (Fig. 4c) is reasonable over a large range of conditions, but indicates lower performance near the critical values of Δμ, at which the discontinuity in composition is observed and the typical lattice configurations at equilibrium change rapidly. These uncertainties near phase transitions can introduce deviations in the bounds of the two-phase region such as those observed at 250 K. We further note that above the miscibility gap (≈600 K), stable compositions change more continuously, and the subsequent decrease in the NESS metric is significantly less pronounced. By identifying regions of constraint space where typical states of the system change rapidly, NESS calculations of SEGAL models show some promise at the automatic detection of phase transitions.

Predicting phase stability

Finally, we give examples on how the SEGAL model can be used to extract information on phase stability. To reduce the artificial effects of a finite simulation cell, we trained SEGAL on larger cells for the AgPd (125-site) and CuAu (128-site) systems. After drawing 5000 samples from the AgPd model at each set of constraints for 15 temperatures between 200 and 900 K and 81 values of Δμ between −0.4 eV and +0.4 eV, a region of thermodynamically unstable compositions was visible and attributed to the miscibility gap. The top of the gap was approximated by using the Felzenszwalb–Huttenlocher image segmentation algorithm57,58 on \({\log }_{10}[{{{\rm{NESS}}}}({{\Delta }}\mu ,T)]\) to estimate the values of Δμ and T where the phase transition occurs (Supplementary Fig. 9a). The boundary of the gap was computed using a polynomial fit to select points (%Ag,T) from those obtained in the sampling procedure above. Below the critical temperature, the points exhibiting the greatest change in composition when Δμ changed by 0.02 eV were selected. At the critical temperature, the point with the largest composition change between Δμ − 0.01 eV and Δμ + 0.01 eV was selected. For the CuAu model, 5000 samples were drawn at each set of constraints for 21 temperatures from 200 to 1200 K and 36 values of Δμ from −0.2 eV to +0.15 eV. Observed discontinuities in stable compositions suggested the presence of a Cu3Au − CuAu two-phase region for temperatures below 700 K. Estimated bounds were determined from the maximum difference in composition between Δμ values separated by 0.02 eV, restricted to the composition range 0.2 < %Au < 0.6. Based on previous work of Takeuchi et al.20, bounds for order-disordered two-phase regions were estimated by locating the temperature with maximal heat capacity TC for each constant value of Δμ and approximating the bounds of the two-phase regions as the compositions at (Δμ − δ, TC) and (Δμ + δ, TC) with δ = 0.01 eV (Supplementary Fig. 9b). Results for both systems agree favorably with reference metadynamics simulations (Fig. 5a, b).

Fig. 5: Prediction of phase diagrams.
figure 5

SEGAL compared with metadynamics benchmark for (a) 125-site AgPd model and (b) 128-site AuCu model. For a, error bars are computed from the uncertainty on the polynomial fit. For b, purple lines show the two-phase region for AuCu3 and AuCu. Pink lines show the two-phase equilibria between ordered compounds and disordered solid solution.

The total number of cluster expansion energy evaluations required to train and sample the AgPd and CuAu models with SEGAL were 1.3 × 107 and 3.0 × 107, respectively. The baseline metadynamics simulations required 4.1 × 107 (AgPd) and 3.6 × 108(CuAu) energy evaluations. However, we note that due to the increased accuracy of the metadynamics simulations, highlighted by the detection of the Au3Cu phase, these values are not directly comparable. The NESS values of these larger models (Supplementary Fig. 10) exhibit many of the similar trends as previous experiments, such as low values in the vicinity of phase transitions. In contrast, NESS values in the disordered phases can be less than 10−2, significantly lower than those observed for the smaller AgPd alloy and the Ising system. As a result, the efficient scaling of SEGAL models to large cell sizes of complex alloys is an outstanding challenge but holds promise for the simulation of multi-component systems.


We have shown that general-purpose generative models for statistical physics can be readily modified for applications computing thermodynamic quantities in materials science. In particular, transforming to the semi-grand canonical ensemble avoids interfaces between competing phases and allows for greater control over the exploration of experimental order parameters such as composition and atomic ordering. Furthermore, a single model with no training examples from previous simulations can generalize across a wide range of constraints and accurately determine thermodynamic potentials, observables, and stable phases. SEGAL does not restrict the form of the potential U(S) in any way and can be trained with crystal graph convolution networks54 or other approaches capable of modeling complex multi-component systems59. As a result, generative models have the potential to become a useful tool alongside standard lattice simulation techniques.

While the approach is promising, a number of algorithmic changes are needed improve its scalability and performance. The current architecture can be more sample efficient than baseline methods but does not scale to cell sizes comparable to those of typical simulations, which introduces finite-size effects and limits the precision of the final estimates. Though the problem of designing exact-density generative models capable of performing state-of-the-art calculations has not been completely solved34,60, research directions for further improvement have been proposed, including implementing the autoregressive network using graph convolutional layers to utilize the symmetry of the crystal system34 or exploiting the local structure of the energy model to improve the scalability of the generation process61,62. Another crucial step is to refine SEGAL’s sampling performance near phase transitions. SEGAL’s ability to identify these regions through a change in typical states and the associated decrease in NESS values could allow for modified training strategies. In particular, training batches can be more frequently focused in regions with low NESS so that additional examples can help the model to improve in cases where the learning task is difficult. Alternatively, SEGAL could be supplemented with standard MCMC simulations run with constraints close to the critical values of temperature and chemical potential or with strategies to account for exponentially suppressed configurations that increase the variance of importance sampling estimates63.


Thermodynamics ensembles

To draw samples from a particular equilibrium ensemble, lattice Monte Carlo simulations must be run under a chosen set of thermodynamic constraints. In the canonical ensemble, temperature and composition are fixed and system configurations are sampled according to their relative Boltzmann weight \(\propto {{{{\rm{e}}}}}^{-\frac{U}{{k}_{{{{\rm{b}}}}}T}}\). Free energies obtained through this approach can characterize a wide range of phenomena in statistical physics. However, when investigating multi-component materials thermodynamics, the free energy minimum can be achieved by any linear combination of phases that satisfies the composition constraints. Therefore, at equilibrium, multiple phases can coexist in a manner that cannot be represented with a single fixed lattice prototype without introducing phase boundaries. The presence of these multi-phase regions must then be inferred from non-convex regions of the free energy as a function of composition that was observed in the simulation. In order to alleviate this challenge, materials scientists often work in the grand canonical ensemble with fixed chemical potentials and temperature. In this ensemble, for each set of constraints only a single phase will be present at equilibrium, except at the critical values where phase transitions occur. As a result, simulations avoid multi-phase equilibria and are more well-suited to a single lattice cell. While GANs have been applied to the grand canonical ensemble in the context of scalar field theory64, most previous exact-density approaches27,28,34 have modeled the canonical ensemble.

The grand potential and resulting microstate probabilities can be derived for a system of i species through a Legendre transform of the canonical ensemble. With a fixed total number of sites ∑iNi = Ntot, the system is in the semi-grand canonical ensemble and is determined by a set of i − 1 chemical potential differences, Δμi = μi − μ0, and the temperature:

$${{\Phi }}(T,\{{{\Delta }}{\mu }_{i}\})=F-\mathop{\sum}\limits_{i\ne 0}{{\Delta }}{\mu }_{i}{N}_{i}=U-TS-\mathop{\sum}\limits_{i\ne 0}{{\Delta }}{\mu }_{i}{N}_{i}$$
$${P}_{{{{\rm{SG}}}}}({{{\bf{S}}}}| T,\{{{\Delta }}{\mu }_{i}\})=\frac{{{{{\rm{e}}}}}^{[-U({{{\bf{S}}}})+{\sum }_{i\ne 0}{{\Delta }}{\mu }_{i}{N}_{i}({{{\bf{S}}}})]/{k}_{{{{\rm{b}}}}}T}}{{Z}_{{{{\rm{SG}}}}}}$$

The relative probabilities, and thus, the representative configurations the system occupy at equilibrium change in response to the above constraints. In particular, varying the chemical potential differences results in driving forces to introduce changes in composition, and increasing the temperature leads to a greater contribution to the grand potential from configurational entropy and greater system disorder. We demonstrate the dependence of composition on chemical potential for a toy system in (Supplementary Fig. 1).


If the sampler was perfect, all microstates configurations would appear with the same relative probabilities as they do in the studied thermodynamic ensemble. One approach to encourage the model probability distribution to converge on the correct values is to minimize the KL divergence, a measure of the difference between two probability distributions, between the model and the ensemble KL(PARPSG). It can be shown that (Supplementary Methods) the resulting minimization objective can be expressed as:

$${{{\rm{KL}}}}({P}_{{{{\rm{AR}}}}}| {P}_{{{{\rm{SG}}}}}) - \log(Z_{SG})={{\mathbb{E}}}_{{{{\rm{AR}}}}}\left[\frac{U({{{\bf{S}}}})}{{k}_{{{{\rm{b}}}}}T}+\log [{P}_{{{{\rm{AR}}}}}({{{\bf{S}}}})]-\frac{{\sum }_{i\ne 0}{{\Delta }}{\mu }_{i}{N}_{i}({{{\bf{S}}}})}{{k}_{{{{\rm{b}}}}}T}\right]=\frac{{{{\Phi }}}_{{{{\rm{AR}}}}}(T,\{{{\Delta }}{\mu }_{i}\})}{{k}_{{{{\rm{b}}}}}T}$$

The true grand potential is the minimum of 〈U − ST − ∑i≠0ΔμiNi〉 for all possible probability distributions over microstates and will provide a lower bound on the training loss function such that ΦAR ≥ ΦSG. While Eq. (6) is not differentiable due to the discrete, stochastic sampling step, gradients can be estimated through34,65:

$${\nabla }_{\phi }{{{\rm{KL}}}}({P}_{{{{\rm{AR}}}}}| {P}_{{{{\rm{SG}}}}})={{\mathbb{E}}}_{{{{\rm{AR}}}}}\left[\log \left(\frac{{P}_{{{{\rm{AR}}}}}({{{\bf{S}}}})}{{\hat{P}}_{{{{\rm{SG}}}}}({{{\bf{S}}}})}\right){\nabla }_{\phi }\log ({P}_{{{{\rm{AR}}}}}({{{\bf{S}}}}))\right]$$
$$\log ({\hat{P}}_{{{{\rm{SG}}}}}({{{\bf{S}}}}))=-\frac{U({{{\bf{S}}}})}{{k}_{{{{\rm{b}}}}}T}+\frac{{\sum }_{i\ne 0}{{\Delta }}{\mu }_{i}{N}_{i}({{{\bf{S}}}})}{{k}_{{{{\rm{b}}}}}T}+\frac{{\hat{{{\Phi }}}}_{{{{\rm{AR}}}}}(T,\{{{\Delta }}{\mu }_{i}\})}{{k}_{{{{\rm{b}}}}}T}$$

where \({\hat{{{\Phi }}}}_{{{{\rm{AR}}}}}\) is an estimate of Eq. (6) over the whole batch of samples. Intuitively, the model will seek to lower the likelihood of configurations for which \({P}_{{{{\rm{AR}}}}} \,>\, {\hat{P}}_{{{{\rm{SG}}}}}\) and increase the likelihood of configurations for which \({P}_{{{{\rm{AR}}}}} \,<\, {\hat{P}}_{{{{\rm{SG}}}}}\). Because U(S) is not required to be differentiable, a wide range of standard energy models can be easily incorporated into this approach.

Training SEGAL does not require any example configurations, only an energy function U(S) to model. Batches of samples are iteratively drawn and used to estimate the loss function and update model parameters. As training continues, the estimated grand potential \({\hat{{{\Phi }}}}_{{{{\rm{AR}}}}}\) decreases toward the true minimum ΦSG, and the relative probabilities of the samples approach their equilibrium values. We found multiple procedures could be implemented in order to effectively allow the model to capture the condition-dependent equilibrium distribution. The chemical potential differences Δμbatch and temperature Tbatch of each batch could be set randomly using a uniform distribution within the bounds being investigated \({T}_{{{{\rm{batch}}}}}\in [{T}_{\min },{T}_{\max }]\), \({{\Delta }}{\mu }_{{{{\rm{batch}}}}}\in [{\mu }_{\min },{\mu }_{\max }]\) or set to specific values chosen as hyperparameters. Training can be stabilized by computing the loss over several sets of conditions [Tbatch, Δμbatch] simultaneously before updating parameters. In this case, estimates of \({\hat{{{\Phi }}}}_{{{{\rm{AR}}}}}\) are computed separately over constant conditions. In addition, because the magnitude of thermodynamic potentials can differ significantly depending on the constraints, when combining samples generated under different conditions the gradients were further normalized by the absolute value of \(\frac{{\hat{{{\Phi }}}}_{{{{\rm{AR}}}}}(T,\{{{\Delta }}{\mu }_{i}\})}{{k}_{{{{\rm{b}}}}}T}\). Following the learning procedure, the model can draw samples over the entire range of conditions it was exposed to during training.

Self-normalized importance sampling

Despite the physics-informed training procedure, generative models will not achieve perfect performance for any ensemble, and estimates of thermodynamic observables can be significantly biased28,34. However, if the probability of the proposed samples PAR is known exactly, the statistical power of numerical estimates can be improved by weighting samples using the relation:


where, for example, samples that appear more frequently in the generated distribution than in the target distribution are given less weight to compensate for their increased rate of appearance. While the normalizing constant of PSG is unknown in many practical problems, samples can be still be treated as a well-designed proposal distribution for a Markov Chain29 or used as a biasing distribution for histogram reweighting27. Nicoli et al.28 introduced the use of generative models with SNIS, which offers the added benefit of providing estimates of both normalizing constants and observables. Defining w(S) as the unnormalized ensemble probability divided by the generative model probability PAR:

$${Z}_{{{{\rm{SG}}}}}={{\mathbb{E}}}_{{{{\rm{AR}}}}}[w({{{\bf{S}}}})]={{\mathbb{E}}}_{{{{\rm{AR}}}}}\left[{{{{\rm{e}}}}}^{[-U({{{\bf{S}}}})+\mathop{\sum}\limits_{i\ne 0}{{\Delta }}{\mu }_{i}{N}_{i}({{{\bf{S}}}})]/{k}_{{{{\rm{b}}}}}T}/{P}_{{{{\rm{AR}}}}}({{{\bf{S}}}})\right]$$

Because an estimate of ZSG must be used in Eq. (10), SNIS is still biased in practice, but the biases can be substantially smaller than those achieved by simply averaging over samples of the generative model. One metric to evaluate this approach is the effective sample size (ESS), which provides an estimate of the number of samples from the true target distribution required to match the performance of the SNIS. The ESS can be normalized (NESS) to evaluate the typical quality of generated samples when compared with the target distribution:

$${{{\rm{NESS}}}}=\frac{1}{n}\frac{{\left(\mathop{\sum }\nolimits_{i}^{n}{w}_{i}\right)}^{2}}{\mathop{\sum }\nolimits_{i}^{n}{w}_{i}^{2}}$$

Note that if the generated distribution closely resembles the target distribution and all wi are close to ZSG, the NESS will approach 1. As the generated distribution deviates from the target and the variation in wi increases, the NESS will approach \(\frac{1}{n}\).

Density functional theory calculations

DFT calculations were carried out using the Vienna Ab initio Simulation Package66,67 v. 5.4.4, within the projector-augmented wave method68,69. The Perdew–Burke–Ernzerhof functional within the generalized gradient approximation70 was employed as the exchange-correlation functional, including dispersion corrections through Grimme’s D3 method71,72. The kinetic energy cutoff for plane waves was restricted to 520 eV. Integrations over the Brillouin zone were performed using Monkhorst-Pack k-point meshes73 with a uniform density of 64 k-points/Å−3. A stopping criterion of 10−6 eV was adopted for the electronic convergence within the self-consistent field cycle. Optimization of unit cell parameters and atomic positions was performed until the Hellmann–Feynman forces on atoms were smaller than 10 meV/Å.