Introduction

Metallic1 and ceramic2 alloys are often synthesized in phases where the chemical elements are dispersed almost randomly on a crystal lattice, namely solid solution (SS) phases. Characterizing the spatial distribution of chemical elements at the atomic scale is critical in establishing structure–property relationships in SSs. For example, spatial variations in chemistry cause strengthening via solute-dislocation-like interactions3, while the percolation of locally passive chemical regions is associated with corrosion resistance4. In a truly random SS the spatial distribution of chemical elements is defined by the underlying geometry of the crystal lattice and its associated symmetries. However, in real materials, thermal effects induce a trade-off between enthalpy and entropy that favors low-energy chemical motifs5,6,7,8 (Fig. 1a), thereby affecting the spatial distribution of chemical elements. This deviation from randomness is known as chemical short-range order (SRO).

Fig. 1: Representation and identification of local chemical motifs.
figure 1

a A local chemical motif is defined by the central atom and its first coordination polyhedron. b Distinct chemical motifs with the same average chemical composition contribute equally to the first nearest neighbors Warren–Cowley parameters. c Illustration of the chemical-motif identification framework. Each atom in the system is first given awareness of its chemical environment by being represented as a local chemical motif. During the identification step, the graph representation of the motif is employed by an E(3)-equivariant graph neural network—E(3)-GNN —to identify equivalent motifs, i.e., motifs that can be transformed into each other by euclidean symmetries.

The state of SRO is often characterized using the first nearest neighbors (1NN) Warren–Cowley (WC) parameters9,10,11,12,13,14,14,15,16,17,18,19, defined as

$${\alpha }_{{\rm{AB}}}=1-\frac{p({\rm{A}}| {\rm{B}})}{{c}_{{\rm{A}}}}=1-\frac{1}{{c}_{{\rm{A}}}}\left[\frac{1}{N{c}_{{\rm{B}}}}\times \mathop{\sum }\limits_{i=1}^{N}{p}^{(i)}({\rm{A}}| {\rm{B}})\right],$$
(1)

where A and B are chemical elements, cA is the average concentration of atoms of type A, N is the total number of atoms, and p(AB) is the probability of finding an atom of type A in the 1NN shell of a B atom, which can be broken down into a per-chemical-motif contribution p(i)(AB). When considering only the 1NN, these nc(nc + 1)/2 independent WC parameters (where nc is the number of chemical elements in the system) provide an incomplete8 description of SRO at the first-neighbor shell since distinct chemical motifs with the same chemical concentration are indistinguishable and contribute equally to the WC parameters. For example, Fig. 1b illustrates several chemical motifs with different bonding environments that all have the same contribution to WC parameters. Consequently, as shown in our previous work (ref. 8), reverse-engineered atomic configurations from WC parameters yield unphysical degenerate solutions, and the lack of non-degenerate descriptors prevents the connection of per-atom properties (e.g., generalized stacking fault energy20 and magnetic moments15) with their corresponding chemical motif. Meanwhile, multi-shell WC parameters can be seen as a full characterization of the chemical concentration variation with distance21,22. Despite the completeness of this description (for chemical concentration), understanding and quantifying the compatibility conditions among WC parameters across different shells remains a longstanding challenge23.

To move beyond the characterization of bonding preferences provided by WC parameters, experimental efforts have employed transmission electron microscopy techniques24,25,26,27,28,29,30,31 to assess spatial correlations among atomic columns’ chemistry. Yet, these methods are unable to access the complete 3D spatial chemical distribution, and the signals associated with SRO may have originated from other atomistic features32,33,34. Meanwhile, atom probe tomography approaches35,36,37,38 are nascent and provide complete 3D characterization, but are still limited in accuracy and by anisotropic resolution39,40. As experimental efforts evolve in their ability to capture the spatial distribution of chemical elements and correlate them with material properties, they would benefit from a framework for the non-degenerate identification of chemical motifs and quantification of SRO beyond WC parameters.

Here, we propose an approach to characterize the state of SRO using all of the 3D atomistic information available. By employing machine learning (ML) and group theory together, we are able to extend the local chemical motif method introduced in ref. 8 into a framework that is applicable to arbitrary crystal lattices with any number of elements; the mathematical foundation for such generalization is provided. This approach naturally leads to a proper information-theoretic measure for quantifying SRO and a reduced—but complete (i.e. non-degenerate)—representation of the chemical motif space. We demonstrate the application of this approach by identifying all possible chemical motifs in face-centered cubic (fcc), body-centered cubic (bcc), and hexagonal close-packed (hcp) systems containing up to five chemical elements, and quantitatively characterizing the state of SRO in the bcc MoTaNbTi high-entropy alloy using a machine learning potential.

Results

Representation and enumeration of chemical motifs

The local chemical motif \({{\mathcal{M}}}_{i}\) is defined as the group of atoms composed of a central atom i and its first coordination polyhedron (1CP), as illustrated in Fig. 1a. The resulting polyhedron is the cornerstone of our atomic-scale analysis because it completely characterizes the local atomic environment surrounding atom i. In a fcc lattice, the 1CP (Fig. 1a) takes the form of a cuboctahedron with octahedral symmetry point group Oh. This cuboctahedron is comprised of eight triangular faces, six square faces, and 12 vertices, representing the Na = 12 1NN surrounding the central atom. Consequently, the chemical motif consists of 36 edges connecting 13 atoms (including the central atom). Out of these edges, 24 correspond to connections among the 1NN, while the remaining 12 edges account for the connections between the central atom and the 1NN. A similar description of the geometry of bcc (Fig. 2b) and hcp (Fig. 2c) motifs is provided in Supplementary Section 1.

Fig. 2: Counting of distinct chemical motifs in three-element crystal lattices.
figure 2

The ternary diagrams indicate the chemical composition of the first coordination polyhedron (1CP, illustrated in Fig. 1a) and the color bar on the right shows the number of distinct 1CPs for that corresponding composition. The number of distinct chemical motifs is obtained after multiplying by nc = 3 to account for the central atom type. This analysis was performed for the: a face-centered cubic, b body-centered cubic, and c hexagonal close-packed lattices.

A chemical motif \({{\mathcal{M}}}_{i}\) is deemed equivalent to another motif \({{\mathcal{M}}}_{j}\) (i.e., \({{\mathcal{M}}}_{i} \sim {{\mathcal{M}}}_{j}\)) if they are related through Euclidean symmetry operations. Conversely, two motifs are said to be distinct (i.e., \({{\mathcal{M}}}_{i}\nsim {{\mathcal{M}}}_{j}\)) if they cannot be related to each other by any Euclidean symmetry. Figure 1c illustrates how a particular motif is equivalent to two other motifs and distinct from a third motif.

Consider a system with nc chemical elements in a crystal structure in which atoms have Na 1NN. Out of the \({n}_{{\rm{c}}}^{{N}_{{\rm{a}}}+1}\) possible chemical motifs that can be constructed, only a select few are distinct. To analytically count the number of distinct chemical motifs, we apply Polya’s pattern inventory formula (Eq. (7)), which stems from Polya’s enumeration theory (see the “Methods” subsection “Polya’s enumeration theory”). This approach is a general mathematical formalism based on group theory for counting the number of distinct colorings of objects under the action of a permutation group. When adapting Polya’s theory to the counting of distinct chemical motifs, the objects being counted are only the 1CPs because the role of the central atom (Fig. 1a) is trivial to account for as a simple multiplicative factor given by the number of atom types nc (represented here by the number of different colors). The symmetry of the 1CP is defined by the crystal lattice, which also defines the permutation group under consideration.

We illustrate this approach by applying it to fcc, bcc, and hcp three-element (nc = 3) systems. The outcome of Polya’s pattern inventory formula (Eq. (7)) is a polynomial in which the coefficients indicate the chemical composition of the 1CP, while the prefactors are the number of distinct 1CPs with that same composition. While these polynomials are given in Supplementary Section 2, the information contained in them is better represented in a ternary diagram (Fig. 2) that indicates the chemical composition of the 1CP along with the corresponding number of distinct 1CPs with that composition. The number of distinct 1CPs for a fixed chemical composition quantifies the degeneracy of the WC representation for that 1CP. For example, for the three-element fcc system (Fig. 2a), there are 768 distinct equiatomic 1CPs, which all have the same WC parameters given a specific central atom type. Summing the number of distinct motifs across this compositional diagram (i.e., all of the 91 possible 1CP compositions) results in a total of only 3 × 12, 111 = 36, 333 distinct chemical motifs, out of the 313 = 1,594,323 possible ones.

The enumeration of distinct motifs reveals the incomplete description of WC-like parameters, which may result in the misleading characterization of the diversity of chemical motifs in physical systems. This problem becomes particularly alarming for chemically complex materials, such as high-entropy alloys and ceramics. In Table 1 we have applied our approach to up to nc = 5 chemical elements. Consider, for example, that for a five-element hcp system, there are only 9, 100 distinct chemical motif compositions but more than 100 million distinct chemical motifs: a difference of four orders of magnitude in the complexity of the chemical representation.

Table 1 Counting of distinct chemical motifs

Classifying chemical motifs with machine learning

While Polya’s theory enables the counting of distinct chemical motifs, it is not capable of classifying an unidentified motif, which is a fundamental component of the framework illustrated in Fig. 1c. Classification of an unknown chemical motif requires finding which distinct motif is equivalent to the unknown motif. However, rigorously establishing this equivalency requires the determination of graph isomorphisms, which is a computational task not typically solvable within polynomial time. Here we circumvent this computational limitation by employing a randomly initialized E(3)-equivariant graph neural network41 (E(3)-GNN, see the “Methods” subsection “E(3)—equivariant graph neural network”). GNNs are often capable of creating representations that capture intra-graph relationships and topology, effectively distinguishing between graph structures42. This capability can also be understood by the similarity between GNNs’ message-passing algorithm to the Weisfeiler–Lehman test for graph isomorphism43,44.

Our E(3)-GNN employs a graph convolutional neural network on the 1CP, where nodes represent atoms and edges represent the connections between them. The network processes this data through hidden layers where graph features undergo transformations adhering to the principles of irreducible representations of 3D rotations and spatial inversion; in the end, a fingerprint zi that encodes the chemical motif \({{\mathcal{M}}}_{i}\) is generated. With this neural network architecture, any two equivalent chemical motifs (i.e., motifs that can be related to each other by an E(3) symmetry operation) are guaranteed to have the same fingerprint; or, if \({{\mathcal{M}}}_{i} \sim {{\mathcal{M}}}_{j}\) then zi = zj.

One nuanced point of this approach is that it does not guarantee that any two distinct chemical motifs will have different fingerprints, i.e., if \({{\mathcal{M}}}_{i} \nsim {{\mathcal{M}}}_{j}\) then it is not guaranteed that zi ≠;zj. Thus, validation is required. Here, we accomplish this by evaluating the pattern inventories in Table 1 with E(3)-GNN as follows. First we create a data set with all possible 1CPs for a given crystal lattice and number of chemical elements, then we compute the fingerprint of each 1CP with E(3)-GNN. A grouping algorithm is employed to cluster all equivalent 1CPs, from which the number of distinct 1CPs with the same composition can be counted. This ML-obtained pattern inventory matches exactly each of the analytically obtained pattern inventories in Table 1, confirming that our framework is able to differentiate between any possible chemical motifs in these systems. Notice how this approach can quickly become computationally intractable: a five-element HCP system requires the creation of 1.2 billion graphs. This is addressed in Supplementary Section 3, where we discuss how to significantly reduce the computational cost by validating the network’s expressivity solely on a symmetrically complete subset of the data set, resulting in orders of magnitude savings.

Information-theoretic quantification of short-range order

The probability of observing a distinct motif \({{\mathcal{M}}}_{i}\) at temperature T is

$$P({{\mathcal{M}}}_{i},T)=\frac{N({{\mathcal{M}}}_{i},T)}{\mathop{\sum }\limits_{j=1}^{{N}_{{\rm{dm}}}}N({{\mathcal{M}}}_{j},T)},$$

where \(N({{\mathcal{M}}}_{i},T)\) is the number of times that \({{\mathcal{M}}}_{i}\) is observed in the thermally equilibrated system at temperature T, and the sum in the denominator is over all Ndm distinct motifs \({{\mathcal{M}}}_{j}\). A random SS (indicated here by T = ) is such that \(N({{\mathcal{M}}}_{i},\infty )\) is proportional to the total number of motifs equivalent to \({{\mathcal{M}}}_{i}\), given by \(m({{\mathcal{M}}}_{i})\). For example, Fig. 3a illustrates one motif for each of the possible values of \(m({{\mathcal{M}}}_{i})\) observed in the three-element systems of Fig. 2. Meanwhile Fig. 3b shows the fraction of distinct motifs with a given value of \(m({{\mathcal{M}}}_{i})\).

Fig. 3: Degeneracy of chemical motifs.
figure 3

The total number of motifs equivalent to \({{\mathcal{M}}}_{i}\) is given by \(m({{\mathcal{M}}}_{i})\). a Illustration of one motif for each of the possible values of \(m({{\mathcal{M}}}_{i})\) observed in the three-element systems of Fig. 2. b Fraction of distinct motifs with a given value of \(m({{\mathcal{M}}}_{i})\).

Thermal effects in a real SS induce a trade-off between enthalpy and entropy that favors low-energy chemical motifs5,6,7,8. This deviation from randomness—namely, chemical SRO—can be quantified with proper information-theoretic measurement of the difference between \(P({\mathcal{M}},T)\) and \(P({\mathcal{M}},\infty )\), known as the Kullback–Leibler (KL) divergence45:

$${D}_{{\rm{KL}}}\left[P({\mathcal{M}},T)\,| | \,P({\mathcal{M}},\infty )\right]=\mathop{\sum }\limits_{i=1}^{{N}_{{\rm{dm}}}}P({{\mathcal{M}}}_{i},T)\,{\log }_{2}\left[\frac{P({{\mathcal{M}}}_{i},T)}{P({{\mathcal{M}}}_{i},\infty )}\right].$$
(2)

To illustrate the capabilities of this approach, we have evaluated \(P({{\mathcal{M}}}_{i},T)\) through Monte Carlo (MC) simulations for the bcc high-entropy alloy MoTaNbTi across a wide range of temperatures (see the Methods section “Monte Carlo simulations”). The KL divergence (Fig. 4a) is seen to approach zero at high temperatures, indicating that the probability of observing motifs is converging towards a random SS (i.e., entropy-dominated). Conversely, at low temperatures, the KL divergence is significantly different from zero, indicating deviations from a random SS due to SRO (i.e., an entropy–enthalpy trade-off). Figure 4a shows that Eq. (2) is a convenient form of summarizing the complex information about all motifs—provided by \(P({{\mathcal{M}}}_{i},T)\)—into a single quantity.

Fig. 4: Quantification of chemical short-range order in bcc MoTaNbTi.
figure 4

a The Kullback–Leibler divergence (DKL) is a proper information-theoretic quantification of chemical short-range order, i.e., the difference between the probability of observing a local chemical motif in thermal equilibrium \(P({{\mathcal{M}}}_{i},T)\) versus in a random solid solution \(P({\mathcal{M}},\infty )\). b Without the local chemical motif representation, there is no atomic-level granularity in understanding the distribution of local lattice strain across the system. c Association of representative chemical motifs with their corresponding local lattice strain and probability \(P({{\mathcal{M}}}_{i},T)\) at T = 300 K. The inset shows that the three motifs with the lowest local strain are variations of motifs observed in a B2-ordered alloy.

This approach also enables the association of any per-atom property with their corresponding motif and \(P({{\mathcal{M}}}_{i},T)\). For example, Fig. 4b shows the distribution of local lattice strains46,47 for the entire system (see the “Methods” subsection “Local lattice strains”), with no atomic-scale granularity in understanding. Meanwhile, using our approach, we obtain Fig. 4c, where representative chemical motifs are associated with their corresponding local lattice strain. In this figure it can be seen that lower strains are associated with motifs that are observed much more often in thermal equilibrium when compared to a random SS, as measured by the probability ratio \(P({{\mathcal{M}}}_{i},T)/P({{\mathcal{M}}}_{i},\infty )\) that is part of Eq. (2).

The inset in Fig. 4c further illustrates the nuanced characterization of SRO provided by our approach. It is commonly accepted that chemical SRO is the precursor of ordered structures (or precipitates), such as the B2 structure shown in Fig. 4c (inset on the left). In this figure, we quantify this concept by showing that the two motifs associated with the TaMo B2 exhibit lower than average local lattice strain while being significantly more frequent in the thermally equilibrated system than in a random SS. Notice that because the system is still in the SS phase (Fig. 4c, inset on the right), the B2 motif with Ta at the center is more often observed in “defected” states (i.e., with one Ti or Nb atom substituting the site of a Mo) than in their ideal configuration expected from the B2 ordered structure. Supplementary Section 5 shows the decomposition of other ordered crystal structures into local chemical motifs.

Motifs dissimilarity

The relative probability of motifs (Eq. (2)) is not a complete characterization of SRO because it does not contain any information about the spatial distribution of these motifs. This missing information can be provided by rigorously defined correlation functions8 between a motif \({{\mathcal{M}}}_{i}\) and other motifs at a distance r from \({{\mathcal{M}}}_{i}\):

$${\phi }_{i}(r,T)=1-2\,{ \langle {d}_{ij} \rangle }_{| {{\bf{r}}}_{i}-{{\bf{r}}}_{j}| = r},$$
(3)

where dij is a dissimilarity measure between motifs \({{\mathcal{M}}}_{i}\) and \({{\mathcal{M}}}_{j}\) (i.e., it quantifies how different these two motifs are), with \({ \langle \ldots \rangle }_{| {{\bf{r}}}_{i}-{{\bf{r}}}_{j}| = r}\) being an average that includes only motifs \({{\mathcal{M}}}_{j}\) (located at rj) at a distance r from \({{\mathcal{M}}}_{i}\) (located at ri). While the framework developed in ref. 8 is still applicable, the dissimilarity measure dij needs to be extended to account for arbitrary lattice geometries and a number of elements.

Here we generalize the definition of dij (Eq. (4) in ref. 8) by rewriting it as the sum of three separate terms:

$${d}_{ij}={w}_{1}\cdot {d}_{ij}^{{\rm{cat}}}+{w}_{2}\cdot {d}_{ij}^{{\rm{com}}}+{w}_{3}\cdot {d}_{ij}^{{\rm{dmo}}},$$
(4)

where the weights w1, w2, and w3 govern the importance of each term, which are all normalized to fall within the closed interval [0, 1]. The definition of each dissimilarity component in Eq. (4) is described next, with the support of Supplementary Fig. 3 as a visual guide to the calculation of each component.

The first term (\({d}_{ij}^{{\rm{cat}}}\)) captures the dissimilarity between the central atoms

$${d}_{ij}^{{\rm{cat}}}=1-{\delta }_{{\tau }_{i}{\tau }_{j}},$$
(5)

where τi is the central atom type of \({{\mathcal{M}}}_{i}\), and δ is the Kronecker delta.

The second term (\({d}_{ij}^{{\rm{com}}}\)) represents the dissimilarity between the chemical compositions of the 1CPs:

$${d}_{ij}^{{\rm{com}}}=\parallel {{\bf{k}}}_{i}-{{\bf{k}}}_{j}{\parallel }_{2},$$

where ki are the Cartesian coordinates obtained from the barycentric coordinates of a (nc−1)-simplex (see the “Methods” subsection “Simplex and barycentric coordinates”). For example, for a ternary alloy the motifs at different vertices of the composition triangle are separated by a dissimilarity distance of one (see the section “Chemical composition space" in Supplementary Fig. 3).

Finally, \({d}_{ij}^{{\rm{dmo}}}\) represents the dissimilarity between distinct motifs with the same 1CP chemical composition:

$${d}_{ij}^{{\rm{dmo}}}={\left\Vert {{\bf{z}}}_{i}-{{\bf{z}}}_{j}\right\Vert }_{2}\times \frac{1}{M},$$
(6)

where zi is the E(3)-GNN embedding of the \({{\mathcal{M}}}_{i}\) graph, and M is a normalization factor set to the maximum L2 distance among all possible motifs.

The weights (w1, w2, and w3) for each dissimilarity component are chosen to be proportional to the number of chemical bonds associated with their corresponding structures in the motif:

$${\bf{w}}=\left[\begin{array}{c}{w}_{1}\\ {w}_{2}\\ {w}_{3}\end{array}\right]=\frac{1}{{N}_{{\rm{a}}}+2{N}_{{\rm{b}}}}\cdot \left[\begin{array}{c}{N}_{{\rm{a}}}\\ {N}_{{\rm{b}}}\\ {N}_{{\rm{b}}}\end{array}\right],$$

where Na is the number of atoms in the 1CP, and Nb is the number of bonds within the 1CP. For example, the fcc crystal structure has Na = 12 and Nb = 24, while in bcc Na = 8 and Nb = 12.

Using this generalized approach, we have completed the characterization of SRO in MoTaNbTi, which was initiated in Fig. 4. Figure 5a shows the spatial correlation function for a representative motif in this system. With correlation functions such as this one, it is possible to evaluate the length scale8 (ξi) of chemical fluctuations for each motif \({{\mathcal{M}}}_{i}\), which is an important materials parameter in the understanding of various chemistry–microstructure relationships3,4. Figure 5b shows the probability distribution of the length scale of chemical fluctuations, where it can be seen that the effect of SRO is to decrease the average length scale. Figure 5 also shows that the distribution of chemical fluctuations converges towards the distribution of a random SS (i.e., entropy-dominated) at high temperatures, similar to what was observed in Fig. 4a for the probability distribution of chemical motifs.

Fig. 5: Length scale of chemical fluctuations in bcc MoTaNbTi.
figure 5

a Spatial correlation function for a representative chemical motif. Higher temperatures reduce the amount of spatial correlation between motifs. b Probability distribution of the length scale of chemical fluctuations (ξi). Inset shows the temperature dependence of the maximum of the probability distribution, where it can be seen that the distribution of chemical fluctuations converges towards the distribution of a random SS (i.e., entropy dominated) at high temperatures.

Notice that \({d}_{ij}^{{\rm{dmo}}}\) (Eq. (6)) directly depends on the E(3)-GNN embedding, which makes the correlation functions implicitly depend on the model parameters. In Supplementary Section 7, we demonstrate that the chemical fluctuation length scale is robust against different random initializations of the E(3)-GNN. The authors believe that this robustness originates from the correlation function \({C}_{i}(r,T)={\phi }_{i}(r,T)-{\phi }_{i}^{0}(r,T)\) definition8, which is relative to a “baseline” \({\phi }_{i}^{0}(r,T)\) that is also influenced by the random initialization:

$${\phi }_{i}^{0}(r,T)=1-2\,{ \langle {d}_{ij} \rangle }_{P({\mathcal{M}},T)},$$

where \({ \langle \ldots \rangle }_{P({\mathcal{M}},T)}\) indicates an average evaluated with motifs \({{\mathcal{M}}}_{j}\) randomly sampled from the thermally equilibrated distribution \(P({\mathcal{M}},T)\).

Discussion

The generalized framework presented here characterizes chemical fluctuations in arbitrary crystal lattices with any number of chemical elements, with a rigorous mathematical foundation for the generalization being provided. The framework culminates in a reduced representation of the chemical space (Supplementary Fig. 3) and an information-theoretic quantification of SRO. Analytical results using group theory demonstrate that this approach eliminates degeneracies present in other representations of SRO (Fig. 1b). Our framework can identify motifs in atomistic data with a computational speed of 1.2 × 106 atoms per hour in a single CPU core with an Apple Silicon M1 processor, or 63 × 106 atoms per hour on a NVIDIA V100s GPU. While this is indeed more computationally expensive than computing first nearest-neighbors Warren–Cowley parameters, the computational cost is still negligible for typical large-scale atomistic simulations with tens of millions of atoms. The application of this framework to arbitrary crystal structures can be automated by employing symmetry finder algorithms48 to determine the symmetry group of chemical motifs. It is important to consider the neural network expressivity and memory constraints when expanding the scope of this method, especially when working with high-entropy materials. In Supplementary Section 3 we provide a series of strategies for validating the expressivity on a symmetrically complete subset of the data set, which enables the application of this approach to systems with at least 1.2 billion motifs.

The approach introduced here is complete for the first-coordination environment of an atom, while the WC parameters are not (Fig. 1b). Yet, our approach is not complete beyond the first-coordination environment, and this subtle point warrants some discussion as it remains a long-standing challenge in the field. The incompleteness of our approach beyond first-neighbors is due to the fact that the full set of correlation functions (Eq. (3)) is still not enough to completely reconstruct the chemical state of the system. Fundamentally, this originates from the aggregation of the contribution of all motifs \({{\mathcal{M}}}_{j}\) to the correlation function of motif \({{\mathcal{M}}}_{i}\) in Eq. (3). A complete description would require pairwise correlation functions (i.e., individual correlations between all motif pairs), and possibly higher-order correlation functions (e.g., between motif triplets), which might be computationally impractical to evaluate. Perhaps more importantly than computational feasibility: the authors are not aware of experimental measurements on this complete set of correlation functions or any suggestion in the literature of how they affect materials' properties. Meanwhile, the set of correlation functions evaluated here (Eq. (3)) is, in principle, physically equivalent to approaches currently in use to evaluate SRO length scale with electron microscopy27,30, and have been historically employed to evaluate experimental observables such as scattering cross sections and susceptibilities in statistical physics27,30,49,50.

A discussion is also warranted regarding the interpretability of our approach beyond the first-coordination environment when compared to WC parameters. By aggregating (Eq. (4)) the contribution of the central atom chemical species (\({d}_{ij}^{{\rm{cat}}}\), Eq. (5)) with the other contributions (namely, 1CP chemical composition \({d}_{ij}^{{\rm{com}}}\) and motif structure \({d}_{ij}^{{\rm{dmo}}}\)) in the calculation of the correlation function (Eq. (3)) we end up with a description beyond the first-coordination environment that is less interpretable than WC parameters, which describe the variation in chemical composition with distance in a physically intuitive manner. This reduced interpretability is traded by an increase in the physical fidelity of the evaluation of SRO length scales by the inclusion of effects beyond chemical composition (i.e., \({d}_{ij}^{{\rm{com}}}\) and \({d}_{ij}^{{\rm{dmo}}}\)).

Extending our current framework to account for long-range order would be a natural continuation of the work presented here because this would allow the investigation of phenomena where ordered compounds and solid solutions are both present (e.g., precipitation hardening). This could be accomplished, for example, with motif-node-based graphs51,52, where each node in the graph describes an entire chemical motif, or with variational autoencoder-based order parameters53. In Supplementary Section 8 we show that the chemical fluctuation length scale (Fig. 5b) shows variations that are compatible with peaks in the specific heat of CrCoNi that have been attributed to transitions to long-range order (i.e., a order–disorder transition).

The capabilities developed here facilitate the evaluation of chemistry–microstructure relationships that will be valuable for materials theory and experiments alike. For example, this approach is useful for augmenting the visualization of large-scale atomistic simulations13,14,17,18,19,54 or experimental imaging at the atomic scale24,25,26,27,28,29,30,35,36,37,38, leading to better characterization of chemical SRO and its connection with physical properties (e.g., Fig. 5b, c). Our approach could also better inform the chemistry during the development of machine learning interatomic potentials for chemically complex systems55,56,57,58,59, which is currently a challenge for the state-of-the-art in the field. The results presented here demonstrate how data science and machine learning can be employed to uncover chemical complexity in large atomic-scale data sets and transform these findings into quantities of relevance for the physical modeling of these materials.

Methods

Polya’s enumeration theory

Consider a system with nc chemical elements in a specified crystal structure, which defines the 1CP. We define S to be the set of vertices of a 1CP, C to be the set of possible atom types (i.e., chemical elements), and G to be a group isomorphic to the 1CP symmetry group. The 1CP pattern inventory is given by Polya’s enumeration theory60:

$${P}_{G}\left(\sum _{c\in C}w(c),\sum _{c\in C}w{(c)}^{2},\ldots \,,\sum _{c\in C}w{(c)}^{d}\right),$$
(7)

where PG is the cycle index polynomial of G, w(c) is the atomic type label of cC (e.g., "Mo”, "Ta”, “Nb”, or “Ti” for MoTaNbTi), and d = S is the cardinality of S (i.e., number of elements in the set). The pattern inventories for fcc, hcp, and bcc lattices with three elements are given in Supplementary Section 2. Pattern inventories for arbitrary lattices and the number of elements can be evaluated using our Polya Python package61.

Equation (7) contains all of the information required to enumerate (i.e., count) the number of distinct 1CPs for each of its possible chemical compositions. For example, the pattern inventory of an ABC ternary hcp system (Supplementary Section 2) allows us to easily read from the polynomial coefficients that there are 1444 distinct 1CPs with chemical composition A5B2C5. Similarly, the total number of distinct 1CPs (Ndm/nc, also known in group theory as the number of orbits) can be obtained by setting w(c) = 1 for all cC in the polynomial:

$$\frac{{N}_{{\rm{dm}}}}{{n}_{{\rm{c}}}}={P}_{G}(| C| ,\ldots \,,| C| ).$$

E(3)-equivariant graph neural network

Local chemical motifs were converted to graphs in which nodes represent atoms and have as attribute the atomic type (i.e., chemical element) as a one-hot encoding. Graph edges store the direction (unit) vector \({\hat{{\bf{r}}}}_{ij}\) between atoms i and j. Graph embeddings are made invariant to local lattice distortions by remapping atoms to their ideal positions before constructing the graphs. Each graph is then processed through a randomly initialized E(3)-GNN41 (implemented using the e3nn package62) composed of \({\mathbb{E}}(3)-\) equivariant convolutions and gates, generating a fingerprint \({{\bf{z}}}_{i}\in {{\mathbb{R}}}^{4}\) of the chemical motifs \({{\mathcal{M}}}_{i}\).

Our E(3)-GNN architecture is composed of three convolutions using spherical harmonics \({Y}_{\ell }^{m}\left({\hat{{\bf{r}}}}_{ij}\right)\) as edge attributes, with degrees as large as \({\ell }_{\max }=2\). A total of 10 cosine radial basis functions are employed, with ranges evenly spread from zero to three distance units (measured as multiples of the nearest-neighbor distance). The final output length is four, i.e., \({{\bf{z}}}_{i}\in {{\mathbb{R}}}^{4}\). The hidden-layers irreducible representation, in the e3nn notation, were set to 3 × 2o + 3 × 2e + 3 × 1o + 3 × 1e + 3 × 0o + 3 × 0e. Equivariant neural networks have been shown to be capable symmetry compilers, enabling the representation of important features of atomic geometry even with random initialization63. Our choice here of randomly initializing the network and not training it is meant to maximize the influence of each weight. This is also supported by the agreement between the pattern inventories obtained with this computational approach and the analytical results using Polya’s enumeration theory.

Monte Carlo simulations

Monte Carlo simulations using the machine learning moment tensor potential64 from ref. 65 were employed to sample the thermal-equilibrium chemical configurations for the bcc MoTaNbTi high-entropy alloy. Starting configurations were composed of 13 × 13 × 13 chemically random supercells at the equiatomic composition. The simulations were performed for a total of 30 Monte Carlo steps per atom, with acceptance probabilities based on the Metropolis-Hastings algorithm66,67. Periodic boundary conditions were enforced in all dimensions.

Simulations were carried out for temperatures ranging from 300 to 1700 K in 100 K increments; the lattice parameter at each temperature accounted for thermal expansion. A total of 20 independent simulations were performed for each temperature, and only the final configuration of each simulation was employed in our analyses, which resulted in a data set of 87,880 motifs per temperature. The nearest-neighbor Warren–Cowley parameters of this system (calculated using our Ovito68WarrenCowleyParameters Python modifier (github.com/killiansheriff/WarrenCowleyParameters) are shown in Supplementary Section 4 as a function of temperature.

Local lattice strain

The final configuration of each Monte Carlo simulation was relaxed with fixed simulation box dimensions, and the local lattice strain of each atom n was evaluated:

$${\delta }_{n}(T)=\frac{\parallel {{\bf{r}}}_{n}^{{\rm{f}}}-{{\bf{r}}}_{n}^{{\rm{i}}}{\parallel }_{2}}{{a}_{{\rm{NN}}}(T)},$$

where \({{\bf{r}}}_{n}^{{\rm{f}}}\) is the atom’s position after relaxation, \({{\bf{r}}}_{n}^{{\rm{i}}}\) is the position before relaxation (i.e., in the ideal bcc structure), 2 is the L2 norm, and aNN(T) is the nearest-neighbor distance at temperature T.

Simplex and barycentric coordinates

A simplex69 is a generalization of triangles (2-simplex) and tetrahedra (3-simplex) to higher dimensions. It represents the simplest possible polytope in any given dimension. Mathematically, the standard n-simplex is defined as the subset of \({{\mathbb{R}}}^{n+1}\) given by

$${\nabla }^{n}=\left\{{\bf{t}}\in {{\mathbb{R}}}^{n+1}:\mathop{\sum }\nolimits_{i = 1}^{(n+1)}{t}_{i}=1\wedge {t}_{i}\ge 0\,{\rm{for}}\,i=1,\ldots ,(n+1)\right\}.$$

A barycentric coordinate system specifies the location of a point with respect to a simplex. In our case, the barycentric coordinates correspond to the chemical composition of a 1CP. For example, consider a motif \({{\mathcal{M}}}_{j}\) and its associated 1CP. The barycentric coordinates for the composition of this 1CP are

$${{\boldsymbol{\lambda }}}_{j}=\frac{1}{{N}_{{\rm{a}}}}\left({N}^{(1)},{N}^{(2)},\ldots ,{N}^{({n}_{{\rm{c}}})}\right),$$

where Na is the number of atoms in the 1CP, N(i) is the number of atoms of type “i”, and nc is the number of chemical elements in the system. While the barycentric coordinates are a convenient form to express the chemical composition, they can be converted into the more conventional Cartesian coordinates \({{\bf{k}}}_{j}\in {{\mathbb{R}}}^{({n}_{{\rm{c}}}-1)}\) with the following equation:

$${{\bf{k}}}_{j}=\frac{1}{{N}_{{\rm{a}}}}\mathop{\sum }\limits_{i=1}^{{n}_{{\rm{c}}}}{N}^{(i)}{{\bf{v}}}_{i},$$

where vi are vertices of the (nc−1)-simplex (in Cartesian coordinates). The numerical implementation of n-simplex objects can be found in our Simplex Python package70.