Properties of solid-state materials are strongly related to their crystal structures. Even in the same elemental composition, the physical properties such as magnetization and adsorption energy are significantly affected by the crystal structures1,2,3,4. HEA, which consists of more than five elements, has drawn intensive attention for its outstanding mechanical properties5,6 when forming the solid solution phase. The mechanical properties of solid solution HEA depend on its phase. The fcc HEA has high ductility7, and the bcc HEA has high strength8. That`s why valence electron concentration (VEC) is used to classify the bcc and the fcc solid solution phase of HEA9,10.

To confirm the crystal structures efficiently, the structural searching in combination with the evolutionary algorithm with density functional theory (DFT) have been applied11,12. Recent approaches for crystal structure prediction become accelerated by adopting machine learning algorithms trained with the available experimental and theoretical database. Learning-based methods even predict the crystal structures of unknown materials using a sufficient number of training data13,14. As a result, one can bypass direct experiments or calculations to find the structural phases, so the cost for exploring the unknown materials and their characteristics becomes significantly reduced. In practice, the existing database, such as the inorganic crystal structure database (ICSD)15 and Automatic-Flow (AFLOW)16 have been used for training data. For instance, to investigate the most probable Mn-Ge and Li-Mn-Ge system structure, deep neural network (DNN) with ICSD has been used to predict the crystal structures13. When the number of the existing training data is insufficient, the calculation based on DFT can be applied to generate the training data17.

However, the above approaches cannot be applied to the unexplored multi-elements alloys such as HEA18 because of the insufficient data in the experiment. In addition, the possible compositional number of HEA is more than 106, so preparing training data set of HEA using DFT calculation like other crystal system19,20 is infeasible. Although some machine learning-based approach shows accurate performance 21,22, the most approaches for predicting phases of unexplored HEA are restricted to nearly equiatomic cases23,24. It is because the calculation of the non-equiatomic HEA dataset requires huge computation due to its vast compositional space25. Therefore, the prediction of the HEA’s crystal structures without the calculation in the vast space is a demanding issue26.

In this sense, we develop a learning-based approach to predict the vast compositional space of multi-element alloys (binary alloy, ternary alloy, and HEA), while only the binary alloy dataset is involved as the training set.

For structural phase prediction using a learning-based approach, designing proper features is crucial, because it determines the cost and accuracy of the prediction. Conventionally, the compositional properties such as Z(i) (atomic number), \(n_{d}\)(i) (d-orbital occupancy), and \(\sigma_{d}\)(i) (d-orbital spin) for ith atom are used as proper features for predicting structural phases of binary alloys27.

Especially in previous works, it is revealed that \(n_{d}\) and \(\sigma_{d}\) denotes occupancy of d electrons30. The d electron occupancy effectively involves in cohesive interaction and determines the stability of the crystal structural phase. Therefore, from several decades ago, this occupancy is widely used to classify the structural phase of transition metal. H. L. Skriver classify bcc, fcc, and hcp phase of 3d, 4d and 5d non-magnetic transition metal using \(n_{d}\)28, and it is expanded to magnetic transition metal using \(n_{d}\) and \(\sigma_{d}\) features29,30.

However, this approach is not directly applicable to multi-element alloys because the number of features is increasing as the types of elements increase. Although {\(n_{d}\)(N), \(\sigma_{d}\)(N)}, as a list of paired features for N-elements alloy, are well known as features for the crystal structure prediction of transition metal30, expensive DFT calculation is still necessary to obtain those values for multi-element alloys.

In this work, we propose expandable {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\)} features, which are transformed from {\(n_{d}\)(N), \(\sigma_{d}\)(N)} features as illustrated in Fig. 1. For the transformation from {\(n_{d}\)(N), \(\sigma_{d}\)(N)} to {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\)}, we utilized ensemble trees31 considering each atoms’ surrounding condition in the alloy. In practice, \(n_{d}\) and \(\sigma_{d}\) of the transition metal can be changed by the electron transfer from s or p orbitals when the lattice constants or surrounding atoms are changed in transition metal alloy32. For example, \(\sigma_{d}\) of Mn can be significantly enlarged when the volume of the alloy increases33. To consider those conditions, the concentration (C) and atomic radius difference (δ) are added as additional features to obtain {\(n_{d}^{tr \left( N \right)}\), \(\sigma_{d}^{tr \left( N \right)} \}\) features in alloy condition. Finally, the {\(n_{d}^{tr \left( N \right)}\), \(\sigma_{d}^{tr \left( N \right)} \}\) features are reduced to {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\)} features by average pooling, as shown in Fig. 1. (The details of the feature transformation process are in section "Algorithm of this work".) Note the {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\)} features are always two variables in any number of element types in the multi-elements alloy. So, these expandable features can be used to train of binary alloy dataset, and then applied to the prediction of the multi-elements alloy properties as demonstrated in the following.

Figure 1
figure 1

Schematic representation for feature transformation module to obtain {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\)} features from {\(n_{d}\)(N), \(\sigma_{d}\)(N)} features. {\(n_{d}\)(N), \(\sigma_{d}\)(N), C, δ} features in alloy {\(M\)(N)} consist of raw features. With regression tree ensembles, {\(n_{d}\)(N), \(\sigma_{d}\)(N)} features are transformed to {\(n_{d}\) tr (N), \(\sigma_{d}\) tr (N)}. In this transformation, {C, δ} features are used in edges in the ensemble tree. Then, by vector-wise average pooling, {\(n_{d}\) tr (N), \(\sigma_{d}\) tr (N)} is reduced to {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\)} features, which used C of the constituent atom as weight. {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\)} is used for the training of the calculated binary alloy dataset. Note that {\(n_{d}\)(N), \(\sigma_{d}\)(N)} features are the information from a pure transition metal, while {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\)} features represent the information in alloy condition. After the training of the module, the {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\)} features obtained from multi-element alloys can be used for the prediction of the structural phases.

Results and discussions

For the generation of binary alloy dataset, the stable crystal structures of disordered transition metal binary alloys are calculated in all compositional space. Using DFT calculations, we consider three structural phases of body-centered cubic (bcc), face-centered cubic (fcc), and hexagonal close-packed (hcp), which are competing with each other depending on the elemental configurations. The calculated structural phases are compared to the experimental results, which show good agreement. So, we believe that our calculated results can be used for the training set of crystal structure prediction of the experimental data.

To validate the {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\)} features, we compare the accuracies of structural phase predictions using {\(n_{d}^{\left( N \right)}\), \(\sigma_{d}^{\left( N \right)}\)} and {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\)} features by the evaluation from the test set of the calculated binary alloys in Fig. 2. Figure 3(a) shows structural phase classification region trained by {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\)} features. The accuracy with {\(n_{d}^{\left( N \right)}\), \(\sigma_{d}^{\left( N \right)}\)} features is 81.1% and with {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\)} features is 78.74%. This validation indicates that the transformed features, {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\)} well reproduce the prediction accuracy with raw features, {\(n_{d}\)(N), \(\sigma_{d}\)(N)} in the binary alloy.

Figure 2
figure 2

Binary alloy dataset of \(M_{x}^{\left( 1 \right)} M_{1 - x}^{\left( 2 \right)}\)(\(M^{\left( 1 \right)}\) = Mn, Fe, Co, and Ni; \(M^{\left( 2 \right)}\) = TM) generated from DFT calculation. bcc, fcc, and hcp structures are denoted in blue, red, and green, respectively. The lowest energy phase is denoted as a stable phase and has the second-lowest energy used as a meta-stable phase. The energy difference between stable phase and meta-stable phase is denoted as ΔEDFT. To confirm the validity of the DFT calculation, the phases of binary alloys from experimental reports are denoted as dot and cross for correct and incorrect predictions, respectively. Among the experimental data of binary alloys, 212 alloys containing Mn, Fe, Co, and Ni are used (The experimental binary alloy data is available in Table S1).

Figure 3
figure 3

(a) Classification region of the structural phases in {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\)} feature space, which is trained from the calculated binary alloy dataset. bcc, fcc, and hcp are denoted with blue, red, and green shaded regions, respectively. Since the trained regions from the {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\)} and {\(n_{d}^{\left( N \right)}\),\({ }\sigma_{d}^{\left( N \right)}\)} features are similar, and we only show the trained result from {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\)} features in this figure. (b) A t-SNE58 plot of {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\), δ} features. The 1862 binary alloys from DFT calculation and the experimentally determined 870 multi-element alloys are denoted with circles for comparison. All the experimental multi-element alloys are located in the range of the calculated binary alloys. (c) Mean accuracy of the test set for the calculated data of binary alloy and the experimental data of binary alloy, ternary alloy, and HEA with various feature sets. 10% of the calculation data is randomly chosen as the test set, and the remaining 90% of the calculated data is used as the training set. The error bar denoted the standard deviation of the accuracy.

In addition to the transformed {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\)} features, we also use atomic size difference (δ), configurational entropy (Sc) and electronegativity difference (\(\chi\) d) which are known to determine the stability of HEA49. To predict the structural phase with the chosen features, the support vector machine50 was used with the calculated binary alloy dataset in Fig. 2. Figure 3(c) shows the accuracies of the phase prediction for each set of chosen features. As expected, the accuracy for binary dataset increases with a large number of features, and it is up to 91.78%. This behavior is well known and shown in most machine learning works22,23 when the training set and test set are divided from the same data set.

We applied this trained algorithm to the experimentally reported multi-elements alloys (Tables S1 and S2) for the practical demonstration of this work. Here, many binary alloys, ternary alloys, and HEAs such as VNbMoTaW51 and CoCrFeMnNi52, well known for application, are included in the test set. As shown in Fig. 3(c), the accuracies of the calculated data and all the experimental data are comparable in case of {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\)} and {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\), δ} feature spaces. In {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\), δ} feature space as shown in Fig. 3(b), the accuracy for experimental data of all multi-elements alloys is 80.56%, which is comparable with 81.79% accuracy for calculated data of binary alloys. Especially, the accuracy of the HEA increases up to 84.20%, and it comes from that the existing HEA data mainly consists of bcc and fcc phases. In both the calculated data and the experimental data, the misclassification data mainly comes from fcc and hcp phases (See the confusion matrices in figure S3). It implies that the accurate determination of fcc and hcp phases in the calculated data will improve the classification performance of the experimental multi-element alloy.

This result implies that the trained algorithm by binary alloys can be expanded to the prediction for the experimental multi-elements dataset, including HEA. The accuracy of HEA is comparable with the previous works that classify the phases of HEA with machine learning. For classification of bcc, fcc, and NSP (non-single phase) of HEA with support vector machine (SVM), it has 90.69% accuracy22 and classification of bcc, fcc, and hcp phase of HEA 87 ~ 89% accuracy23.

Unlike the accuracy of the calculated binary alloy data, the accuracy of the experimental data of ternary alloy, and HEA drastically decreases when Sc and \(\chi\) d features are added. Figure 4(a) shows why Sc can`t use as expandable features. In Fig. 4(a), binary can have \(S_{c,max}^{bin}\) when it forms the binary equiatomic alloy. However, a multi-element alloy which consists of MEA and HEA have larger Sc than \(S_{c,\max }^{{{\text{bin}}}}\) 53. It makes the HEA data located in the extrapolative region. In Fig. 4 (b), most Sc of ternary alloy and HEA data distributed out of range of the binary alloy data. It is well known that machine learning shows poor performance in extrapolation region54, so the HEA data shows 30% accuracy with the five features, including Sc. When \(\chi\) d is applied as an additional feature, the accuracy of HEA also significantly decreases compared with the binary and ternary alloy, so the \(\chi\) d feature is not considered as an expandable feature for phase prediction of HEA.

Figure 4
figure 4

(a) Sc of binary alloy as a function of c (Gray line). \(S_{c,max}^{bin}\) is the theoretical maximum Sc that binary alloy can have. Beige color and orange color denote the MEA (Medium entropy alloy) and HEA region, respectively. (b) Data distribution of calculated binary alloy and experimental data (binary alloy, ternary alloy, HEA) in and Sc feature.

This work has an advantage of saving costs for generating training set data. The existing work based on raw features and neural networks used at least 80% data of HEA in the training process to classify the phases of HEA23,55,56. The HEA data for training is limited because it is based on the experimental results, and it is hard to get calculation data for its substantial computational cost. However, in this work, no HEA data is involved in the training process. Instead, a simple binary alloy dataset is used to predict the phase of HEA with comparable accuracy. This kind of approach is not common and applicable only when the training set and the test set are located in similar feature space (Fig. 3b).

By using expandable features for the structural phase prediction of multi-elements alloy, this work shows significant improvement in the cost of preparing the training set. As shown in Fig. 5, the cost for generating a training set with raw features increases when the number of elemental types of alloy increases. By using the expandable features, however, the cost for generating a binary alloy dataset is only required. To schematically evaluate the cost for training set, the average cost for equiatomic 3d transition metals is calculated with AKAI-KKR-CPA40 code in 2.1 GHz Intel Xeon E5-2620 processor. By considering the number of possible configurations with the cost, the total cost for the binary alloy dataset requires 0.56 years/core. Likewise, the cost becomes 20.46, 1,144.11, and 159,056.46 years for ternary, quaternary, and quinary alloys, respectively. So, it becomes more than three orders of magnitude larger computational cost for HEA than this work. It implies that for training the machine learning algorithm to predict phases of the unknown HEA, obtaining HEA data in vast compositional space can be bypassed.

Figure 5
figure 5

Computational cost of generating a training dataset using DFT calculation. For the training set using raw features, we suppose that 80% of the training set is required from the original dataset. However, using expandable features, the cost for generating the training set of the binary alloy is only required because the training algorithm using expandable features can be directly applicable to the multi-element alloys.

We believe that this work can be practically applied to find new multi-element alloy by combining with further experiments, likewise the previous works based on machine learning49,57. Further experiment should be needed about the issue which can`t be solved in machine learning level for the lack of data. Combining with this work, new multi-element alloy such as the solid solution of HEA can be practically found by dealing with the issue such as segregation58 in the further experiment.


To conclude, we suggest a learning-based algorithm to predict structural phase of multi-element alloy (binary alloy, ternary alloy and HEA) from binary alloy dataset. In this approach, we transformed the raw features {\(n_{d}\)(N), \(\sigma_{d}\)(N)} to accurate and expandable features {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\)}. By employing the {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\), δ} features, it shows 80.56% accuracy for the experimentally reported multi-element alloys, which shows the practicality of the algorithm. These expandable features enable to obtain comparable accuracy without using the multi-element alloys in the training data. Furthermore, it only requires at least three orders of magnitude smaller computational cost for HEA than generating a training set with raw features.

We suggest that this work can be used to find new multi-element alloy such as the solid solution HEA with further experiment. In further work, we expect the approach can be expanded to find unknown solid solution phase HEA by screening multi-phase and intermetallic phase with training result from binary alloy data, which guarantees small computational cost for the training set.


Dataset for training and test

We make a binary alloy dataset for the training and utilize this for inferencing multi-element alloys. The binary alloy dataset is shown in Fig. 2. The generated dataset consists of 1876 kinds of disordered binary alloys, as indicated below:

$$M_{x}^{{\left( 1 \right)}} M_{{1 - x}}^{{\left( 2 \right)}} (M^{{\left( 1 \right)}} = {\text{ Mn}},{\text{ Fe}},{\text{ Co}},{\text{ Ni}};M^{{\left( 2 \right)}}:{\text{transition}}\;{\text{metal}};\;x = {\mkern 1mu} 0.05,{\mkern 1mu} 0.1,{\mkern 1mu} 0.15 \ldots {\mkern 1mu} 0.95)$$

To prevent the overfitting \(\sigma_{d} = 0\) region, we restricted the binary alloy training data by locating magnetic center Mn, Fe, Co, Ni in \(M^{\left( 1 \right)}\) site. Since d electron bandwidth is broad to stabilize Madelung energy, it prefers closed packed structure such as bcc, fcc, and hcp. Therefore, we only consider bcc, fcc, and hcp structures of each composition34. Although the alloy can have additional intermetallic phases, the classification of these simple phases is still important. In HEA, the fcc phase HEA has high ductility, and the bcc phase HEA has high strength 10,35, and the phases are dominant in both binary alloy and HEA when the atomic size difference (δ) is small (binary alloy: figure S2, HEA36,37). It implies that the classification of the simple lattice structures is valid in some compositional spaces. Therefore, in this work, we focused on our interest in these simple phases. Based on the structural phase of a binary alloy, including intermetallic phases (figure S2), we`ll extend to classify solid solution phases and intermetallic phases of multi-element alloy such as HEA in further work.

In addition, since we used the all-d-metal binary alloy data as the training set, we restricted the test set of the multi-element alloy as all-d-metal alloy such as CoCrFeMnNi, which is still located in vast compositional space and practically applicable38,39.

For generating the dataset, DFT calculation in AKAI-KKR-CPA40 code was applied. For the calculation, Korringa-Kohn-Rostoker (KKR)41 method is implemented with Coherent-Potential-Approximation (CPA). CPA method effectively consider disordered random alloy such as \(M_{x}^{\left( 1 \right)} M_{1 - x}^{\left( 2 \right)}\) by considering one lattice site with the average concentration of \(M^{\left( 1 \right)}\) and \(M^{\left( 2 \right)}\). To find lattice parameter at the ground state, we calculated the total energy of \(M_{x}^{\left( 1 \right)} M_{1 - x}^{\left( 2 \right)}\) with various volumes. The electron exchange–correlation potential is considered with the generalized gradient approximation, Perdew–Wang functional, (GGA91)42. Spin orbit coupling (SOC) was considered when 5d transition metal in the binary alloy. The structural phase calculation from AKAI-KKR-CPA showed consistency with 165 kinds of compositions among the 212 compositions from the experiment. For predicting the structural phase of multi-element alloy in experimental data, we make the multi-element dataset with 611 binary alloys and 106 ternary alloys from NIMS material database43, and 259 HEA data23,25,44 was used. Since our attention is restricted to the d valence element, we excluded the alloy with p, s, or f valence elements in multi-element alloy.

Main feature of this work

With {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\)} feature, we additionally choose features which determine stability of HEA as below equations: \(S_{c} = - R\mathop \sum \limits_{i = 1}^{N} c_{i} lnc_{i}\)

$${\varvec{\varDelta}}H = 4\mathop \sum \limits_{i = 1,i \ne j}^{N}{\varvec{\varDelta}}H_{ij}^{liq} c_{i} c_{j}$$
$${\text{VEC}} = \mathop \sum \limits_{i = 1}^{N} c_{i} {\text{VEC}}_{i}$$
$$\delta = 100\% \sqrt {\mathop \sum \limits_{i = 1}^{N} c_{i} \left( {1 - \frac{{r_{i} }}{{\tilde{r}_{i} }}} \right)^{2} }$$
$$\chi_{d} = \sqrt {\mathop \sum \limits_{i = 1}^{N} c_{i} \left( {\chi_{i} - \tilde{\chi }_{i} } \right)^{2} }$$

Configurational entropy (Sc), mixing enthalpy (ΔH), valence electron concentration (VEC), atomic size difference (δ) and electronegativity difference (\(\chi\) d) are used to classify the structural phase of HEA to bcc, fcc, and non-single phases22. N is the number of elements in the alloy, and \(c_{i}\) is molar fraction of element i. \({\varvec{\varDelta}}H_{ij}^{liq}\) is the mixing enthalpy of element i and j of binary liquid alloy from Miedema`s theory48. VEC is evaluated by averaging of VEC of element i. Among the features of HEA, VEC is excluded in this work for its similarity with \(n_{d}^{ex}\). ΔH, is also excluded for its required improvement60. Then, the feature set become {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\), δ, Sc, \(\chi\) d}. To reduce possible configuration of the feature set, we choose two main features among {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\), δ, Sc, \(\chi\) d}. In Figure S1, we evaluated the accuracy of the test set with various paired features, and {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\)} shows the best accuracy, 0.8346. Therefore, we use {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\)} as two main features and add additional features among the remaining feature set, {δ, Sc, \(\chi\) d}.

Algorithm of this work

Figure 6 describes the detailed process of how raw features in N-element alloy can be transformed into expandable features. From the multi-element alloys (\(M_{{C_{1} }}^{1} M_{{C_{2} }}^{2} \cdots\) \(M_{{C_{N} }}^{N}\)), the raw features ({\(n_{d}^{{M^{N} }}\), \(C_{N} ,\) \(\delta\)}, {\(\sigma_{d}^{{M^{N} }}\), \(C_{N} ,\) \(\delta\)}) can be obtained from their compositional information. To obtain the transformed features ({\(n_{d}^{{tr, M^{N} }} , \sigma_{d}^{{tr,M^{N} }}\)}), the regression ensemble tree is used. {\(C_{N} ,\) \(\delta\)} features are used as decision rules in the ensemble tree. All parameters such as nodes and depth in the ensemble tree are optimized by training the ensemble tree with the binary alloy data. For training the ensemble tree, {\(n_{d}^{{M^{N} }}\)} and {\(\sigma_{d}^{{M^{N} }}\)} of the calculated 1862 binary alloy is used as {\(n_{d}^{{tr, M^{N} }}\)} and {\(\sigma_{d}^{{tr,M^{N} }}\)}. From the trained ensemble tree, transformed features in each tree {\(n_{d,k}^{{tr, M^{N} }}\)} are obtained, and their averaged value is used as the transformed features of the multi-element alloy. Then, by weighted average pooling when {\(C_{N}\)} used as a weight, 1xN array of the transformed features are reduced to a scalar, expandable features (\(n_{d}^{ex} , \sigma_{d}^{ex}\)) as follow:

$$n_{d}^{ex} = \frac{{\sum C_{i} n_{d}^{tr,M\left( i \right)} }}{{\sum C_{i} }}$$
Figure 6
figure 6

Details of the feature transformation process in this work. Through the ensemble regression tree and average pooling, raw features in the N-element alloy are transformed into expandable features. With {\(\sigma_{d}^{{M^{N} }}\), \(C_{N} ,\) \(\delta\)} features, \(\sigma_{d}^{ex}\) can be obtained in the same way.

For classification of the phase of the multi-element, we utilize the support vector machine (SVM) algorithm with error-correcting output coding (ECOC)45 as implemented in MATLAB46. Three structural phases, bcc, fcc, and hcp, were used as classes, and all hyper-parameters in the SVM are optimized to minimize the classification error. Various subsets of the feature set {\(n_{d}^{ex}\), \(\sigma_{d}^{ex}\), δ, Sc, \(\chi\) d} used as input in the algorithm. The bcc, fcc, and hcp phases are represented using integer encoding. To obtain an unbiased prediction error of the classification model, we perform fivefold cross validation47. To cope with the nonlinearity using SVM, we used Gaussian kernel function K with support vector x and kernel scale G, \(K{ }\sim{ }e^{{ - \left| {x - \overline{x}} \right|/G}}\).