Cluster-formula-embedded machine learning for design of multicomponent β-Ti alloys with low Young’s modulus

The present work formulated a materials design approach, a cluster-formula-embedded machine learning (ML) model, to search for body-centered-cubic (BCC) β-Ti alloys with low Young’s modulus (E) in the Ti–Mo–Nb–Zr–Sn–Ta system. The characteristic parameters, including the Mo equivalence and the cluster-formula approach, are implemented into the ML to ensure the accuracy of prediction, in which the former parameter represents the BCC-β structural stability, and the latter reflects the interactions among elements expressed with a composition formula. Both auxiliary gradient-boosting regression tree and genetic algorithm methods were adopted to deal with the optimization problem in the ML model. This cluster-formula-embedded ML can not only predict alloy property in the forward design, but also design and optimize alloy compositions with desired properties in multicomponent systems efficiently and accurately. By setting different objective functions, several new β-Ti alloys with either the lowest E (E = 48 GPa) or a specific E (E = 55 and 60 GPa) were predicted by ML and then validated by a series of experiments, including the microstructural characterization and mechanical measurements. It could be found that the experimentally obtained E of predicted alloys by ML could reach the desired objective E, which indicates that the cluster-formula-embedded ML model can make the prediction and optimization of composition and property more accurate, effective, and controllable.

INTRODUCTION β-Ti alloys with a body-centered-cubic (BCC) structure have attracted more attention due to their prominent properties (high strength, low Young's modulus (E), and good corrosion resistance), showing a great potential for biomedical applications 1,2 . However, the low Young's modulus and BCC-β structural stability are a pair of contradiction, which is difficult to be solved in simple alloy systems. Low-E β-Ti alloys with E = 40-65 GPa are generally formed in multicomponent systems, such as Ti-35Nb-5Ta-7Zr (E = 55 GPa) and Ti-24Nb-4Zr-7.9Sn (E = 42 GPa, the number before each element represents the weight percent, wt. %) [2][3][4][5][6][7] . The BCC-β structural stabilities of these alloys are always in the vicinity of the lower limit of β stabilization, which was often characterized by a Mo-equivalent parameter (Mo eq ) 8,9 . Besides Mo eq , the d-electron theory method 10,11 and valence electron concentration [12][13][14] are frequently used to design multicomponent low-E alloys. A cluster-formula approach is also applied to design alloy compositions in multicomponent systems, which considers the correlations among alloying solute elements and the base element in light of the chemical short-range orders (CSROs) of solid solutions and then gives chemical compositions intuitively [15][16][17] . Thus, a series of low-E β-Ti alloys have been achieved by the cluster formula of [(Mo,Sn) − (Ti,Zr) 14 ](Nb,Ta) 1-3 , in which the lowest E = 48 GPa could reach at the [(Mo 0.5 Sn 0.5 ) − (Ti 13 Zr 1 )]Nb 1 alloy (Ti 81.25 Zr 6.25 Nb 6.25 Mo 3.125 Sn 3.125 in atomic percent at. %) 18 . Moreover, these alloy compositions were restrained further by a modified Mo eq parameter, and it was found that the Mo eq values of these low-E alloys are very close to the lower limit of β stabilization (11.8 wt.% Mo) 19 . A higher Mo eq corresponds to a relatively high E, and a relatively lower Mo eq could not render alloys with a single BCC-β structure, which induces a high E due to the precipitation of second phases, such as ω and α′′ 20 .
Although the experimental trial-and-error approaches based on several physical-metallurgy-guided rules could design and obtain new materials with targeted properties, it could not consider issues comprehensively and efficiently 21,22 . Recently, artificial intelligence methods, especially the machine learning (ML), have attracted more attention since these methods could predict alloy properties and design new high-performance materials efficiently and accurately with the assistance of physical-metallurgy rules [23][24][25][26][27][28] . For instance, the ML model could establish a relationship between the input and output through training data, in which the predicted capability strongly depends on the size and quality of the database as well as the range and distribution of the input parameters. Since these databases generally contain small data, the predictions that rely on ML alone may not be accurate enough. Furthermore, complex interactions among alloying elements will also affect the predicted accuracy. Therefore, the ML method should be combined with some characteristic parameters in a given system, which could result in more effective predictions. Xue et al. carried out the ML surrogate model, together with characteristic parameters (such as itinerant electron concentration e/a, modulus mismatch η, and work function w, etc.), to search for high-entropy alloys with high microhardness in Al-Co-Cr-Cu-Fe-Ni system, in which the predicted results are well consistent with the experiments since these selected characteristic parameters are closely related to the desired property 29 . Similarly, the combination of structural and compositional features (cohesive energy, atomic radius, and electronegativity) with the ML could well predict the elastic properties of Al-Co-Cr-Fe-Ni high-entropy alloys 30 . Xu et al. embedded the physical-metallurgy parameters, the volume fraction, and driving force of second-phase precipitates that represent the microstructural features, into the ML model to design advanced ultrahighstrength stainless steels successfully 31 . The optimal prototype alloy predicted by this model was well demonstrated by experiments, in which an excellent agreement could be obtained for the predicted optimal parameter settings and the final mechanical property (microhardness). In fact, the introduction of any characteristic parameters in the ML process aims mainly at constructing a perfect target-oriented loop of the input (alloy compositions) and output (properties) for an accurate and efficient prediction and design.
It is noted that any composition formula has not been considered in the ML method until now. Therefore, the present work will embed a cluster-formula approach into the ML model in multicomponent low-E β-Ti alloys because this approach is closely related to alloy compositions, besides the Mo eq parameter. In the forward design from the input (alloy compositions, c i ) to output (Young's modulus, E), the auxiliary gradient-boosting regression tree (XGBoost) method 32 , random forest (RF) method 33 , and support vector regression (SVR) method 34 will be applied to establish the relationship between the composition and Young's modulus based on the existing experimental results in Ti-Mo-Nb-Zr-Sn-Ta systems, including binary, ternary, quaternary, and quinary low-E β-Ti alloys. The Mo eq parameter will be embedded into this process to constrain the BCC-β structural stability. Then, in the reverse design of an alloy composition according to a specific E value, the cluster-formula approach will also be embedded into the ML model to simplify the selection of alloy compositions since a given E value would correspond to many compositions. Moreover, the genetic algorithm (GA) will be adopted to deal with the nonlinear alloy optimization problem. Finally, the predicted alloys with the cluster-formula-embedded ML model will be validated through a series of experiments, including microstructural characterizations and mechanical measurements.

RESULTS
Characteristic parameters in low-E β-Ti alloys Among all the parameters to characterize the structural stability of the metastable β-Ti, the Mo equivalence (Mo eq ) is the most popular parameter in multicomponent low-E β-Ti alloys. It is expressed with the equation of Mo eq ¼ 1:0 Mo þ 1:25V þ 0:59W þ 0:33Nb þ 0:25Ta þ 1:93Fe þ 1:84Cr þ 1:50Cu þ 2:46Ni þ 2:67Co þ 2:26Mn þ 0:30Sn þ 0:31Zr þ 3:01Si À 1:47Al in wt:% ð Þ in which the coefficient before each alloying element was obtained by the slope ratio of the [β/ (α + β)] phase boundary of the binary Ti-M phase diagram to that of the Ti-Mo, representing the contribution of each element to the β structural stability 9,19 . The critical lower limit of β stabilization is determined as (Mo eq ) C = 11.8 wt.%, indicating that any alloy with a larger Mo eq value above 11.8 wt.% would exhibit a single BCC β-Ti structure 9,19 . It is known that in the vicinity of the lower limit of β stabilization, some second phases of ω, α′′, and α′ could be precipitated inevitably since they are sensitive to the preparation processing and heat treatments, resulting in an increase of E 20 . Experimentally, only when the Mo eq value is larger than 13.0 wt.%, these second phases could be avoided and the E could reach the minimum 9,35 . It is noted that the Mo eq parameter ignores the interactions among elements, which is an important factor because the excessive addition of elements with strong interactions could result in the formation of intermetallic compounds in solid-solution alloys 9,20 . Furthermore, the Mo eq describes the β structural stability of Ti alloys alone, and could not correlate directly with the properties, such as the Young's modulus. Actually, the interactions among alloying elements with the base element could be incarnated into the CSROs of solid solutions, which is the significant microstructural feature [15][16][17] . Based on CSROs, we proposed a 'cluster-plus-glue-atom' structural model to describe the local atomic distribution of alloying solute elements. In this model, the cluster is the nearest-neighbor polyhedron centered by a solute atom that has strong interaction (characterized by a large negative enthalpy of mixing (ΔH) 36 ) with the base solvent atoms to represent the strongest CSRO. Some other solute atoms having weak interactions with the base (a positive ΔH) are certainly required to fill the space among clusters for the balance of atomic-packing density, named as glue atoms 37 . Thus, an uniform composition formula [cluster](glue atom) m (m being the glue-atom number) can be abstracted from this model, which is called the cluster-formula approach 18,19 . Specifically, in the Ti-Mo-Nb-Zr-Sn-Ta multicomponent system, Mo and Sn occupy preferentially the cluster center due to the relatively strong interactions with Ti (ΔH Ti-Mo = −4 kJ/mol, ΔH Ti-Sn = −21 kJ/ mol), while Nb and Ta tend to occupy the glue sites owing to their weak interactions with Ti (ΔH Ti-Nb = +2 kJ/mol, ΔH Ti-Ta = +1 kJ/ mol). For the Zr element, it can enter into any site to replace the Ti since they are in the same group (ΔH Ti-Zr = 0 kJ/mol). Thus, an ideal cluster formula could be expressed as [(Mo,Sn) − (Ti,Zr) 14 ] (Nb,Ta) m , which contains crucial information on the alloy chemistry, i.e., chemical compositions and chemical interactions 18 . When considering the β-stability and low-E simultaneously, this formula should be classified into two groups: in which the total amount (x + y + z 3 ) of Nb, Ta, and Zr (i.e., the glue-atom number, m) varies from m = 1 to m = 3 according to the Mo content (u) since Mo is a strong β-stabilized element; the amount of Mo and Sn in the cluster center is u + v = 1 (0 ≤ v < 1); the amount of the base Ti and partial Zr in the cluster shell is defined as w + z 1 = 14 (0 ≤ z 1 ≤ 1); and the total amount of Zr is the sum of that in the cluster shell and glue site, being z = z 1 + z 3 . It is noted that the glue atoms contain the Zr element occasionally for the consistence with the existing experimental results of Zr substitution for Nb.
in which the glue-atom number (x + y) of Nb and Ta should be increased because Nb and Ta are weak β-stabilizers when the Mo element is not added; the amount of Sn and partial Zr in the cluster center is ; the amount of the base Ti and partial Zr in the cluster shell is w + z 1 = 14 (0 ≤ z 1 ≤ 1); the total Zr content is z = z 1 + z 2 . These concrete composition formulas will be then embedded into the ML to constrain complex compositions during the reverse design process in the present work. Machine-learning model A ML model will be trained based on a dataset containing Young's modulus (E) and alloy composition (c i ) for each element to build the relationship of E = f(c i ) in the present work, as seen in Fig. 1. Thus, the E value of any alloy could be predicted when the composition is known in the forward design (Loop I in Fig. 1). On the other side, the prediction of an alloy composition could also be realized if the E value is given in the reverse design (Loop II in Fig. 1). In order to make the prediction more accurate, some characteristic parameters representing the low-E and β structural F. Yang et al. stability should be set into the ML process to constrain the relationship of E = f(c i ), since it is a function of multiple variables for a single target. For the dataset, it is constituted of 82 existing low-E β-Ti alloy samples 3,5,7,[9][10][11]18,19, , including binary, ternary, quaternary, and quinary alloys in Ti-Mo-Nb-Zr-Sn-Ta systems, which are listed in Supplementary Table 1. Sample data were screened strictly with the Mo eq parameter to avoid the effects of second-phase precipitation and processing conditions on E. That is to say, alloys containing only a minor amount of α″ and ω, which was checked scarcely, were permitted in the dataset since the obvious phase precipitation could enhance the E. Besides, all these alloys were prepared by rapid-quenching processing.
In the forward design of ML (Loop I in Fig.1), we employed three representative ML models, XGBoost implemented in the XGBoost machine-learning libraries 32 , RF, and SVR with a radial basis function kernel implemented in the scikit-learn package 34 . During the ML process, the characteristic Mo eq parameter was also put into the dataset to constrain alloy compositions further because it represents the β structural stabilities of Ti alloys. Then, this dataset was split into a training set and a testing set by using a split ratio of 9:1 for the generalization ability of the models, i.e., the training set was taken from 90% of the dataset. We used both the root mean-squares error (RMSE) and the coefficient of determination (R 2 ) as the criterion for the prediction accuracy. The RMSE and R 2 are expressed with the equations of and respectively, where n is the number of samples, y is the mean Young's modulus of samples, y i and f(x i ) are the experimental Young's modulus and the predicted value by ML, respectively. Since the dataset capacity of 82 samples is somewhat limited in the present work, it is necessary to employ the multiple hold-out method to calculate the RMSE and R 2 of ML models for the guarantee of accuracy 31 . This procedure was repeated for 500 times in these three models by partitioning the training set and testing set randomly with a ratio of 9/1, and then the RMSE and R 2 values were counted, as shown in Fig. 2. The mean values of RMSE of the training set are 1.3 ± 0.6, 3.8 ± 0.3, and 3.9 ± 0.4 GPa for XGBoost, RF, and SVR models, respectively; and these values of the testing set are 4.6 ± 0.7, 4.9 ± 0.9, and 5.2 ± 0.9 GPa for XGBoost, RF, and SVR, respectively. Besides, the mean values of R 2 of the training set are 98 ± 1, 89 ± 1, and 88 ± 2% for XGBoost, RF, and SVR, respectively; and these values of the testing set are 87 ± 2, 84 ± 4, and 73 ± 3%, respectively. Thereof, both the predicted accuracy and generalized ability of the XGBoost model are better than the RF and SVR models for the prediction of alloy property. Figure 3 gives the optimal results for both the training and testing sets trained by the XGBoost, RF, and SVR models, respectively, in which the blue and red points represent the training set and the testing set. The RMSE values of the training and testing sets for XGBoost are 1.5 and 3.2 GPa, respectively, while the RMSE values of the training and testing sets for RF and SVR are 3.8 and 4.1 GPa, 3.9 and 4.7 GPa, respectively. Meanwhile, the R 2 values of the training and testing sets for XGBoost are 98 and 92%, while the corresponding R 2 values are 90 and 88% for RF, and 89 and 79% for SVR, respectively. It could be found that most of the predicted values by the XGBoost model in both the training and testing sets are more consistent with the experimental results than the RF and SVR predictions, indicating that the XGBoost prediction is more accurate than the RF and SVR.
Once the relationship between the alloy composition and Young's modulus was established by the XGBoost model in Loop I, the Young's modulus of any given composition alloy could be well predicted. However, in the reverse design of ML, a given E value  might correspond to dozens or hundreds of multicomponent compositions, which would certainly lead to a complexity in the prediction of low-E β-Ti alloys. In order to solve this problem, we implemented the cluster-formula approach into the reverse design (Loop II in Fig. 1) to constrain the correlations among alloying elements with the aid of their interactions with the base Ti, besides the Mo eq parameter. Thus, a given E value will correspond to very few compositions alone in a given system, which could reduce the difficulty of composition exploration drastically. Thereof, this problem can be converted into an optimization problem by virtue of the cluster-composition formula: Minimize Eðx; y; z; u; v; wÞ subject to CFðx; y; z; u; v; wÞ 11:0 Mo eq ðx; y; z; u; v; wÞ 14:0 Or Minimize jEðx; y; z; u; v; wÞ À E S j subject to CFðx; y; z; u; v; wÞ 11:0 Mo eq ðx; y; z; u; v; wÞ 14:0 ; where x, y, z, u, v, and w represent the atom numbers (or the amounts) of Nb, Ta, Zr, Mo, Sn, and Ti in a given clustercomposition formula, respectively; E(x, y, z, u, v, w) is the predicted Young's modulus trained by the XGBoost model; CF(x, y, z, u, v, w) is the specific cluster-composition formula given in abovementioned Eqs. (2) and (3). Here, we took the strengthen elitist selection of the GA 60-63 to solve this optimization problem. For the optimization problem given in Eq. (6), we aim at exploring of β-Ti alloys with the lowest Young's modulus (Min E). While for a specific E S value, it is necessary to minimize the difference between E(x, y, z, u, v, w) and E S , i.e., Min |E − E S |, as described in the optimization problem given in Eq. (7). Thus, an integrated ML process would be established, as presented in Fig. 1. According to it, the predicted new composition alloys will be verified by a series of experiments, then they will be put into the original database for a much more accurate prediction.

Experimental verification
Firstly, the XGBoost model is used to predict the Young's modulus E of any given alloy in Ti-Mo-Nb-Zr-Sn-Ta system in the forward design of ML. We selected two alloys randomly only with the Mo eq constraint, Ti 84.5 Mo 2.0 Nb 10.0 Sn 1.0 Ta 2.5 (I-1, at. %) and Ti 82.5 Mo 3.0 Nb 10.5 Sn 1.0 Ta 3.0 (I-2, at. %). The E values of these two alloys predicted by ML is E = 67 GPa for I-1 and E = 65 GPa for I-2 alloy, respectively (as seen in Table 1). Then, a series of experimental measurements were carried out to verify this prediction.
The X-ray diffractometer (XRD) results in Fig. 4a show that the I-2 alloy with Mo eq = 13.30 wt.% exhibits a single BCC-β structure, while the diffraction peaks of the α′′ martensite with an orthorhombic structure appear in I-1 alloy, besides the BCC-β. It is resulted from that the Mo eq = 11.28 wt.% of I-1 is relatively lower than the critical lower limit ((Mo eq ) C = 11.8 wt.%) for the β stabilization. The tensile tests of these alloys were then measured, and the engineering tensile stress-strain curves were given in Fig. 4b, from which the Young's modulus could be calculated. The experimental E values of I-1 and I-2 alloys are E = 76 ± 3 and E = 73 ± 2 GPa, respectively, which is somewhat different from the corresponding predicted values (67 and 65 GPa). The difference between the predicted and experimental E values might be ascribed to that alloy samples in this quinary system were not included in the dataset due to no existing results, which would give rise to an uncertainty and then reduce the prediction accuracy. In addition, the higher measured E value of I-1 alloy is also resulted from the obvious precipitation of α′′ martensite that could increase the E slightly. Thereof, the predication of E is not accurate enough when the Mo eq is considered in ML alone. There Fig. 3 The experimental E values vs. the predicted values by the optimal model. a XGBoost. b RF. c SVR. The blue points and red points represent the training set and the testing set, respectively. must exist some inevitable uncertainties in the ML process, which needs to consider more other parameters for a further constraint of alloy compositions, such as the factor to reflect complex interactions among elements in multicomponent systems. Thereof, we add the cluster formulas as a constraint in the reverse design (Loop II) to explore alloy composition according to a given E value.
During the process of the GA optimization in the reverse design (Loop II), we still use the XGBoost model in the forward design and take the Mo eq and the cluster formulas as constraints simultaneously. A wide composition range of each element was first set, in which a population containing 3000000 individuals (i.e., alloy compositions) was generated randomly. Then, the composition alloys satisfying both the cluster formula and the Mo eq parameters were obtained with the XGBoost model, and the fitness of each alloy composition was calculated according to the objective function of Min E for the optimization problem given in Eq. (6) or Min |E − E S | for the optimization problem given in Eq. (7). Thus, the next generation of population was produced by the genetic operations (including selection, crossover, and mutation) in light of the survival of the fittest principle. Finally, the GA would terminate after 100 iterations, and the optimal alloy composition could be achieved. Figure 5 gives the evolution histories of the GA with the objective functions of Min E (Fig. 5a-c) and Min |E − E S | (Fig. 5d, e), in which the red line represents the objective function while the black line represents the mean value of E or |E − E S | for each generation. It was found that these two values will be converged after about 50 iterations, indicating a high efficiency of   Figure 6a gives the XRD patterns of these five predicted alloys by the reverse design in ML, in which the II-2, II-4, and II-5 alloys possess a single BCC-β structure. While for II-1 and II-3 alloys, several weak diffraction peaks of the α′′ martensite appear on the β matrix. Although the Mo eq values of these alloys surpass the critical lower limit ((Mo eq ) C = 11.8 wt.%) of β stabilization, they are still in the vicinity of this limit, at which the β stability is susceptible to external conditions. Figure 7 gives the tensile engineering stress-strain curves of these alloys, from which the E values were measured, being E = 49 ± 1, 46 ± 1, 47 ± 2, 58 ± 2, and 56 ± 2 GPa, respectively. Both the experimental and predicted E values of these designed alloys are all listed in Table 1. It could be found that the predicted E is consistent well with the experimental E for each alloy, which indicates that the clusterformula-embedded ML can predict the alloy property (forward design) or alloy composition (reverse design) precisely. In addition, the Mo eq values of these alloys, as well as alloy compositions (both at. % and wt. %) are also given in Table 1.
We also did the XRD analysis on the alloy samples after tension to check the variation of the crystalline structure induced by the applied load, which is shown in Fig. 6b. It is found that the diffraction peaks of the stress-induced α′′ martensite appear obviously in the β matrix of each alloy, which is also validated by the double-yielding phenomenon in the engineering stress-stain curves shown in Fig. 7. It is ascribed to the stress-induced α′′ martensite transformation when the β structural stability is relatively lower, which is consistent with the existing results 20  , the SAED pattern shows that the II-2 alloy exhibits a single β-Ti structure without any other diffraction spots of either α′′ or ω phase, as presented in Fig. 8a. In fact, the Young's moduli of α′′ and ω phases are higher than that of β-Ti, especially the ω phase 20 . While the diffraction spots of the α′′ martensite appear in II-3 alloy, as found in Fig. 9a, in which both the bright-field and dark-field images demonstrate that a small amount of α′′ martensitic plates are distributed in the β-Ti matrix. It is consistent with the XRD result, which might be resulted from the relatively lower Mo eq value (Mo eq = 11.95 wt. %) in II-3 than that (Mo eq = 13.20 wt. %) in II-2. Nevertheless, there does not exist the ω phase in II-3 alloy yet, in which the appearance of ω could enhance the Young's modulus of β-Ti alloys drastically. After tension, the stress-induced α′′ martensitic plates (including coarse and fine plates) appear in the β matrix in the II-2 alloy, as observed in the dark-field images (Fig. 8b). By comparison, a much larger amount of α′′ martensitic plates appear in the tensioned II-3 alloy,   as seen in Fig. 9b, which is closely related to the relatively lower β stability in the latter. Actually, it could be also confirmed by the fact that the double-yielding platform of II-3 is more obvious than that of II-2, indicating that a much higher amount of α′′ martensite transformed from the β matrix during the tension process.
For the designed [(Mo 0.5 Sn 0.5 ) − (Ti 13 Zr 1 )](Nb 0.5 Zr 0.5 ) (II-5) alloy with a given Young's modulus of E S = 60 GPa, the TEM characterization (Fig. 10a) indicates that this alloys exhibit a single BCC β-Ti structure owing to the relatively higher Mo eq value (Mo eq = 13.53 wt. %). After tension, there still exists a small amount of the stressinduced α′′ martensite (Fig. 10b), similar to those in II-2 and II-3 alloys. All of these results certify again that the BCC-β phase with lower stability would like to be accompanied by the protogenous α′′ or stress-induced α′′ martensite, which could render β-Ti alloys with lower Young's moduli.

DISCUSSION
In the forward design of Loop I in ML, the relationship between the alloy composition (c i ) and alloy property (Young's modulus, E) was established by using the XGBoost model, in which the characteristic Mo eq parameter for the representation the β structural stability of alloys is implemented to further restrict alloy compositions. Based on it, two alloys of I-1 and I-2 in Ti-Mo-Nb-Sn-Ta system were designed, but the experimental E values of these two alloys are somewhat different from the predicted ones, as seen in Table 1. This trend might be caused by the fact that alloy samples in this quinary system are not included in the ML database due to no existing experimental results, which would give rise to an uncertainty, and then reduce the prediction accuracy. More importantly, in the reverse design of Loop I, a given specific E value could predict many compositions because it is a multivariable function for a single target. For instance, if we set the specific Young's modulus as E S = 55 GPa only with the Mo eq constraint, we can obtain 85 compositions, in which the compositions vary with a step size of 1.0 at. %, as seen in Supplementary Table 2. It is difficult to validate all these predicted composition alloys by experiments, which could not manifest the superiority and efficiency of ML. In addition, there might exist some inevitable uncertainties that could not be reflected by the Mo eq alone, which has been demonstrated by the present experiments for I-1 and I-2 alloys.
Then, we embedded the cluster-formula approach into the reverse design of ML (Loop II), which is a direct constraint to the composition. This approach considers the interactions among alloying elements with the base Ti, as well as among themselves.
Our previous experimental results have demonstrated the validity of the cluster-formula approach, i.e., the β-Ti alloys with cluster formulas generally possess relatively lower Young's moduli, as evidenced by the quinary β-Ti [(Mo 0.5 Sn 0.5 ) − (Ti 13 Zr 1 )](Nb 1 ) alloy  with E = 48 GPa 20 . Thus, for a specific E value, there exist a limited number of alloy commotions or just only one composition in a given multicomponent system. For example, there have only two compositions (II-1, II-2) in the Ti-Mo-Nb-Zr-Sn-Ta hexacomponent system and one composition (II-3) in the Ti-Nb-Zr-Sn-Ta quinary system to correspond to the Min E (Optimization problem given in Eq. (6)). More significantly, the experimental E values of these three composition alloys are in a great consistence with the predicted ones (Table 1). Furthermore, these two complex systems are not included in the ML database since no existing experimental results have been reported. Similar tendency also appears in II-4 and II-5 alloys for a specific E value. Therefore, by embedding the cluster formula and Mo eq into the ML, we can optimize alloy compositions orientated by targets more effectively and precisely than the previous random research. In fact, when selecting alloy compositions in the forward design of ML, the compositions should satisfy the cluster-formula approach for a better prediction of E. It means that implementation of characteristic parameters that represent the features of a given system is crucial to establish the ML model, which is contributed to the prediction and design of the alloy composition and property with a much higher accuracy.
In our previous alloys, we applied the cluster-formula approach into other compositionally complex systems to explore the relationship between the alloy composition and properties 37,64 . For instance, it is found that the compositions of Ni-base single crystal superalloys with prominent creep resistance satisfy a uniform cluster formula of ½Al À Ni 12 ðAl 1:5 Cr 1:5 Þ, in which all the alloying elements are classified into three groups, Al series (Al), Cr series (Cr), and Ni series (Ni) 65 . Since the creep rupture lifetime of these superalloys could be correlated to the composition in light of the cluster-formula approach, the cluster-formula-embedded ML model would predict the rupture lifetime of alloys and design new alloy compositions with high performance.
To summarize, a property-orientated alloy-design strategy combining the ML and characteristic parameters has been proposed to search for the low-E β-Ti alloys in the Ti-Mo-Nb-Zr-Sn-Ta system. The characteristic parameters are Mo eq and cluster-composition formula, respectively. The XGBoost could establish the relationship between the alloy composition and Young's modulus E well in the forward design of ML with the guidance of Mo eq . However, the experimental E value is indeed somewhat different from the predicted value of a given alloy when such alloy samples are not included into the ML database. On this basis, the cluster-formula approach was embedded in the reverse design to design new alloys with the desired E. ](Nb 2.5 Ta 0.5 ), respectively, even though such kind of alloy samples are not included in the ML database. Therefore, for any specific E, this cluster-formulaembedded ML model could predict alloy composition precisely and efficiently in multicomponent systems. This design framework would be expected to predict other high-performance alloys, such as high-entropy alloys and Ni-base superalloys, in which the cluster-formula approach is of importance due to their chemical complexity.

Modeling method
All the machine-learning models were created using the Scikit-learn (including RF and SVR) and XGBoost machine-learning libraries. The reverse design was realized by the GA using geatpy library 63 , Scikit-learn, XGBoost, and geatpy are available under open-source licenses.

Experimental procedures
The predicted alloys by ML were prepared by copper-mold suction cast processing, in which the experimental details are same as those in ref. 19 . Crystalline structures of alloys were first analyzed by the BRUKER D8 XRD with a Cu Kα radiation (λ = 0.15406 nm). The microstructures were then verified with the JEM2100F FEG scanning transmission electron microscopy, in which the samples were prepared by twin-jet electro-polishing in a solution of 6% HClO 4 + 59% CH 3 OH + 35% CH 3 (CH 2 ) 3 OH (volume fraction) at about 243 K. Tensile tests were executed on an UTM5504-G Material Test System (MTS) with a strain gage, where the tensile rate is set as 0.5 mm/min. The gauge size of rod tensile samples is 3 × 25 mm (diameter × length), and three samples for each composition alloy were measured to assure the reliabilities of tensile data.

DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.