Introduction

Novel materials have a significant impact on our daily lives and modern industries such as the aerospace, biomedical, and energy sectors1,2,3,4,5. However, the conventional research and design (R&D) of novel materials utilizes a “trial-and-error" approach, which is challenging due to the complexity and diversity of materials. This approach incurs high costs, and the commercial implementation of novel materials generally takes decades. Fortunately, owing to the rapid advancements in artificial intelligence (AI) and machine learning (ML), materials R&D has evolved to a state-of-the-art data-driven paradigm6,7,8. The data-driven paradigm is expected to diminish the cost and duration of materials R&D by half, and thus expedite the materials development cycle from decades to a few years.

The key concept of the date-driven paradigm is the integration of AI techniques and materials science. Various AI techniques, especially ML algorithms, have been employed to uncover Composition-Process-Structure-Property (CPSP) relationships in materials science9,10,11,12,13. ML models trained on extensive data can aid in the discovery of innovative materials such as organic compounds14,15, solar cells16,17, alloys18,19,20, and perovskites21,22. For example, Rao et al. proposed an active learning strategy to accelerate the exploration of high-entropy Invar alloys. With the assistance of ML, they characterized two novel Invar alloys from a plethora of potential combinations23. Raccuglia et al. utilized successful and unsuccessful historical reactions to train their ML model, discovering critical factors that affect chemical reactions and synthesizing new organic compounds based on this trained model24.

Within the material and physical computation community, several notable AI platforms have emerged. Materials Cloud25 offers an ensemble material simulation platform with a primary focus on ab initio computation to emulate material mechanisms. Similarly, the Materials Project26 stands out as a quantum computation platform providing a valuable inorganic materials database and associated quantum properties, leveraging advanced ML acceleration methods such as ML-based potential fields: M3GNet27, MEGNet28, etc. AFLOW-ML29 and JARVIS-ML30 contribute significantly by offering crystal properties prediction tools based on DFT calculations or ML surrogate models, covering formation energies, exfoliation energies, bandgaps, and magnetic moments, among other attributes. MatMiner31 and Magpie32, popular ML tools in materials research, host various downstream ML libraries that greatly benefit the material community. In addition, the AI/materials community has also developed general AI toolkits and platforms that have broad applications in materials science33,34,35,36. For instance, a simple but powerful tool for non-data researchers has been developed by Peng37, which possesses a command-line interface and a web-based GUI for classification and regression tasks. Another representative is AlphaMat, an emerging materials informatics hub developed by Wang38. This platform focuses on data preprocessing and downstream ML models, offering a streamlined workflow for materials designers. AI platforms like AlphaMat can automate and expedite the construction of accurate ML models for different materials, providing AI support for materials scientists.

Despite these advancements, utilizing the aforementioned general-purpose AI toolkits and platforms requires programming skills, which might be a barrier for material designers lacking programming experience. Additionally, existing platforms emphasize model construction and overlook inverse materials design. There is significant scope for enriching AI platforms in material R&D. Therefore, we have created the MLMD (machine learning for materials design) platform, providing a friendly interface and more comprehensive AI tools for material designers. MLMD distinguishes itself with an entirely code-free interface for seamlessly executing property inference and inverse material design. Crucially, MLMD not only considers inherent physical structure properties but also explores the impact of material defects and processing/testing technologies on material performance. It includes a range of ML algorithms for constructing models besides data analysis and descriptor refactoring, enabling end-to-end (data to new materials) novel materials discovery. MLMD can automatically construct classification or regression models based on user-uploaded data, with a strong emphasis on privacy protection (no data will be stored). Users can simply select an ML algorithm, and the associated hyper-parameters will be automatically tuned. In addition to model construction, MLMD also incorporates model inference20,39, surrogate optimization40,41,42, and active learning43,44,45,46,47,48 techniques for materials inverse design. By utilizing model inference or surrogate optimization, users can screen out novel materials from a virtual search space based on the constructed models.

In addition, the MLMD framework has made significant strides in addressing the challenge of limited data availability. To address data scarcity in materials science, we develop a Bayesian toolkit for material design to integrate into active modules in MLMD. This feature incorporates nine utility functions that balance exploration and exploitation. Through the active learning module, MLMD enables the provision of novel advanced materials with single or multiobjective properties through an iterative design process. MLMD also integrates transfer learning into heuristic algorithms (TL-opt) to address small data problems in material design, demonstrating its applicability in both single and multiple objective material design, showcasing advantages in Al alloy design42.

The effectiveness and robustness of the MLMD platform in both model construction and inverse design have been demonstrated through various datasets in the work, including perovskites, steel, high-entropy alloy, et al. This framework adheres to the material genome concept, thereby emphasizing the emerging paradigm of material design. We firmly believe that the MLMD platform has the potential to greatly enhance the accessibility and utility of AI techniques for materials communities.

Nomenclature

   

LR49

Logic regression

MLPR49

Multi-layer perception regression

SVC49

Support vector classification

RFR49

Random forest regression

BTC49

Bagging tree classification

XGBR50

XGBoost regression

RFC49

Random forest classification

LOO

Leave one out

XGBC50

XGBoost classification

R2

Determination coefficient

CBC51

CatBoost classification

NSGA-II52

Non-dominated sorting genetic algorithm II

CV

Cross validation

GA53

Genetic algorithm

SVR49

Support vector regression

DE54

Differential evolution

KNNR49

K-nearest neighbor regression

PSO55

Particle swarm optimization

SA56

Simulated annealing

EI57

Expected improvement

AEI57

Augmented expected improvement

EQI57

Expected quantile improvement

REI57

Reinterpolation expected improvement

PES58

Predictive entropy search

POI59

Probability of improvement

LassoR49

Lasso regression

UCB60

Upper confidence bound

KG61

Knowledge gradient

EHVI62

Expected hypervolume improvement

LinearR49

Linear regression

PCA49

Principal component analysis

t-SEN49

t-distribution stochastic neighbor embedding

ABR49

AdaBoost regression

BR49

Bagging regression

CBR51

CatBoost regression

GPR49

Gaussian process regression

DTR49

Decision tree regression

GBR49

Gradient boosting regression

RidgeR49

Ridge regression

ABC49

AdaBoost classification

EIP57

Expected improvement with “plugin"

SHAP63

SHapley Additive exPlanations

GBC49

Gradient boosting classification

SMS-EMOA52

\({{{\mathscr{L}}}}\) metric selection evolutionary multiobjective optimization algorithm

Results

Overview and architecture

The primary objective of MLMD is to make ML programming free and empower materials scientists with an end-to-end approach to materials design. To train a prediction model on MLMD, users are required to upload a CSV-formatted data file containing feature matrix X and target variable Y. The feature matrix X includes information regarding material components and processes, and the target variable Y incorporates one or more material properties. MLMD was developed with six core modules, as depicted in Fig. 1, and seven functionalities:

  1. (1)

    Database. MLMD provides materials scientists with databases containing material data (e.g., polycrystalline ceramic, HEAs, ferroelectric perovskites) generated from experiments or collected from literatures. These databases are downloadable and serve various purposes, including serving as source domains for transfer learning. Additionally, MLMD offers outlier detection algorithms like DBSCAN64, IsolationForest65, LocalOutlierFactor66, and One Class SVM67 to identify data points that deviate significantly from the rest. Outlier detection can greatly enhance the generalization of ML models.

  2. (2)

    Data visualization. MLMD offers an initial data overview, including the distribution of features and targets, along with the statistics derived from the data.

  3. (3)

    Feature engineering. Material compositions and processes significantly influence the structure, properties, and performance of materials. These components are commonly used as the feature descriptors in ML, determining the performance limits of prediction models. Thus, MLMD integrates feature engineering, encompassing handling missing and duplicate values, assessing feature correlation, and ranking feature importance. Additionally, MLMD also provides transformation functions to transform the composition descriptors to atomic descriptors, such as atomic radius, band gap, and valences.

  4. (4)

    Quantitative CPSP relationships (QCPSP). The establishment of QCPSP in material through ML is fundamental to material design. MLMD supports nearly all widely utilized regression and classification algorithms, such as linear analysis, sparse kernel machine, probability model, neural network, transfer learning, and ensemble learning. The most suitable model can be selected for training on the data and making inference.

  5. (5)

    Surrogate optimization. Integrating predictive models into numerical optimization algorithms can accelerate the attainment of optimal material compositions, processes, and other relevant features that align with desired properties. Subsequently, the discovered advanced materials will undergo experimental verification.

  6. (6)

    Active learning. Achieving a high-accuracy prediction model is a challenge in materials science due to the limited data. Consequently, sampling-based material design strategies are provided within the active learning module in MLMD platform. Global optimization by Bayesian-based active learning has gained prominence for the ability to address data scarcity and reduce material discovery costs. Typically, this approach explores the design space using an optimal policy that balances exploitation and exploration to identify the global optimum. In active learning, a probability surrogate model, Gaussian process (GP), is constructed using initial input-output data obtained from costly experiments or simulations.

  7. (7)

    Interpretable ML. Achieving physical interpretability is a significant challenge and goal in material informatics. Interpretable ML can enhance materials scientists’ understanding of the CPSP relationships of materials. MLMD platform also provides the Shapley Additive Explanations (SHAP) method to facilitate model explanation.

Fig. 1: Overview and architecture of MLMD.
figure 1

a Data module, which encompasses material databases, data visualization, and feature engineering. b Regression module, which comprises a group of ML regression algorithms. These algorithms can be further utilized in the surrogate optimization module. c Classification module, which involves a group of ML classification algorithms. These algorithms can be further leveraged in the surrogate optimization module. d Surrogate optimization module, where the ML model is incorporated into numerical algorithms to accelerate materials design. e Active learning module, sampling methods based on Bayesian are provided to search the material composition space and discover novel materials, particularly under limited available data. f Other module, which provides advanced ML algorithms such as transfer learning, dimensionality reduction, and interpretable ML.

As illustrated in Fig. 2, there are three primary flowcharts for materials design within the MLMD platform. These include model inference, surrogate optimization, and active learning. The efficiency of model inference and surrogate optimization relies on the robustness of the prediction model (surrogate model), and the model performance is limited by available data. In surrogate optimization, the well-trained prediction model will be integrated into stochastic optimization algorithms to accelerate materials design. Active learning in MLMD employs a sampling strategy grounded in Bayesian principles. It balances exploration and exploitation in order to formulate an optimal material design strategy. The active learning module will recommend the next experiment via Bayesian global optimization under limited data. The recommended experiment can be conducted, and the new results of the experiment will validate the ML prediction and simultaneously are fed back to the dataset for the following cycle of iterations in the active learning loop. In the materials design flowcharts within MLMD platform, some useful tools were used, including Streamlit, Scikit-Learn49, Pymoo52, extreme gradient boosting decision tree (XGBoost)50, Scikit-Opt, and Bgolearn.

Fig. 2: Flowcharts of materials design in MLMD platform.
figure 2

a Model inference involves establishing an ML model, generating visual samples, executing model inference, and subsequently verifying the recommended new materials through experiment. b Surrogate optimization entails establishing ML models, incorporating feature physical constraints, selecting heuristic optimization algorithms, and subsequently verifying the recommended new materials through experiment. c Active learning involves utilizing GPR model, creating a virtual sample space, selecting a suitable utility function, and ultimately verifying the recommended new materials through experiment. Notably, the results from experiments on the new materials of the three methods can be iterated into the next loop for building more better ML model.

Classification module

Here, eight materials datasets labeled as C1-C3 and R1-R5 were used as case studies to showcase the reliability and effectiveness of four commonly used modules within our MLMD platform. Details of the datasets can be found in Table 1.

Table 1 Eight case material datasets

The proposed classification module aims to address classification issues in materials science. It merely necessitates uploading a dataset in the CSV format and the programming-free selection of an MLMD-implemented algorithm to complete model construction. Moreover, the user can straightforwardly adjust or auto-optimize hyper-parameters to further refine constructed MLMD classification models to enhance their accuracy (metric details are provided in Supplementary Note 1). The performances of six MLMD-implemented classification algorithms (LR, SVC, BTC, RFC, XGBC, and CBC), were accessed and compared with the baseline models across three distinct classification issues, described as follows. C1: Identify the crystalline structure of a polycrystalline ferroelectric ceramic, categorizing it as either a perovskite or a non-perovskite structure. C2: Categorize an alloy into one of three classes: crystalline alloy (CRA), ribbon metallic glass (RMG), or bulk metallic glass (BMG). C3: Discriminate between solid-solution HEAs and classify them as hexagonal close-packed (HCP), body-centered cubic (BCC), face-centered cubic (FCC), or mixed solid solution (MSS). The baseline models were SVC implemented in R75, RFC implemented in Java13, and RFC implemented in python76 for C1, C2 and C3, respectively.

As depicted in Fig. 3a, c, e, the default MLMD-implemented SVC, RFC, and XGBC models achieved a 10-fold CV accuracy exceeding 80% for all three issues. The results demonstrate that the default MLMD-implemented model can provide satisfactory classification accuracy without requiring any other operations. In addition, the optimization of hyper-parameters can significantly enhance the performance of all MLMD-implemented models. The MLMD platform also offers a user-friendly hyper-parameter tuning feature that can achieve improved models without requiring programming skills. In the three cases, the recommended MLMD-implemented models are the tuned XGBC model (CV-accuracy = 86.5%), tuned RFC model (CV-accuracy = 87.4%), and tuned RFC model (CV-accuracy = 92.6%) for C1, C2, and C3, respectively. The recommended MLMD-implemented model performed comparably to the baseline model for C2, and outperformed the baseline model for C1 and C3, indicating the robust classification capability of our platform. MLMD platform also provides a confusion matrix for each recommended model in the classification module, as illustrated in Fig. 3b, d, f. The confusion matrix is used to observe the performance of the classification model in each category, and is able to calculate the other classify performance metrics, such as precision and recall. (raw confusion matrix plots are provided in Supplementary Fig. 1). According to the CPSP relationship, the property of a material significantly relies on its microstructure. Therefore, identifying the microstructure based on composition and process is very important for material design. For instance, the BCC HEAs are much harder than FCC HEAs, and a HEA that belongs to BCC class should be designed for wear resistance. Researchers generally modify microstructures based on experience in conventional materials design paradigm, while our platform provides a convenient tool for identifying the microstructure through classification.

Fig. 3: Cross-validation results of six ML models through classification module within our MLMD platform.
figure 3

a The 10-fold CV accuracy for classifying perovskite formability of ferroelectric perovskites (Case C1). b The confusion matrix of the tuned MLMD classification model of perovskite formability of ferroelectric perovskites (Case C1). c The 10-fold CV accuracy for classifying the glass-forming ability of alloys (Case C2). d The confusion matrix of the tuned MLMD classification model of the glass-forming ability of alloys (Case C2). e The 10-fold CV accuracy for classifying solid-solution structures of HEAs (Case C3). f The confusion matrix of the tuned MLMD classification model of solid-solution structures of HEAs (Case C3). In each sub-figure, the lighter bar, darker bar, and red dashed line represent the default MLMD model, tuned MLMD model, and baseline model, respectively.

Regression module

Similar to the Classification Module, the regression module only requires a CSV-format dataset for constructing predictive models. It is flexible to select various regression algorithms and adjust corresponding hyper-parameters without programming. We compared the performance of six MLMD-implemented regressors (SVR, KNNR, MLPR, RFR, XGBR, and CBR) to the baseline model for predicting the fracture stress of low-alloy steels (R1), the Curie temperature of ferroelectric perovskite (R2), and the flow stress of FGH98 superalloys under hot deformation (R3). The baseline models utilized here were RFR implemented in Java77, SVR implemented in R75, and GPR implemented in python78 for R1, R2 and R3, respectively. As can be seen in Fig. 4a, c, e, the 10-fold CV-R2 (metric details are provided in Supplementary Note 1) of the recommended XGBR model for R1, SVR model for R2, and CBR model for R3 are 0.9427, 0.8480, and 0.9828, respectively. The recommended MLMD-implemented models outperform the baseline model for all three regression problems. The properties predicted from the recommended MLMD-implemented regressors have been plotted against the experiment measurement, as shown in Fig. 4b, d, f. Notably, the data points clustering near the diagonal line exhibit that our MLMD platform delivers satisfactory performance across diverse regression problems (raw plots are provided in Supplementary Fig. 2). Different from the classification, regression is commonly employed to predict material performance in relation to properties like strength, elongation, and hardness, among others. Researchers can leverage well-trained regression models to surrogate time-consuming trial-and-error experiments and design advanced materials at low cost. The regression module within MLMD offers a convenient tool for the experiment researchers lacking programming skills.

Fig. 4: Cross-validation results of six ML models through regression module within our MLMD platform.
figure 4

a The 10-fold CV R2 for regressing fracture strength of steels (Case R1). (b) The prediction of tuned MLMD regression model of perovskite fracture strength of steels (Case R1). c The 10-fold CV R2 for regressing high ferroelectric Curie temperature of perovskite (Case R2). d The prediction of tuned MLMD regression model of high ferroelectric Curie temperature of perovskite (Case R2). e The 10-fold CV R2 for regressing flow stress of FGH98 superalloys under hot deformation. (Case R3). f The prediction of tuned MLMD regression model of flow stress of FGH98 superalloys under hot deformation (Case R3). In each sub-figure, the lighter bar, darker bar, and red dashed line represent the default MLMD model, tuned MLMD model, and baseline model, respectively.

In summary, the classification and regression module within MLMD platform allow the acquisition of accurate models through programming-free algorithm selection and hyper-parameter tuning. Additionally, well-trained prediction models can be preserved for various other applications.

Surrogate optimization module

The well-trained regression model serves as a surrogate model, which can be integrated into the efficient numerical optimization algorithm to accelerate the materials design. The surrogate optimization module within MLMD platform necessitates an experimental or simulation dataset, corresponding ML prediction models, and boundary constraints for feature variable. The surrogate optimization module provided a convenient tool that quickly finds out a reasonable combination in the search space of compositions and processing parameters according to the targeted properties.

Reduced activation ferritic-martensitic (RAFM) steels developed from conventional 9Cr-1Mo steels have been regarded as promising candidate structural materials for fusion reactors, owing to their good thermo-physical, thermomechanical, and irradiation-resistant properties compared to those of austenitic steels68,69. In the section, the surrogate optimization module was utilized to design RAFM steels with enhanced strength and excellent ductility. Initially, two ML prediction models were constructed via the regression module in MLMD platform based on dataset R4. It is worth noting that an accurate prediction model is an essential prerequisite for optimization. The CV-ρ (metric details are provided in Supplementary Note 1) of the regressor for predicting UTS and total elongation (TE) is 0.9912 and 0.8816, respectively, as shown in Fig. 5a, b. Hence, these two ML prediction models can be employed to discover novel RAFM steels with given physical constraints (details are provided in Supplementary Table 1). Figure 5c, d depict the tensile properties of RAFM steels in R4-dataset and MLMD-recommended steels at 600 °C and 300 °C, respectively, and the Pareto front of RAFM steel is significantly pushed forward, especially at 300 °C. Li et al.70 proposed an intelligent design model, which consists of the forward model that maps composition and processing to properties and the reverse model that maps properties to composition and processing. The intelligent design model is capable of implementing property-oriented composition and processing design. As illustrated in Fig. 5c, the optimal RAFM steel at ambient temperature 600 °C achieved by Li is denoted by the blue star, and the RAFM steels designed by MLMD are highlighted by the red star. It is evident that the properties of the designed materials are very close. Furthermore, the compositions and processing of the materials are also similar, as shown in Table 2. Besides, MLMD can also discover other advanced materials with interesting properties that lie on Pareto Front, and the novel material number depends on the simple hyper-parameter setting. These materials can be applied in various scenarios according to specific requirements (the composition and processes at 600 °C are provided in Supplementary Table 2). Meanwhile, we also design RAFM steels at temperature 300 °C to further demonstrate the convenience and effectiveness of material design in MLMD platform, as shown in Fig. 5d. The recommended properties exhibit a sharp improvement compared to the report71, with a UTS of 643 MPa and TE of 14.64%. The selected material on Pareto Front, highlighted with a red star at 723.10 MPa and a TE of 20.7%, shows a 12.5% improvement in UTS and a 41.4% improvement in TE. An experiment will be conducted to validate the finding in a future study (the composition and processes at 300 °C are provided in Supplementary Table 3). In summary, these results demonstrate the powerful ability of the surrogate optimization module within the MLMD platform to effectively accelerate the discovery of high performance materials (raw plots are provided in Supplementary Fig. 3).

Fig. 5: The RAFM steels design process through surrogate optimization module in MLMD.
figure 5

a The prediction of UTS via the tuned MLMD regression model. b The prediction of TE via the tuned MLMD regression model. c The Pareto front of RAFM steels in original data is represented by red circles, and the pushed forward Pareto front of RAFM steels by NSGA-II algorithm with surrogate optimization module are represented by yellow circles at ambient temperature 600 °C. The blue Pentasta represents the material designed by the original work. d The Pareto front of RAFM steels in original data are represented by blue circles, and the pushed forward Pareto front of RAFM steels by NSGA-II algorithm with surrogate optimization module are represented by yellow circles at ambient temperature 300 °C. The red Pentasta represents the optimal material designed by MLMD at an ambient temperature 300 °C.

Table 2 Comparison among the AI-model-based designs and experiment results

Active learning module

As previously mentioned, an accurate predictive model is necessary for implementing surrogate optimization, but it often requires a large quantity of data, while the data in materials science are typically limited. Consequently, we provide a sampling-based material design strategy within the active learning module of MLMD platform. The active learning module necessitates only two datasets: the experimental/simulation data and the virtual sample data. Novel materials with desired properties can be discovered via Bayesian sampling implemented in MLMD platform from the virtual sample space.

HEAs possess excellent properties, including cryogenic toughness, strength, and thermal stability at elevated temperatures, as well as good corrosion and wear resistance72,73. Wen et al. designed a strategy combining SVM with experimental design algorithms to search for AlxCoyCrzCuuFevNiw HEAs with large hardness74. After seven iterations by iterative loops of AI-dominated and knowledge-dominated methods, the 10 new alloys with high hardness were achieved and synthesized, and their compositions are listed in Table 3. To assess the effectiveness and convenience of the active learning module within the MLMD platform, we use the identical samples as those utilized in Wen’ report74, which encompassed 155 synthesized HEAs as well as the supplementary virtual sample space, to design the HEAs with high hardness. The concentrations of six-component alloys in the virtual sample space are constrained within 34 < x < 47, 5 < y < 33, 8 < z < 34, 0 < u < 13, 5 < v < 20, and 0 < w < 16 at.%. Various utility functions are utilized for HEAs design within active learning module, towards desired properties, including EI, EIP, AEI, EQI, REI, UCB, POI, and PES. The distinction between utility functions lies in their varying emphasis on exploration or exploitation during the sampling process. Furthermore, the appropriate utility function may exhibit varying levels of efficiency, necessitating a case-by-case evaluation. Fortunately, the selection of the utility functions can be easily changed due to the programming-free nature of MLMD. Here, the results of three samples designed using EI, REI, and UCB in active module are illustrated in Fig. 6a–c, d–f, g–i, respectively (the results of remaining utility functions are provided in Supplementary Note 2). The compositions of three samples recommended by the three sampling approaches closely resemble the alloys designed in the original work74 through multiple iterations. The detailed composition of HEAs through the active learning module is presented in Table 4. Consequently, these alloys designed using MLMD demonstrate a high potential to posses high hardness. A similar design strategy can be applied to optimize other properties, such as light HEAs with high strength and parameters of HEA coatings. The active learning module within MLMD platform can also be extended to bulk metallic glasses, superalloys, and other materials.

Table 3 The top 10 newly predicted and synthesized alloys after seven iterations by iterative loops of AI-dominated and knowledge-dominated methods in original work74
Fig. 6: The atomic percentage distribution of novel AlxCoyCrzCuuFevNiw HEAs designed by Wen and MLMD.
figure 6

ac The HEAs designed by Wen through iterations are represented by blue circles, while those designed by the EI strategy in active learning in MLMD are denoted by red Pentastar. df The HEAs designed by Wen through iterations are represented by blue circles, while those designed by the REI strategy in active learning in MLMD are denoted by red Pentastar. gi The HEAs designed by Wen through iterations are represented by blue circles, while those designed by the UCB strategy in active learning in MLMD are denoted by red Pentastar.

Table 4 The 9 newly designed alloys in one iteration by EI, REI, and UCB in MLMD

Discussion

In this work, MLMD, a cutting-edge AI platform for material design was developed, aiming to accelerate the discovery of advanced materials. MLMD provides a user-friendly programming-free interface. MLMD enables efficient end-to-end materials design with one or more desired properties. Simultaneously, to tackle the issues of limited data availability in materials science, MLMD also provides the sampling-based active learning method to recommend new materials efficiently. It also integrates commonly utilized functions in AI platform such as data analysis, descriptors refactoring, and prediction modules. The outcomes of regression, classification, surrogate optimization, and active learning modules within MLMD in the study demonstrate the strong power of material design.

Notably, the approach to material design employed in MLMD differs from that outlined in the original research work70,74. In the surrogate optimization and active learning module, MLMD employs a distinct method for recommending novel materials, albeit yielding highly consistent results within previous research. Additionally, the user-friendly interface of MLMD streamlines the design process, ensuring that researchers without extensive programming skills can focus their efforts on conducting experiments, analyzing mechanisms and characterizing novel materials.

Finally, we will continue to commit to ongoing enhancements and releases of MLMD, aiming to address the challenges frequently encountered in material design, including the development of more efficient ML algorithms, more frontier tools, and visualization interfaces to enhance the efficiency of processing material data. We believe that MLMD has the potential to become an indispensable tool for material design, especially friendly for researchers who are unfamiliar with programming, and facilitate the advancement of material informatics.

Methods

Architecture

MLMD offers a complete materials design wrokflow, encompassing data collection, data preprocessing, feature engineering, model establishment, parameters optimization, material discovery, and experiment validation, as illustrated in Fig. 2. This process can be executed through a user-friendly interface. Various material data can be sourced from simulations, experiments, literature (manually collected data from published papers and patents), and open databases. In addition, MLMD provides material databases (e.g., polycrystalline ceramic, HEAs, ferroelectric perovskites). The feature engineering module involves descriptor refactoring, correlation analysis, and feature importance ranking. The classification and regression modules are used to predict the material properties, which integrate different AI models. The surrogate optimization module encompasses GA, PSO, DE, SA, and NSGA-II to accelerate the material discovery with single or multiple objectives. The active learning module offers various material design strategies such as EI, PI, AEI, UCB, and EHVI to direct experiments. The core elements and detailed architecture can be found in Fig. 1.