## Introduction

Machine learning (ML) models and algorithms are increasingly applied in materials science for a wide variety of tasks ranging from materials characterization, property prediction, and to structure/composition generation design as reviewed in1,2,3,4,5,6,7,8,9,10,11. These data-driven algorithms have dramatically sped up the exploration in the vast chemical design space and have helped to discover many novel functional materials12. However, compared to the mature bioinformatics field with thousands of web servers (>9000)13,14, the ecosystem of materials informatics is still in the embryo stage with <100 web servers, most of them being data infrastructures15. This can also be seen in our survey in Table 1 which focuses on inorganic crystal materials. We also find that the ecosystem of chemical informatics web apps is also in the primitive stage as reviewed in16. In contrast, the bioinformatics field even has a search engine named bio.tools which indexes and tracks biological scientific web servers throughout their lifetime.

Here we argue that despite the increased sharing of data, programs or source code in the materials informatics community, the missing web apps for these tools have significantly impeded the progress of our field as most experimental teams do not have the expertise to implement, train and deploy these tools locally and many of the proposed materials informatics algorithms are under-used. Furthermore, compared to bioinformatics, materials informatics web tools are much fewer in terms of quantity, diversity, and quality. Developing and providing web servers can make complex algorithms accessible to a broad research and user community. In addition to providing user-friendly services to materials researchers, a recent study has found that there exists a positive association between the number of citations and the probability of a web server being reachable14.

Currently, the most widely used web services in materials include Materials Project(MP)17, Aflow-lib18, and OQMD19, which are all mainly used as data sources. Even though these major databases come with several related analysis tools, there are many missing web apps that are strongly needed in exploratory materials discovery research. This process can be generally divided into four major stages each needing specific convenient web apps: characterization, property prediction, synthesis, theory discovery, and materials design20.

Starting from the composition exploration, one would need tools and models that can check the charge neutrality and electronegativity balance and estimate its formation energy. Composition-based prediction of crystal symmetry or lattice constants or even crystal structures is also highly desirable. When structures can be predicted or obtained via element substitution, tools such as structural relaxation, formation energy calculation, e-above-hull energy calculation, Pauling rule check, phonon calculation, and synthesizability are all useful to evaluate the feasibility of the candidate materials. The second major category of tools needed is property prediction web apps as provided by several existing servers18,21. However, many of these property prediction web apps do not support screening multiple inputs, which limits their usage in high-throughput screening for new materials. Nowadays, the modern deep generative materials design models can easily generate millions of candidate compositions22 and structures23. Also, many of these tools do not support a convenient download of the prediction results. In addition, it is desirable that databases of hypothetical new materials can be made available for users to find novel functional materials.

In this paper, we first survey current state-of-the-art (SOTA) web services in the inorganic materials community and identify the requirements of a sufficient materials web app and the limitations of current web apps. We then introduce MaterialsAtlas.org, our materials informatics web app platform for supporting the whole life cycle of materials discovery. It includes multiple candidate materials composition and structure validations/checks, materials property prediction modules, hypothetical materials databases, and utility tools. Our web apps are developed with high-throughput materials discovery processes in mind with a user-friendly web interface and an easy download of results.

## Results

### Survey of existing web apps for materials discovery

While there are many known AI or ML studies applied to the materials discovery process10,24, many of them do not offer or share their code, programs, web apps, or even datasets, which significantly lower their potentials in materials research. Compared to thousands of bioinformatics web apps, the number of materials informatics web apps is much fewer and are developed in an ad hoc way without considering the high-throughput screening requirement from the materials discovery process. Table 1 shows a list of web apps and tools that support the materials discovery process.

Materials characterization is a key step in experimental analysis which is especially true with the progress of high-throughput materials characterization that generates huge amounts of data. There are an increasing number of algorithmic studies on phase mapping of X-ray diffraction data25,26, symmetry determination in electron diffraction27, predicting crystallographic dimensionality and space group from a limited number of thin-film XRD patterns28, predicting accurate scale factor, lattice parameter and crystallite size maps for all phases29, and tuning of parameters in the Rietveld method30. However, most of these studies provide user-friendly web services. In our survey, only USCD team provides a web tool for coordination environment prediction from X-ray absorption spectroscopy31.

The second major category of web tools is for materials property prediction. This includes aflow-ML18, Javis-ML21, Crystal.AI32, thermoelectric predictor33, NIMS tools34, SUNCAT catalysis property predictor35, and matlearn36. These web apps cover a variety of material’s properties. For example, JARVIS-ML from NIST can predict formation energies, exfoliation energies, bandgaps, magnetic moments, refractive index, dielectric, thermoelectric, and maximum piezoelectric and infrared modes. However, many of these web apps are developed in an ad hoc way; they usually only accept one composition or structure at a time and cannot be used for screening. They usually do not come up with a performance measure to indicate the prediction confidence. More importantly, many of the algorithms or descriptors are outdated. For example, a recent benchmark study37 showed that the best algorithms for formation energy and bandgap prediction are based on Graph Neural Networks (GNN), which are all much better than other structural descriptor-based methods as used in18 and21.

The third category of web apps is diverse utility tools for structure and composition analysis including crystal toolkit, phase diagram, and others from Materials projects17, prototype finder from aflowlib18, phase diagram tool from OQMD19, analysis tools from JARVIS21, Matgenie from USCD38, phonon visualizer from MaterialsCloud39, and crystal symmetry tool from Bilbao crystallographic server.

The fourth category of web tools is the materials design tools including polymer designer40, Matlearn composition explorer36, SUNCAT catalysis designer41, and heterostructure designer in JARVIS21.

There are several offline tools that are very useful for materials discovery including the crystal structure prediction softwares such as USPEX42 and CALYPSO43. There are also platform tools such as JAMIP which includes property ML models and first-principle calculation job managements.

### MaterialsAtlas: platform of materials discovery tools

The MaterialsAtlas platform includes four types of web apps for supporting exploratory materials discovery including: composition and structure check and validation, materials property prediction, screening of hypothetical materials, and utility tools.

## Tools for composition and structure validation

### Chemical validity check

Given a predicted or generated material composition or structure, there are several steps to verify their physical feasibility. The first quick check of the chemical validity is the charge neutrality and electronegativity balance check (Fig. 1). These two check algorithms are based on the SMACT package44 with improvements to speed up the enumeration and search process. For both checks, only composition information is needed. Another chemical validation check is the Pauling rules check. Here we only check the input structure against the first three Pauling rules45.

### Formation energy and e-above-hull energy check

Another structure validation step is to check the thermodynamical stability in terms of formation energy calculation. This step is usually done by DFT relaxation and then the calculation of their total energy and formation energy. However, this computation is expensive for a large amount of structures. Here, we can first optimize input materials using Bayesian optimization with symmetry relaxation as introduced by Zuo et al.46. Here, we implemented two ML models for formation energy prediction, one is based on the Roost algorithm47 with only the composition as input. This model has demonstrated exceptionally good performance for compound stability prediction among composition-only ML models48. The other structure-based energy prediction model is based on our deep global attention graph neural networks (DeeperGATGNN)49 due to its exceptional performance based on our systematic benchmark studies. The e-above-hull energy prediction module has been implemented based on Pymatgen APIs: given an input materials composition and its total energy, it will report the e-above-hull energy.

### Prediction of crystal symmetry (space group and crystal systems) and lattice parameters

Given a materials composition, predicting its structure is very valuable as its many macro-properties such as ion conductivity, thermal conductivity, band gap, and formation energy can be calculated using first-principle calculations. However, currently crystal structure prediction is an unsolved problem. In this case, predicting the crystal symmetry such as crystal systems or space groups can be very useful to estimate some of its properties. Here we implement neural network models for space group and crystal system prediction50 which have achieved SOTA performance. Another important structure information of crystals is the unit cell parameters, whose precise estimation can greatly help the crystal structure prediction step. Here we implemented a deep neural network model for lattice parameter estimation, which has demonstrated exceptionally good performance for cubic systems and reasonably good results for other crystal systems51.

### Template-based crystal structure prediction

We have developed and implemented a template-based crystal structure prediction algorithm TCSP for fast structure determination52. By exploiting the vast known crystal structures, our algorithm has demonstrated good performance in CSP as benchmarked on the Materials Project dataset. The only input to this app is a material formula with an optional space group parameter, it will then generate multiple hypothetical crystal structures along with the template information used.

## Materials property prediction with composition or structures

Depending on the types of features used to train the algorithm, we can categorize the ML properties predictive models as either composition-based or structure-based. Composition-based prediction algorithms have been demonstrated to be reliable, accurate, and even preferred at times53. The composition-based category includes models that primarily use chemical composition-induced descriptors such as elemental representation or chemical composition features54,55. Algorithms used in these composition-based ML models range from basic ML techniques such as decision trees56 to more complex deep learning algorithms such as Convolutional Neural Networks57 or Graph Neural Networks47.

Composition-based ML models for property prediction come with both advantages and disadvantages. Because these models only use chemical composition descriptors as inputs, their predictive performance heavily relies on the quality of these features and the dataset. Therefore, the application of these models requires careful curative steps53. As the composition ML models omit the structural information of the materials, these models generally offer results with inferior predictive performance compared to structure-based ML models, especially when the size of the dataset is sufficiently large37,58. However, thanks to this omission of structural information, composition-based models are more computationally efficient than structure-based ones and can be used to screen much larger chemical space as material compositions are much easier to acquire than crystal structure data47. This omission can be very beneficial in some scenarios since structural-feature extraction is generally very complex and need to be symmetrically invariant53. With just composition descriptors, composition-based ML models can adapt any simple ML algorithms such as decision trees and support vector machines and still obtain accurate results53. Composition-based models can also adapt more robust ML algorithms from Deep Learning as shown in several deep learning models for property prediction including ElemNet (17 fully-connected layers)58, Roost (GNN)47, and Periodic-table based Convolutional Neural Network59. We note that composition-based predictors have one inherent limitation due to the polymorphic structures that may correspond to a given composition, which may bring bias to these models.

Another category of ML models for materials property prediction is structure-based ML models. As almost all materials properties are heavily dependent on their structures, the structure-based ML models for materials property prediction usually achieve greater accuracy than composition-based ML models60,61. Structure-based models use structure-based descriptors or features learned from raw structure information60,62,63. Structure Graph, Voxel Grids64, Coulomb Matrix65, and Voronoi Tessellation12 are some of the most popular techniques to represent materials based on knowledge of their structure. Although models of this category accomplish better prediction results, they can only predict properties of materials whose structures are already known from repositories like Inorganic Crystal Structure Database (≈165,000 materials)66 or Materials Project Database (≈125,000 materials)17 (whereas the cardinality of chemical materials is infinite) and hypothetical materials generated using generative models22,67.

Recent studies have shown that when structural descriptors are learned by deep neural network models, they can predict materials properties with much better accuracy than methods that use descriptors based on physicochemical information37,68. For doing this, GNN models have been intensively used as they have shown great success in this task60,63,69. GNN models have been found to achieve SOTA performance for various materials property prediction tasks. CGCNN60, MEGNet63, GATGNN68, SchNet69, and MPNN70 are some of the well-known GNN models for materials property prediction that use graph representation learning. One of the problems of these existing GNN models is that they cannot go deep, i.e., their performance decreases with increasing number of graph convolution layers as the representation of all the node vectors becomes indistinguishable. This problem is known as the over-smoothing problem71,72,73,74, and almost all the GNN models are victims of it. But recently, we designed a deeper and much improved version of the GATGNN model (DeeperGATGNN49) using Differentiable Group Normalization (DGN)75 and skip-connections76,77 which allows our DeeperGATGNN to use a large number of graph convolution layers to predict materials property with better accuracy than all the above mentioned GNN models for the five datasets used in a recent large-scale benchmark study37 and the Band Gap dataset from Materials Project Database. In our current system, the structure-based formation energy predictor is based on CGCNN and the structure-based predictors for band gap, elastic moduli, hardness, thermal conductivity are based on our DeeperGATGNN trained with samples from Materials Project. The details of the datasets used to train our DeeperGATGNN models are presented in Table 2.

## Materials property prediction tools

### Predicting 2D materials from composition

We train a Random Forest classification model to predict whether a given composition forms a 2D or layered structure78. As for the training data, 6351 2D materials (positive samples) are collected from the 2DMatPedia dataset79; 15,959 negative samples are gathered from The Materials Project by removing 2D materials. After training, our model achieves a classification accuracy of 88.98%. For a given input formula, our model outputs a predicted label (True or False) with corresponding probability in the downloaded results file. Inputs of multiple formulas are also supported either as a CSV file or by typing them into the input box separated by a comma or space. Clicking the ’Check now’ button will show the found 2D materials; clicking the ’Download results’ link, the detailed results will be downloaded.

### Predicting noncentrosymmetric materials from composition

A Random Forest classification model is trained to predict whether a material is noncentrosymmetric80. For training this model, a total of 82,506 samples are collected from the Materials Project by removing those compositions belonging to multiple space groups with conflicting centrosymmetric tendencies; here, 60,687 of them are positive samples and 21,919 are negative samples. The predicted accuracy reaches 84.8%. The input format and output form are the same as the above method.

### Predicting band gap from composition or structure

The band gap prediction models are trained with the dataset downloaded from the Materials Project. There are a total of 36,837 samples downloaded. The composition ML model is based on the CrabNet81, which uses a transformer self-attention mechanism82 in the compositionally restricted attention-based network for materials property prediction. Evaluations of over 28 datasets have shown good performance compared to other models. The structure-based band gap predictor is based on the dataset downloaded from the Materials Project and trained using the DeeperGATGNN graph attention network model49. For a given input formula, this model outputs the predicted band gap values.

### Predicting elastic moduli from composition or structure

We trained two types of prediction models for elastic moduli prediction: composition-based prediction models and structure-based ones. The former type are Roost neural network models47 trained with only materials compositions. Our structure-based elastic Moduli prediction models are based on our recent work of DeeperGATGNN algorithm49, which is a global attention-based scalable deep graph neural network model with the state-of-the-art performance for structure-based materials property prediction. Both types of models are trained using the known materials with elastic information in the MaterialsProject database. For each category, we train four models to predict bulk modulus, shear modulus, Young’s modulus, and poisson ratio based on the composition or structure information.

### Predicting hardness from composition or structure

The most recent study uses deep learning for hardness prediction which has shown good performance83. Another study84 uses 1062 experimentally measured load-dependent Vickers hardness data extracted from the literature to train the XGBoost ML algorithm using composition-only descriptors with boosting with excellent accuracy (R2 = 0.97). In a related study85, XGBoost has been applied to build a temperature-dependent Vickers hardness prediction model with R2 = 0.91 performance using only 593 labeled samples. Here we trained a Roost ML model for composition-based hardness prediction and trained a graph neural network model for structure-based hardness prediction using our DeeperGATGNN algorithm49.

### Predicting thermal conductivity from composition or structure

The most recent study on thermal conductivity prediction is from86 in which GNNs (CGCNN) and random forest approaches are combined to build the prediction model. The prediction model is trained with 2668 ordered and stoichiometric inorganic structures from the ICSD. Here we build a graph neural network model Roost47 model for a composition-based prediction model and a CGCNN graph neural network model60 for structure-based predictions. The dataset is downloaded from87, which contains thermal conductivity values for 2701 crystal structures contained in the ICSD database. Due to the limited data size, the prediction performance is only for experimental purposes.

### Predicting superconductor transition temperature from composition

We also train a random forest model and a CrabNet model to predict the superconductor transition temperature. The dataset is collected from the superCon database88.

In our current implementation of materials predictors, all models only generate a single-point prediction without uncertainty estimation as shown in almost all materials prediction algorithms37. However, in practice, it is desirable to obtain robust predictions with accurate uncertainty estimation89, which can be achieved via methods such as ensemble90, Bayesian91, or evidential deep learning regression models92. While such methods have been rarely used in materials property predictions, we expect their wider adoption in the future and will be added to our models in future upgrades.

## Generative design and screening for materials discovery

### Deep generative design of materials compositions/formulas

Generative models, such as variational autoencoder (VAE)93 and Wasserstein generative adversarial network(WGAN)94, play an important part in computer vision, audio processing, natural language processing, and molecular science. However, limited works have focused on using generative models to generate virtual inorganic materials (e.g., compositions and crystal structures). There are mainly two directions that researchers use generative models in material science. The first is we use generative models to generate compositions22,95. Dan et al. propose22 to use WGAN models to generate hypothetical materials compositions that are trained using the ICSD dataset. Their models not only can rediscover most compositions from existing materials databases but also generate many novel compositions that are chemically valid. Here we provide the screening tools for such hypothetical materials.

### Deep generative design of cubic crystal materials

Compared to generating virtual materials compositions, generating virtual crystal structures is more helpful for practitioners to find novel materials since many materials’ properties can only be calculated with structural information. Several works96,97,98 based on VAE and23,67,99,100 based on GAN have been proposed to generate material structures. CubicGAN proposed by Zhao et al.23 is the first method that can achieve the large-scale generative design of novel cubic materials. The authors not only can rediscover most of the cubic materials in The Materials Project and ICSD but also can discover new prototypes with stable materials. In their work23, they found 31 new prototypes for space groups of Fm$$\bar{3}$$m, F$$\bar{4}$$3m, and Pm$$\bar{3}$$m, of which 4 prototypes contain stable materials. A total of 506 cubic materials have been verified stable by phonon dispersion calculation. Here in our web app platform, we provide the search function for those materials (Table 3 and Fig. 2).

### Tools for hypothetical materials screening

One of the major goals for the materials informatics community is to expand the existing materials repositories in terms of materials compositions, structures, and properties, which can help accelerate materials with novel functions. Using our recently developed materials composition generative models (MATGAN)22, we have generated a large number of hypothetical material compositions which are deposited to the database for screening, hence the Hypothetical composition database (Fig. 3). For convenience, we also selected those lithium compound candidates and built the Hypothetical lithium materials database. Using our crystal structure generator, the CubicGAN23, we have created a cubic materials database for screening. Hypothetical materials compositions can also be combined with element substitution based structure prediction to generate new materials database. Finally, we trained a 2D materials classifier which is used to screen the whole hypothetical compositions generated by MATGAN, which are then deposited as the hypothetical 2d materials database.

## Utility Tools

Several utility tools (Fig. 4) to assist the materials discovery process have been developed and deployed on our platform. This includes chemical composition enumeration tool, feature generation and click-and-run machine learning models for users’ datasets, composition and structure search, and supercell generator and structure file format converter.

### Composition enumerator

Given several elements, what are the possible chemically valid formulas that can be synthesized and stable? Based on the SMACT materials informatics package44,101, we develop this composition enumerator to generate target materials compositions given a set of elements or an existing formula with one or more dopant elements. Due to the oxidation preferences, the number of possibilities is limited and this tool can help the investigator to narrow down the search space. A case study on how to use this module for discovering new materials is reported in our work52. With the hypothetical compositions, one can then apply crystal structure prediction to get their crystal structure and then predict their properties using composition-based or structure-based predictors.

### Feature generation

The very first step for developing materials property prediction models is to generate and select a set of good descriptors. Here we implemented a pipeline that allows users to choose feature combinations from diverse feature types such as composition features, structure features, electronic features, etc. This will greatly simplify the steps for materials scientists without a strong materials informatics background to develop ML models.

### Composition-based ML models for user-specified property prediction

We have built an ML pipeline that allows the user to specify the datasets and target property values and the algorithm, the web tool, will then build composition-based ML models and report the prediction performance. The test input will be a group of materials formulas.

### Structure-based ML models for user-specified property prediction

We have built a pipeline that allows the user to train a structure-based ML model for their custom-property prediction at http://materialsatlas.org/mlstructure which can greatly help the materials scientists to try different representations and ML algorithms to get the best performance.

### Finding similar compositions and structures

In many of the tinkering and exploratory studies of the materials design space, it is very helpful to find similar materials and explore their property changes. We use the Earth Mover’s Distance102 to search top N most similar formulas from different databases. For structure similarity, we use the computed XRD features103 to search similar structures. This search function will help with that.

For the convenience of the community, we have included other utility tools such as structure file conversion and supercell generation apps.

## Discussion

In addition to candidate materials composition and structure validation, materials property prediction, and screening of materials, several additional tools and services are highly desirable for exploratory materials discovery and will be added to our platform to lower the barrier for materials scientists in data-driven exploratory materials discovery.

### Phonon prediction, synthesizablity prediction, additional crystal structure prediction algorithms

One important validation step for newly proposed hypothetical materials is to calculate its mechanical dynamic stability. This can be done by calculating the phonon dispersion spectrum and checking whether the material is dynamically stable at 0K temperature when there are no imaginary frequencies. The phonon dispersion relations for hypothetical materials are important to study the k-space dependence of frequencies of normal modes. However, first principle phonon dispersion calculation is computationally expensive. Based on recent work on phonon density of states prediction104 and phonon vibration frequency prediction105, we are developing a graph neural network model for phonon dispersion prediction aiming to use for checking the dynamic stability of structures. Another module under development is the material synthesizability prediction model, which has been shown to be able to achieve good performance for inorganic materials using semi-supervised ML models106,107. In addition, we find that crystal structure prediction plays an important role in exploratory materials discovery and current DFT-based global optimization-based algorithms are applicable only to small systems due to the inherent challenges in crystal structure prediction. In addition to the template-based crystal structure prediction service52, we are planning to develop deep learning-based crystal structure algorithms by exploiting the databases of known crystal structures.

### Predicting ion conductivity from composition or structure

Due to the extremely limited datasets, prediction of ion conductivity has been very challenging with moderate success by using a set of hand-crafted structural descriptors108,109. This module is under development and will be added in future to our platform.

### Extensible servers and API services

To expand the coverage of functionalities, our MaterialsAtlas web server is open to include third-party web apps for materials research. We welcome any investigator to collaborate with us and deploy their applications on our platform. Only executable code or python code in a Linux environment is needed. Another useful feature is the REST API services so that other web services can call our APIs to do some query or calculation, which has shown great success in Materials Project’s Pymatgen APIs.

### Visualization and interactive exploration of design space

Interactive exploration in the materials design space has big potential to help researchers. We will add modules that support the visualization of materials property distribution among materials in the structural or composition space as shown in Fig. 5. In this figure, we map the structures into a 2D space using t-sne110 and XRD representation of the structures. We then annotate those red dots as the samples with annotated thermal conductivity with the dot size representing the magnitude of the thermal conductivity. Such interactive maps will greatly facilitate the search for high performance materials.

Despite the rapid progress of ML for materials research, a lot of studies have only led to papers without sharing their software while some of them shared their source code but without creating a user-friendly web service or web apps for them. Based on the experience of the bioinformatics field, it is critical for materials informatics researchers to develop and share easy-to-use web apps that wrap their developed algorithms for maximum adoption and usage of such data-driven tools in real-life materials discovery and analysis. We have surveyed the status quo of materials informatics web apps and find that they drastically lag behind the bioinformatics community. Here we report our MaterialsAtlas.org web platform that implements and integrates a variety of user-friendly tools for aiding the materials design space exploration, generation of candidates, and validating the candidates. These tools and those planned together will greatly decrease the barrier for materials researchers without deep computing or ML backgrounds to effectively exploit such tools.

## Methods

### System architecture and web app

MaterialsAtlas uses Django’s built-in SQLite3 database for storing hypothetical materials found by our generative materials design models22,23,78. Moreover, a RESTful API framework is used to send data from the Django back-end to the Vue.js front-end and vice versa. For example, a user will input either a chemical formula or element in one of the apps which will then be interpreted through the Django REST framework. The data is then queued as a job using Redis and subsequently, a Python worker is used to input the data into the corresponding app function. Once the worker and job have finished, the result is returned to the front-end to be viewed by the user. MaterialsAtlas also uses Ajax for some of the applications to communicate to our API. On a separate note, Nginx is used as the web application’s HTTP server. Additionally, MaterialsAtlas utilizes Nginx to proxy to the back-end and front-end server. For easier deployment, Docker is used to assemble each web-service as containers allowing the web application to work as a whole.

### Backend models

Python is used as MaterialsAtlas’ primary back-end language to compute each application result and write to the Django database.

### Job submission

When integrating a web application with any ML model, latency is a large concern. Using Redis’ job queue and fast in-memory data storage functionality allows a web application of this nature to run smoothly.