Introduction

Research in the field of material informatics (MI) has become increasingly active with recent developments in ceramic material databases and computational performance (CPUs and GPUs). MI combines informatics with materials exploration to accelerate the search for new functional materials. Examples of inorganic materials databases, such as the Materials Project1 and the Inorganic Crystal Structure Database (ICSD)2, contain a vast amount of crystal structure data points and material properties estimated using computational simulations. These databases have facilitated many studies in materials science using high-throughput calculations, such as first-principles and force-field calculations3,4,5. By applying MI to these databases or via high-throughput studies, researchers can organize complex relationships between the compositions, structures, and physical properties of materials, facilitating the more efficient search of useful materials. The process typically involves converting material data into descriptors xi, and learning a prediction function that yields the desired physical properties as the objective variable Y6. Descriptors numerically represent crystal structures and serve as an interface between crystal structure and material property data. Researchers (users) can propose candidate new materials, convert them into descriptors, and predict their physical properties instantly4,7,8,9. By identifying promising materials in advance, experimental costs can be significantly reduced. Furthermore, analyzing the constructed prediction functions using methods such as importance variable analysis may provide chemical semantic interpretations of the descriptors10,11. However, existing methods face challenges in identifying unknown materials that are not registered in databases. Inorganic crystalline material descriptors contain information derived from composition, crystal structure and other characteristics. Although it is easy to generate descriptors from conventional crystal structure descriptions, such as the lattice parameters and fractional coordinates of ions, reconstruction of crystal structures from those descriptors remains challenging. Consequently, the MI approach, which uses such descriptors, cannot predict the physical properties of undiscovered compounds—i.e., non-registered compounds—in the material database. Attempts have been made to evaluate phase stability and other physical properties by constructing machine learning models using only compositional information, without using structure-derived descriptors12,13,14. A composition and its descriptors are reversible, and users can easily obtain evaluation results by simply considering their compositions. However, the accuracy of evaluation results is greatly reduced as material properties are highly dependent on both composition and crystal structure information. Nevertheless, there is significant potential in areas outside the scope of existing material databases. According to a prior report15, it is estimated that only 16% of inorganic compounds have been discovered in ternary systems and merely 0.6% in quaternary systems. In another paper16, it was estimated that, in compositional combinations considering charge neutrality, there were 30 million cases in ternary systems and 30 billion cases in quaternary systems. This indicates that novel functional materials are likely hidden in the vast space of undiscovered materials, necessitating an efficient exploration approach. The conventional MI approach alone is inefficient for determining the compositions and structures of undiscovered materials with desired properties. To support specialized researchers and enhance their intuition, this study introduces the concept of a materials map, which visualizes existing datasets in a two-dimensional space. Even today, numerous groundbreaking material discoveries are believed to heavily rely on the intuition and expertise of exceptional specialists. Therefore, the development of MI-derived tools that can effectively and efficiently support these experts is expected to accelerate the discovery process.

One challenge experienced by researchers when using the MI approach arises from the utilization of high-dimensional descriptors as inputs for the MI scheme. To optimize the intuition and perspective of researchers, this study introduces the concept of a materials map, which transforms existing datasets into a two-dimensional space. This map evaluates materials based on two intuitive scales: chemical composition and crystal structure. Consequently, it enables the specification of the search range for known materials while simultaneously identifying unexplored materials. Notably, 2D materials maps using fundamental properties, such as electronegativity and ionic radius, have been proposed in several prior studies17,18. However, the experience of researchers invaluable when selecting relevant coordinates since conventional materials maps only consider specific indices. Therefore, in this study, the descriptors were dimensionally reduced using an autoencoder, a type of deep learning model, while retaining maximum information related to composition and structure. Furthermore, by applying a neural network structure associated with the target properties, we developed a materials map tailored to user-specified target properties.

Dataset

To validate the materials mapping method developed in this study, we present an application example focusing on candidate solid electrolyte materials in all-solid-state batteries, which are attracting attention as a next-generation energy source. One crucial property of solid electrolyte materials is their high Li-ion conductivity. We used a dataset comprising the data of 708 conductivity simulations of Li–O-based inorganic solid materials, which we reported previously19. All compounds were selected from the Materials Project database and matched two conditions: (1) inclusion of Li and O, and (2) incorporation of 25 elements (Mg, Ca, Sr, Ba, Sc, Y, La, Ti, Zr, Hf, Nb, Ta, Zn, B, Al, Ga, In, C, Si, Ge, Sn, P, As, Sb, S). For samples meeting these conditions, the migration energy (ME, denoted as Emig [eV]) of lithium-ion diffusion in the solids was calculated using an automated ion conductivity evaluation algorithm based on the Bond Valence Force Field (BVFF)20 method, where smaller values of Emig indicate a smaller barrier to Li-ion diffusion and thus better conductivity. The compositional information considered in this study includes the chemical composition ratio and various properties of constituent elements (atomic number; atomic weight; electronegativity; melting point in the metallic state; atomic/ion/covalent bond radii; number of s-, p-, d-, and f-electrons; and Mendeleyev number) expressed in histogram form21. Three types of structural information were employed: the Radius Distribution Function (RDF), the Angular Distribution Function (ADF), and the Voronoi diagram. Compositional and structural descriptors were represented by numerical sets of 1782 and 858 dimensions, respectively. All descriptors and Emig values were scaled by the maximum value for each type (compositional descriptors, RDF, ADF, Voronoi diagrams, and Emig) and converted to a range of 0 to 1. All data tables are available in the Supporting Information.

Methods

In this study, an autoencoder was used to compress each of the compositional and structural descriptors into one dimension each. The autoencoder22 is a neural network model with specific constraints on the number of neurons and layer shapes. It is an unsupervised machine learning technique that can effectively reduce the dimensionality of input information. Upon receiving an n-dimensional input, the number of neurons (dimensions) is reduced (encoded) to the user-specified dimensionality through the application of fully connected layers and activation functions (encoded values), and the encoded values are then expanded (decoded) to ensure that the output’s dimensionality matches that of the input (decoded values). When the autoencoder successfully reproduces the input descriptors in its outputs, the encoded values effectively capture all the information from the original descriptors. In this study, the compositional and structural descriptors were compressed to one dimension each and then used to generate a two-dimensional map based on the compositional and structural axes. Furthermore, an intermediate layer was introduced to learn the relationship between ME and the 2D information in the coding layer, allowing for the creation of a materials map associated with ME. A schematic of the neural network model used in this study is presented in Fig. 1. To optimize the model weights, loss functions were defined as Eqs. (1) and (2), corresponding to minimization of the mean squared error (MSE).

$$\begin{array}{c}{Loss}_{comp}= \frac{1}{n}\sum \limits_{i=1}^{n}\sum\limits_{j=1}^{d}{\left({Out \,comp}_{i, j}-{In \,comp}_{i, j}\right)}^{2},\end{array}$$
(1)
$$\begin{array}{c}{Loss}_{str}= \frac{1}{n}\sum \limits_{i=1}^{n}\sum_{j=1}^{d}\limits{\left({Out \,str}_{i, j}-{In \,str}_{i, j}\right)}^{2},\end{array}$$
(2)

where n and d denote the number of samples and descriptors respectively.

Figure 1
figure 1

Autoencoder architecture for the materials map. Compositional and structural descriptors are condensed into a one-dimensional format, producing encoded values, denoted as xcomp and xstruct, which are linked to the migration energy (ME) as the target objective.

In addition, the two encoded variables derived from the composition and structure descriptors are associated with the lithium ions’ ME using a network structure, as illustrated in Fig. 1. The loss function is defined by Eq. (3), which also considers the MSE.

$$\begin{array}{c}{Loss}_{ME}= \frac{1}{n}\sum\limits_{i=1}^{n}{\left({Pred}_{i}-{True}_{i}\right)}^{2}.\end{array}$$
(3)

The loss function for the entire model is defined as the sum of Losscomp, Lossstr, and LossME multiplied by the weight factor W (Eq. 4).

$$\begin{array}{c}{Loss}_{all}={Loss}_{comp}+{Loss}_{str}+W\times {Loss}_{ME}.\end{array}$$
(4)

The weight factor, W, controls the priority assigned to ME prediction. A larger W prioritizes the accuracy of ME regression, whereas a smaller W prioritizes the recovery of compositional and structural descriptors. We examined patterns with W = [0, 0.01, 0.1, 1] to balance the accuracy of ME prediction with the reconstruction of input descriptors. The batch size, learning rate, and L2 penalty were set as hyperparameters, and a grid search was conducted to determine the combination that minimized Lossall. The number of epochs was determined when Lossall did not improve for 80 consecutive optimizations using the Adam optimizer23. Hyperparameter tuning resulted in a batch size of 16, learning rate of 0.001, and L2 penalty of 10−6. To train the materials map autoencoder, 80% of the data were randomly selected and assigned to the training set, and the remaining 20% were used as test data to verify the generalizability of the model.

Results and discussion

Figure 2 shows values of Losscomp, Lossstr, and LossME obtained for different values of W, demonstrating a trade-off relationship between the reconstruction of composition and structure descriptors (Losscomp and Lossstr) and the regression of migration energy (LossME) as a function of W. We adopted a weight factor of W = 0.1 to balance the performance of both processes. Figure 3a–c depict diagnostic plots of the test data for composition, structure, and ME regression predictions at W = 0.1. Flattened values for the composition (1782 dimensions) and structural (858 dimensions) descriptors are displayed in Fig. 3a,b. The coefficient of determination, R2 score, exceeded 0.8, indicating the accurate reconstruction of both descriptor types. Figure 3d–f compare the input and output descriptors for the three selected samples that exhibit the largest loss values (Lossall), confirming that the rough shapes of the descriptors were reconstructed. On the other hand, the ME regression (Fig. 3c) exhibits a very low R2 score of 0.29, which stems from the insufficient prediction power of the original compositional and structural descriptors as reported in our paper19.

Figure 2
figure 2

Loss functions of (a) compositional descriptors, (b) structural descriptors, and (c) ME regression, represented as functions of the weight parameter W. This parameter governs the trade-off in performance between composition/structure reconstruction and ME prediction.

Figure 3
figure 3

Reconstruction evaluation of compositional/structural descriptors and ME prediction performance of the materials map autoencoder. Diagnosis plots of (a) compositional and (b) structural descriptors, as well as (c) ME predictions, along with (d–f) three examples of input and output histogram-descriptors that correspond to the three poorest reconstruction loss functions. Compositional descriptors are represented by numbers 0–1782, while structural descriptors are represented by numbers 1783–2640.

Figure 4 shows a map of the constructed materials, with the horizontal axis obtained by compressing the compositional descriptor (1782 dimensions) to one dimension, and a vertical axis likewise obtained by compressing the structural descriptor (858 dimensions) to one dimension. Because each axis was produced by an autoencoder, no units or clear chemical interpretation are present. The colors of the plots in the figure correspond to the MEs of lithium ions, with larger and smaller values represented in dark red and dark blue, respectively. Note that the research and development of battery materials requires high lithium-ion conductivities, which correspond to materials with a low ME. Encoded values for compositional and structural descriptors, and calculated and predicted ME for all 708 materials are listed in Supporting Information. In the figure, the crowed plots indicate areas where many materials with similar compositions and structures have been reported, and where the material search has progressed relatively. However, such studies have never been conducted in empty plot areas, which correspond to unknown materials. These areas are considered risks in the context of material exploration owing to the high possibility of materials not being synthesized. Hence, the sparse plot areas might be promising in terms of both synthesizability and discovery of unknown materials. In Fig. 4, the plotted colors shift from red to blue (corresponding to a decrease in ME) as the value of structural coding increases, confirming that ME is primarily linked to structural information. Appropriate control of the crystal structure is crucial in obtaining highly conductive lithium-ion materials. As the compositional coding increased, the plot distribution became narrower, concentrating in a range from − 2 to 2 on the structural axis. Two possible reasons are suggested: (1) the range of possible crystal structures is limited with respect to chemical composition, and such a combination of composition and structure is not possible; or (2) vacancies in the map correspond to materials that researchers have not yet worked with. In any case, the materials map presented in Fig. 4 may be useful for summarizing reported research, visualizing unknown material areas, and providing a bird’s-eye view for materials researchers. In particular, the inclusion of an ME regression component in the autoencoder shown in Fig. 1 is largely beneficial in the search of high-ionic-conductivity materials, as the user can roughly evaluate the ME even in the non-plotted coordinates of the material map. Further details are discussed herein.

Figure 4
figure 4

Map for Li-ion conductive materials in the Li–O system. The horizontal and vertical axes correspond to the compressed compositional and structural descriptors, respectively, achieved through the autoencoder approach.

Although the significance analysis of variables encoded by an autoencoder is generally a challenging task, we investigated factors affecting the composition and structure axes using a trial-and-error approach. The relationship between composition, structure, and high Li-ion conductivity was also considered. As a result, we infer that the lithium-to-oxygen ratio is thought to be one of the factors affecting the compositional axis. In Fig. 5, the molar ratio of (a) Li to (b) O in the composition is plotted as a function of the encoded value of the compositional descriptor xcomp. At xcomp >  ~ 1.0, the molar ratio of Li tends to be <  ~ 0.3, whereas at xcomp <  ~ 1.0, it ranges from 0 to 0.6 (Fig. 5a). Similarly, the molar ratio of oxygen tends to be > 0.55 at xcomp <  ~ 1.0 (Fig. 5b), except for compounds whose molar ratio of oxygen was < 0.3. Thus, the compositional feature at xcomp >  ~ 1.0 appears to reflect relatively low Li and high O concentrations, which indicates that the proportions of lithium and oxygen are factors associated with ME. This conclusion is supported by the connection between xcomp and ME in the neural network model illustrated in Fig. 1. However, the correlation between xcomp and the ratio of Li and/or O is rather poor even at xcomp > 1; therefore, numerous other composition-related parameters must be intricately involved in the encoded value of xcomp.

Figure 5
figure 5

Correlation between encoded compositional variables and (a) Li concentration and/or (b) O concentration. Relationship between encoded structural encoded variables and (c) Li concentration and/or (d) O concentration.

The relationships between the molar ratio of lithium to oxygen in the composition and the structural axes is displayed in Fig. 5c,d, respectively. A more distinct dependence on the molar ratios of Li and O is evident on xstruct as compared to xcomp. Specifically, the molar ratios of Li and O are scattered at less than 0.3 and greater than 0.5, respectively, at − 5 < xstruct < 0. However, an increase in the lithium proportion and decrease in the oxygen proportion are clearly visible at xstruct > 0. This suggests that the structural axis correlates with the lithium and oxygen concentrations in the lattice at xstruct > 0. We inferred that the compositional information of the molar ratios of Li or O was indirectly extracted from the radius distribution function (RDF) of Li–O, Li–Li, etc., although these descriptors were classified as structural when xstruct > 0. In addition, regions exceeding zero on the structural axis represent clusters of materials with relatively small ME (blue distribution). Therefore, we suggest that the dense distribution of lithium and low oxygen concentration in the lattice are advantageous for achieving high ionic conduction.

The Li–O materials used in this study include representative host structures for fast Li-ion conductors, such as the Perovskite-type24,25,26, Garnet-type27,28,29,30,31, and NASICON-type structures32,33,34,35,36,37,38. Specific ID numbers from the Materials Project database (mp-id), chemical compositions, ME values, and encoded values for composition and structure descriptors are listed in Table 1. These data points are plotted on the material map in Fig. 6a, where each structure is relatively clustered. However, the garnet- and NASICON-type structures are located far from each other. Although differences in the materials map with respect to crystal structure (type) depend on the structural axis (vertical axis), the same structure exhibits a larger variation on the structural axis than on the compositional axis. The distributions of the Garnet and NASICON structures exhibit similar values on the structural axis, which may be simply interpreted as them being distinguished solely by composition rather than structure. One possible reason for such a counterintuitive conclusion may be the strong dependence of the structural axis on the local structure of Li and O ions, as suggested previously. In other words, the structural axis is expected to be strongly influenced by the Li coordination state and arrangement information, rather than the similarity of the host structure. For example, a significant difference in structural values was observed between Li7La3Zr2O12 and Li7La3Sn2O12, which belong to the same garnet-type material. Specifically, the lithium ions in Li7La3Zr2O12 are preferentially distributed in octahedral sites39, whereas those in Li7La3Sn2O12 as well as other compositions are distributed in tetrahedral sites31. This suggests that the structural axis is not dependent on the host structure, as classified by the Perovskite, Garnet, and NASICON types, but on the local structure around Li. This is likely because both the structural and compositional axes are associated with the target-variable ME in the neural network model, as shown in Fig. 1.

Table 1 Details of Perovskite, Garget, and NASICON materials in this dataset.
Figure 6
figure 6

Visualization of typical Li-ion conductive materials (Perovskite, Garnet, NASICON-type materials) visualized on a materials map. Distribution of representative Li-ion conductive materials in the map at (a) W = 0.1 and (b) W = 0. Note that there are no discernible relationships between compositional/structural descriptors and ME values when W = 0.

To confirm this, we reconstructed the material map by setting the weight of ME loss (LossME) to zero; i.e., W = 0. Consequently, the encoded variables became independent of ME. Figure 6b presents the resulting material map. Apparently, materials with Garnet, NASICON, and Perovskite structures are plotted as clusters and exhibit different structural encoded values in descending order of the structure axis (Fig. 6b). This suggests that structural descriptors (RDF, ADF, and Voronoi polyhedron) can capture the grouping of crystal host structures, even without considering Li-ion conductivity.

Figure 7 presents a heat map of the ME predicted for arbitrary coordinates on the material map using a neural network from two encoded variables to the ME (see Fig. 1). Although predictive performance for ME is low (see Fig. 3c), and, in some areas, the predicted ME values do not match the measured values shown in the plots, the map generally matches the distribution trend of the measured values. The region around xcomp ~ 0.0 and 0 < xstruct < 4 on the materials map corresponds to a lower predicted ME values, indicating higher Li-ion conductivity. Furthermore, in this region, materials registered in the database are sparsely distributed, and the possibility of synthesizing real materials is high. As shown in Fig. 5, the area is characterized by high proportions of lithium in the compounds. For example, this area prominently features garnet-type solid electrolytes, which have a higher molar ratio of Li ions than NASICON- or perovskite-type solid electrolytes. Therefore, the concentration of Li ions may be an efficient guideline for the optimization of garnet compounds by metal substitution40. Furthermore, several undiscovered compounds may be present in this area. Unfortunately, in principle, specifying the composition and structure for given coordinates (xcomp, xstruct) on the material map is difficult, as an accurate conversion from descriptors to crystal structure remains infeasible. However, a material map may inspire experienced researchers to examine the distribution of materials around the focused coordinates. For example, Fig. 8 shows the compositional distribution of materials in proximity of a relevant area, which may support the researchers’ intuitions. In addition, descriptors can be restored by decoding the input coordinates on the materials map. Although the crystal structure cannot be directly restored from descriptor information, obtaining information that can be used as material design guidelines is possible. In summary, these findings provide valuable insights for material researchers, enabling them to explore the distribution of materials, predict properties, and recover descriptors using the material map.

Figure 7
figure 7

Materials map with predicted ME distribution (Background color represents predicted ME values; data points are obtained from materials simulations).

Figure 8
figure 8

Enlarged materials map highlighting areas with anticipated high lithium conductivity and limited research. Compositions are labeled in the vicinity of their respective data points.

Conclusion

In the field of materials informatics for inorganic materials, there are existing databases containing the data of hundreds of thousands of crystal structures. Vigorous research is underway to predict material properties using machine learning, as well as develop fast and efficient methods for optimizing functional materials. However, the vast space of registered materials in these databases suggests the presence of numerous undiscovered materials with desirable properties. Furthermore, it is difficult to systematically search a database using conventional MI methods.

In this study, we designed a materials map to visualize the distribution of known materials by using compositional and structural axes as the basis for organizing the materials. The materials map offers a quick and comprehensive overview of the current state of material exploration. For instance, areas densely populated on the materials map indicate extensive research efforts, reducing the likelihood of discovering new materials. Conversely, sparsely populated areas suggest uncertainty in material synthesis. Additionally, we successfully linked structurally and compositionally encoded values to ME values, allowing us to visualize regions on the materials map where high ionic conductivity is expected. By analyzing both the density of data points and predicted ionic conductivity values on the materials map, material researchers can make informed decisions about their next exploration area, considering results and associated risks. This approach has the potential to facilitate the discovery of materials with exceptional ionic conductivity. While other machine learning techniques such as t-Distributed Stochastic Neighbor Embedding (t-SNE)41 and Principal Component Analysis (PCA)42, can reduce descriptor dimensionality, they face challenges in decoding descriptors from arbitrarily chosen encoded values and establishing connections between encoded values and objective variables, such as migration energy in this context. In this regard, the autoencoder-based materials map we present here offers a distinct advantage. Unfortunately, although the specification of encoded variables for optimized ionic conductivity by the autoencoder-derived materials map is feasible, the reconstruction of crystal structures is technically difficult owing to the irreversibility between the structural data and their descriptors. We, therefore, believe that integrating experience and knowledge of materials researchers may be necessary for overcoming this difficulty, and we anticipate that the 2D visualization of materials information will support researchers’ in intuitively understanding material distributions.