Introduction

Innovation in materials technology often initiates with the discovery of materials. To take an example, the discovery of powerful permanent magnets and lithium battery materials has led to the emergence of modern and mass-produced electric vehicles, making a significant impact on our society. Two scenarios are possible for the materials discovery. The first is the discovery of unknown functions in already known compounds. For this purpose, an experimental database of known compounds is searched using features representing the function. The features are chosen based on physical and/or empirical rules using information on constituent elements and crystal structures of compounds. Systematic first-principles calculations are sometimes performed to obtain features. The second scenario begins with discovering a compound as-yet-unreported by experiments, i.e., an as-yet-unknown compound. This is challenging since the chemical composition space of inorganic compounds with multiple elements and multiple crystal sites is vast. The space cannot be explored efficiently without a good strategy to narrow down the search space. A combination of an experimental database and its data-driven analysis is a powerful approach. In this article, we will focus on the second scenario, i.e., the discovery of as-yet-unknown compounds.

Currently, several inorganic compound databases are available, such as the inorganic crystal structure database (ICSD)1 in which approximately 250000 compounds are registered. The yearly trend of the number of unique compositions registered in ICSD is shown in Fig. 1. They are only for ternary and quaternary compounds consisting of multiple cations and a single anion having chemical compositions of integer ratios, which can be selected using the ANX formula in ICSD. Anions are taken from groups 15 (pnictogen), 16 (chalcogen), and 17 (halogen) in the periodic table. Cations are from the remaining groups, except for group 18 (noble gas) and hydrogen. These compounds include complex or pseudo-binary (-ternary) pnictogenides, chalcogenides, and halides. According to the rule, carbonates and silicates are included, but nitrates, phosphates, and sulfates are not. The number of ternary compounds composed of two cations and one anion registered to date is 5823. Similarly, a quaternary compound consisting of three cations and one anion counts 4897. Among them, 2574 (44%) for ternary and 3428 (70%) for quaternary are oxides. The predominance of oxides is natural since they are relatively easy to find as natural minerals or to synthesize artificially. The bar chart shows that the annual increase in the number of registered compounds has saturated or declined. The trend suggests that the discovery of ternary compounds is getting more difficult each year if we continue the same traditional approach. At the same time, given the high diversity of elemental combinations, there is a good chance to discover compounds in quaternary compounds, especially for non-oxides.

Fig. 1: The yearly trend of the number of unique compositions registered in ICSD (2021 Ver.2) for ternary and quaternary ionic compounds (orange bars).
figure 1

Only compounds reported to be experimentally synthesized, satisfying the charge-neutral condition, and having no partially occupied sites were adopted. Oxides are shown separately by green bars. Data-extraction procedures from ICSD to construct these figures are given in the Supplementary Information.

Chemically relevant compositions (CRCs) means the chemical composition that gives a stable or metastable compound under given thermodynamical conditions. Thermodynamically stable compounds are on the convex hull of formation energies, while metastable compounds show slightly higher formation energies above the convex hull. It is typically not easy for experiments to estimate the convex hull of the formation energy for a given thermodynamic condition. Identifying stable and metastable compounds by experiments is time and labor intensive. On the other hand, the convex hulls at zero temperature can be drawn based on energetics by systematic first principles calculations. Additional phonon and configurational calculations must be performed to incorporate temperature effects, which are possible but rather time-consuming. It should be noted, however, estimation of the formation energies for compounds is quite costly when their crystal structures are unknown, since the structures should be determined prior to the first principles calculations. If the CRC can be estimated prior to experiments or first principles calculations, the information is very useful in narrowing down the chemical composition in the search for compounds.

In the last decade, large databases of first principles calculations of inorganic compounds have been constructed and made available for many users2,3,4,5,6,7. Combining machine learning models and first principles data, attempts to find CRCs have been reported8,9,10,11,12. In ref. 8, a procedure was given to estimate the probability as CRC using compositional similarity. In ref. 9 using a database of the first principles formation energies, a machine-learning model was constructed only with chemical compositions to predict yet-unknown CRCs. The present authors used a list of compounds registered in ICSD as training data, and adopted methods to establish recommender systems for the discovery of CRCs13,14. Recommender system15,16,17 is a type of information filtering system, which is increasingly popular in a variety of fields, for example, E-commerce and social networking services. It attempts to estimate personalized recommendation scores of items to users based on their history of purchase or ratings. When this method is used for material discovery, the purchase history corresponds to the experimental database of compounds. The recommendation score is then related to the probability of finding a CRC. In our studies13,14, two types of algorithms were used to estimate the recommendation scores. One is a descriptor-based recommender system with features specific to chemical elements. The other is a tensor-based recommender system. They will be explained in the following chapters, together with some successful examples to synthesize as-yet-unreported compounds. In the last chapter, we describe the construction of a recommender system for experimental processing conditions for compounds based on a parallel experimental data-set collected in-house. Synthesis condition data was put into a tensor-based recommender system to evaluate recommendation scores for unexperimented conditions.

Compositional descriptor-based recommender system

Firstly recommender system of CRC was constructed using compositional descriptors13. They are made up of 22 elemental features, such as the atomic number and Pauling electronegativity, which can be classified into (1) intrinsic quantities of elements, (2) heuristic quantities of elements, and (3) physical properties of elemental substances. The compositional space was made from the means, standard deviations, and covariances of these 22 elemental features weighted by the concentration of the constituent chemical element. Here, the compositional space was restricted to ionic compounds with integer-valency cation and anion. Grid points were placed on the compositional space at integer composition-ratios. The points corresponding to compounds registered in the ICSD were designated as ‘entries’. The rest of the grid points were treated as ‘no-entries’. The data were then supplied to the machine learning for the binary classification in which responses have two distinct values of y = 1 and 0. A score of y = 1 was given to ‘entries’, and y = 0 for ‘no-entries’. Although the composition of y = 1 can be regarded as CRC, the composition of y = 0 does not necessarily mean that the composition is not CRC. There may be insufficient synthesis experiments at that chemical composition of ‘no-entry’. There is also a possibility that the composition is a CRC, but the corresponding compound is difficult to synthesize experimentally.

After the machine learning using classifiers, a recommendation score, ŷ, was estimated at approximately 1.3 million pseudo-binary and approximately 3.8 million pseudo-ternary compositions that were not registered in ICSD. The recommendation scores were then arranged in descending order. To verify whether the chemical compositions with high recommendation scores correspond to currently unknown CRCs, we examined if they were listed in another database, ICDD-PDF18. As there was a large overlap between registered compositions in ICSD and ICDD-PDF, the data-set that were not included in ICSD were extracted from ICDD-PDF. We then examined whether chemical compositions with high recommendation scores were included in ICDD-PDF. Figure 2a shows the cumulative numbers of verified CRCs for pseudo-binary compositions with the ranking of the recommendation scores. Results by three classifiers, i.e., random forest, gradient boosting, and logistic regression, are much better than that of the random sampling in all cases, indicating that the approach is helpful for discovering the currently unknown CRCs that are not present in the training database. Among the three classifiers, the random forest method performed the best. The histogram of the number of verified CRCs by the random forest method is shown in Fig. 2b. The discovery rate defined by the numbers of verified CRCs in the candidate CRCs is 18% for the top 1000, and 15% for the top 3000 candidates. The discovery rate for the top 1000 is 60 times greater than that by the random sampling, 0.29%. It should be noted, however, that the discovery rate evaluated in this way is only a lower limit, since unknown compounds not registered in the ICDD-PDF cannot be counted. First principles calculations can be used to examine if the candidate CRCs are on the convex hull of formation energies. This will be discussed in the next chapter with Fig. 6.

Fig. 2: The numbers of verified CRCs for pseudo-binary compositions with the ranking of the recommendation scores.
figure 2

a The cumulative numbers of verified CRCs by three classifiers, i.e., random forest, gradient boosting, and logistic regression, are compared with that by random sampling. b The histogram by the random forest method, i.e., the differential form of a for the random forest method.

Experimental efforts were carried out in collaboration with synthetic experts to synthesize unknown compounds with high recommendation scores19. Figure 3 shows Li2O-GeO2-P2O5 pseudo-ternary system with chemical compositions registered in three databases, i.e., ICSD, ICDD-PDF, and Springer Materials (SpMat)20. Chemical compositions of CRCs with high recommendation scores but not registered in any database are numbered according to their recommendation scores. Synthesis experiments were performed at target compositions by firing the mixed starting powders in air. The products were supplied to powder x-ray diffraction experiments. At the composition of 6 in Fig. 3, Li6Ge2P4O17, the diffraction patterns were not able to be assigned to any known compound. After optimizing synthesis conditions and detailed characterization, a phase having the composition Li6Ge2P4O17 was identified. The discovered phase showed a crystal structure different from any known compounds in the three databases.

Fig. 3: Candidate CRCs on the Li2O-GeO2-P2O5 pseudo-ternary system with chemical compositions registered in three databases, i.e., ICSD, ICDD-PDF, and SpMat.
figure 3

Adopted from ref. 19 with small modifications.

Another set of synthesis experiments was carried out for AlN-Si3N4-LaN pseudo-ternary system21. Fifteen compositions with high recommendation scores were selected as candidates for CRCs. Synthesis experiments were performed at target compositions by firing the mixed starting powders at 1900 °C under 1.0 MPa N2. A pseudo-ternary nitride, La4Si3AlN9, forming a crystal structure different from any known compounds was successfully identified. An as-yet-unknown variant (isomorphous substituent) of a known compound was also discovered at the composition of La7Si6N15.

Tensor-based recommender system

Different from the case in the previous chapter, the recommender system in this chapter does not use any descriptors. The CRCs registered in the ICSD database were used as training data. They were stored in a tensor, which was decomposed assuming a low-rank structure of the tensor. The recommendation scores for unknown data were then evaluated. A simplified scheme of the matrix-based recommender system often used in E-commerce is shown in Fig. 4a. The vertical axis corresponds to a customers’ list. The history of each customer is stored on the horizontal axis as purchased records of items. Low-rank structure of the matrix means that customers with similar preferences are interested in purchasing similar items. The matrix in E-commerce contains an enormous number of data, but is typically sparse. Combined with an appropriate decomposition technique, this type of recommender system is known to be very helpful for both customers and E-shops.

Fig. 4: Schematic illustration of matrix- and tensor-based recommender systems.
figure 4

a A simplified scheme of the matrix-based recommender system used in E-commerce. b An example of a 3rd order tensor expressing binary compounds. c Using the Tucker decomposition method, a large tensor can be approximated by a product of a small core tensor and three matrices. Adopted from ref. 14 with small modifications.

In the work reported in ref. 14, the compositional space was restricted to ionic compounds composed of two, three, and four cations {A, B, C, D} and one anion {X} having integer valency. The number of candidates was approximately 7.4 million for ternary AaBbXx with max(a, b, x) = 8, approximately 1.2 billion for quaternary AaBbCcXx with max(a, b, c, x) = 20 and approximately 23 billion for quinary AaBbCcDdXx with max(a, b, c, d, x) = 20. The number of the training data in ICSD was 9313, 7742, and 1321 for ternary, quaternary and quinary, respectively. Figure 4b shows an example of a 3rd-order tensor expressing binary compounds. Three axes are cation type, anion type, and integer set showing the chemical composition. Using the Tucker decomposition method22, the 3rd-order tensor can be approximated by a product of a core tensor and three matrices, as shown in Fig. 4c. For verification, the data-set unregistered in ICSD but included in two other databases, ICDD-PDF and SpMat, were used. Figure 5 shows the cumulative numbers of verified CRCs with the ranking of the recommendation scores for ternary, quaternary and quinary systems. The discovery rate was 59%, 52%, and 15% for the top 100 candidates for ternary, quaternary and quinary systems, respectively. The lower discovery rate for the quinary system can be ascribed to the smaller number of training data than the ternary and quaternary systems. The high discovery rate for the present tensor-based recommender system, which does not use any descriptors, was well confirmed.

Fig. 5: The cumulative numbers of verified CRCs with the ranking of the recommendation scores for ternary, quaternary and quinary systems.
figure 5

a The top 100 candidates. b The top 3000 candidates. Adopted from ref. 14 with small modifications.

A set of first principles calculations was made to examine if the candidate CRCs are on the convex hull of formation energies. Pseudo-binary systems that contain candidate CRCs with the top 27 recommendation scores were selected. First principles calculations were performed using the plane-wave basis projector augmented wave (PAW) method23,24 as implemented in the VASP code25,26. Since crystal structures were scarcely known a priori, calculations were exhaustively made, adopting all possible prototype structures registered in ICSD. Lowest energy structures were then used to draw the convex hull. A part of the results for pseudo-binary oxides is shown in Fig. 6 together with discovered CRCs and their recommendation scores in parentheses. Known CRCs registered in three databases are also plotted. As described in ref. 14, among 27 candidate CRCs, 23 compositions (85%) were found on the convex hull. Recalling that the 23 CRCs are not registered in any of the three databases, this result demonstrates the high performance of the present recommender system.

Fig. 6: The convex hull of the formation energy by the DFT calculations for pseudo-binary-oxide systems containing candidate CRCs.
figure 6

Closed circles (green) denote compounds on the convex hull. Closed triangles (blue) and squares (violet) denote CRCs registered in ICSD and ICDD-PDF + SpMat, respectively. Candidate compositions are given with recommendation scores in parentheses. Adopted from ref. 14 with modifications.

Synthesis condition recommender system

Methods to estimate recommendation scores for unknown CRCs have already been described in previous chapters of this article. It is true that some compounds were experimentally discovered at the proposed CRC based on the recommendation. However, we also experienced that synthesis experiments were often unsuccessful at the proposed CRCs. Since the predictive performance was well confirmed as described in the previous chapters, the failure is likely attributed to the lack of knowledge to find successful synthesis conditions. It is natural that a yet-undiscovered compound is difficult to synthesize. Experts in experimental chemistry attempt to synthesize compounds based on their experiences and knowledge of similar compounds. If there is a database of various synthesis conditions for diverse compounds, a computer may indicate the synthesis conditions efficiently instead of a human expert through machine learning. Synthesis conditions can be collected through text mining of scientific literature27,28,29,30. Such databases have been constructed recently, which may be useful for finding successful synthesis conditions. While such databases provide valuable information, there is a major problem when applied to machine learning. The data-set obtained from literature is strongly biased toward successful synthesis results. But, a good combination of successful and unsuccessful synthesis results is preferred for reliable machine learning. For this purpose, it is desirable to develop equipment that can automatically perform a large number of synthesis experiments in parallel without human bias, which is called combinatorial or parallel synthesis equipment.

Automated experimental equipment to construct such a database has been reported recently31,32,33,34,35. The present authors reported parallel synthesis experiments to prepare precursor powders of various inorganic oxides in four different ways, i.e., solid-state reaction, polymerized complex, cyclic ether sol–gel, and spray coprecipitation31. In the work of ref. 32, pseudo-binary inorganic oxide compositions were targeted and parallel synthesis experiments were made by a polymerized complex method. There were 28C2 × 27 = 10206 combinations of two cations from 28 elements and 27 compositional ratios, as shown in Fig. 7a. Among them, 1139 compositions were known to be CRCs and registered in ICDD-PDF. The remaining 9067 compositions were unknown if they were CRCs. Some of them may be unstable. Others may be difficult to synthesize experimentally and require special conditions for synthesis. Since the synthesis of pseudo-binary inorganic oxides has a long history of in-depth investigation, the chance of discovering yet-to-be-found oxides may be quite low. Therefore, to discover compounds, it is necessary to employ a much more efficient method than random trials on chemical compositions and synthesis conditions.

Fig. 7: A synthesis-condition recommender system.
figure 7

a The chemistry space and the synthesis condition space. b A schematic of the Tucker decomposition of the synthesis condition tensor. c Results of additional synthesis experiments for the top 300 synthesis conditions. The number of successful (orange) and unsuccessful (blue) results were shown as a function of the recommendation score. d The fractions of the successful synthesis conditions, i.e., success rate, for each bin of the recommendation score in c. Adopted from ref. 32 with small modifications.

Here, the synthesis condition space was composed of 66150 conditions. At each target composition, a maximum of five different synthesis temperatures, ranging from 873 to 1273 K, was adopted. Three starting materials were used for V and Mo; for the rest, one starting material was used for each cation. In order to obtain training data for the machine learning, synthesis experiments were performed under 1542 conditions in total, which included 600 conditions at compositions where the presence of CRC was unknown, and 942 conditions at known CRC. Both of them were randomly selected. As shown in Fig. 7a, at the known CRC, the target compound was successfully synthesized under 499 of 942 conditions. On the other hand, at the unknown CRC, not a single condition out of 600 was successful. Results of the synthesis experiments were put into a fourth-order tensor with four axes, namely, ‘starting material #1’, ‘starting material #2’, ‘cation mixing ratio’, and ‘firing temperature’, as shown in Fig. 7b. Then the tensor was subjected to the Tucker decomposition and recommendation scores for unexperimented conditions were estimated. In order to verify the predictive performance of the recommender system, additional synthesis experiments were conducted at the top 300 synthesis conditions of unexperimented compositions. A histogram in Fig. 7c displays the number of successful and unsuccessful results as a function of the recommendation score. The fractions of the successful synthesis conditions, i.e., success rate, for each bin of the recommendation score are shown in Fig. 7d. Although the success rate was about 20% when the recommendation score was 0.2, it increased proportionally with the recommendation score. It became about 50% when the recommendation score was 0.5. In this way, the usefulness of the recommendation score to estimate the success rate of synthesis conditions was demonstrated.

The top 300 synthesis conditions included 135 conditions for 75 unknown compositions. Synthesis experiments under the targeted conditions successfully found two as-yet-unknown pseudo-binary oxides: La4V2O11 and La7Sb3O18. Their powder X-ray diffraction profiles were analyzed by the Rietveld method using the RIETAN-FP program36 after the crystal structure determination using the EXPO2014 code37 to identify their crystal structures. La4V2O11 and La7Sb3O18 were found to be isostructural to known compounds, γ-Bi4V2O11 and La7Ru3O18, respectively. Although the discovery of inorganic pseudo-binary oxides was thought to be difficult, two as-yet-unreported compounds were successfully synthesized using the recommender system of the process conditions.

Conclusion and outlook

The recommender system is increasingly popular in a variety of fields in our society, such as E-commerce and social networking services. Based on a database, it attempts to suggest to an individual user what products to buy, what movies to watch, and so on. The method can be applied to materials discovery using an experimental database. The recommendation score can be related to the probability of finding the most pertinent chemical composition, synthesis conditions, etc. In this article, we described such studies on recommender systems for materials discovery.

Firstly, studies on the discovery of as-yet-unknown compounds using the recommender system were reviewed. A training dataset was obtained from those registered in ICSD. Two kinds of techniques were used to estimate recommendation scores. One method used compositional descriptors made up of elemental features. The other method used a tensor decomposition technique. The predictive performance for currently unknown CRCs was determined by examining their presence in other databases (ICDD-PDF and SpMat) in which overlapped data with ICSD was omitted. According to the recommendation, synthesis experiments were made. Two pseudo-ternary compounds, Li6Ge2P4O17 and La4Si3AlN9 with currently unknown structures were successfully discovered.

Next, a synthesis-condition recommender system was constructed by machine learning of a parallel experimental data-set collected in-house using a polymerized complex method. Recommendation scores for unexperimented conditions were then evaluated. Additional synthesis experiments were conducted at the top 300 synthesis conditions of unexperimented compositions to verify the predictive performance of the recommender system. Although inorganic pseudo-binary oxides have historically been the subject of much research and discovering compounds was thought to be difficult, two as-yet-unknown pseudo-binary oxides, La4V2O11 and La7Sb3O18 were successfully synthesized.

High performance of the recommender system for the discovery of CRC and synthesis conditions was well demonstrated in these works. It may be interesting to know the advantages between the tensor-based and descriptor-based approaches. In general, they are dependent on the quality and quantity of the problems and datasets. When many data are uniformly distributed in the search space, the tensor-based approach should be preferred. Otherwise, the descriptor-based approach helps avoid so-called cold-start problems, which occur when few known CRCs are available. Especially when the descriptors representing the target property (formation energy, synthesis condition, etc.) are clearly identified, the descriptor-based approach should be worthwhile to adopt.

As for the synthesis condition recommender system, the data acquisition speed is rate-controlling. A breakthrough is expected to occur when the recommender system is combined with a high-speed and automated synthesis robot to improve the quality of the recommendation iteratively.

The use of recommender systems is still in infancy, it would be important to consider its application to a variety of problems and data in materials science and technology.