## Introduction

The fast conduction of lithium (Li) ions in a solid is a phenomenon of significant scientific interest and technological importance. The room-temperature Li-ion conductivities (σRT) in poorly conductive and fast conducting materials can differ by more than twenty orders of magnitude1,2. The high σRT in electrode and electrolyte materials are essential for high power/rate performance of batteries. In particular, replacing the flammable liquid electrolyte used in commercial Li-ion batteries with a fast Li-conducting solid electrolyte, to produce an all-solid-state battery, provides improved safety, excellent stability, and long cycling life1,2. Although there are several thousands of known lithium-containing compounds, fast Li+ conduction with σRT close to 10−3–10−2 S cm−1, comparable to the level in liquid electrolytes, is a rare property held by only a few solid-state Li-ion conductors (SSLCs), such as lithium thiophosphates (e.g., Li7P3S113 and Li10GeP2S124, LGPS), garnet (e.g., Li7Li3Zr2O125, LLZO), NASICON (e.g., Li1.3Al0.3Ti1.7(PO4)36, LATP), perovskite (e.g., Li0.5La0.5TiO37, LLTO), Li3N8, and argyrodite9 (e.g., Li6PS5Cl) (Fig. 1a). Since these known SSLCs do not meet all desired attributes required for the commercialization of all-solid-state batteries10, there is significant interest in discovering new SSLC materials with high σRT. The challenges in predicting new SSLCs are largely a result of the diverse chemistries and structures of SSLCs, and current computational predictions and laboratory syntheses are often performed on a limited number of candidates1,2. SSLCs have compositions ranging from oxides and sulfides to nitrides and mixed halides, and a diverse set of crystalline structures including perovskite, argyrodite, garnet, and NASICON, and newly discovered structures, such as LGPS and Li7P3S11. Over the past few years, first-principles computation has played an important role in the successful prediction of a number of novel SSLCs11,12,13,14,15. Recent studies have determined a number of key physical factors required for fast Li-ion diffusion, such as anion lattice packing13, lattice dynamics16,17, frustration of the mobile-ion sublattice18,19, and concerted ion migration14. So far, transforming a theory into a predictive model to explore a vast composition-structure space of many materials remains a significant challenge.

Machine learning (ML) has emerged as a technique for materials discovery thanks to its capability of recognizing complex patterns in data20,21,22,23,24,25,26,27,28 by representing materials with critical descriptors such as the combination of chemistry, composition, and crystal structure that yields desired materials properties. While significant research progress has been achieved by improving the materials descriptors over the years29,30,31,32,33,34,35,36,37, the applications of ML for materials discovery is in general plagued by two significant challenges. First, a ML model requires training on a sufficient amount of data to capture the correlation between a desired physical property and the features of materials. Unfortunately, only a few select materials exhibit the property of interest, as is often the case in materials discovery. In many cases, even the data for materials with poor properties is scarce, due to lack of interest in performing and reporting these measurements. For example, most solids with poor ionic conductivity do not have conductivity data. The second challenge is that the parameterization of a ML model is highly susceptible to variances and errors in property data38. In the case of SSLCs, the conductivity obtained through experimental measurements can vary by a few orders of magnitude due to factors including synthesis method, sample preparation, and measurement technique39. For SSLCs, it is challenging to train a ML model of Li+ conductivity from only a few compounds with known values of σRT with large variances and to make reliable predictions for thousands of compounds. This scarcity of high-quality property data greatly limits the applicability of supervised ML models to capture and predict complex structure-property relationships over a broader space of materials beyond known examples.

Unlike supervised learning models, which require well-labeled training data, unsupervised learning can be readily applied to vast datasets regardless of whether any properties or labels exist. As a technique to draw inferences from features of data without explicitly labeled properties, unsupervised learning has been applied in materials science for feature extraction, pattern recognition, clustering, and phase mapping40,41,42,43,44. However, the application of unsupervised learning to directly discover new materials with enhanced properties has rarely been explored27. As shown in this study, unsupervised learning, through training on a broad range of materials, can draw boundaries between good and poor examples, identifying candidates similar to good examples, which are then further verified by more accurate first-principles calculations. This new approach using unsupervised learning for materials discovery has multiple advantages. Switching the target of ML from predicting the property (e.g., σRT) in supervised learning to grouping materials in unsupervised learning alleviates the issues of poor data quality and accuracy. Rather than predicting the targeted materials property accurately for each candidate, the output from unsupervised learning is a significantly narrowed list of materials candidates for subsequent exploration by more accurate first-principles calculations, thus significantly reducing the cost for an expensive high-throughput first-principles screening by utilizing a limited quantity of low-quality data. In addition, unsupervised learning uses unlabeled data and readily expands the applicability of the ML model to the entire materials space.

In this study, we propose an unsupervised learning scheme for guiding materials discovery, and demonstrate it for materials discovery of SSLCs. We apply unsupervised learning to screen all known Li-containing compounds from the Inorganic Crystal Structure Database. Our trained unsupervised learning models cluster Li-containing compounds into groups of SSLCs with high conductivity and other groups of materials with poor ionic conduction. Using ab initio molecular dynamics (AIMD) simulation to quantify σRT for predicted compounds45, 16 new candidates having σRT exceeding 10−4 S cm−1 are identified, and three of them have σRT exceeding 10−2 S cm−1, on par with known SSLCs with highest σRT. As proposed and demonstrated, our new approach of ML-guided materials discovery circumvents the data scarcity challenges, identifies new materials using a small number of known examples, and provides unique insight on structure-property relations.

## Results

### Scheme of the unsupervised discovery of SSLCs

We illustrate our scheme of the unsupervised discovery of SSLC materials in Fig. 1. In order to train the unsupervised model, a quantitative representation of the complex materials structure (Fig. 1a) is required as input. Instead of using a combination of hand-picked features, we used digital diffraction patterns of the crystal structure. Specifically, a representation for each crystal structure was built based on Bragg’s law to map the three-dimensional periodic crystal lattice into a set of X-ray diffraction intensities at a fixed set of 2θ values (Method and Fig. 1b)35,46,47. Here, we only considered the anion lattice of the crystal structure, relying on the knowledge that anion configuration and Li+-anion interactions significantly affect Li sites, diffusion channels, and the energy landscape of Li migration1,13,15. The anionic lattice was set to S anion and was scaled to the same atomistic volume, so that the representation was invariant to lattice parameter or the chemical constituent (Method). The resulting representation, called modified X-ray diffraction (mXRD), is unambiguously defined for every anion lattice (Fig. 1b), fully capturing the anionic crystal structure information. Here, we performed our unsupervised discovery on 2986 compounds that contain lithium but not transition metals. Since some compounds have the same structure, one representative structure was used. A dataset of 528 representative anionic structures and their mXRDs were performed for the unsupervised learning (Method).

### Unsupervised clustering of Li-containing compounds

We performed clustering, a common unsupervised learning technique, to group materials with similar mXRD representations. We first generated a model (named C1) based on the agglomerative hierarchical clustering method to train a bottom-up grouping of the mXRD dataset (Method and Fig. 2a). The grouping showed a good quality of clustering as the mXRDs shared similar characteristics within the same groups (Fig. 2d) and different groups were well differentiated (Supplementary Note 1, Supplementary Fig. 1 and 2). More importantly, a visible clustering of SSLC materials is found using this model (Fig. 2b). Most known SSLCs with σRT close to 10−3–10−2 S cm−1, despite being structurally distinctive, were clustered into two groups in the center of the dendrogram out of a total seven groups, including LGPS, Li7P3S11, LLZO, and Li3N, in group VI, and argyrodite, β-Li3PS4, LLTO in group V. LATP, as an exception in group VII, lay close to the boundaries of group V and VII and its mXRD pattern still exhibited some similarity with group V. In addition, statistical analysis of σRT within the group quantitatively confirmed the correlation on σRT (Supplementary Note 1, Supplementary Figs 3 and 4). The violin plot of σRT of group V and VI showed significantly higher σRT (Fig. 2c), and the majority of compounds outside of group V and VI had σRT significantly below 10−4 S cm−1. A statistically significant difference of σRT of the two groups V and VI versus the rest groups was proved by the t-test (Supplementary Note 2).

As confirmed by quantitative correlation between the groups and σRT, our unsupervised learning model captured the physical dependence of fast solid-state Li+-diffusion on anion lattice. To critically assess the robustness of clustering in capturing the observed physical correlation, we performed three different clustering techniques. In addition to the aforementioned model, we trained a second model (named C2) to create a top-down grouping by recursively applying divisive spectral clustering (Method). These two models were purely based on the mXRD dataset of anion lattices without seeing any labeled σRT data. In our third grouping, the model (named C3) used the limited available σRT information to optimize the clustering of known SSLC examples (Supplementary Note 4). Despite the differences in the clustering methodologies of three models, the observed aggregation of fast-conducting examples was mostly consistent. Known SSLCs largely overlapped among the groups generated by these three models (Supplementary Notes 35, Supplementary Table 1 and Supplementary Figs 59). In particular, LGPS, Li7P3S11, LLZO, and Li3N were always clustered to the same group by all three models. Our results from three distinct models confirmed the reliability of clustering fast-conducting versus poor-conducting materials based on unsupervised learning using mXRD representations of the structures.

### Physical insights from unsupervised learning

The clustering of SSLCs by mXRD provides new insight into the understanding of crystal structures exhibiting fast-ion conduction. While Li-ion diffusion in solids has been shown to correlate with various parameters, such as lattice volume1, anion chemistry48, bond ionicity25,48, phonon mode16,17, and Li coordination number25, no single unified theory explains the similarity among highly distinctive crystal structures of all SSLCs. Our unsupervised clustering quantitatively confirmed the similarity among the mXRD patterns of anion lattice of SSLCs. The mXRD encodes the symmetry and ordering of the anionic lattice and showed strong correlation with ionic conductivity (Fig. 2). Given the information of lattice volume and anion chemistry critical for ion diffusion were removed from the mXRD descriptor, the resulted clustering of Li-conducting phases suggests that the long-range periodicity of the anion lattice as encoded in mXRD plays a fundamental role in Li-ion diffusion. By analyzing the structural origin of the clustered groups, (Supplementary Note 6), we found the materials in Group I, II, and III correspond to highly symmetrical fcc (face centered cubic), hcp (hexagonal close packed), and bcc anion lattices, respectively. For these anion lattices, Li ions are symmetrically confined in highly symmetric tetrahedral or octahedral sites of anions (as an example, Fig. 2e for Li2S), and migrate among these well-defined sites13. Groups IV, V, and VI show a moderate level of variance, which can be understood as mild distortion of the anion lattices. The distortion of anion lattices disturbs Li+ bonding environments and causes Li+ to deviate from highly symmetric locations to geometrically frustrated configurations. For example, in LGPS and LLZO, the distorted anion polyhedra generate multiple positions to host Li ions, observed as the spread Li-ion probability density observed in AIMD simulations (Fig. 2e), which were represented as partially occupied Li sites (e.g., Li1 and 96 h sites in LGPS and LLZO, respectively) from diffraction experiments4,5. Having multiple positions for Li+ to occupy may lead to a degeneracy of Li sublattice energy and an entropically-enabled disordered-Li sublattice migrating among metastable configurations18,19. Therefore, as observed in their mXRD representations, the SSLCs clustered in group V and VI exhibit the characteristics of moderately distorted anion lattices, which is closely related to disordered Li sublattice for fast Li-ion conduction. The materials in Group VII, as reflected by the high standard deviation of mXRD peaks, correspond to the least symmetric and highly disordered anion lattices (Supplementary Figs 1012). The highly disordered anion lattices in these materials may locally trap Li ions and impede Li-ion percolation across the crystal structure (Supplementary Fig. 13), resulting in the low conductivities observed for compounds in this group.

### SSLC confirmed by AIMD simulations

Given the successful clustering of known SSLC materials by unsupervised learning models, the other structures clustered into the same groups are expected to exhibit fast Li-ion conduction. To further assess the conductivity of these compounds discovered from the unsupervised grouping, we conducted AIMD simulations, which have been demonstrated as a highly accurate and predictive computation approach for calculating Li ion conductivity11,14,15,45. From the screening of initial 2989 compounds from the ICSD, we narrowed the evaluation of the ion-conduction property down to 82 unique compounds, which were from the intersection of these fast-conducting groups in the aforementioned three models. Thus, our unsupervised learning scheme successfully reduced a high throughput screening of thousands of compounds to a focused exploration of <100 candidates with much higher success probability. Among these, we rediscovered LiZnPS4, which was previously discovered by the bcc-anion-packing rule and was confirmed with an experimental σRT of 5.7 × 10−4 S cm−111,12,13. According to AIMD simulations (Fig. 3), 16 more candidates are predicted to have σRT higher than 10–4 S cm−1. In particular, three new materials systems, Li8N2Se, Li6KBiO6 and Li5P2N5, have σRT exceeding 10−2 S cm−1, a conductivity higher than that of the best known SSLCs. A list of these materials and the calculated Li+ conduction properties are summarized in Supplementary Tables 34 and Supplementary Fig. 14. Figure 3 plots the predicted σRT and activation energy of newly discovered SSLCs (filled symbols), in comparison with σRT reported in the past few decades (open symbols, Supplementary Table 2). The newly discovered SSLCs are in the upper left corner of Fig. 3, which corresponds to high σRT of >10−5 S cm−1 and low Ea of 0.17–0.45 eV. More importantly, these SSLCs comprise new structures, chemistries, and compositions significantly different from known SSLCs, demonstrating the capability of our crystal-structure-based unsupervised learning model to discover materials beyond existing chemistries.

## Discussion

A fraction of compounds uncovered by our grouping did not show fast Li-ion diffusion in AIMD simulations (Supplementary Tables 57). Among these compounds, a majority exhibit too small of a percolation radius for Li-ion migration, a blocking of diffusion network by other cations, or a poorly connected diffusion network. The inclusion of these compounds was attributed to the fact that our unsupervised models were trained solely on the anionic geometry without considering factors such as the effects of other cations. In addition, some compounds with low ionic conductivity may be further optimized via doping or tuning Li concentration. Future extension of our scheme should attempt to include features in addition to the anion lattice for more accurate prediction.

For these Li-ion conductors to be utilized as solid electrolytes for solid-state Li-ion batteries, other materials properties, such as electrochemical window, interface compatibility, and mechanical properties1,2,10,15,16, are also required. We employed the first-principles computation techniques established in the previous studies10,15 to evaluate the thermodynamic intrinsic electrochemical window of these newly identified ion conductors (Supplementary Fig. 15). Consistent with the general trend identified in the previous studies15, most of the materials have limited electrochemical windows. Many identified nitrides are stable with Li metal in agreement with the previous computation study49, while other compounds are not stable against Li metal or at low potential due to the reduction of cations. The identified fluorides have a very high oxidation limit of >6 V, which may be ideal for stable protection of high-voltage cathodes. The oxides have decent electrochemical windows but most have relatively low ionic conductivities of ~10−4 S cm−1 (except for Li6KBiO6). The identified sulfides have narrow windows but these two sulfides may have significantly better air/moisture stability than currently used thio-phosphates. In summary, while our discovery does not identify an ionic conductor that out-competes current solid electrolytes, the potential choices of fast ion conductors with improvements in certain aspects (such as stability against Li metal, high voltage, or air) are predicted from the computation discovery. The properties and applicabilities of these materials in solid-state batteries may require further computational or experimental studies and optimizations.

In summary, the unsupervised learning models succeeded in distinguishing fast Li-conducting and poor Li-conducting materials, leading to the prediction of sixteen new compounds as solid-state Li-ion conductors with room-temperature conductivities higher than 10−4 S cm−1 with a few new compounds exceeding 10−2 S cm−1. These newly discovered candidates have highly different structures and chemical compositions from current known fast Li-ion conductors, demonstrating the effectiveness of our unsupervised learning approach for discovering new materials over a wide materials space. This novel unsupervised learning approach also reveals the unique structure–property relationship between anion lattice and Li+ conduction over a large materials space. Whereas the supervised learning has been widely adopted in the majority of machine-learning studies for materials, our unsupervised learning scheme, which narrows a high-throughput screening to a focused prioritized list by utilizing a limited amount of low-quality data, presents a different approach of using ML for materials discovery, and is generally applicable for other physical properties.

## Methods

### Data preprocessing

The raw data of crystalline structures were exported from the Inorganic Crystalline Structure Database (ICSD) in the format of cif files50. The range of analysis in the current study includes all compounds containing Li but not transition metal species, except Sc, Y, La, Ti, and Zr. The exclusion of transition metal species is based on the consideration that compounds containing transition metal ions are usually redox active and hence may not be suitable for application as solid-state electrolytes. These filters yielded a total of 2986 ICSD entries (ver. November 2016). The representative structures for each entry was identified either as the “chemical_name_structure_type” flag in the cif files or as the chemical formula if this flag was not set explicitly. The entries that were structurally similar in the hierarchical clustering were further filtered to remove duplicates in the training set. The final training set included 528 unique representative structures for the unsupervised learning analysis.

### Representation

The anionic sublattice of Li-containing compounds is uniquely represented in the X-ray diffraction pattern based on Bragg’s law. For the diffraction from (hkl) plane, the angle is determined by

$$2\sin \left( \theta \right) = \lambda \cdot d_{hkl}^{ - 1}$$
(1)

where the interplane distance dhkl is a function of the size and shape of the unit cell

$$d_{hkl}^2 = h^2a^{ \ast 2} + k^2b^{ \ast 2} + l^2c^{ \ast 2} + 2hka^ \ast b^ \ast \cos \left( \gamma \right) + 2hla^ \ast c^ \ast \cos \left( \beta \right) + 2klb^ \ast c^ \ast \cos \left( \alpha \right)$$
(2)

The intensity is determined by the amplitude of light scattered from the lattice plane

$$F_{hkl} =\sum_{j = 1}^m {N_jf_j{\mathrm{exp}}\left[ {2\pi i(hx_i + ky_i + lz_i)} \right.}$$
(3)

where the sum runs over all atoms of the unit cell on (hkl) plane, Nj is the fraction of every equivalent position that is occupied by atom j at coordinates (xj, yj, zj). The scattering factor fj is a product describing the interaction of the X-ray with the electrons around an atom. Using Eqs. 1, 2, and 3, the X-ray diffraction of a periodic lattice is determined by the size and shape of unit cell, as well as the position and identity of atoms on a given plane. The following procedure was employed to obtain the XRD representation of the geometry of anion sublattice of the crystalline structure. First, we removed all cations from the crystalline structure, keeping only the anionic sublattice in the unit cell. Second, we substituted the remaining anions for a unitary species (e.g., S2−), removing the influence of the scattering factor f on the diffraction intensity. Third, the unit cell was isotropically expanded or compressed to a pre-determined volume per anion of 40 Å3, removing the effect of unit cell size on the position of the diffraction peaks. After these initial steps, the X-ray diffraction of modified lattice encodes only information for the geometry and topology of anion sublattice. The calculation of diffraction pattern is then performed at a fixed set of 2θ values from 0 to 89.98° at a step size of 0.1° using the pymatgen package51, generating a 900-dimensional vector for each diffraction pattern51. We confirmed the results of hierarchical clustering was consistent when the step size was increased to 0.02° (Supplementary Note 7). A Gaussian smearing was then performed to normalize the integrated intensity of diffraction to a unitary value.

### Unsupervised learning

We used the Dendrogram function from the SciPy package to perform agglomerative hierarchical clustering (AHC)52. In AHC, each sample starts in its own cluster, and the clusters merge progressively according to the similarity metric as one moves up the hierarchy. The output from AHC is a bottom-up hierarchical tree diagram (dendrogram). The Euclidean distance (L2) between two diffraction profiles was used as the similarity metric and Ward linkage was used to measure the cluster dissimilarity53. The same clustering results were also obtained using the hclust package in R.

In addition to the hierarchical clustering, we used the kernlab package in R to perform spectral clustering. Spectral clustering uses the eigenvalues of the similarity matrix of the data to divide the samples in to K groups, where K is a manually selected integer54. To create hierarchical grouping results, we recursively applied the bisectional divide (K = 2) on the larger portion from the previous grouping, and obtained a divisive top-down hierarchical diagram after the clustering.

### First-principles calculation

All Density Functional Theory (DFT) calculations were performed using the Vienna Ab initio Simulation package (VASP) within the projector augmented-wave approach and Perdew–Burke–Ernzerhof (PBE) generalized-gradient approximation (GGA) functionals55,56,57. The parameters in static DFT calculations were consistent with the Materials Project58. Ab initio molecular dynamics (AIMD) simulations were performed in supercell models using non-spin-polarized DFT calculations with a Γ-centered k-point. The time step was set to 2 fs. The initial structures were statically relaxed and were set to an initial temperature of 100 K. The structures were then heated to targeted temperatures at a constant rate by velocity scaling during 2 ps. During the estimation of Li ion diffusion, NVT ensemble using Nosé–Hoover thermostat was adopted. The total time of AIMD simulations were in the range of 100 ps to 1000 ps until the diffusivity was converged. The ionic diffusivity and conductivity were calculated following established method in previous study45.