Unsupervised discovery of solid-state lithium ion conductors

Although machine learning has gained great interest in the discovery of functional materials, the advancement of reliable models is impeded by the scarcity of available materials property data. Here we propose and demonstrate a distinctive approach for materials discovery using unsupervised learning, which does not require labeled data and thus alleviates the data scarcity challenge. Using solid-state Li-ion conductors as a model problem, unsupervised materials discovery utilizes a limited quantity of conductivity data to prioritize a candidate list from a wide range of Li-containing materials for further accurate screening. Our unsupervised learning scheme discovers 16 new fast Li-conductors with conductivities of 10−4–10−1 S cm−1 predicted in ab initio molecular dynamics simulations. These compounds have structures and chemistries distinct to known systems, demonstrating the capability of unsupervised learning for discovering materials over a wide materials space with limited property data.

T he fast conduction of lithium (Li) ions in a solid is a phenomenon of significant scientific interest and technological importance. The room-temperature Li-ion conductivities (σ RT ) in poorly conductive and fast conducting materials can differ by more than twenty orders of magnitude 1,2 . The high σ RT in electrode and electrolyte materials are essential for high power/rate performance of batteries. In particular, replacing the flammable liquid electrolyte used in commercial Liion batteries with a fast Li-conducting solid electrolyte, to produce an all-solid-state battery, provides improved safety, excellent stability, and long cycling life 1,2 . Although there are several thousands of known lithium-containing compounds, fast Li + conduction with σ RT close to 10 −3 -10 −2 S cm −1 , comparable to the level in liquid electrolytes, is a rare property held by only a few solid-state Li-ion conductors (SSLCs), such as lithium thiophosphates (e.g., Li 7 P 3 S 11 3 and Li 10 GeP 2 S 12 4 , LGPS), garnet (e.g., Li 7 Li 3 Zr 2 O 12 5 , LLZO), NASICON (e.g., Li 1.3 Al 0.3 Ti 1.7 (PO 4 ) 3 6 , LATP), perovskite (e.g., Li 0.5 La 0.5 TiO 3 7 , LLTO), Li 3 N 8 , and argyrodite 9 (e.g., Li 6 PS 5 Cl) (Fig. 1a). Since these known SSLCs do not meet all desired attributes required for the commercialization of all-solid-state batteries 10 , there is significant interest in discovering new SSLC materials with high σ RT . The challenges in predicting new SSLCs are largely a result of the diverse chemistries and structures of SSLCs, and current computational predictions and laboratory syntheses are often performed on a limited number of candidates 1,2 . SSLCs have compositions ranging from oxides and sulfides to nitrides and mixed halides, and a diverse set of crystalline structures including perovskite, argyrodite, garnet, and NASICON, and newly discovered structures, such as LGPS and Li 7 P 3 S 11 . Over the past few years, firstprinciples computation has played an important role in the successful prediction of a number of novel SSLCs [11][12][13][14][15] . Recent studies have determined a number of key physical factors required for fast Li-ion diffusion, such as anion lattice packing 13 , lattice dynamics 16,17 , frustration of the mobile-ion sublattice 18,19 , and concerted ion migration 14 . So far, transforming a theory into a predictive model to explore a vast composition-structure space of many materials remains a significant challenge.
Machine learning (ML) has emerged as a technique for materials discovery thanks to its capability of recognizing complex patterns in data 20-28 by representing materials with critical descriptors such as the combination of chemistry, composition, and crystal structure that yields desired materials properties. While significant research progress has been achieved by improving the materials descriptors over the years [29][30][31][32][33][34][35][36][37] , the applications of ML for materials discovery is in general plagued by two significant challenges. First, a ML model requires training on a sufficient amount of data to capture the correlation between a desired physical property and the features of materials. Unfortunately, only a few select materials exhibit the property of interest, as is often the case in materials discovery. In many cases, even the data for materials with poor properties is scarce, due to lack of interest in performing and reporting these measurements. For example, most solids with poor ionic conductivity do not have conductivity data. The second challenge is that the parameterization of a ML model is highly susceptible to variances and errors in property data 38 . In the case of SSLCs, the conductivity obtained through experimental measurements can vary by a few orders of magnitude due to factors including synthesis method, sample preparation, and measurement technique 39 . For SSLCs, it is challenging to train a ML model of Li + conductivity from only a few compounds with known values of σ RT with large variances and to make reliable predictions for thousands of compounds. This scarcity of high-quality property data greatly limits the applicability of supervised ML models to capture and predict complex structure-property relationships over a broader space of materials beyond known examples.
Unlike supervised learning models, which require well-labeled training data, unsupervised learning can be readily applied to vast datasets regardless of whether any properties or labels exist. As a technique to draw inferences from features of data without explicitly labeled properties, unsupervised learning has been applied in materials science for feature extraction, pattern recognition, clustering, and phase mapping [40][41][42][43][44] . However, the application of unsupervised learning to directly discover new materials with enhanced properties has rarely been explored 27 . As shown in this study, unsupervised learning, through training on a broad range of materials, can draw boundaries between good and poor examples, identifying candidates similar to good examples, which are then further verified by more accurate first-principles calculations. This new approach using unsupervised learning for materials discovery has multiple advantages. Switching the target of ML from predicting the property (e.g., σ RT ) in supervised learning to grouping materials in unsupervised learning alleviates the issues of poor data quality and accuracy. Rather than predicting the targeted materials property accurately for each candidate, the output from unsupervised learning is a significantly narrowed list of materials candidates for subsequent exploration by more accurate first-principles calculations, thus significantly reducing the cost for an expensive high-throughput first-principles screening by utilizing a limited quantity of low-quality data. In addition, unsupervised learning uses unlabeled data and readily expands the applicability of the ML model to the entire materials space.
In this study, we propose an unsupervised learning scheme for guiding materials discovery, and demonstrate it for materials discovery of SSLCs. We apply unsupervised learning to screen all known Li-containing compounds from the Inorganic Crystal Structure Database. Our trained unsupervised learning models cluster Li-containing compounds into groups of SSLCs with high conductivity and other groups of materials with poor ionic conduction. Using ab initio molecular dynamics (AIMD) simulation to quantify σ RT for predicted compounds 45 , 16 new candidates having σ RT exceeding 10 −4 S cm −1 are identified, and three of them have σ RT exceeding 10 −2 S cm −1 , on par with known SSLCs with highest σ RT . As proposed and demonstrated, our new approach of ML-guided materials discovery circumvents the data scarcity challenges, identifies new materials using a small number of known examples, and provides unique insight on structureproperty relations.

Results
Scheme of the unsupervised discovery of SSLCs. We illustrate our scheme of the unsupervised discovery of SSLC materials in Fig. 1. In order to train the unsupervised model, a quantitative representation of the complex materials structure (Fig. 1a) is required as input. Instead of using a combination of hand-picked features, we used digital diffraction patterns of the crystal structure. Specifically, a representation for each crystal structure was built based on Bragg's law to map the three-dimensional periodic crystal lattice into a set of X-ray diffraction intensities at a fixed set of 2θ values (Method and Fig. 1b) 35,46,47 . Here, we only considered the anion lattice of the crystal structure, relying on the knowledge that anion configuration and Li + -anion interactions significantly affect Li sites, diffusion channels, and the energy landscape of Li migration 1,13,15 . The anionic lattice was set to S anion and was scaled to the same atomistic volume, so that the representation was invariant to lattice parameter or the chemical constituent (Method). The resulting representation, called modified X-ray diffraction (mXRD), is unambiguously defined for every anion lattice (Fig. 1b), fully capturing the anionic crystal structure information. Here, we performed our unsupervised discovery on 2986 compounds that contain lithium but not transition metals. Since some compounds have the same structure, one representative structure was used. A dataset of 528 representative anionic structures and their mXRDs were performed for the unsupervised learning (Method).
Unsupervised clustering of Li-containing compounds. We performed clustering, a common unsupervised learning technique, to group materials with similar mXRD representations. We first generated a model (named C 1 ) based on the agglomerative hierarchical clustering method to train a bottom-up grouping of the mXRD dataset (Method and Fig. 2a). The grouping showed a good quality of clustering as the mXRDs shared similar characteristics within the same groups ( Fig. 2d) and different groups were well differentiated (Supplementary Note 1, Supplementary Fig. 1 and 2). More importantly, a visible clustering of SSLC materials is found using this model (Fig. 2b). Most known SSLCs with σ RT close to 10 −3 -10 −2 S cm −1 , despite being structurally distinctive, were clustered into two groups in the center of the dendrogram out of a total seven groups, including LGPS, Li 7 P 3 S 11 , LLZO, and Li 3 N, in group VI, and argyrodite, β-Li 3 PS 4 , LLTO in group V. LATP, as an exception in group VII, lay close to the boundaries of group V and VII and its mXRD pattern still exhibited some similarity with group V. In addition, statistical analysis of σ RT within the group quantitatively confirmed the correlation on σ RT (Supplementary Note 1, Supplementary Figs 3 and 4). The violin plot of σ RT of group V and VI showed significantly higher σ RT (Fig. 2c), and the majority of compounds outside of group V and VI had σ RT significantly below 10 −4 S cm −1 . A statistically significant difference of σ RT of the two groups V and VI versus the rest groups was proved by the t-test (Supplementary Note 2).
As confirmed by quantitative correlation between the groups and σ RT , our unsupervised learning model captured the physical dependence of fast solid-state Li + -diffusion on anion lattice. To critically assess the robustness of clustering in capturing the observed physical correlation, we performed three different clustering techniques. In addition to the aforementioned model, we trained a second model (named C 2 ) to create a top-down grouping by recursively applying divisive spectral clustering (Method). These two models were purely based on the mXRD dataset of anion lattices without seeing any labeled σ RT data. In our third grouping, the model (named C 3 ) used the limited available σ RT information to optimize the clustering of known SSLC examples (Supplementary Note 4). Despite the differences in the clustering methodologies of three models, the observed aggregation of fast-conducting examples was mostly consistent. Known SSLCs largely overlapped among the groups generated by these three models (Supplementary Notes 3-5, Supplementary Table 1 and Supplementary Figs 5-9). In particular, LGPS, Li 7 P 3 S 11 , LLZO, and Li 3 N were always clustered to the same group by all three models. Our results from three distinct models confirmed the reliability of clustering fast-conducting versus poor-conducting materials based on unsupervised learning using mXRD representations of the structures.
Physical insights from unsupervised learning. The clustering of SSLCs by mXRD provides new insight into the understanding of crystal structures exhibiting fast-ion conduction. While Li-ion diffusion in solids has been shown to correlate with various parameters, such as lattice volume 1 , anion chemistry 48 , bond ionicity 25,48 , phonon mode 16,17 , and Li coordination number 25 , no single unified theory explains the similarity among highly distinctive crystal structures of all SSLCs. Our unsupervised clustering quantitatively confirmed the similarity among the mXRD patterns of anion lattice of SSLCs. The mXRD encodes the  ARTICLE symmetry and ordering of the anionic lattice and showed strong correlation with ionic conductivity (Fig. 2). Given the information of lattice volume and anion chemistry critical for ion diffusion were removed from the mXRD descriptor, the resulted clustering of Li-conducting phases suggests that the long-range periodicity of the anion lattice as encoded in mXRD plays a fundamental role in Li-ion diffusion. By analyzing the structural origin of the clustered groups, (Supplementary Note 6), we found the materials in Group I, II, and III correspond to highly symmetrical fcc (face centered cubic), hcp (hexagonal close packed), and bcc anion lattices, respectively. For these anion lattices, Li ions are symmetrically confined in highly symmetric tetrahedral or octahedral sites of anions (as an example, Fig. 2e for Li 2 S), and migrate among these well-defined sites 13 . Groups IV, V, and VI show a moderate level of variance, which can be understood as mild distortion of the anion lattices. The distortion of anion lattices disturbs Li + bonding environments and causes Li + to deviate from highly symmetric locations to geometrically frustrated configurations. For example, in LGPS and LLZO, the distorted anion polyhedra generate multiple positions to host Li ions, observed as the spread Li-ion probability density observed in AIMD simulations (Fig. 2e), which were represented as partially occupied Li sites (e.g., Li1 and 96 h sites in LGPS and LLZO, respectively) from diffraction experiments 4,5 . Having multiple positions for Li + to occupy may lead to a degeneracy of Li sublattice energy and an entropically-enabled disordered-Li sublattice migrating among metastable configurations 18,19 . Therefore, as observed in their mXRD representations, the SSLCs clustered in group V and VI exhibit the characteristics of moderately distorted anion lattices, which is closely related to disordered Li sublattice for fast Li-ion conduction. The materials in Group VII, as reflected by the high standard deviation of mXRD peaks, correspond to the least symmetric and highly disordered anion lattices ( Supplementary Figs 10-12). The highly disordered anion lattices in these materials may locally trap Li ions and impede Li-ion percolation across the crystal structure (Supplementary Fig. 13), resulting in the low conductivities observed for compounds in this group. SSLC confirmed by AIMD simulations. Given the successful clustering of known SSLC materials by unsupervised learning models, the other structures clustered into the same groups are expected to exhibit fast Li-ion conduction. To further assess the conductivity of these compounds discovered from the unsupervised grouping, we conducted AIMD simulations, which have been demonstrated as a highly accurate and predictive computation approach for calculating Li ion conductivity 11,14,15,45 . From the screening of initial 2989 compounds from the ICSD, we narrowed the evaluation of the ion-conduction property down to 82 unique compounds, which were from the intersection of these fast-conducting groups in the aforementioned three models. Thus, our unsupervised learning scheme successfully reduced a high throughput screening of thousands of compounds to a focused exploration of <100 candidates with much higher success probability. Among these, we rediscovered LiZnPS 4 , which was previously discovered by the bcc-anion-packing rule and was confirmed with an experimental σ RT of 5.7 × 10 −4 S cm −111-13 . According to AIMD simulations (Fig. 3), 16 more candidates are predicted to have σ RT higher than 10 -4 S cm −1 . In particular, three new materials systems, Li 8 N 2 Se, Li 6 KBiO 6 and Li 5 P 2 N 5 , have σ RT exceeding 10 −2 S cm −1 , a conductivity higher than that of the best known SSLCs. A list of these materials and the calculated Li + conduction properties are summarized in Supplementary Tables 3-4 and Supplementary  Fig. 14. Figure 3 plots the predicted σ RT and activation energy of newly discovered SSLCs (filled symbols), in comparison with σ RT reported in the past few decades (open symbols, Supplementary Table 2). The newly discovered SSLCs are in the upper left corner of Fig. 3, which corresponds to high σ RT of >10 −5 S cm −1 and low E a of 0.17-0.45 eV. More importantly, these SSLCs comprise new structures, chemistries, and compositions significantly different from known SSLCs, demonstrating the capability of our crystal-structure-based unsupervised learning model to discover materials beyond existing chemistries.

Discussion
A fraction of compounds uncovered by our grouping did not show fast Li-ion diffusion in AIMD simulations (Supplementary  Tables 5-7). Among these compounds, a majority exhibit too small of a percolation radius for Li-ion migration, a blocking of diffusion network by other cations, or a poorly connected diffusion network. The inclusion of these compounds was attributed to the fact that our unsupervised models were trained solely on the anionic geometry without considering factors such as the effects of other cations. In addition, some compounds with low ionic conductivity may be further optimized via doping or tuning Li concentration. Future extension of our scheme should attempt to include features in addition to the anion lattice for more accurate prediction. For these Li-ion conductors to be utilized as solid electrolytes for solid-state Li-ion batteries, other materials properties, such as electrochemical window, interface compatibility, and mechanical properties 1,2,10,15,16 , are also required. We employed the first-principles computation techniques established in the previous studies 10,15 to evaluate the thermodynamic intrinsic electrochemical window of these newly identified ion conductors (Supplementary Fig. 15). Consistent with the general trend identified in the previous studies 15 , most of the materials have limited electrochemical windows. Many identified nitrides are stable with Li metal in agreement with the previous computation study 49 , while other compounds are not stable against Li metal or at low potential due to the reduction of cations. The identified fluorides have a very high oxidation limit of >6 V, which may be ideal for stable protection of high-voltage cathodes. The oxides have decent electrochemical windows but most have relatively low ionic conductivities of~10 −4 S cm −1 (except for Li 6 KBiO 6 ). The identified sulfides have narrow windows but these two sulfides may have significantly better air/moisture stability than currently used thio-phosphates. In summary, while our discovery does not identify an ionic conductor that outcompetes current solid electrolytes, the potential choices of fast ion conductors with improvements in certain aspects (such as stability against Li metal, high voltage, or air) are predicted from the computation discovery. The properties and applicabilities of these materials in solid-state batteries may require further computational or experimental studies and optimizations.
In summary, the unsupervised learning models succeeded in distinguishing fast Li-conducting and poor Li-conducting materials, leading to the prediction of sixteen new compounds as solid-state Li-ion conductors with room-temperature conductivities higher than 10 −4 S cm −1 with a few new compounds exceeding 10 −2 S cm −1 . These newly discovered candidates have highly different structures and chemical compositions from current known fast Li-ion conductors, demonstrating the effectiveness of our unsupervised learning approach for discovering new materials over a wide materials space. This novel unsupervised learning approach also reveals the unique structure-property relationship between anion lattice and Li + conduction over a large materials space. Whereas the supervised learning has been widely adopted in the majority of machinelearning studies for materials, our unsupervised learning scheme, which narrows a high-throughput screening to a focused prioritized list by utilizing a limited amount of low-quality data, presents a different approach of using ML for materials discovery, and is generally applicable for other physical properties.

Methods
Data preprocessing. The raw data of crystalline structures were exported from the Inorganic Crystalline Structure Database (ICSD) in the format of cif files 50 . The range of analysis in the current study includes all compounds containing Li but not transition metal species, except Sc, Y, La, Ti, and Zr. The exclusion of transition metal species is based on the consideration that compounds containing transition metal ions are usually redox active and hence may not be suitable for application as solid-state electrolytes. These filters yielded a total of 2986 ICSD entries (ver. November 2016). The representative structures for each entry was identified either as the "chemical_name_structure_type" flag in the cif files or as the chemical formula if this flag was not set explicitly. The entries that were structurally similar in the hierarchical clustering were further filtered to remove duplicates in the training set. The final training set included 528 unique representative structures for the unsupervised learning analysis.
Representation. The anionic sublattice of Li-containing compounds is uniquely represented in the X-ray diffraction pattern based on Bragg's law. For the diffraction from (hkl) plane, the angle is determined by where the interplane distance d hkl is a function of the size and shape of the unit cell The intensity is determined by the amplitude of light scattered from the lattice plane where the sum runs over all atoms of the unit cell on (hkl) plane, N j is the fraction of every equivalent position that is occupied by atom j at coordinates (x j , y j , z j ). The scattering factor f j is a product describing the interaction of the X-ray with the electrons around an atom. Using Eqs. 1, 2, and 3, the X-ray diffraction of a periodic lattice is determined by the size and shape of unit cell, as well as the position and identity of atoms on a given plane. The following procedure was employed to obtain the XRD representation of the geometry of anion sublattice of the crystalline structure. First, we removed all cations from the crystalline structure, keeping only the anionic sublattice in the unit cell. Second, we substituted the remaining anions for a unitary species (e.g., S 2− ), removing the influence of the scattering factor f on the diffraction intensity. Third, the unit cell was isotropically expanded or compressed to a pre-determined volume per anion of 40 Å 3 , removing the effect of unit cell size on the position of the diffraction peaks. After these initial steps, the Xray diffraction of modified lattice encodes only information for the geometry and topology of anion sublattice. The calculation of diffraction pattern is then performed at a fixed set of 2θ values from 0 to 89.98°at a step size of 0.1°using the pymatgen package 51 , generating a 900-dimensional vector for each diffraction pattern 51 . We confirmed the results of hierarchical clustering was consistent when the step size was increased to 0.02°(Supplementary Note 7). A Gaussian smearing was then performed to normalize the integrated intensity of diffraction to a unitary value.
Unsupervised learning. We used the Dendrogram function from the SciPy package to perform agglomerative hierarchical clustering (AHC) 52 . In AHC, each sample starts in its own cluster, and the clusters merge progressively according to the similarity metric as one moves up the hierarchy. The output from AHC is a bottom-up hierarchical tree diagram (dendrogram). The Euclidean distance (L2) between two diffraction profiles was used as the similarity metric and Ward linkage was used to measure the cluster dissimilarity 53 . The same clustering results were also obtained using the hclust package in R.
In addition to the hierarchical clustering, we used the kernlab package in R to perform spectral clustering. Spectral clustering uses the eigenvalues of the similarity matrix of the data to divide the samples in to K groups, where K is a manually selected integer 54 . To create hierarchical grouping results, we recursively applied the bisectional divide (K = 2) on the larger portion from the previous grouping, and obtained a divisive top-down hierarchical diagram after the clustering.
First-principles calculation. All Density Functional Theory (DFT) calculations were performed using the Vienna Ab initio Simulation package (VASP) within the projector augmented-wave approach and Perdew-Burke-Ernzerhof (PBE) generalized-gradient approximation (GGA) functionals [55][56][57] . The parameters in static DFT calculations were consistent with the Materials Project 58 . Ab initio molecular dynamics (AIMD) simulations were performed in supercell models using non-spin-polarized DFT calculations with a Γ-centered k-point. The time step was set to 2 fs. The initial structures were statically relaxed and were set to an initial temperature of 100 K. The structures were then heated to targeted temperatures at a constant rate by velocity scaling during 2 ps. During the estimation of Li ion diffusion, NVT ensemble using Nosé-Hoover thermostat was adopted. The total time of AIMD simulations were in the range of 100 ps to 1000 ps until the diffusivity was converged. The ionic diffusivity and conductivity were calculated following established method in previous study 45 .

Data availability
The diffraction data and AIMD simulation results are available through GitHub repository https://github.com/tri-na?tab=repositories. Other data generated during and/ or analyzed during the current study are available from the corresponding authors on reasonable request.

Code availability
The code used for the creation of unsupervised learning models used in the manuscript are available in the GitHub repository at https://github.com/tri-na?tab=repositories.