Element selection for crystalline inorganic solid discovery guided by unsupervised machine learning of experimentally explored chemistry

The selection of the elements to combine delimits the possible outcomes of synthetic chemistry because it determines the range of compositions and structures, and thus properties, that can arise. For example, in the solid state, the elemental components of a phase field will determine the likelihood of finding a new crystalline material. Researchers make these choices based on their understanding of chemical structure and bonding. Extensive data are available on those element combinations that produce synthetically isolable materials, but it is difficult to assimilate the scale of this information to guide selection from the diversity of potential new chemistries. Here, we show that unsupervised machine learning captures the complex patterns of similarity between element combinations that afford reported crystalline inorganic materials. This model guides prioritisation of quaternary phase fields containing two anions for synthetic exploration to identify lithium solid electrolytes in a collaborative workflow that leads to the discovery of Li3.3SnS3.3Cl0.7. The interstitial site occupancy combination in this defect stuffed wurtzite enables a low-barrier ion transport pathway in hexagonal close-packing.


Machine Learning methodology
We treat the problem of detecting phase fields that contain stable compounds as an instance of the inlier/outlier detection problem, which is tightly connected to anomaly and novelty detection tasks as well. The overview of deep learning approaches for this problem is given in 1 . In our case, inliers (or normal data) are phase fields containing stable compounds, and outliers (or anomalous data) are the phase fields without them. We assume that the outliers are not available to us at all and that we need to rely only on inliers to build a model, bringing us into the unsupervised learning setting. As described in the Section Machine learning models, we select an autoencoder for this task from a range of potential machine learning approaches.
Autoencoders are a class of non-linear dimensionality reduction techniques that are used in an unsupervised manner. An autoencoder trained on normal data should fail to reconstruct the anomalous data samples and produce large reconstruction errors for them. In addition, the variational autoencoder learns a probability distribution from the input normal data, improving its description. We use reconstruction errors to rank unseen phase fields by their closeness to the normal data.
In order to avoid overfitting, we employ the following strategy discussed in detail in the Section Model Validation. We hold-out a portion of our normal data prior to training and compare the distributions of reconstruction errors on the training and hold-out sets after fitting the model.
Generalizability is deemed to be good when the resulting distributions are very similar to each other. In our case, we compare a simple statistic of these reconstruction errors. In general, validation of anomaly detection methods without access to a small sample of outliers is a subject of ongoing research 2 .
In our case, we prefer models with a relatively broad distribution of reconstruction errors, when both average reconstruction errors and validation metrics are comparable. We expect such models to produce a ranking that is more discriminative and stable.
Training data and its representation are discussed in the Section Training set for the Variational Autoencoder Configuration.

Machine learning models
We studied different computationally undemanding machine learning methods 3 for phase field ranking. Each of these methods provides a method-specific metric of dissimilarity between the sample and normal data -the outlier score, which can be used to rank the phase fields. Since these algorithms are stochastic in nature, their outlier scores vary considerably depending on the initialization conditions, e.g., random seed number. In order to alleviate that, we take the mean value of scores across multiple runs as the final score, which requires a large number of iterations to achieve convergence ( Supplementary Fig. 1). Noting that the more computationally involved neural-network-based approaches account for statistical variations by considering multiple models in a large network and a dropout 4 , yielding converged results in every iteration, we employ a neuralnetwork-based Variational Autoencoder (VAE), which uses the autoencoder's reconstruction error as a metric, in our study.
Supplementary Figure 1. Convergence of the different methods′ scores with the number of models used. ML methods used for phase field ranking are histogram-based (HBOS), isolation forest (Iforest), cluster-based (CBLOF), minimum covariance (MCD), connectivity-based (COD) 3 . Unlike the Variational Autoencoder neural network, accounting for statistical variations, up to 1000 iterations (models) are required to achieve convergence with these approaches.

Training set for the Variational Autoencoder Configuration
The open-source software (implementation of VAE and Phase Field Ranking method) developed for this study is available at https://www.github.com/lrcfmd/PhaseFieldRanking. In this paper, we use the VAE to learn the similarities between unexplored phase fields and the reported phase fields in ICSD. This is conducted by learning the multi-dimensional representations of the phase fields and their hidden features in the latent space with a dimensionality of choice in the way that best reconstructs the training data (RE and KL divergence are minimized). There is a small difference in distributions of RE between 2-and 4-dimensional latent space, and a robust distribution of RE that persists in choices of 4, 8, 16 dimensions for the latent space 7 . Thus, we choose 4-dimensional latent space for the model used.

Model validation
We perform 5-fold cross validation. In each of the five iterations, we hold out a different 20% of the training data and train the model on the remaining 80%. We then apply the VAE model to the validation data (the separate 20%), and check that the distribution of scores of the validation set follows the distribution of the training set, ensuring the width, position and distribution of RE around the peak are within the same range -broadly analogous to the cohesion and separation methods for internal validation of unsupervised learning 8 .
To simplify the comparison of RE for different data points, we use the normalised RE, calculated as (RE -REmin)/(REmax -REmin). The validation error is calculated as the percentage of the validation set that is distributed in the second half of the training distribution (i.e., the number of phase fields from the validation set which have the normalised RE higher than the threshold value of 0.5 where IQR is the interquartile range. These data size-adapted bins, with widths 0.04 for training and 0.077 for testing data respectively, allow size-independent comparison of RE distribution in the training and testing data ( Figure 2a in the main text and Supplementary Fig. 3). Such data distribution results in ten data entries located in the first bin of testing data. Figure 3. A histogram of normalised RE (scores) in ICSD and testing datasets. The width of the bins depends on the number of entries in the two datasets, according to Supplementary Equation 1. The scores distributions demonstrate that the patterns for the majority of the data entries are well-recognised by the model (normalised RE < 0.5) with a few outliers.

Feature space
We study different basis sets of features for description of atomic elements built from 39 elemental features 12 Supplementary Fig. 5a. This reveals some strongly positively and negatively correlated pairs of atomic descriptors, including some well-known relations (e.g., density and atomic weight, reverse correlations between covalent radius and first ionisation energy, polarisability and electronegativity 15 , etc.). Supplementary Fig. 5b, however, demonstrates that FRE practically do not correlate with the total RE, suggesting the importance of maximizing the number of wellreconstructed elemental descriptors for detailed representation of the phase fields. We also confirm that in general, maximization of the number of features enhances the dispersion of the total RE between the data points in the model by increasing the width of the RE distribution, thereby improving the precision in discrimination between the phase fields and identification of the best candidates ( Supplementary Fig. 5c).
We validate this elemental feature basis set by developing a Monte-Carlo (MC) approach, in which at every step a basis set of features is augmented or decreased with a randomly selected feature -a move that can be accepted or discarded depending on the VAE validation error produced with this basis (Supplementary Fig. 5d). The MC-derived basis set produces outcomes that are similar to but less effective than the set above (larger validation error, smaller distribution width), supporting the use of the set discussed above in the model.

Synthesis
Samples with composition LiSnS2.33Cl0. 33  hours, and then cooled to room temperature at a ramp rate of 5 °C min -1 . The resulting powder was then manually ground in order to obtain a fine powder. Precursors and resulting powders were handled in an Ar-filled glovebox (O2 < 1 ppm). Instruments. For accurate quantification, data were also collected from standard materials (SnO2, Bi2S3, NaCl standards supplied by the manufacturer) for each element.

Diffraction and refinement
Routine analysis of phase purity and lattice parameters were performed on a Bruker D8 Advance diffractometer with a monochromated Cu source (Kα1, λ = 1.54060 Å) in powder transmission Debye Scherrer geometry (capillary) with sample rotation.
The structural models were refined by the Rietveld method as implemented in the Fullprof suite 16 . For the sake of realism, all uncertainties were increased by Berar's factor (4.5, 3.2 and 3.9 for SXRD, NPD Bank 4 and NPD Bank 5, respectively according to FullProf).
The program FullProf was used to obtain information about the microstructure, following the method described by Rodriguez-Carvajal et al 16,17 . This model uses the Scherrer formula, which considers that the size broadening can be written as a linear combination of spherical harmonics.
Peak shapes were modelled using the spherical harmonics expansions in a hexagonal material with Laue class 6/mmm.

Maximum Entropy Method Analysis of Diffraction Data
The maximum entropy method (MEM) applied to diffraction data consists of optimizing the reconstruction of the scattering density from the observed structure factors by finding the maximum of the informational entropy under several constraints through an iterative procedure 18 . MEM applied to crystallography is a powerful tool for reconstructing scattering density from incomplete and/or noisy data systems and limits termination effects obtained through usual Fourier synthesis, particularly important in disordered systems 19

DC polarisation measurements
For DC polarization measurements, a cylindrical pellet of Li3.3SnS3.3Cl0.7 with a thickness of 1.75(9) mm and a surface area of 0.18 (9)

Ex-situ Raman microscopy and XRD after Li plating/stripping
After completing Li plating/stripping measurements, the symmetric cell was disassembled inside the Ar-filled glovebox. The pellet material was collected and sealed inside a custom-built stainless-steel holder with a CaF2 window. Raman spectra of the cycled and pristine solid electrolyte were collected using a Renishaw In-Via Rama spectrometer equipped with an inverted microscope to provide spatial sensitivity and the ability to focus on individual particles. Spectra were collected with 785 nm wavelength laser at room temperature. Laboratory XRD patterns were collected for material extracted from the bulk of the cycled pellet, and powder scraped from the Li|Li3.3SnS3.3Cl0.7 interfacial region. This interfacial powder was mixed with amorphous boron and sealed in a 0.5 mm diameter borosilicate capillary for measurement.

Nuclear Magnetic Resonance (NMR)
SLR rates in the laboratory frame (T1 -1 ) were obtained using a saturation recovery pulse sequence and the data were fitted to a stretch exponential function of the form: where τ are the variable delays and α is the stretch exponential (values between 0.4 and 1).
SLR rates in the rotating frame (T1ρ -1 ) were recorded using a standard spin-lock pulse sequence at frequencies of ω1/2π ( 7 Li) = 15, 45, and 80 kHz, and data were fitted to a stretch exponential function of the form: where β values are between 0.3 and 1.
Temperature calibrations were performed using the chemical shift thermometers Pb(NO3)2, CuI and CuBr using 207 Pb and 63 Cu NMR. [23][24][25][26] The errors associated with this method were calculated using the broadening of the isotropic peak and ranged from 5 -20 K.

Ranking of the quaternary unexplored Li-M-A-A′ phase fields
For the ranking, we calculated reconstruction errors (RE) of all phase fields of interest as the SnS2/LiCl∼ 0.05), determined by the Rietveld method. The relative weight fractions of impurities derived from Rietveld refinement on these two samples were used to form a system of equations and determine the subsequent composition to make. The intersection of the two lines derived from equations: Li3SnS3Cl -x × (0.9 × Li0.8Sn0.8S2 + LiCl) and Li11SnS3Cl9 -x × (0.05 × SnS2 + LiCl) (short and long yellow lines, respectively, on Figure 2b) gives a composition close to Li3.6SnS3.6Cl0.4.  Figure 6. Laboratory XRD patterns of samples. a #2, b #6 and c #7 made in the Li-Sn-S-Cl phase field (cf. Figure 2b and Supplementary Table 5), synthesized in sealed quartz tubes at 700 °C for 12 h. Known phases are denoted with signs defined in the legend. Phase A is denoted with the " * " sign.

Crystal structure determination
The observed reflections for the SXRD pattern of Li3.3SnS3.3Cl0.7 indicated the presence of a 63 screw axis and a c glide plane, therefore leaving the possibility for the three space groups P63mc, P6 � 2c and P63/mmc (extinction symbol P --c). 36 The presence of a minor impurity of orthorhombic Li4SnS4 was identified in the SXRD pattern and this phase was included in the refinement. The final refinement yields no more than 4 wt% of this phase. The screening of the Fourier difference map using the NPD data, revealed the presence of a negative nuclear density in the octahedral interstice of the anion sublattice ( Supplementary Fig. 7).
A lithium atom was added in this position and its occupancy refined to 0.092(8) while the goodness of fit for the refinement, χ 2 , decreased from 9.53 and 6.90 to 5.69 and 1.59 for Banks 4 and 5, respectively.
A Rietveld refinement using a structural model containing a Li atom on the Tsite was envisaged (constraining the overall occupancy of the T + -Tface shared unit to be less than 1). However, at such low occupancy, the error obtained on the refined sof was too high (sof = 0.01(2)) for this site to be kept in the final structural model. The position, displacement parameters and sof of all atoms were then refined simultaneously constraining the composition to be charge-neutral, with no constraint of full occupancy for the anion site, and this yielded the final model. The outcome of the refinement is presented in Supplementary Table 6 and Supplementary Table 7.
The final refined composition is Li3.41(4)SnS3.29 (3) Supplementary Table 6. Summary of the outcome of the refinement of Li3.3SnS3.3Cl0.7 against synchrotron X-ray powder diffraction (SXRD) and neutron powder diffraction (NPD) data.  The results of these fits are shown in Supplementary Table 9.

Temp /°C
Supplementary Figure 10. a-e I-t curve measured at different steady applied voltages (value in legend) and f I-V curve of the electronic contribution. A fit using Ohm's law yields a total electronic conductivity 1.1(1) × 10 -8 S cm -1 .
Prior to the DC polarization, EIS measurements were performed at room temperature using a Biologic VSP-300 potentiostat/galvanostat, which enables the collection of the impedance spectra down to a low frequency of 1 mHz, permitting visualization the Warburg tail at low frequency region ( Supplementary Fig. 11). Fitting of the data points using a (R-CPE) equivalent electrical circuit yields a total conductivity of 1.0(4) × 10 -6 S cm -1 close to the value obtained using the Keysight instrument (data collected to 20 Hz, cf. Figure 4, main text).

Evidence of ionic conductivity and stability against Li metal through Li plating/stripping in a
Li|Li3.3SnS3.3Cl0.7|Li symmetric cell The voltage profile of the Li|Li3.3SnS3.3Cl0.7|Li symmetric cell upon repeated Li plating/stripping at room temperature, 323 K and 343 K over 400 h, is shown in Figure 4e, and Supplementary Fig.   12a. The transport of Li + through the Li3.3SnS3.3Cl0.7 solid-state electrolyte, as well as across the Li3.3SnS3.3Cl0.7|Li 0 interface, can be confirmed. An initial polarization of ±1.28 V was observed at a low current density of 10 µA cm -2 at room temperature due to the moderate ionic conductivity of the material. At elevated temperatures, the polarization is largely reduced, confirming improved ionic conductivity. The change in overpotential can be used to evaluate the Li|Li3.3SnS3.3Cl0.7 interfacial reactions and possible degradation of Li3.3SnS3.3Cl0.7. During the initial 50 h, the overpotential rises gently, in line with the increased diameter of the semicircle over cycling observed from the EIS ( Supplementary Fig. 12), indicating the formation of an interphase layer. Interestingly, the overpotentials remain constant over the subsequent plating/stripping, suggesting a stable interphase layer is formed. This promising result is in contrast with what has been observed in the Sb-and Asdoped Li4SnS4 material, 39,40 which showed irregular voltage profiles over 20 h. After galvanostatic stripping and plating, the symmetric cell was disassembled. At the Li|Li3.3SnS3.3Cl0.7 interface, the colour of Li3.3SnS3.3Cl0.7 pellet became darker. XRD measurements of Li3.3SnS3.3Cl0.7 material extracted from the bulk pellet and also from this interfacial region show negligible change when compared against the as-synthesised powder ( Supplementary Fig. 13). No difference can be seen from the Raman spectra between the cycled bulk pellet and the pristine Li3.3SnS3.3Cl0.7 ( Supplementary Fig. 14). However, the dark colour powders scraped from the Li surface show new Raman bands, which can be tentatively assigned to S-S-S bending (i.e., S8, 158/215 cm -1 ), 41 Sn-S (185 cm -1 ), 42 and SCl2/S3 -(~525 cm -1 ) 43 that have been observed at the interface between Li 0 and sulfide-based solid state electrolytes. 41 The mixed S-Cl anion material enhances stability properties against Li metal compared to the monoanionic sulfide Li4SnS4 without degrading the ionic conductivity.
Supplementary Figure 12. Galvanostatic Li plating/stripping of a symmetric Li|Li3.3SnS3.3Cl0.7|Li cell. a Enlarged snapshots of voltage profiles at the three studied temperatures. b Electrochemical impedance spectra collected at 50 cycle intervals for the cell shown in Figure 4e Supplementary Figure 13. Ex situ investigation of Li3.3SnS3.3Cl0.7 after galvanostatic Li plating/stripping. Laboratory XRD patterns show comparison between as-synthesized Li3.3SnS3.3Cl0.7, material extracted from the bulk of the cycled pellet, and powder which was scraped from the interfacial region of the Li|Li3.3SnS3.3Cl0.7|Li cell. This interfacial powder was mixed with amorphous boron (B) and sealed in a 0.5 mm diameter borosilicate capillary for measurement. Figure 14. Ex situ Raman spectra of Li3.3SnS3.3Cl0.7 after galvanostatic Li plating/stripping. Raman microscopy of the as-synthesized pristine Li3.3SnS3.3Cl0.7 (black), the bulk solid electrolyte after cycling (blue), and of the small dark particles within the bulk cycled powder visible under the microscope (red). No peaks were observed in the higher wavenumber regions 700-4000 cm -1 for any of the studied materials.

Nuclear Magnetic Resonance Spectroscopy
In static 7 Li solid-state NMR spectra the absence of mobility is displayed through a broadening of the 1/2 ↔ −1/2 central transition arising from the strong homonuclear 7 Li -7 Li dipolar coupling interactions. At low temperatures the material is said to be in the rigid lattice regime, in this regime spectra dominated by dipolar broadening are observed. This effect is present at approximately 165 K, where the line width of the central transition is approximately 5.2 kHz for Li3.3SnS3.3Cl0.7.
As the temperature is increased the dipolar interactions are continuously averaged due to the increasing motion of the Li spins causing the spectra to narrow significantly ( Supplementary Fig. 9).
This effect can be seen by plotting the full width half maximum of the peak in function of the temperature (Figure 4b), where the onset of motional narrowing, Tonset occurs at around 200 K. As Li3.3SnS3.3Cl0.7 appears to be in the fast-motional regime at room temperature, this is a reasonable explanation for the absence of two distinct resonances attributed to the material in the 6 Li MAS NMR spectrum. At room temperature rapid exchange between tetrahedral and octahedral Li sites will occur on the NMR timescale leading to the observation of a single resonance.
The temperature dependence of the 7 Li SLR in the laboratory (T1 −1 ) and rotating frame (T1ρ −1 ) under static conditions were monitored using the saturation recovery and spin lock pulse sequences.
These measurements were performed in order to obtain values for the activation energy, conductivity and dimensionality of the Li diffusion. The SLR T1ρ −1 rates are purely induced by diffusion processes initially increasing with temperatures above room temperature before decreasing thereby passing through a maximum at temperatures that are characteristic of the Li correlation rates  (Figure 4c). The solid lines correspond to linear fits of (τ/ω) 0.5 and τln(1/ωτ ) for one and two-dimensional diffusion, respectively. The frequency dependence of the SLR T1ρ −1 rates (also shown in Figure 4c) clearly rules out the possibility of three-dimensional diffusion. Errors in the spinlock frequencies ω1 are estimated to be 10% while the errors in the correlation time τ are extracted from the fit in Supplementary Fig. 16. Errors in T1ρ -1 are obtained from the outputs of the fits to Supplementary Equation 3.

Exploration of Li-Mg-S-Cl quaternary phase fields Computational and Experimental Results
Li-Mg-S-Cl is ranked #4 in the list of unexplored phase fields by the variational autoencoder (Supplementary Table 3). Compositions in this phase field were sampled computationally prior to synthesis using a method similar to that described in the main text, identifying a low energy region which lead to the discovery of new compounds ( Supplementary Fig. 18). The variational autoencoder identified and ranked highly this series of compounds that had not been discovered in the previous four decades since the initial report of ionic conductivity in Li2MgCl4. 50 Fig. 19). A solubility limit exists within the new Li2-2xMg1+x+yCl4-ySy series. Six compositions were synthesized using the conditions described above, shown by blue squares in Supplementary Fig.  20a. Synthesis attempts at other compositions (green squares and triangles in Supplementary Fig.   20a) were unsuccessful, resulting in significant amounts of MgS or MgCl2 as secondary phases and no further change in lattice parameter of the majority cubic phase. Quenching of powder samples by placing sealed tubes into a water bath from synthesis temperature, rather than cooling at a controlled rate, was used to promote further solubility of Mg through entropic mechanisms. This enabled the isolation of two further compositions (cyan squares in Supplementary Fig. 20a)