## Introduction

Energy storage is a key technology to meet growing energy demand by harnessing renewable sources. Liquid electrolyte-based lithium ion batteries have been extensively deployed in the portable electronic and electric vehicle markets. Alternative batteries that utilize solid state electrolytes (SSEs) avoid the safety issues associated with organic liquid electrolytes and offer high energy density by enabling the use of a lithium metal anode1,2. The most significant obstacle to the adoption of SSEs is the realization of solid-state materials with the full suite of required properties, including sufficiently high ionic conductivity, stability against both lithium metal and the oxidizing cathode material (in practice this is often kinetic and associated with the formation of stable electronically insulating interfaces) together with appropriate mechanical properties3. As such, considerable research has been devoted to the discovery and development of SSEs that meet these requirements4,5.

The amount of time and effort required to discover a suitable material in any domain has driven the application of machine learning methods to predict material properties6. Recent works have used previously published data7,8 to train machine learning models and predict the ionic conductivity performance of materials using only their composition9. This approach is limited by the quality and quantity of the data available to train models. Literature reports in materials science tend to focus on subsets or particular families of materials with favourable or promising properties, leading to many reports on a limited range of materials10,11. While natural language processing (NLP) tasks have access to billions of training examples, in experimental materials science even large datasets typically contain fewer than 10,000 entries12. Due to these comparatively small training sets, it is imperative that the highest quality data are used to avoid providing inaccurate data to predictive models. As there are no large repositories of experimental ionic conductivities currently available for solid lithium ion conductors to perform a machine learning investigation, the first step must be sourcing high quality data.

Machine learning models for material’s figure-of-merit performance can be built from knowledge of either the composition alone, or the structure and composition. While models built from knowledge of both structure and composition are generally superior in performance, composition-only models are important both for general reasons and for specific considerations relevant to lithium ion conductors. The experimentally measured conductivity of a material derives from its non-averaged structure which is defined by its composition. This will include structural defects that cannot be captured fully in an average crystal structure recorded in a database such as the inorganic crystal structure database (ICSD), unless the material is fully ordered without fractional site occupancy or substitutional disorder. Most structures with lithium ion conductivity that have been reported in detail (i.e., with the lithium positions) exhibit considerable disorder of this type. Even the average structure is unavailable for potential compositions that have not been experimentally studied, and in addition, many experimental reports of ionic conductivity give composition but not structural analysis of the materials investigated. Reported average crystallographic structures for lithium ion conductors frequently do not give precisely determined lithium positions because of the low X-ray scattering power and extensive structural disorder, again raising the important technical question of the connection between the potentially decisive local structure and the crystallographically-determined average structure. We thus build a dataset for machine learning models to predict lithium ion conductivity based on composition. There will be limitations of this approach, for example, the model will be unable to discriminate between polymorphs of a given compound. Nevertheless, crystal structure is not always known nor can it be for entirely novel compositions, thus a compositional model with low computational requirements is necessary for screening unexplored chemical space. The most direct measurement of the ionic conductivity of a material is via a.c. impedance spectroscopy (ACIS) measurement, usually on a dense ceramic13. All of the ionic conductivities for the materials included in this database were measured via ACIS.

For a specialist domain topic like solid electrolyte chemistry, the task of digesting the presented information requires significant expertize. Throughout the literature, there are inconsistencies in how data are presented, which introduces difficulties when comparing different reports. A broad knowledge of the background literature is essential for recognizing potentially problematic experimental procedures affecting both composition and conductivity, uncovering discrepancies in reported data, and identifying materials and properties that have in fact been computationally derived rather than experimentally measured (which problematically and unfortunately may not be clearly stated in the body of the text in some cases). All of these challenges increase the difficulty and time required to construct a high-quality database of experimentally reported data.

Leading NLP approaches have demonstrated their capability to extract chemical data from the extensive corpus of past scientific literature14, a process referred to as automated scraping. Text mining has been demonstrated to be a powerful tool in creating materials datasets. For example, Court and Cole15 created a dataset of materials and their associated magnetic ordering temperatures. This is possible as a magnetic ordering temperature is reported as a single number usually in the text. Unfortunately for ionic conductors, the task of finding and pairing compositions, temperature of measurement, and conductivities is too complex even for state of the art NLP techniques to be effective. There are the standard issues of tokenizing chemical formulae consistently, and parsing correct values in text and tables. In particular, for ionic conductors with a non-crystalline component, the composition is reported as a mixture of reactants rather than a stoichiometric chemical formula. Furthermore, as the vast majority of reported data is presented in figures with no standardized units for conductivity and extreme heterogeneity between entries, extracting relevant data is a combined challenge in both the fields of NLP and computer vision. Accordingly, the creation of a reliable database is unattainable with present automated capabilities, and thus a manual approach is employed here.

Previous investigations have predicted the ionic conductivity of solid-state materials using statistical methods. Due to the aforementioned difficulties in gathering initial datasets of sufficient size and quality these approaches build models that are based on relatively small experimentally-derived datasets (of the order of 40–82 entries)8,9,16. In this study, we have reviewed the literature to gather a dataset of experimentally reported solid-state lithium ion conductivities which with 403 unique compositions is an order of magnitude larger than previously available. A statistical overview of the dataset is presented, with the range of conductivities examined for each structural prototype. Unsupervised embedding and clustering techniques are used to partition this dataset into nine families by compositional similarity, thus assessing the diversity of the dataset. We develop supervised regression and classification models to predict the lithium ion conductivity and assess whether a material will possess an ionic conductivity log10(σ) ≥ 4 at room temperature, where the conductivity is reported in units of S cm–1. The best regression models achieve a mean absolute error for log10(σ) of 0.85, and the best classification models have a Matthews Correlation Coefficient (MCC) of 0.63, assessed under k-folds cross-validation in both cases.

## Results and discussion

### Database construction

A large collection of solid-state lithium electrolyte literature was gathered, and the ionic conductivities were extracted for the materials reported in each study. The experimental procedures in a given source were critically assessed to understand how each sample was synthesized, characterized, and processed into a ceramic. We ensure that in each of the studies, samples had clearly defined compositions and reported direct measurements of the conductivity taken via ACIS. The values of ionic conductivity in the database are a mixture of bulk and total values, as the two are not always distinguished, with only a small number of studies providing sufficient detail in labelling the reported values as such. Where exact stoichiometry may be unclear from the given reagents, any studies that lacked supporting characterization (such as ICP analysis) to confirm the presence of lithium, were discarded. The ionic conductivity and material composition are both of equal importance in the database, as the predictive models are constructed with these two variables. By ensuring that data is exclusively gathered from experimental studies of high calibre, we gain confidence in the quality of the results of subsequent machine learning analysis. Typically, this requires extracting the values from an Arrhenius plot and converting each value from the plotted units (commonly plotted as either σ in S cm–1 or S m–1, log10(σ), log10T), or ln(σT)) to conductivity in S cm−1 at a specific temperature. In some reports these values may also be provided in tables, or stated in the main body of text along with supporting discussion, allowing for cross-checking of the reported value.

The first stage of the initial literature review was carried out by an undergraduate student to collate source papers of reported conductivities from keyword searches using search engines, and reviews of the field16,17,18,19,20. This survey focussed on tabulating the physical properties reported in each paper: composition, ionic conductivity, temperature at which the conductivity was measured, activation energy, and structural prototype. Following this initial tabulation, the activation energy was excluded from the final database as it is not reported frequently enough to warrant inclusion.

Owing to the complexities described above, further expert validation of the data was required. The ionic conductivity of a material is typically determined using ACIS, although it can also be calculated through molecular dynamics simulations21, or examined by NMR diffusion experiments22, ion migration studies23, or entirely different measurements not directly related to ion transport (e.g., maximum entropy method analysis of diffraction data24). Even experimental papers which report a measured conductivity for a material through ACIS may themselves involve a variety of measurements and sample preparations, creating uncertainty around reported values. Postgraduate and postdoctoral researchers with more than two years direct experience of battery research with a broad knowledge of background literature assessed experimental procedures, consistency in sample preparation, quality, and other aspects of the reported data based on the details provided. Each researcher handled a selection of entries and was tasked with validating the database entry against the source report.

Dealing with such a large table of data in spreadsheet form adds significant challenges. Specifically, working with an online spreadsheet directly with twenty researchers leads to issues with version conflicts, edit histories, issues with concurrent user access, merging changes from multiple users, as well as assigning and tracking tasks. These issues were avoided by reducing the individual tasks to their core components through a bespoke interface developed with the streamlit prototyping library, shown in Supplementary Fig. 1. The interface was created to present a single entry from the database with its composition, associated conductivity at a specific temperature, and source paper. For each entry, the researcher was tasked with evaluating the conductivity at that specific temperature, making note of any mistakes with the composition, and reported conductivity or temperature from the source. Positive feedback to researchers was provided through the presentation of a unique compliment provided by a GPT-2 transformer based language generation model25,26, displayed to the researcher after evaluating and recording each entry.

### Database overview

A database was created with 820 entries collected from 214 sources; each entry contains the ionic conductivity of a chemical composition at a specific temperature, ranging from 5 to 873 °C, with an expert-assigned structural label. There are 434 different entries (Table 1) in the database for ionic conductivities experimentally measured at room temperature (15–35 °C). For a further 31 materials, the room temperature conductivities are extrapolated from measurements above room temperature, to obtain a dataset of 465 entries, with 403 unique compositions, as 37 room temperature compositions have conductivities extracted from multiple reports. The room temperature conductivities span the range of 5.00 × 10–16 to 2.50 × 10–2 S cm–1, with a mean log10(σ) of −5.01 and median of −4.41 (Fig. 1). The distribution of conductivities in this dataset and the associated standard deviation are estimated by optimizing the parameters of many probability distribution functions using the Fitter library (github.com/cokelaer/fitter); the distribution which fits the data with the lowest error is an asymmetric Laplace distribution. The interquartile range (50% of the data; materials from the 25th to the 75th centile of log10(σ) in the dataset) spans from −7.30 to −3.03.

During database construction, each material in the dataset was manually allocated a label, based on the structural prototype the material belongs to. If the material structure was not discussed directly in the text and its family could not be deduced with reasoning, then this composition was assigned the structural label of Other. The breadth of structural chemistry encompassed by this dataset is shown by the fifteen unique families present in this set of expert-curated labels (Supplementary Table 1), which can be used to partition this database and expose trends that have been reported in the literature.

In Fig. 2 the distribution of log10(σ) for each structural family for which room temperature data is available, has been created by fitting a density kernel to the conductivities. This consists of placing a Gaussian distribution of fixed height and width at the x co-ordinate for each conductivity, and summing these together to approximate the probability density, allowing us to estimate the spread of reported conductivities. Irregular distributions with long tails are observed for some structural families. As the majority of these sets contain fewer than 50 reported materials, reports of materials with higher conductivities in the literature will lead to anthropogenically biased distributions27. Anthropogenic bias is inescapable when constructing a dataset of experimentally measured properties from the literature. The reduced scientific interest in undertaking the lengthy characterization of materials with little importance to electrolyte chemistry, has meant that materials with very low or negligible conductivity are underreported. Distributions will be skewed towards conductivities of interest, and thus not truly representative of the underlying chemistry.

The room temperature dataset predominantly consists of NASICON, garnet, perovskite, glass, thio-LISICON, and LISICON type materials, each with more than 27 members. The anion chemistries of the materials are provided in Table 2, showing that 75% of the materials in the database are pure oxide compounds (consisting of 44% NASICON, 19% garnet, 18% perovskite, and 8% LISICON type materials), 12% are pure sulfides, and 2% are pure halide compounds. Mixed anion materials (oxyhalides, oxysulphides, etc.) make up 11% of the materials included (46% of these are argyrodites such as Li6PS5Cl, and 16% are antiperovskites such as Li3OCl). In general, materials containing sulfur as an anion exhibit higher minimum and maximum conductivities which is supportive of the outlook that is commonly encountered in the literature that sulfides exhibit the highest Li ion conductivities.

### Machine learning

With a database of materials gathered, unsupervised or supervised machine learning (ML) may be applied to these compositions to extract chemical trends. Unsupervised learning involves the application of embedding and clustering techniques based on the elements in the material, with no further knowledge of chemical properties such as conductivity required. Unsupervised techniques are beneficial as they do not require time-intensive labelling, and may highlight trends and similarities that may not be immediately apparent from a large collection of data in a table. Unsupervised clustering has successfully been applied in previous investigations to cluster electrolyte materials8 based on crystal structure through hierarchical clustering applied to the anionic frameworks of 528 lithium containing structures from the ICSD. Conversely, supervised techniques attempt to fit a predictive function for a property to chemical descriptors such that the property can be predicted for a new material by statistical learning from known examples in a given training set. Machine learning is applied to compositional descriptors to predict each material’s room temperature lithium ion conductivity (a regression task), or to predict whether each material possesses a room temperature lithium ion conductivity log10(σ) ≥ 4 (a classification task).

In our previous work, we introduce the Element Movers Distance (ElMD)28 as a metric to quantify the similarity between two chemical formulae. This is demonstrated to be an expressive measure of chemical similarity that aligns with domain expert judgement. This metric can be incorporated with unsupervised dimensionality reduction and automated clustering to present chemical composition data to those who study these spaces. This brings high-dimensional compositional spaces into concise structured representations, such as maps, that can be interpreted by humans. In doing this the landscape of known compositions can be categorized according to our knowledge of related materials. Following the methods described previously with the ElM2D plotting library (github.com/lrcfmd/ElM2D), we construct a distance matrix of ElMD scores between the compositions in the ICSD (2021)29 and the compositions contained within the ionic conductors database here. This metric space is reduced to two dimensions with principle component analysis (PCA) (Fig. 3). A Gram centred matrix30 is first obtained from the given distance matrix, and then singular value decomposition of the Gram matrix carried forward to obtain the coordinates of each point projected to the first two principle components. PCA linearly scales each metric distance to maximally preserve each of the interpoint relationships across the dataset, which has previously been shown to closely reflect the true structure of the metric space28. Figure 3 thus represents the distribution of this dataset in the compositional space of the materials constituting the ICSD.

Each of the lithium-containing compounds of the ICSD are highlighted against other compositions of the ICSD and the 455 unique compositions from our entire database (i.e., compositions with data recorded at any temperature) in Fig. 3a, with the expert-curated labels of the structural families included in the lithium conductors database in Fig. 3b. Though structure has not been included in the initial representation, expert-identified structural families are seen to tend to cluster in this compositional embedding, reflecting the connection between composition and structure. Perovskites (Supplementary Fig. 2), NASICONs (Supplementary Fig. 3), thio-LISICONs, and garnets are found in distinct areas of the compositional map; each of these structural families are grouped tightly on the map, despite the absence of structural information (Fig. 3b). The lithium ion conducting materials in the database are found in the same regions of compositional space as known lithium compounds, and can be seen to match the diversity of lithium chemistry that has been explored to date reasonably well. This reflects the anthropogenic bias intrinsic to the research process, as much of the work devoted to discovering new lithium-containing materials has been driven by applications in battery technologies. There are a number of areas of accessible lithium-based chemistry (compounds seen on the right-hand side of Fig. 3a) where known materials appear underexplored with regard to ionic conductivity. This compositional space should be considered in the search for new families of lithium ion conductors.

Previous work has shown that, while PCA gives an accurate realization of compositional space with respect to ElMD28, it is not the best representation for further processing with automated clustering techniques. The compact and concentric patterns that these clusters follow are difficult to unravel both visually and algorithmically, particularly when framed against the noise of so many unrelated compounds. We find that non-linear dimension reduction techniques attain a much clearer separation of the space into distinct regions of compositional similarity, which can be clustered more consistently (Fig. 4). Uniform manifold approximation and projection (UMAP) draws apart the points of a space by first forming a neighbourhood graph of points in the metric space then embedding this graph to a two-dimensional plane of projection via Laplacian Eigenmaps to capture global information31. These 2D distances are then refined through a ball and spring model32 to capture the local intricacies of the metric space.

UMAP (Fig. 4a, b) and PCA (Fig. 4c, d) are applied to evaluate the reduced space of the 403 compositions of room temperature solid state lithium ion conductors in the database reported here. The UMAP plot contains several clear regions, which can be separated into nine distinct clusters using the density-based spatial clustering of applications with noise (DBSCAN) algorithm33 with an epsilon radius of 4 (Fig. 4a). The epsilon value determines the radius of disks that are overlaid on every point in the two-dimensional plot, which are then used to classify the points into different clusters. If two points cover each other with overlapping disks, then these will be assigned the same cluster label. DBSCAN has the ability to capture dense regions of an embedding, but if epsilon is too large then the output will fail to separate disjoint clusters. In this study, epsilon was chosen manually to maximize consistency between automated clusters and the clusters that can be visually observed.

Each of these unsupervised ML-derived clusters from Fig. 4a are chemically reasonable, with clear stoichiometric substitutions or structural similarities connecting their constituents. This becomes apparent from comparison with the expert-derived structural family labelling in Fig. 4b, d. For example, Clusters 0 and 8 from the automated clustering are predominantly populated by NASICONs, perovskites are exclusively found in Clusters 5 and 6, whereas Cluster 4 is almost exclusively garnet structure materials. In addition to the practical benefits automated embedding and classification provides to rationally organize materials with minimal human bias, these clusters have further application in supervised training. As some data must be withheld from training and retained to test the performance of a trained model, each DBSCAN-derived cluster will be used as a testing set in a process referred to as Leave One Cluster Out Cross Validation (LOCO-CV). These clusters range in size from 6 materials to 93 materials, with the training set then typically containing 85–90% of the available data to train each model. The distributions of log10(σ) for each LOCO cluster have been plotted in Supplementary Fig. 4, with basic statistics given in Supplementary Table 2, where many of the clusters span similar ranges of conductivity. Given the intra-cluster chemical consistency and inter-cluster dissimilarity, these assessments are a measure of how each model performs at predicting the ionic conductivities of materials that are chemically dissimilar from those on which the model has been trained.

### Supervised learning

A dataset of 403 entries is constructed, where compositions with duplicate room temperature conductivities from differing sources have been represented by the median of these multiple reported conductivities. With this dataset in hand, we apply the best available ML models that can be implemented with minimal modification, i.e., off the shelf. This is done with traditional statistical learners (ensemble models) with mat2vec14 composition-based feature vectors34, and deep learning techniques (CrabNet). For statistical learners, we wish to ensure the best models and associated hyperparameters are chosen, so that we do not simply overfit to one portion of the data. A simple model with fixed hyperparameters is not guaranteed to give good predictions on unseen compounds. Such models may overfit to the training data, leading to poor predictions on unseen compositions, or give exceptional performance on certain subsets of the data with poor performance on the rest. Some of the issues of overprediction can be remedied by surveying a range of statistical models35. State of the art techniques for predicting materials properties through composition apply this principle by training an ensemble of models, in the belief that each model will learn to focus on a different set of features. The predictions of each individual model are combined, which tends to give more robust predictions across the entire domain. In statistical models, the ensemble approach is notably used in the random forest (RF) algorithm36, where large ensembles of decision trees are randomly constructed and kept or discarded depending on their predictive quality. The resulting quality of RF predictions depends on the values of each hyperparameter chosen when initializing the model, and poor choices can lead to very poor models. To alleviate this, best practice has traditionally focussed on trialling a range of hyperparameters in combination with one another, but this is time consuming and does not guarantee that the optimal configuration will be found. More recent AutoML approaches37 improve on this by framing the choice of statistical model and its associated hyperparameters as a meta-problem to be solved. Many separate algorithms and hyperparameters can be trialled and assessed in combination, with the measured performance used to update a selection policy for future trials until optimal combinations are found.

In AutoSklearn38, many types of models and data pre-processing stages from the scikit-learn library are chained together to form data processing pipelines. The supplied training data is shuffled into k-folds cross-validation sets and used to assess each pipeline, with the performance noted. This performance is used to update the parameters of a tree-based Bayesian optimization selection policy, which will decide the models and hyperparameters to choose in future iterations, alternating between exploring untried combinations, and exploiting relationships known to give good results. Given that RFs return more robust predictions through ensembling many weaker models together, we would expect an ensemble of effective models to give even stronger predictions. As simple models are quick to train, thousands of pipelines can be evaluated during the AutoSklearn training process. After the allotted training time of ten minutes, the 50 pipelines with the highest performance are selected to form a trained ensemble which can be used to predict unseen data.

In comparison, Compositionally Restricted Attention Based Networks (CrabNets)39 are an implementation of the transformer model40 of deep learning. Here, self-attention is employed to learn how relationships between each of the elemental vectors in a composition are aligned with a target property. The transformer’s positional encoder is repurposed as a fractional encoder to capture the ratio of each element in the composition, which enables CrabNets to capture similarities and small variations in stoichiometry with precision. This is particularly relevant for ionic conductors, where minor substituents (e.g., those controlling the exact lithium content) can significantly influence the ionic conductivity because they determine the defect concentrations and associated local structure that can govern ionic motion.

One shortcoming of deep neural networks such as CrabNets is that they require large quantities of training data which are typically unavailable for materials science problems. This limitation can be alleviated by transfer learning, which involves pretraining networks on much larger datasets of compounds and their associated properties, such as the computed energy of formation. The trained parameters of this network can be exported to initialize future models for different properties, as opposed to initializing all of these values randomly. The desired benefit of pretraining the network on a wider range of compositions and their associated formation energies, is that the knowledge of chemical relationships absent in our training set can be extrapolated to future predictions. By transferring this knowledge from another domain, the most salient chemical relations are intended to be well represented in the network. This typically leads to a faster convergence to the optimal value when training the neural network on the desired property, and can lead to improved predictive performance in the target domain. This has been demonstrated in other investigations41,42, where the application of transfer learning and neural networks has achieved state of the art for materials property prediction. In this work we compare the performance of AutoSklearn ensembles, randomly initialized CrabNets, and CrabNets that have been pretrained on compositions and their formation energies from the OQMD43.

Training CrabNets involves iteratively updating many model parameters of the network on the same dataset multiple times; each iteration is called a training epoch. Once an iteration has completed, the millions of model parameters will have been more finely tuned to align the data with the target property, which should give a better model than the previous iteration. When model training begins, we expect poor performance when predicting properties of materials in the test set, but as the model is further biased by training data after several epochs, more robust predictions should be attained. In general, when training neural networks, the training error steadily decreases over time, as the parameters of the model get more aligned with the input. After prolonged training, however, these parameters begin to overfit to the training data, and the model gets steadily worse at predicting anything outside the training set44.

The training and testing performance at each epoch can be plotted on a training curve, which characterizes how performance evolves with the number of training epochs (Supplementary Figs. 58). A training curve can be used to determine the optimal training time (e.g., number of epochs). Model parameters can be exported from the training epoch that displays best performance at test set predictions. Training for sufficiently long time (to see degradation in test set performance) and then reverting to an earlier state in training is referred to as early stopping, in contrast to a priori deciding the number of training epochs, or training indefinitely. Early stopping across 500 training epochs is applied in this study, with each model taking the optimal set of training weights, giving a reasonable measure of how CrabNets with and without transfer learning perform using standard hyperparameters (discussed in Supplementary Note 1).

The performance of AutoSklearn and CrabNet regression and classification models at predicting the conductivities of the materials in this dataset is evaluated through four methods: control studies, parity plots, scoring metrics, and cross-validation techniques. We then use the best approach from this assessment to train final regression and classification models on all available data.

To give some measure of the worst-case performance, we provide two control experiments. In the first control experiment, we take the reported conductivity of each material, shuffle these labels, and treat the average of five of these shuffled values as an ensemble prediction from a poor model. This has the effect of providing a quasi-random prediction that demonstrates how ensembles can bring predictions closer to the mean (Fig. 5a). In the second control experiment, we demonstrate how a model which simply predicts the mean will perform. We take the mean of all of the room temperature conductivities (−5.02 in log10(σ)) and treat these as the output prediction for each material, giving the same prediction for every entry. The true conductivities are plotted against each of these control predictions to observe the performance (Fig. 5b).

Plots are an effective method to directly confirm the performance of a statistical model. For regression tasks, we plot the actual conductivities of each material against the predicted conductivities of a trained model. An ideal model would give each prediction perfectly on the leading diagonal. Dense pointclouds can be difficult to visually interpret, so errors of each prediction (ypredytrue) are calculated and plotted via histogram to quantify this distribution of errors. A Student’s t-distribution is fitted to the errors of all repetitions (without averaging) to provide intervals for how many predictions are within certain bounds of error for each model. The shuffled control has a zero-centred gaussian distribution of errors on the histogram with a standard deviation of 2.34 (Fig. 5c). The mean control has an error of −0.44 below the true value on average, with 68% of the predictions having an error within −1.99 to 1.10 of the true log10(σ) (Fig. 5d). Given this worst-case performance, we may demonstrate how the best compositional models perform at predicting new compositions.

When we have many plots for different models, it becomes difficult to visually confirm the best performing model. To quantify which of these models are best performing, we must use statistical metrics to rank the quality of the output predictions for each model. Regression models are often scored via Mean Absolute Error (MAE) and Pearsons R2 score. The MAE returns the average difference between each prediction and its known value, where values closer to 0 reflect stronger model performance. The R2 score shows the correlation between the true and predicted values, where a 1 is a perfect score, and anything below zero indicates that on average model predictions perform worse than simply returning the mean of the test set for all inputs.

For classification tasks, the performance may be demonstrated via a confusion matrix. This is a 2 × 2 matrix that compares the predictions made by the classification model against the true classification labels. An ideal result would have leading values (True Positives and True Negatives) and zeros elsewhere, but in reality, many predictions will be False Positives and False Negatives. For simplicity, however, the most frequently reported score for classification is accuracy. The accuracy score is defined as the number of true predictions divided by the total count of values in the testing set:

$${{{\mathrm{accuracy}}}} = \frac{{{{TP}} + {{TN}}}}{{{{TP}} + {{TN}} + {{FP}} + {{FN}}}}$$
(1)

On heavily imbalanced datasets with few negative class instances, the accuracy can return a high score for poor classifiers that output a single classification. This is due to the small number of negative instances, which do not significantly alter the denominator even if they are heavily misclassified (Eq. 1). To prevent misleading reporting, the MCC45 can be taken as a more informative score46 by considering the proportion of each class in the confusion matrix:

$${{{\mathrm{MCC}}}} = \frac{{{{TP}} \cdot {{TN}} - {{FP}} \cdot {{FN}}}}{{\sqrt {\left( {{{TP}} + {{FP}}} \right) \cdot ({{TP}} + {{FN}}) \cdot ({{TN}} + {{FP}}) \cdot ({{TN}} + {{FP}})} }}$$
(2)

The MCC is calculated by taking the difference of the product of true predictions and the product of false predictions, and dividing by the geometric mean of all entries in the confusion matrix. This returns a value from 1 for perfect classifications to −1 for entirely incorrect classifications. The MCC provides more weighting to the score for any misclassified values, allowing us to judge the outcome of the confusion matrix succinctly. By themselves, isolated scores do not convey the strength of a model and these must be compared against a known point of reference, such as a control study, to understand the significance of a particular result.

As an aim of machine learning models is to predict the behaviour of as-yet unknown materials, it is important to distinguish between performance in interpolation between materials that have similar chemistries, where similar structure-property-composition relationships would be expected, and in extrapolation to materials characterized by structure and bonding that is not found in the training set. For example, predicting performance within a solid solution family with some members in the training set used would be interpolation, whereas evaluating the conductivity from a material with a new structure type would be extrapolation.

This question naturally arises when evaluating ML model performance. Here, it is important that the data being tested have not been previously used to train the model, but in and of itself, this does not directly address interpolation versus extrapolation ability. The standard method of splitting data is via k-folds cross-validation, where the dataset is split into k equal sets, and one of these sets is used to test the model. In this report we take k = 5, where the model is trained on four of these subsets (80% of the data) and then tested on the fifth (20% of the data). This process is repeated for each set, and the mean score across all test sets is used as the final measure of performance. As many of the compounds in this dataset possess some similarity with one another, we expect the model should be able to interpolate relationships between known compositions.

Ideally, we want predictive models to be able to extrapolate beyond known materials, and statistically infer future chemical relationships from observed compositions. To test this, we utilize the DBSCAN labels assigned in Fig. 4 as Leave One Cluster Out (LOCO) labels to separate the 403 unique room temperature conductors into testing sets. As the compositions within each cluster have been confirmed to share chemical similarity, and to have dissimilarity from other clusters, using each cluster shown in Fig. 4a as a testing set provides a better estimate of the ability of a model to screen novel compositions than the k-folds approach, which will entail greater chemical similarity between the training and testing sets.

Both of these cross-validation techniques are applied to train AutoSklearn and CrabNet regressors and classifiers, with the average of five repetitions of each experiment taken as the final score. We collate the performance of the two control studies and the ML models for regression and classification, in Tables 3 and 4 respectively. Plots of all regression models performance can be found in Supplementary Figs. 9, 10.

The two control studies give the highest MAE and lowest R2 scores between the actual and the predicted values under each cross-validation scheme. These numbers are important to consider when evaluating any improvement in predictive performance. All models perform better than these controls, and under k-folds cross-validation, and AutoSklearn models perform comparably to randomly initialized CrabNet models. However, under LOCO-CV, the AutoSklearn model fails to fit a suitable decision boundary to predict unseen materials; performance metrics reveal no significant improvement over the mean control. CrabNet models are better than AutoSklearn models at the extrapolatory LOCO task, and these see improved performance in both MAE and R2 correlation. CrabNet models with transfer learning outperform all other models across each metric and cross-validation scheme. The ~10% increase in performance of transfer learning regression models over those initialized randomly suggests that pretraining in other domains has given the model a clear advantage when inferring unseen chemical relationships. To demonstrate this further, three of the regression models parity plots and distribution of errors are given in Fig. 6. These plots allow us to visually judge models against one another, and to assess each model’s performance at predicting materials similar to those within the training dataset (k-folds) as opposed to materials with unseen chemistry (LOCO-CV).

The AutoSklearn regression model under LOCO-CV (Fig. 6a) demonstrates tighter prediction error bounds than the shuffled control, but still leads to predictions with an error of −0.68 on average and a standard deviation of 1.55 (Fig. 6d). An ML model which typically achieves predictions of ionic conductivity within two orders of magnitude could be interpreted as a positive outcome. However, comparison to the mean control demonstrates that this model has not learned a meaningful representation for extrapolating beyond the chemistries within the training set. The AutoSklearn error distribution is not an improvement over the mean control, which has an average error of −0.44 and a standard deviation of 1.54 (Fig. 5d). CrabNets with and without transfer initialization output a range of predictions closer to the real values, with tighter error bounds than AutoSklearn models. The CrabNet regression models with transfer learning trained under LOCO-CV (Fig. 6b) are not as consistently skewed as AutoSklearn, with an average error of −0.02 and a standard deviation of 0.811 (Fig. 6e). These models typically return predictions with less error for high and medium conductivity materials, but often fail to capture the outlying low conductivity regions. This highlights the complexity of predicting exact materials properties when there has been little exposure to these unexplored chemistries. The best regression performance is achieved using CrabNet models with transfer learning under k-folds cross-validation (Fig. 6c), which leads to a distribution of errors centred around −0.01, and a standard deviation of 0.58 (Fig. 6f). As LOCO-CV forces each model to extrapolate future predictions, it is expected that the figures of merit will be less attractive than under k-folds cross-validation. Whereas regression models achieve only a modest improvement to the bounds set by the respective control studies, this is not the case for each of the classification models, which we turn to now.

Table 4 displays the average MCC and accuracy score for each model’s test set performance across five runs, where it is seen that control models may seem initially reasonable when judged by accuracy. A complete table of results under standard metrics may be found in Supplementary Tables 3 and 4 for comparison, although we consider MCC to carry the strongest judgement of model performance. CrabNet models with transfer learning return the highest MCC of 0.63 under k-folds cross-validation, and CrabNets without transfer learning return a slightly lower score of 0.57. AutoSklearn models do not give as strong performance, with an MCC of 0.46, but this is clearly a step improvement on the MCC scores of the control studies, with accuracy also seen to improve by some margin when comparing each model to the controls. As with the regression models, classification models trained under LOCO-CV return lower scores. This is highlighted by the AutoSklearn model, which has a particularly poor MCC (close to the MCC of zero of the two controls) of 0.10 when classifying LOCO test set materials, despite a promising accuracy score. The highest scoring LOCO classification model is the CrabNet with transfer learning; an MCC of 0.38 indicates more of the high conductive materials are correctly classified as having log10(σ) ≥ 4 than misclassified, which is supported by the high test set accuracy of 0.73.

The two distinct cross-validation techniques have been applied to rank these statistical models against one another. However, interpolation between related materials within known chemistries (defined as known structure and bonding) should be considered independently from extrapolating into unknown chemistries beyond the training data. Accordingly direct comparison should not be drawn between the metrics for the two different cross-validation protocols, as these assess different aspects of the performance of the ML models trained against the dataset. We are forced to use the data in our possession to assess the quality of each model. The data arise from the efforts of researchers in the field, and thus reflect various research trends and foci that have emerged, rather than directly expressing the possibilities for structure, bonding, and performance for materials drawn from element combination at the level of the periodic table. Given this anthropogenic bias, there will be consistencies and trends within each chemical family of the dataset.

By separating the materials of the database into clusters by chemical similarity and testing under LOCO-CV, the reduced performance compared to validation by k-folds highlights the challenge of extrapolating known compositional relationships to other chemical families that may span different ranges of conductivity. Comparatively, under k-folds cross-validation, each material in the testing set has a greater likelihood of having corresponding materials with similar elemental composition to their own in the training set. The model under assessment thus has more opportunities to interpolate between compositions in the training data, allowing it to make stronger predictions as it has to some extent been presented with similar examples during the training, rather than having them deliberately withheld.

This emphasizes the strength of structure-property-composition relationships in lithium ion transport. It is reasonable to assume that ion transport takes place by local hopping through barriers governed by physical models that are closely connected in their physiochemical origin across all materials in the dataset regardless of structure and bonding. However, the changes in structure and bonding between these machine-identified materials clusters in which lithium transport occurs by similar, unifying diffusion mechanisms are sufficient to hinder extrapolation of performance from one set of chemistry to another, despite no fundamental change in mechanism taking place between the clusters. This contrasts with the situation prevailing for example in superconductivity, where entirely different mechanisms may govern high-temperature superconductivity in cuprates and low temperature superconductivity in elemental and alloy systems that pair by weak-coupling BCS. This mechanistic difference has been shown to undermine attempts to extrapolate with machine learning from superconductors with one pairing mechanism to another47, whereas for lithium ion transport it is the chemistry (the structure and bonding) that controls performance even under a unified physical mechanism. Nevertheless, CrabNet models with transfer learning are seen to consistently outperform both the control studies and AutoSklearn models at predicting ionic conductivity. This is shown statistically across all cross-validation schemes and metrics in both classification and regression models, and can be visually attested from the parity plots. As such further discussion will assume these models as the focus unless stated otherwise.

### The Final Models

When screening compositions with machine learning we want to use the best possible model to increase the likelihood of making robust predictions. Model performance is typically improved by using the most training data available, and choosing an optimal training time. As discussed earlier, the optimal training time can be determined by assessing the performance vs. epoch training curve to decide which set of model parameters to use (i.e., early stopping). An important practical consideration is that any model to predict ionic conductivity would be most valuable when screening new materials. Accordingly, to assess the ability of our ML models to estimate the ionic conductivities of unstudied materials or novel chemistries, we train a final classifier and a final regressor on the entire initial database of unique room temperature conductivities and test it against eleven newly reported materials that have not been included in the initial database. We refer to this new set of materials as the experimental holdout set. These are selected to represent a range of chemistries and also conductivities, which matches the situation facing the experimentalist targeting new families of ion-transporting materials: it is desirable to understand the likely lithium conductivity of a particular composition in order to aid the selection of specific new chemistries for investigation.

We select CrabNet with transfer learning as the architecture for these two models, as k-folds and LOCO-CV assessment show that it offers the best interpolation and extrapolation performance based on the considerations above. The final CrabNet models are trained on all unique entries of the initial database presented here. In the earlier validation investigations, early stopping could be employed by using the test data to select the set of network weights at the best performing training epoch on the training curve. In our final models, a fixed number of training epochs are determined a priori by assessing the training curves of CrabNets with transfer learning under LOCO-CV and selecting a training time which typically attains optimal performance (Supplementary Note 2). Final models are trained on all unique compositions with room temperature conductivity (i.e., all 9 LOCO clusters), with the classification model trained for 98 epochs, and the regression model trained for 323 epochs.

The performance of these neural networks at classifying or predicting the log10(σ) of a selection of recently reported materials is assessed across a range of reported conductivities. The individual performance for each material in the holdout set is given in Table 5. As there are more training data available than in the validation investigations, the final models should have similar or improved performance to the results observed through cross-validation. The final classification model predicts whether the compounds of the experimental holdout set possess high (log10(σ) ≥ 4) or low ionic conductivity with an accuracy of 0.91 and a MCC of 0.83. The final regression model achieves an MAE of 1.34 on the holdout set, with an R2 score of 0.51. The performance of the final model against this necessarily small holdout set is consistent with the more robust performance indicators obtained from the previous validation investigations.

Despite the disparity in chemistries between the majority oxide training set and the more varied experimental holdout set (Supplementary Fig. 11), it appears from these metrics and also from consideration at the level of individual materials, that the regressor predicts properties reasonably. Compositions with exceptionally high conductivity are underestimated by the regression model. For nine of the eleven materials, the conductivity has been correctly predicted within two orders of magnitude, which would be expected for materials related to Li10GeP2S12, as this is contained in the training data. However, for the non-oxide materials of the holdout set that are dissimilar to those in the training set, performance is reasonable even when these materials have crystal structures that differ from other materials included in the training set. Li3.3SnS3.3Cl0.7 is the first lithium ion conducting defect stuffed wurtzite based on hexagonal close packed S2– anions48. Li3P5O14 has an ultraphosphate crystal structure defined by extended anionic layers, and is also structurally distinct from materials included in the training set49. Given that these are structurally differentiated materials, the ionic conductivities have been reasonably predicted (within 1.69 of the true log10(σ)) by a regression model that is based purely on composition. These models can be used as screening tools to motivate the further study of candidate materials and phase fields, and assist in the prioritization of resource commitment for experimental synthetic work.

Given the intended purpose as a screening tool, and the more favourable metrics demonstrated by the classification model, a reliable classification of high conductivity materials is more helpful than an absolute estimate of the ionic conductivity from the regressor. There are fewer materials with exceptionally high or low conductivity in the database, and as such there will be greater uncertainty when predicting a specific conductivity for materials in these extrema. Training on classification features gives a more balanced distribution of positive and negative class labels, which gives the model a less skewed dataset for judging its composition-based decision boundary, as reflected in the more favourable performance scores of the classification models. Although there is identified anthropogenic bias present in the dataset, the MCC score under LOCO-CV improves in comparison to each control. This leads us to conclude that these classification models predict with sufficient reliability whether a material has a log10(σ) ≥ 4 for these to be further employed to screen candidate ionic conductors (e.g., the material contains lithium and is likely to have low electronic conductivity). This does not replace expert chemical knowledge and judgement, instead providing a complementary numerical insight based on the evaluation of data at a scale hard for human experts to assimilate.

Here we present a dataset of experimentally reported lithium SSEs. This dataset includes the composition, structural type, conductivity, and measured temperatures of 789 ACIS measured conductivities, with 403 unique compositions with an associated ionic conductivity near room temperature. Multiple stages of data validation were carried forward by a team of domain experts to ensure that all data are correctly imported from the literature. The creation of a reliable database is a task that is particularly difficult to carry forward with automated tools due to the wide inconsistencies in how data is reported in the field of ionic conductors, necessitating lengthy human validation. Automated scraping would be a viable strategy if all future reports were to prominently state in the abstract a well-defined composition, ionic conductivity in common and clearly stated units (e.g., S cm−1), the temperature at which it was measured (e.g., 298 K) and the technique used to measure it (e.g., ACIS). With this in mind, we encourage researchers and journal editors to consider reporting core findings in this manner, which will enable materials science researchers to leverage tools from the NLP community to gather even larger datasets in the future.

The dataset represents the diversity of chemistry spanned by lithium-containing materials, with a numerical preponderance of oxide-based examples. There are 15 structural families represented at room temperature, including oxides, sulfides, halides, and mixed anion materials. These room temperature compositions are visualized and clustered with the ElM2D package to partition the dataset into nine chemically distinct clusters for leave one cluster cross-validation (LOCO-CV) assessment of the performance of machine learning models.

Supervised statistical (AutoSklearn) and deep learning (CrabNet) models have been applied to this dataset to predict the ionic conductivity of a material from its elemental composition alone. Regression and classification models have been evaluated with standard statistical metrics under different cross-validation regimes to assess their performance at predicting the ionic conductivities of novel materials. The ionic conductivity of a material is the product of many chemical and structural considerations, and also depends on external factors such as temperature. Further, the measured conductivity can also strongly depend on sample preparation, the presence of impurity phases, and crystallite size distribution, which are often discussed collectively under the nebulous term, ‘sample quality‘ This makes ionic transport a difficult property to reliably predict from limited and anthropogenically biased compositional data. Given this challenge, we go beyond standard statistical metrics by designing control studies to investigate the models more thoroughly. We show that CrabNets with transfer learning demonstrate the best performance under both k-folds and LOCO cross-validation.

We present a classification model that is able to estimate whether a material has high or low conductivity with reasonable reliability. This is a practical tool to aid experimentalists in their decisions to prioritize candidates for further investigation as lithium ion conductors. Predictions from this model for chemistries dissimilar to those contained in the database are likely to be less reliable than those of closer chemistries, and materials that may have received a low conductivity prediction from these models may still be of interest. This emphasizes the importance of reporting newly synthesized materials with distinct chemistry and their measured properties. This should be encouraged even if said property is not seen as being “exceptional” in comparison to heavily investigated and optimized materials families that have seized the attention of many researchers.

Acquiring new data is the only route to improving the performance of supervised models in outlier conductivity regions. Diversification of the structure and bonding within studied ionic conductors expands the predictive utility of these models because the database on which they are trained is more representative. This experimental synthetic exploration of uncharted chemical (composition and structure) space to generate new examples is thus of foundational importance, regardless of the absolute performance of the arising material. Each qualitatively distinct material in terms of differentiated structure and bonding assists our understanding of where high performing materials may be located in chemical space. This distinguishes the generation of materials closely related to existing examples—which is valuable for optimization—from studies that explore distinct parts of the relevant chemical space. The model performance here reinforces the importance of exploratory discovery synthesis coupled with definition of structure-property-composition relationships for lithium ion transport.

## Methods

### Database construction

A visual interface was developed using the python library Streamlit 0.60.0. Data is read into the application using pandas 1.0.1, with interface fields to select the researcher and currently presented data entry. The pdfs of each paper, which had been downloaded during earlier validation stages, were presented to each researcher on each page by dynamically updating the file address in an embedded iframe, and running a python 3.7 http server in the pdf folder. Fields for comments were included in the application, which were stored in a csv file and updated manually after each round of validation.

### Unsupervised learning

The PCA map of the ICSD was created by using the numpy 1.21.2 singular value decomposition implementation, applied to a centred 32-bit floating point ElMD28 0.4.15 kernel matrix, to project the distances of each point to two-dimensional co-ordinates. UMAP32 embeddings were generated using umap-learn 0.5.2 with an increased spread value of 5, a random seed of 5, and default parameters otherwise.

### Supervised learning

LOCO and k-folds cross validation methods were applied (discussed previously) using AutoSklearn38 and CrabNet39 models. AutoSklearn 0.14.5 models were trained on 128 vCPUs (dual AMD EPYC 7502) with default hyperparameters and a timeout of 600 s. CrabNet (commit 6296be6b06dde24a5d32e3a42657ef0ba0339344) models were generated using a batch size of 512, a RobustL1 loss function, a Lamb lookahead optimizer with stochastic weight averaging, a cyclic learning rate from 1 × 10–4 to 6 × 10–3, and a Leaky ReLU activation function. CrabNet models were trained as discussed previously on an Nvidia Quadro RTX 4000. Experiments may be replicated using the code provided in the “Code availability” section.