Data mining of molecular dynamics data reveals Li diffusion characteristics in garnet Li7La3Zr2O12

Understanding Li diffusion in solid conductors is essential for the next generation Li batteries. Here we show that density-based clustering of the trajectories computed using molecular dynamics simulations helps elucidate the Li diffusion mechanism within the Li7La3Zr2O12 (LLZO) crystal lattice. This unsupervised learning method recognizes lattice sites, is able to give the site type, and can identify Li hopping events. Results show that, while the cubic LLZO has a much higher hopping rate compared to its tetragonal counterpart, most of the Li hops in the cubic LLZO do not contribute to the diffusivity due to the dominance of back-and-forth type jumps. The hopping analysis and local Li configuration statistics give evidence that Li diffusivity in cubic LLZO is limited by the low vacancy concentration. The hopping statistics also shows uncorrelated Poisson-like diffusion for Li in the cubic LLZO, and correlated diffusion for Li in the tetragonal LLZO in the temporal scale. Further analysis of the spatio-temporal correlation using site-to-site mutual information confirms the weak site dependence of Li diffusion in the cubic LLZO as the origin for the uncorrelated diffusion. This work puts forward a perspective on combining machine learning and information theory to interpret results of molecular dynamics simulations.

(MD) simulations, that while synchronous collective Li migration takes place in the tetragonal phase, the cubic phase is characterized by asynchronous hops. Later research by Burbano et al. 28 showed that the slow cyclic-type Li movements in the tetragonal LLZO explained the low conductivity. One reason for this discrepancy may lie in the analysis of the Li hops. While the study of a relatively small system in AIMD simulations allows for the tracking of the atom jumps with ease, the conclusions may not be statistically accurate due to the small size and the limited number of Li hops. On the other hand, the large trajectory dataset obtained from classical MD simulations enables a credible statistical analysis but at the same time its massive size is a barrier to obtaining insight from the data. In addition, the spatial distribution of lithium around its equilibrium lattice sites adds to the difficulty since the identification of the Li jumps would require an accurate spatial definition of Li sites. Previous works have used spherical regions or polyhedral constructions to assign particles to given sites, where the site location was given a priori [29][30][31] . Other works borrowed ideas from computer science and leveraged the crystallographic information by performing k-means clustering to link particles to specific sites [32][33][34] . However, the k-means clustering is known to be NP-hard and a large dataset would require a good initial guess of centers. On the other hand, despite that they are relatively easy implementations, all these methods are only able to separate datasets with convex shape, i.e., for each data point pair, the data point that lies on the straight line connecting the data pair is also within the shape region. In ionic systems, if ions stay stationary, the crystallographic sites of these ions are defined on their locations. However, at finite temperatures, the thermal effects trigger faster vibration and movement of the ions. In this scenario, the crystallographic sites could be defined by aggregating the ion trajectories over time and then assigning the regions with higher density as the crystallographic sites. In other words, the continuous movement of particles allows one to define the nuclear density where the low density regions can be representative of site separation. In this work, we develop a novel density-based clustering of trajectories (DCT) method that learns the site information from the MD simulation trajectories. This method calculates the nuclear density from the trajectories and learns information from the density.

Results and Discussion
The DCT method developed in this article is based on the observation that ionic diffusive events in the crystal lattice are relative infrequent relative to the usual simulation scales. In addition, the ions mostly oscillate around their equilibrium lattice site, leading to high density therein as depicted in Fig. 1. This insight leads to the development of the following algorithm: first, the Li nuclear density is estimated from all Li ion trajectories by binning the Li positions to cubic voxels of side length dr. We note that by taking dr = 1 Å the density pattern from the simulation matches well its experimental counterpart as obtained from diffraction 35 . Second, we select a local density threshold ρ to separate the various spatial regions. As shown in Fig. 1, a small ρ results in connected density, which cannot be used to differentiate the Li sites. On the other hand, when it is too large, the corresponding clusters are too small causing potential loss of information. We suggest that one could choose an intermediate ρ such that where V 0 is the entire domain of the simulation and V ρ is the region with density greater than ρ, i.e., We shall call V ρ the core domain and the corresponding voxels the core voxels. The voxels, whose density are nonzero and less than ρ, are named noise voxels. Third, we use a breadth-first search algorithm to find core voxels adjacent to interested core voxels and therefore to identify the voxel clusters. Starting from one core voxel, the breadth-first search method finds all the neighboring core voxels and then for each of them, the neighboring search is repeated until no connected core voxels are further found, see Scheme S1 for illustration. Lastly, in the mapping step, a reverse density calculation that map the density to the particles is performed to assign cluster labels of the voxel to the particles therein.
The DBSCAN algorithm developed by Kriegel and co-workers 36 shares similar characteristics as our method, such as the ability to cluster the data based on the density and the ability to automatically recognize non-convex spatial patterns. However, the DBSCAN does not compute the density directly, but relies on calculating the distance between every data entry. This is advantageous when dealing with higher dimensional data on small datasets. In contrast, the data size of MD trajectories is typically large while the dimensionality of the particle trajectory is relatively low (3 dimensions). While DBSCAN requires the computation of all the pair distances (i.e. if the simulation is characterized by N atoms and T snapshots then it requires (NT) 2 operations), DCT relies on the computation of density and the neighboring search is done by indexing, requiring M calculations for a density map with M voxels. In addition, the nuclear density computation is ubiquitous in molecular dynamics simulations as a way to assist the interpretation of particle transport mechanism [37][38][39][40][41] . The proposed method therefore can be used in conjunction with the exploratory data analysis of the nuclear density pattern. Lastly, the clustering analysis yields two time series. The first time series consists in the site occupancy of all available sites with time. The second time series tracks how each Li's site changes with time. In MD analysis, velocity autocorrelation functions and Van Hove correlation functions are widely used techniques 42 . While both methods can be used to reduce the data and to categorize particles, any local effects, such as the impact of the neighboring cationic environments, are lost in the analysis. The DCT framework complements these methods by allowing the tracking of individual particles, and the determination of the overall particle dynamics.
The clustering results on a 2 × 2 × 2 supercell give a total number of 576 clusters (all available sites) for the cubic LLZO (c-LLZO) in the whole temperature range, and 448 clusters for the tetragonal LLZO (t-LLZO) from 300 K to 900 K and 576 cluster for temperatures greater than 900 K. The change of cluster number of t-LLZO at 900 K is accompanied by Li diffusivity change as t-LLZO starts to be similar to c-LLZO above 900 K, see Figure S1. The abrupt change in the total number of Li clusters and diffusivity in t-LLZO suggests that certain transition Scientific RepoRts | 7:40769 | DOI: 10.1038/srep40769 occurs at a temperature between 900 K and 1000 K 43 . In c-LLZO we did not find the splitting of 48 g sites into 96 h sites in c-LLZO since the two sites are closely connected.
As shown in Fig. 1, since we have the site attributes for all Li ions at all time steps, we can use this to derive the site characteristics. Even without crystallographic information, the cluster analysis of the local Li density detects two types of sites in c-LLZO (i.e. 48 g and 24d sites) and three occupied site types in t-LLZO (i.e., 8a, 16f and 32e sites) as shown in Section S2. The determination of crystallographic sites enables the validation of the simulations against experiments and could be in principle used to detect phase transitions. In addition, the site occupancy can also be obtained as discussed in Section S3. Generally, Li vacancy clusters are rarely found and the most frequent local configurations of Li occupancy are T 04 , T 13 and O 11 as shown in Figure S6, where the upper case letter indicates the center site (Td or Oh), the first index is the number of Li at the center site and the second index is the total number of Li nearest neighbors.
The DCT allows us to obtain the statistics of the Li jumps, which, as noted above, is naturally encoded in a time series obtained by associating every Li to an identified cluster at a given time step. The hopping rate in c-LLZO is greater than that in t-LLZO, see Figure S8. The activation barrier energy is also different within the two phases, as shown in Figure S9, where the barrier of t-LLZO is higher.
We notice that in c-LLZO, despite the high jump rate, it is highly probable that Li returns to the original site after hopping to the neighboring site as shown in Figure S10. Since the net displacement of the Li is zero in the back-and-forth jumps, such events do not contribute to the Li diffusivity. To verify our assumption, we computed the jump diffusivity of Li, D jump , from the jump statistics by the following formula 29 where l is the distance between neighboring site centers, calculated from the average Li coordinates within the given center, and ω is the Li hopping frequency. We filter out the back-and-forth Li hops and compute the jump diffusivity D jump . The comparison between D jump and D self , is shown in Fig. 2. Reasonable agreement is reached as a result of filtering out the back-and-forth Li hops. Without filtering, however, the jump diffusivity is much higher, as shown in Figure S12d. The difference is even more prominent at lower temperature. This result also supports our previous conjecture that, since T 04 and T 13 configurations are the most frequent, Li diffusion is relatively short-ranged because continuously open channels are missing. In other words, the analysis suggests that one bottleneck to higher diffusion rate is the relatively low vacancy concentration. If more vacancies are created, the total number of back-and-forth Li hops could be reduced, thereby resulting in more effective Li diffusional rates. Therefore, it is expected that doping the LLZO with higher valence cations increases the Li vacancy concentration and enhances Li self-diffusion. This has been observed if LLZO is doped with, for example, penta-valent elements Ta 5+ and Nb 5+ 23,44-46 or if Li is partially substituted with Al 3+ and Ga 3+ 47-49 . As shown in Figure S13, an increase of diffusivity is observed if the vacancy concentration increases, and this trend continues until the composition reaches Li 5.3 La 3 Zr 2 O 12 .
The intriguing Li jump characteristics calls for a more detailed study, especially on the correlation between Li jumps and Li sites both in time and space. The nature of the temporal stochastic process can be studied by benchmarking the probability P(k) of observing k simultaneous jumps within 0.1 ps against the Poisson distribu- . Interestingly, the Li hopping in c-LLZO follows a Poisson distribution (see also Figure S20), while t-LLZO deviates substantially from it as shown in Fig. 3 for simulations at 900 K (data at other temperatures are shown in Figure S14). This correlated jump behavior in t-LLZO is consistent with previous findings by Meier et al. 27 and Burbano et al. 28 .
In c-LLZO, while the aggregate behavior of Li hopping appears to be qualitatively Poissonian in the temporal scale, studying the site-to-site interaction may provide further spatial scale information. In order to clarify the spatial correlation among Li sites, we calculated the site mutual information for all site combinations as defined by ref. 51.
where X and Y are the random variables denoting corresponding sites and p ij is the probability of the (X, Y) to be in occupancy state (i, j), where i and j could be 0 or 1, e.g., p 01 is the probability that site X is unoccupied and site Y is occupied. Detailed explanation is provided in Section S5. We note that such mutual information is the Kullback-Leibler distance of the individual site occupancy distributions, implying that it measures the amount of information into X is contained in Y, i.e., the site dependence. It is important to note that I(X, Y) > 0 indicates dependence between X and Y, and I(X, Y) = 0 if and only if X and Y are independent. Furthermore, the mutual information is positively related to the correlation. In the special case where X and Y are Gaussian random variables, I ( X , Y ) is rel ate d to t he bivar i ate cor rel at ion ρ b ( X , Y ) fol low ing t he for mu l a ρ = − − X Y I X Y ( , ) 1 exp( 2 ( , )) b 52 . In addition to I(X, Y), we also computed the delayed mutual information defined as ) u 1 2 53,54 . All data are presented as a function of time delay and static site distance, i.e., the number of hops needed to go from one site to the another as computed by Dijkstra's algorithm 55 , see Fig. 4. Other details are presented in Section S5.
The value of I u,d (X, Y) when u = d = 0 corresponds to the entropy. Due to an occupancy close to 0.5 for the Td sites and greater randomness than the Oh sites, see Fig. 5, the entropy of Td sites is higher than that of the Oh sites, as shown in Fig. 4.
The autoinformation, that is I u,d (X, Y) for u > 0 and d = 0, measures the autocorrelation of a given X site. As a consequence of occupancy and in spite of considerable site randomness, the Td sites are slightly autocorrelated as shown in Fig. 4b and, for T = 1200 K, the autocorrelation time is on the order of 1 ps. We can also show the mutual information as a function of distance. Td sites seem to have a weak long-ranged spatial correlation which is greater than the Oh-Oh and Td-Oh. 1D plots of mutual information I u,d (X, Y) versus d for fixed u are shown in Figure S17.
Previous analysis on the site occupancy in Section S3 demonstrated that certain neighboring occupancy patterns, such as O 11 , occurred at a higher probability than others. Surprisingly, this does not mean that the neighboring Oh and Td site are highly correlated. On the contrary, the neighboring sites show decreased correlation in comparison to the Td-Td and Oh-Oh interactions at second nearest neighbors, as evidenced by the low mutual information between Oh and Td in Fig. 4c. Overall, the spatial correlation is weak and short-ranged, confirming the observed uncorrelated Poissonian type of diffusion. Mutual information results at other temperatures also show weak correlation, as shown in Figure S18.

Conclusions
Li diffusion in the cubic and tetragonal phase LLZO is studied by MD simulations. We develop a density-based clustering method to obtain the Li site shape, type and occupation information, and Li jump statistics. The results show that in c-LLZO, the Li occupancy is much higher at the Oh sites than the Td sites, and t-LLZO has an occupancy of 1 for three types of sites. Our study also shows that at RT the Li jumps in c-LLZO consist mostly of a back-and-forth type, and this type of jumps does not contribute to the effective diffusion. In line with the Li occupancy analysis, the reason for the back-and-forth jump is that when Li jump from an Oh to the neighboring empty Td site, the other diffusion paths are mostly blocked by the Li. Therefore, the main obstacle to Li diffusion in LLZO is the low vacancy concentration. Substituting cations with higher valence is an effective strategy to increase the vacancy concentration, as found in experimental observations. In addition, Li hops in c-LLZO are uncorrelated and fit well with the Poisson process while it deviates from the Poisson process in t-LLZO. Lastly, the information theoretic analysis of c-LLZO confirms that the low correlation between sites likely induces a Poissonian type of Li hopping. However, the Li sites still have some short-ranged time and space correlation and this correlation is stronger for the Td sites in comparison to the Oh sites.

Methods
MD simulations were performed in LAMMPS 56 . The potential parameters, presented in Table S1, were fitted from DFT calculations by Klenk and Lai 43 , where the lattice parameters, phase transition temperatures, and Li diffusivities are benchmarked against previous experiments 57 . In order to simulate both the cubic and tetragonal phase of LLZO, and examine the difference in Li diffusion behaviors, we artificially imposed isotropic pressure and biaxial pressure respectively to constraint the lattice shape. For the cubic phase, since 2/9 of the Li sites are vacant, we generated randomly distributed Li vacancy positions and constructed 8 different vacancy configurations. The simulated trajectories from different configurations are collected together to yield the nuclear density estimate. The simulation temperatures ranged from 300 K to 1200 K with a 100 K interval. During each simulation, the temperature was first equilibrated at 1200 K and then cooled to the set temperature in the NPT ensemble employing a Noose-Hover thermostat and barostat with a damping parameter of 0.05 ps for the temperature and 0.25 ps for the pressure. Then the systems were equilibrated at the set temperature for 50 ps followed by the data collection run in NVT ensemble. For the cubic phase, the simulation time was 500 ps long and the trajectories were sampled every 0.1 ps. For the tetragonal phase, a longer 4 ns run was performed, where the same sampling rate was used. The 4 ns runs of tetragonal phase at low temperatures, especially at 300 K, may not be able to generate reliable diffusivity results due to the intrinsically slow Li movements. However, the simulation results have shown experimentally-consistent characteristics of the site information, such as the site types and multiplicity, and the low Li jump statistics, which are necessary in the current study as explained in the results section. For all simulations, a velocity Verlet algorithm with a 1 fs interval was chosen for the time-stepping method. For a system with extra Li vacancies that cause unbalanced charge, a uniform background charge is added, as implemented in LAMMPS.