Autonomously revealing hidden local structures in supercooled liquids

Few questions in condensed matter science have proven as difficult to unravel as the interplay between structure and dynamics in supercooled liquids. To explore this link, much research has been devoted to pinpointing local structures and order parameters that correlate strongly with dynamics. Here we use an unsupervised machine learning algorithm to identify structural heterogeneities in three archetypical glass formers—without using any dynamical information. In each system, the unsupervised machine learning approach autonomously designs a purely structural order parameter within a single snapshot. Comparing the structural order parameter with the dynamics, we find strong correlations with the dynamical heterogeneities. Moreover, the structural characteristics linked to slow particles disappear further away from the glass transition. Our results demonstrate the power of machine learning techniques to detect structural patterns even in disordered systems, and provide a new way forward for unraveling the structural origins of the slow dynamics of glassy materials.


SUPPLEMENTARY METHODS
In this paper, we employ an unsupervised machine learning (UML) method to classify particles into different groups based on their local environment. As mentioned in the Methods section, we describe the local environment of particle i in terms of a vector Q(i) of bond order parameters. This vector has length d = 8. As clustering particles with similar environments is difficult in such a high-dimensional space, we reduce the dimensionality of this vector using a neural-network-based autoencoder. A sketch of such an autoencoder is shown in Supplementary  Figure 1.
Essentially, the autoencoder is a neural network trained to reproduce its input as its output. The neural network is especially designed to contain a "bottleneck" with a dimensionality lower than the input and output vectors, such that the network is forced to compress the information, and subsequently decompress it again. After training the autoencoder (which can be done on a single simulation snapshot), we only retain the encoding part of the network, and use it as our dimensionality reducer (as shown in Fig. 1 of the main text). This assigns to each particle a lower-dimensional vector which can be used to group particles with similar environments.
In order to cluster together similar environments in the low-dimensional subspace found by the encoder, we use Gaussian mixture models (GMMs) as implemented in scikit-learn [1]. This allows us to separate the particles into separate clusters which differ by their local structure. In all cases, we found a separation into two clusters to be optimal. Based on the resulting classification, we assign to each particle a probability P red of belonging to a specific one of the two clusters. Note that in principle, the two clusters carry no inherent meaning. However, in order to make our figures consistent, we choose (using hindsight) the red cluster to correspond to the more mobile particles in the system.
In the following, we describe in more detail both the dimensionality reduction and the clustering procedures, which closely follow the method introduced in Ref. [2]. Note that, for all systems analysed in this work, we perform a separate analysis for the two species of particles.

A. Nonlinear dimensionality reduction using neural-network-based autoencoders
In order to reduce the dimensionality, i.e. extract the relevant information, of the vectors Q(i), we use neuralnetwork-based autoencoders [3][4][5][6][7]. An autoencoder is a neural network that is trained to perform the identity mapping, where the network inputs, Q(i) ∈ R d , are approximately reproduced at the output layer,Q(i) ∈ R d . The network may be viewed as consisting of two parts: an encoder network, which performs a nonlinear projection of the input data onto a low-dimensional subspace (the bottleneck), Y(i) ∈ R c with c < d, and a decoder Supplementary Figure 1: Architecture of a neural-network based autoencoder. The encoder network (highlighted in blue) finds a low dimensional representation of the input, from which the decoder reconstructs an approximation of the input as output.
network that attempts to reconstruct the input data from the low-dimensional projection. This architecture is represented in Supplementary Figure 1.
In this paper, we train the autoencoder using the data associated with the N different local environments from a single snapshot, where N is the number of particles in the simulation. By training the autoencoder to perform the input reconstruction task over this set of N training examples, the encoder is forced to learn a lowdimensional nonlinear projection that preserves the most relevant features of the data and from which the higherdimensional inputs can be approximately reconstructed by the decoder.
The number of input and output nodes, d, is specified by the dimension of the input vectors. Nonlinearity is achieved by providing both the encoder and the decoder with a fully-connected hidden layer with a nonlinear activation function. In this work, we set the number of nodes in the hidden layers to 5d and use a hyperbolic tangent as the activation function. For the bottleneck and output layers, instead, a linear activation function is used.
The internal parameters of the autoencoder, i.e. weights W ≡ {w j } and biases B ≡ {b k }, which are initialized with the normalized initialization proposed by Xavier in Ref. [8], are optimized during the training by iteratively minimizing the reconstruction error of the input data over a training set of N training examples. Specifically, we consider the mean squared error with the addition of a weight decay regularization term [7] to control the magnitude of the network weights where M is the total number of weights, whose value depends on the dimension of the network, and we set λ = 10 −5 . The function in Supplementary Equation 1 is minimized using mini-batch stochastic gradient descent with momentum [7,9,10].
The optimal number of nodes in the bottleneck layer, c, which defines the unknown relevant dimensionality of the input data, can be determined by computing the fraction Supplementary Figure 2: a Scatterplot of the local descriptors of all large particles in the Wahnström after dimensionality reduction. The points are colored according to their value of P red . b Same scatterplot, with a representation of the two Gaussians found by the clustering. c Same scatterplot, but with points colored according to the dynamic propensity measured at the time with the maximum correlation. Note that the particles are colored according to their deviation from the mean dynamic propensity, in units of the standard deviation σ.
of variance explained (FVE) by the reconstruction, whereQ is the mean input vector. Fig. 2a of the main text shows the FVE as a function of c for the three systems analyzed in this work. Here, we choose the optimal value of c as the smallest value for which the FVE is at least 0.75, i.e. at least 75% of the variance of the original data is explained by the autoencoder's reconstruction. As shown in Fig. 2a, this condition is satisfied with a choice of c = 2 both for the hard sphere and Wahnström systems, while a dimensionality of c = 4 was necessary for the Kob-Andersen system. Once the autoencoder is trained, the encoder network (highlighted in blue in Supplementary Figure 1) alone is retained in order to perform the nonlinear mapping of the input vectors Q(i) onto the low-dimensional subspace defined by the bottleneck layer, Y(i).

B. Clustering
GMM is a probabilistic model that assumes that the observed data are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. The optimal values of these parameters are found iteratively with the expectation-maximization (EM) algorithm [11] in order to create a probability density function that agrees well with the distribution of the data. The number of Gaussian components in the mixture, N G , is usually found by minimizing the Bayesian information criterion (BIC) [12], which measures how well a GMM fits the observed data while penalizing models with many parameters to prevent overfitting. Fig. 2b of the main text shows that the BIC has a minimum at N G = 2 for all the three systems and for both species of particles, indicating that a separation into two clusters is optimal for these systems.
The output of a trained GMM is a list of probabilities, P j (i), corresponding to the posterior probabilities of the i-th observation to arise from the j-th component in the mixture model. Here, we arbitrarily label the two clusters as "red" and "white" and refer to these probabilities as P red (i) and P white (i) = 1 − P red (i).

C. Dimensionality reduction in the Wahnström mixture
In order to demonstrate the effect of the dimensionality reduction, we show in Supplementary Figure 2a a scatterplot of all local environments of large particles in the Wahnström system at density ρ * = 0.81 and temperature T * = 0.7. For each particle, the encoding part of the autoencoder has been used to reduce the dimensionality of the vector Q to two dimensions. As a result, the information retained by the autoencoder can be plotted as a point in 2D space for each particle. Using the Gaussian Mixture Model, we classify this set of points into two clusters (white and red), corresponding to two Gaussian distributions of points (Supplementary Figure 2b). The connection to dynamics can also be clearly visualized in this reduced space, by coloring particles according to their dynamic propensity (Supplementary Figure 2c). This leads to a clear gradient in coloring, indicating that the faster particles (red) lie predominately in the top left cluster, while the slow particles lie in the bottom right one.

A. Different trainings
The training of the autoencoder is performed with a stochastic optimization algorithm and, as a consequence, the learned projection of the input data differs for every particular training. Although the FVE is an excellent indication of the quality of the learned projection, it is not a priory clear how strongly the final results depend on the particular projection found by the encoder.
To explore this, we performed multiple times the analysis for the large particles of the Wahnström system at reduced density ρ * = 0.81 and reduced temperature T * = 0.7. Every time, we trained a new autoencoder (with the same architecture and a different random seed), and subsequently performed a new clustering. Finally, we compared the Spearman's rank correlation between the dynamic propensities of the particles and their membership probabilities P red obtained for different trainings of the UML method.
Supplementary Figure 3 shows that the results obtained for three different trainings are essentially the same, indicating that the method is robust to changes in the training.

B. Autoencoder architecture
In all the analysis performed in this work, we used the same autoencoder architecture, i.e. one hidden layer of dimension 5d in both the encoder and the decoder parts of the network. This choice is somewhat arbitrary and was determined empirically without a thorough optimization.
In this section, we compare the results obtained with different autoencoder architectures with either a larger hidden layer, or multiple hidden layers. We specify the number and the dimensions of the hidden layers with a tuple of the form (h 1 , h 2 , . . . ), whose dimension correspond to the number of hidden layers, and whose components indicate the specific size of the layers.
Supplementary Figure 4 shows the FVE as a function of c obtained with different architectures for the large particles of the Wahnström system at density ρ * = 0.81 and temperature T * = 0.7. The results are approximately equal for all the architectures considered, indicating that there is no advantage in using a deeper autoencoder.

C. Bottleneck dimensionality
As explained in the Supplementary Methods, we choose the optimal dimensionality of the bottleneck, c, as the smallest value for which the FVE is at least 0.75. This choice guarantees that at least 75% of the variance of the original data can be explained by the input reconstruction, meaning that only a small part of the original information is discarded by reducing the dimensionality. However, 75% is an arbitrary choice, and it is not obviously clear how the final results might depend on this choice.
To answer this question, we performed a new analysis of the three main snapshots presented in the paper with a different choice of the bottleneck size, c.

D. Input dimensionality
As explained in the methods section of the main text, our description of the local environment is encoded into a vector of d = 8 BOPs,q l with l = 1, 2, . . . , d, which is then used as the input of the UML method. One could in principle consider a larger or smaller set of BOPs with the obvious consequences of having a more or less complete description of a particle's local environment. In general, using a smaller set of BOPs has the risk that one might miss the relevant structural differences in the system, while including more BOPs should give a better description of the structure and, hence, a better classification. In our previous study [2], where we applied the same UML approach to identify different crystal phases in colloidal systems, we found a choice of d = 8 to be sufficiently large for all the systems examined. In this study, however, it is not obviously clear how the results, in particular the correlations with the dynamics, would be affected by a different choice of d.
To explore this, we repeated the analysis of the three main snapshots presented in the paper with different choices of d: d = 4, 6, 8, 12, 16. For each value considered, we trained a new autoencoder and subsequently performed a new clustering. Additionally, in every case we determined the optimal dimensionality of the bottleneck, c, as the smallest value with a FVE of at least 0.75. Note that, with this choice, c increases with d. The results for the large and small particles of the three systems are shown in Supplementary Figure 6.
For all systems we find a choice of d = 8 to be nearly optimal for both species of particles. Specifically, for the hard sphere and Wahnström systems, the correlations get worse using d < 8, clearly indicating that at least part of the relevant information that correlates with the dynamics is encoded in BOPs with higher l. Additionally, increasing d leads to essentially the same or even worse correlations than d = 8, indicating that the BOPs with l > 8 capture structural differences that poorly correlate with the dynamics. For the Kob-Andersen system, instead, the correlations are only weekly affected by a different choice of d.

SUPPLEMENTARY NOTE 2: COMPARISON TO SUPERVISED METHODS FOR LINKING STRUCTURE AND DYNAMICS
As shown in the main text, we find significant correlations between the structural heterogeneities found by the UML order parameter P red and the dynamical heterogeneities encoded in the dynamic propensity. In order to obtain an estimate of how much dynamical information is retained during the dimensionality reduction, it Supplementary Figure 7: Comparison of the Pearson's correlation coefficient between different order parameters and the dynamic propensity. We compare our results on the Kob-Andersen mixture to previous results taken from Ref. [13]. The methods explored there include three supervised machine learning methods: graph neural networks (GNN), convolutional neural networks (CNN), and support-vector machines (SVM) [14], which are all trained explicitly to fit dynamic propensity data. In addition, Ref. [13] examined three physics-based order parameters, based on soft modes [15], the Debye-Waller factor [16], and the potential energy [17,18]. We compare these results to the prediction given by our UML order parameter P red , the locally averagedP red , and a simple neural network (NN) taking the same vector Q as input as our autoencoder and fitted to the dynamic propensity.
is interesting to compare these correlations to the results of supervised machine learning techniques. These supervised machine learning techniques explicitly fit dynamical data based on a input set of structural features.
For the Kob-Andersen mixture, Ref. 13 provides an excellent set of benchmarking data for various temperatures. Specifically, they compare the performance of state-of-the-art supervised ML techniques for predicting dynamic propensity based on structure. This includes the support-vector machine (SVM) approach introduced in Ref. [14], as well as more advanced machine learning methods: convolutional neural networks (CNN) and graph neural networks (GNN). As shown in Supplementary Figure 7, these algorithms all provide better predictions of the dynamic propensity than our P red and its locally averaged versionP red . This is expected: explicitly drawing on dynamical data should always allow for a better fit than relying purely on structural information. It is however, interesting to note thatP red does outperform all three of the physical order parameters used in Ref. [13] as benchmarks (also plotted in Supplementary  Figure 7), which are based on soft modes in the system [15], the local Debye-Waller factor [16], and the potential energy [17,18].
One key factor in the ability of a (supervised or unsupervised) machine-learned order parameter to predict dynamic propensity is its input data. For our UML parameter, we use a relatively small set of eight bond-order parameters, and it is interesting to see how well a supervised learning algorithm performs with such an input. To test this, we performed a supervised machine learning analysis on our state points included in Supplementary  Figure 7. Specifically, we trained a standard neural network (3 hidden layers, with a width of 2d each), to fit the dynamic propensity of each particle in the snapshot based on the input vector Q. The result is shown as the black dashed line in Supplementary Figure 7. From this, we can make two observations. First, the correlations for P red are only slightly lower than those for the NN, indicating that the UML approach retains the vast majority of the dynamical data encoded in the full set of bond order parameters. Second, for Kob-Andersen, the small set of bond order parameters is not a great predictor for dynamic propensity: the NN scores significantly worse than e.g. the SVM approach, which used 440 hand-crafted structural features as input [13,19]. Based on this, it is possible that a larger set of input data per particle could also improve our UML approach, although we leave this for future work.

SUPPLEMENTARY NOTE 3: DYNAMIC PROPENSITY CORRELATIONS FOR SMALL PARTICLES
In the main text, we report the correlation between P red and dynamic propensity for the large particles in all three of our glass forming models at different degrees of supercooling. For completeness, in Supplementary Fig-ure 8 we show the same data for the small particles, where we observe approximately the same results.

SUPPLEMENTARY NOTE 4: LOCAL AVERAGING OF THE STRUCTURAL ORDER PARAMETER
In the main text, we locally average the structural order parameter found by the UML method, P red , over a spherical local region of radius r c = 2σ A for all systems. Note that, as our method is purely based on structural information, this radius was not optimized in order to maximize the correlations with the dynamics.
In this section we explore how the choice of such a radius influences the correlations with the dynamics, and whether we can extract a growing length scale with the degree of supercooling of the system.

A. Correlations for different cutoff radii
Here, we consider the three main snapshots analyzed in the main text. For each snapshot, we compute the locally averagedP red using three different cutoff radii, r c /σ A = 1, 2, 3, and see how the Spearman correlation with the dynamic propensity is influenced by this choice. A comparison of the correlations obtained for the three systems is shown in Supplementary Figure 9.
In general, we find that the maximum in the correlation shifts to longer times by increasing r c . Moreover, in all cases, the correlation improves going from r c /σ A = 1 to r c /σ A = 2. In particular, for both species of particles in the hard sphere system and for the large particles of the Wahnström system the correlation gets worse for r c /σ A = 3. In the remaining cases (small particles of the Wahnström system and both species of particle in the Kob-Andersen system), we find very small differences between r c /σ A = 2 and r c /σ A = 3.
Supplementary Figure 9: Spearman's rank correlation between the particles' dynamic propensity Di and their locally averaged membership probabilityP red (i) obtained with a different averaging cutoffs, rc, for a-c large and d-f small particles in the hard sphere (a, d), Wahnström (b, e), and Kob-Andersen (c, f) systems at the highest degree of supercooling.

B. Length scale
In order to explore whether we can extract a growing length scale with the supercooling of the system, we now consider the three systems at different degrees of supercooling, i.e. different packing fraction η for hard spheres, and different temperature T * for the Wahnström and Kob-Andersen systems.
For each system, we optimize the cutoff radius by maximizing the correlations ofP red with the dynamic propensity. Specifically, we computeP red using different radii (r c /σ A = 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0), and choose the optimal radius r * c as the one for which the maximum in the correlation betweenP red and the dynamic propensity is the highest. Supplementary Figure 10 shows the results of this analysis for the three systems.
For the Wahnström and especially for the Kob-Andersen systems we clearly find that the optimal cutoff increases with the degree of supercooling. This behaviour is less pronounced and more irregular for the hard-sphere systems, which might be caused by the relatively small system size considered for such systems (only N = 2000 particles compared to the N = 64000 particles of the Wahnström and Kob-Andersen systems). Nonetheless, the largest optimal cutoff is always found at the deepest supercooling.
It should be noted that this length scale is not a purely structural one as we find the optimal r * c by comparing to dynamical data. As a result, it seems likely that the optimal r * c is linked to the dynamical correlation length of the system rather than being indicative of a growing static length scale.

SUPPLEMENTARY NOTE 5: CORRELATIONS IN THE TRAINED ORDER PARAMETERS
The next question we want to address is whether our trained order parameter shows signs of a growing structural correlation length in our systems. To investigate this, we calculate the radial autocorrelation function associated with P red , defined as: where r i is the position of particle i, and the normalized order parameter p i of each particle is defined as with P red the average value of P red taken over all particles. The resulting correlation functions for all three systems are shown in Supplementary Figure 11. While we do observe some increase in overall correlations as each system is supercooled, the range over which these correlations exist does not grow appreciably. This is in contrast to the growing dynamical length scale associated with r * c in Supplementary Figure 10.

SUPPLEMENTARY NOTE 6: BOND ORDER PARAMETERS IN THE TWO CLUSTERS OF SMALL PARTICLES
In the main text, we report the mean value of all bond order parameters q 1 , · · · , q 8 for both red (more mobile, P red > 0.5) and white clusters (P red < 0.5) of large particles in all three of our glass forming models. For completeness, in Supplementary Figure 12 we show the same data for the small particles, where we observe approximately the same results.

SUPPLEMENTARY NOTE 7: MODE COUPLING THEORY FITS
For each of our model glass formers, we extract the mode coupling theory (MCT) critical temperature (T MCT ) or packing fraction (η MCT ) by performing standard MCT fits to the scaling of the structural relaxation  time τ α as a function of temperature (or packing fraction): where ε is the distance from the MCT critical point, i.e. ε = T − T MCT for Kob-Andersen and Wahnström and ε = η MCT − η for binary hard spheres. The fits are displayed in Supplementary Figure 13 and the resulting fit parameters are presented in Supplementary Table I.

SUPPLEMENTARY NOTE 8: SANN VS. CUTOFF IN KOB-ANDERSEN
In order to describe the local environment of each particle i in terms of BOPs, needed for Q(i), we require a definition of what neighbors we include in the bond order calculation. For both hard spheres and Wahnström, we used the definition of nearest neighbour from the solid angle nearest neighbor (SANN) algorithm [21]. This algorithm is known to work well for hard spheres, and as Wahnström is very simliar, we expected it to work well here as well. For Kob-Andersen, we also initially used SANN, but later found that using a tuned fixed cutoff radius worked better. In the following we compare the results of SANN and our cutoff of 1.2 σ A . In particular, we analyse a snapshot of a glassy Kob-Andersen mixture at density ρ * = 1.2 and temperature T * = 0.5. Figure 13: Fits used to extract the mode coupling temperatures and packing fraction in Table I. The points are measurements of the structural relaxation time τα from our simulations, and the dashed lines are fits following Eq. 5. Figure 14 shows a comparison of the Spearman's rank correlation between the particles' dynamic propensity D i and their membership probability P red (i) obtained by using either SANN or a fixed a cutoff radius.

Supplementary
Supplementary Figure 14: Comparison of the Spearman's rank correlation between the particles' dynamic propensity Di and their membership probability P red (i) obtained by using either SANN or a cutoff radius of rc = 1.2σA for both large and small particles.

SUPPLEMENTARY NOTE 9: PEARSON VS. SPEARMAN CORRELATION
All our results are presented in terms of the Spearman's rank correlation between the structural order parameter found by the UML method and the dynamic propensities of the particles. The Spearman's rank correlation is a measure of how strongly the relationship between two variables can be described by a monotonic function (whether linear or not).
However, most of the other works, where supervised machine learning is used to predict the dynamics of particles in glassy systems, use instead the Pearson correlation as a measure of the performance of their methods, which complicates a direct comparison.
Note that our UML method develops an order parameter that is purely based on local structure and does not rely on any dynamical information. As such, there is no obvious reason to believe that P red and the dynamic propensity should be linearly related, which is why we decided to use the Spearman's rank correlation. Nonetheless, we find that two types of correlations are very similar for all systems analyzed. As an example, Supplementary  Figure 15 shows a comparison of the Pearson and Spearman's rank correlations betweenP red (averaged over a local neighborhood of radius r c /σ A = 2) and the dynamic propensity for the large particles of the three main snapshots analyzed in the paper.