Predicting heterogeneous ice nucleation with a data-driven approach

Water in nature predominantly freezes with the help of foreign materials through a process known as heterogeneous ice nucleation. Although this effect was exploited more than seven decades ago in Vonnegut’s pioneering cloud seeding experiments, it remains unclear what makes a material a good ice former. Here, we show through a machine learning analysis of nucleation simulations on a database of diverse model substrates that a set of physical descriptors for heterogeneous ice nucleation can be identified. Our results reveal that, beyond Vonnegut’s connection with the lattice match to ice, three new microscopic factors help to predict the ice nucleating ability. These are: local ordering induced in liquid water, density reduction of liquid water near the surface and corrugation of the adsorption energy landscape felt by water. With this we take a step towards quantitative understanding of heterogeneous ice nucleation and the in silico design of materials to control ice formation.

: Acronyms of the systems as displayed in Supplementary  Figure 1 explained. The interaction of water with the LJ substrates, the back wall and the carbon atoms in graphene were modelled with a Lennard-Jones interaction. The interaction of OH groups with water was treated as mW-mW 5 interaction.

Acronym
Description Interaction Ref.

LJ_fcc111
Fcc crystal with (111) surface exposed LJ 2 LJ_fcc100 Fcc crystal with (100) surface exposed LJ 2 LJ_fcc110 Fcc crystal with (110) surface exposed LJ 2 LJ_fcc211 Fcc crystal with (211) surface exposed LJ  We give a brief overview of the initial features we consider an how they were computed.
Supplementary • dyn: Starting point is a simulation of liquid water interfacing with the substrate. We run two sets, one which is 100 ns long where we save every 1 ps (for Steinhardt q l , 9 local Steinhardt lq l 10 and number of nearest neighbors nn) and one which is 100 ps where we save every 10 fs (for forces and velocities).
• rssB_flex : Energies from minimization of n-mer water clusters and cages positioned in many random positions above the surface. The ice structures are free to relax and adjust.
• rssB_rigid : Similar to the flexible approach but keeping the structure of the deposited ice structure rigid. Since energy minimization with rigid bodies is highly not-trivial we performed short MD simulations with rigid constraints, 12,13 slightly pushing the ice-structure downwards while draining out the kinetic energy with a friction term in the equations of motion. In this manner we find that the ice structures have enough room to find a local minima with respect to position and orientation.
• lmatch: Generalized lattice match calculated as in Ref.
Here we place ice layers corresponding to a certain face randomly over the surface and compute the shortest distance to a substrate atom for each ice molecule provided this is shorter than a certain cutoff (yielding N M contacts). We considered 2D projections of the ice lattices as well as their actual 3D structure and different cutoffs for neighbor choices. Ice faces considered were I h (001), I h (100), I h (110) and I c (001).
• dens: Number density of liquid water in different layers obtained from the dyn runs.
Here are four examples of acronyms and their actual meaning: 1. lmatch2D_Ih001_c2: Generalized lattice match calculated with the second cutoff (c2 = 3.2 Å) of the 2D projected basal face (Ih001)

dyn_nn_all_median:
Median number of nearest neighbors in the whole liquid (cutoff 3.4 Å)

rssBr_4mer_Eall_range:
Total range of all adsorption energies of the water tetramer obtained with the rigid random structure search approach 4. dyn_lq3_l12_var: Variance of the lq 3 parameter in the water layer l12 (definition in Supplementary Supplementary Figure 2: Overview of the different feature families and their automatic namings. The meanings of "dyn", "disp", "rssA", "rssB_rigid", "rssB_flex", "dens" and "lmatch" are explained in the text. Names are created by following the arrows, e.g. "disp_xy_l3_50" or "dyn_lq3_l1_mean" are possible names describing "the xy-displacement in the 3rd layer after 50ps" and "the mean lq3 vaues in the first layer" respectively. The illustration on the top right shows the definition of layers, where each single layer has a z-extension of 3.68Å.
We provide a few more details on the algorithms and metrics used in the machine learning workflow.
After the cooling ramps are performed and after all possible features have been computed we train a random forest model. 14 To this end we specifying a substrate as good when the average T n was > 225 K and as bad otherwise. The model is trained to predict whether a substrate is good or bad, which is a binary classification problem.
A random forest is a collection of decision trees. A single decision tree can be created by performing binary splits regarding certain variables, where in each step the variable is chosen that gives the best improvement in the training metric. This is done until a maximum number of splits is reached or if the metric improvement is below a certain threshold. The random forest is now created by fitting many such decision trees, but randomly selecting a subset of all data for each tree. The substrates are selected via bootstrapping but also the To deal with feature correlations we cluster the features by hierarchical clustering, a classical and simple clustering algorithm which relies on a distance metric. We want to consider two features f 1 and f 2 as close when they are strongly correlated. Thus, we define where MIC(f 1 , f 2 ) is the maximum information coefficient. 15 The MIC is essentially a variant of mutual entropy that measures common information in two features, additionally screens through many different grid sizes that are needed to calculate the necessary histograms, and is also bound between 0 and 1. Most importantly (and contrary to standard Pearson correlation) it is able to recognize non-linear and periodic correlations and thus suited very well for our distance metric.
The clustering algorithm works in a simple manner. It starts by assigning two two closes data points to a cluster. This is repeated until all possible components are connected. After data points are assigned to a cluster, this cluster is regarded as new data point, and distances to this new point are calculated in different ways. When a new distance to that cluster should be calculated we take the mean distance to all the data points in that cluster. The choice of this is called average linkage. We also tested other feature selection approaches that are not based on clustering, see Supplementary Note 4.
The feature clustering allows for a systematic identification of n clusters, in that connected components with the largest distance are iteratively cut until n clusters are left. This can be used to select n features by generating n clusters in this manner and selecting out of each cluster the feature with the highest importance. The features selected in such a way are to some degree decorrelated, but also important for the prediction task (in our case to the target variable T n ). Supplementary Figure 4 shows a matrix with MIC correlations between features that is ordered by the feature distances together with several choices of how n clusters would be selected.
We have tried several models for training the prediction task for T n . We treat this problem as a regression problem, i.e. the exact value of T n should be predicted. We have used random forest, 14 a XGBoost 16 and support vector machines. 17,18 The former was explained in detail in the previous subsection. XGBoost is a popular variant for gradient boosting with trees.
It is also based on single decision trees, however the model is not averaging the votes of all trees like in a random forest. Rather, the modelŷ i is built iteratively, adding new treest that are fit to predict the residual error between the previous model and the prediction target y: where η is the so called learning rate. In this manner the model learns how to correct its own errors. The support vector machine uses a kernel to map points into a space in which they a separable in a manner corresponding to the separation in the target variable. For the details the reader is referred to the literature.
All models come with a variety of hyperparameters, for instance the number of trees to use for a random forest or the learning rate for boosting. We are using a Bayesian tree-   It is beyond the scope of this work to benchmark many different machine learning models, however a brief assessment was done in order to get a feeling of the difficulty of the task.
We find that random forest and XGBoost perform best (the latter is discussed in the main text) and also considerably better than the mean guess and linear model (elastic net). The trend of decreased improvement after 4 included features is also clear.