Understanding homogeneous crystal nucleation under deep undercooling conditions remains a formidable issue, as crystallization is essentially heterogeneous in nature and initiated from impurities, surfaces, or near grain boundaries that often hinder its occurrence1,2. Unreachable until very recently, experimental observations of early stages of nuclei was achieved by a tour de force using time tracking of three-dimensional (3D) Atomic Electron Tomography3 of metallic nanoparticles. Those complex phenomena remain to date out-of-reach experimentally for bulk systems, thus hindering our theoretical understanding. This line of research still belongs mostly to the domain of atomic-level simulations and more particularly to molecular dynamics (MD) with generic interaction models4,5. To reach statistically meaningful events, large scale simulations are required. This still remains challenging as only few studies are providing now million-to billion-atom simulations for monatomic metals2.

To identify translational and orientational orderings during homogeneous nucleation in MD simulations, an unsupervised learning approach6 based on topological data analysis (TDA) signatures was developed through persistent homology (PH)7,8. PH is an intrinsically flexible, yet highly informative, tool which detects meaningful topological features deduced from atomic configurations. It was successfully applied very recently to characterise structural environments in metallic glasses9, ice10 and complex molecular liquids11. Always used as a structural analysis in these studies, the originality here is to use PH as a translational and rotational invariant descriptor to encode the local structures required for the clustering method. For the latter a model-based method is used, namely Gaussian Mixture Models (GMM)12, (already used with success to analyse MD simulations13) and its estimation by an Expectation Maximization (EM) algorithm14. The number of clusters (The word ‘cluster’ is used for groups detected by the machine learning method throughout the text.) is selected by Integrated Completed Likelihood (ICL15), a refinement for clustering of Bayesian Integrated Likelihood (BIC16). The inferred model from the method called hereafter TDA-GMM, is used to identify and describe the structural and morphological properties of the nuclei as well as their liquid environment at various steps of the crystal nucleation.

Figure 1
figure 1

Unsupervised learning of homogeneous nucleation. Snapshot of a 10-million atom MD simulation of Ta during nucleation along the \(T=1900\) K isotherm (a,b). Independent local atomic structures within a cut-off-radius of 6.8 Å form a train set represented in the descriptor space by 215 PH components up to dimension 2. (c) Evolution of the ICL criterion as a function of the number of clusters is used to get autonomously the optimal number of clusters shown in (d). In (a) the snapshot is represented only with atoms in cluster \(C_1\) and cluster \(C_2\) revealing all nuclei (see text), while in (b) atoms of all clusters are displayed showing that those in cluster \(C_3\) are located mainly at the border of the nuclei and \(C_4\), \(C_5\) and \(C_6\) correspond to the surrounding liquid with various topological characteristics.

Results and discussion

With this unsupervised approach, the homogeneous nucleation process was studied in three monatomic metals chosen for the variety of their underlying crystalline phase, namely body-centered cubic (bcc) for Ta, face centred-cubic (fcc) for Al, and hexagonal-closed packed (hcp) for Mg. Large-scale molecular dynamics simulations17 comprising one and ten million atoms were performed with a similar procedure used in our preceding work on pure Zr18 and described in more details in “Methods” section. Figure 1 depicts the methodology applied here to Ta. A rapid quenching at constant pressure brings the liquid from \(T=3300\) K down to \(T=1900\) K close to the time-temperature-transformation (TTT) nose. Crystal nucleation is observed along an isothermal process during which a configuration of the simulation is chosen for the clustering. As it contains many nuclei with different sizes and a substantial amount of liquid, it is considered as representative of the phenomenon. From its inherent structure19, a training set of 5 000 non overlapping local spherical structures within a cutoff radius of 6.8 Å was sampled for the unsupervised learning (see Supplementary Information), with the constraints of covering the entire simulation box uniformly and randomly. Among all possible sets upon applying the GMM, the one with 6 clusters shown in Fig. 1d was automatically inferred to be representative of the system based on the minimum ICL criterion Fig. 1c. The snapshot of the simulation box in Fig. 1a displays only atoms of type \(C_1\) and \(C_2\), as they show clearly a crystalline order, refraining at this stage from characterising it. They reveal all nuclei as it will be seen below, along with their structure, size and morphology out of the simulation box displayed in Fig. 1b. From this model, each atom of each configuration generated by the MD simulation can be assigned to one of the six clusters (the one with the highest probability). Such a clustering training is performed independently for each metal and shows that more than 99.99 % of the structures have a probability to belong to the most probable Gaussian component greater than 0.999, even for structures not in the initial training set.

Figure 2
figure 2

Homogeneous nucleation in Ta and Al undercooled liquids. Snapshots of the molecular dynamics simulations for Ta (a) and Al (b) with respectively 10 and 1 million atoms, during isothermal nucleation at different times at temperatures close to the nose of the Time-Temperature Transformation (TTT) for Ta (c) and Al (d). From stored configurations during fast cooling (blue curves), nucleation events along several isotherms were observed by monitoring the sharp drop of the internal energy (insets in (c,d)). The average nucleation times \(\tau _N\) (symbols) were determined from 5 independent simulations for each temperature giving the TTT curves in the vicinity of the nose (orange lines).

Figure 2 shows typical homogeneous nucleation events in undercooled Ta and Al during an isothermal process close to the nose of the TTT, which can be done by standard MD simulations without the need of an accelerated methods such as the Forward-flux sampling method20. The liquids above the melting point \(T_{M}\) were first quenched down at ambient pressure to the glass transition sufficiently rapidly to avoid nucleation (see Table S1 in Supplementary Information). From stored configurations during cooling, the TTT curves in the vicinity of the nose were built from observation of the nucleation along several isotherms as shown in Fig. 2c,d. An isotherm slightly above the TTT nose is chosen for the analysis, i.e. \(T=1900\) K for Ta and \(T=650\) K for Al. From chosen configurations during the nucleation and growth process, the clustering is obtained from application of the corresponding trained model as described above. For all metals considered here, strongly growing fraction of mainly two clusters, concomitant to the sharp drop of the energy, is observed. For Ta and Al, only local structures belonging to these clusters are drawn in Fig. 2a,b, revealing evidently the nuclei and their evolution in time, recalling that solely the topological vector is describing the local structure. The nuclei morphologies show globular shapes that are rather spherical, characteristic of high \(\Delta T\), although obviously not strictly as revealed more quantitatively from a convex hull analysis. Interestingly, atoms from one of the two clusters (coloured in red) are mainly located inside the nuclei while atoms from the second one (coloured in pink) steadily remain essentially at the border upon growing. They stay finally at grain boundaries after full solidification of the simulation boxes. Its appearance inside the nuclei reveals also the presence of defaults, as it will be examined below.

The simulations of homogeneous nucleation shown in Fig. 2 were performed with 10 and 1 million atoms for Ta and Al, respectively. In both cases, the vast majority of the embryos dissolve back to the liquid while those attaining the critical size are rare and grow. The larger simulation box for Ta allows to follow the nucleation process for a longer time, sufficient to observe more secondary nucleation events21. Direct estimation of the critical size is still unreachable by experiment, as nuclei can be detected only at larger size3. This is also scarcely studied by MD simulation as it is not easy to define their boundary from the surrounding liquid22,23, especially in the case of non-spherical or ramified shape24. Here, the size distribution of nuclei was obtained by counting the number of atoms in overlapping structures identified as red and pink clusters within the cut-off radius. An estimation of the critical size was inferred from the size of the nuclei that persist between the first and second configurations shown in Fig. 2, at least without loosing atoms they contained initially. As it can be seen for Ta on Fig. 1d, the local structures of the two clusters forming the nuclei are unambiguously crystalline (with only a slight distortion for structures from cluster \(C_2\)) giving a clear definition of them. This is repeated in the subsequent consecutive pairs of configurations to refine statistics, and the results for all metals are gathered in Table S1 in Supplementary Information. For Ta, embryos with size less than 120 atoms always dissolve back to the liquid while the few nuclei found with size larger that 150 atoms always grew. Similar values of the critical radius were determined very recently for bcc Fe and fcc Cu25 and fcc Zn26 in similar high \(\Delta T\) regime. For Al and Mg the simulations were performed at lower \(\Delta T\) yielding obviously larger critical nuclei which are consistent with the Lennard–Jones case5,22 and also with Al but somewhat lower with respect to recent MD simulations27.

Figure 3
figure 3

Translational and bond-orientational order parameters for Ta. (a) Radial density profile of the largest nucleus during the growth at 2.7 ns along the \(T=1900\) K isotherm. The red and blue dashed horizontal lines correspond respectively to the average bulk crystalline density and average bulk undercooled liquid without nucleation events, both being simulated at \(T=1900\) K at ambient pressure. (b) Corresponding slice of the nucleus through its centre and the surrounding liquid where atoms have been coloured according to the cluster they belong to (see Fig. 1d). (c) Total radial density profile of the largest nucleus during growth at times corresponding to Fig. 2 before solidification. Inset: time evolution of the radius of the nucleus. (d) Bond-orientational order in terms of bonded pairs of the Common-Neighbor Analysis28 for each cluster of the model.

The nucleation process is characterized at least by two order parameters, the translational order (TO) and the crystalline ordering called hereafter the bond-orientational order (BOO). A dedicated representation of the TO is the number density. It is primarily applied to the embryos and the nuclei at different stage of the growth, through the radial partial atomic density profiles \(\rho _i(r) = N_i(r)/\frac{4\pi }{3} [(r+\Delta r)^3-r^3]\) as a function of distance r of the estimated centre of the nucleus, \(N_i(r)\) being the number of atoms belonging to cluster \(C_i\) in a spherical shell of radius r and thickness \(\Delta r=1\) Å. Considering the nucleation process of Ta as an illustration, Fig. 3a depicts the density profiles \(\rho _i(r)\) for all 6 clusters for the largest nucleus shown in Fig. 2a and its surrounding liquid at time 2.7 ns. The corresponding slice of the nucleus through its centre is drawn in Fig. 3b. Thus, the nucleus is defined by atoms belonging to clusters \(C_1\) and \(C_2\) as described above, atoms of \(C_1\) forming the centre of the nucleus, while atoms of \(C_2\) being mainly located at its border, as can be easily confirmed visually. It should be noted that atoms of cluster \(C_3\) are mainly located at the boundary of the nucleus, but they cannot be considered as being part of it, as they are also present in the entire box. From the total density profile of the nucleus \(\rho _N(r) = \rho _1(r)+\rho _2(r)\), it can be seen clearly that the density of nucleus has already reached at this stage the one of the bulk crystal at the same temperature. Defining the remaining clusters (\(C_3\) to \(C_6\)) as belonging to the liquid yields to a total density profile \(\rho _L(r) = \sum _{i=3}^6\rho _i(r)\) showing that even in the vicinity of the nucleus the liquid is negligibly influenced by its presence, keeping the density of the bulk undercooled liquid.

Figure 3c shows the evolution of the density profile \(\rho _N(r)\) at different times of the growing process. The average radius \(r_N\) of the nucleus is taken as the value of r at half-maximum of \(\rho _N(r)\) and its evolution with time is shown in the inset, displaying a linear behaviour in agreement with CNT2. Whatever the size of the nuclei, the density of the inner part is close to the bulk crystal. More importantly, this is all the more true for all the embryos below the critical size up to a single local structure of type \(C_1\) or \(C_2\) corresponding to the minimal size of about 65 atomic structures identified by the TDA-GMM given the chosen cutoff radius (see Supplementary Information). This feature appears to be general as similar results are found for Al and Mg as shown in the Supplementary Information.

The BOO of each cluster is identified through the Common-Neighbour Analysis (CNA)28, chosen as a well-known and robust tool. The CNA signature30 given in Fig. 3d reveals that structures from clusters \(C_1\) and \(C_2\) possess respectively a perfect and slightly distorted bcc crystalline ordering confirming the above analysis of nucleation and growth in terms of topological descriptors. Structures from clusters \(C_4\), \(C_5\) and \(C_6\) display various high degrees of five-fold symmetry (FFS) characteristic of the liquid state together with a small but non negligible degree of bcc ordering, while structures from cluster \(C_3\) retains both FFS and bcc order in similar proportions. As shown in the Supplementary Information file, the use of Steinhardt parameters29 leads to a similar result. Such a BOO of the four clusters associated to the liquid agrees well with ab initio molecular dynamics simulations31 and was interpreted as compatible with the A15 crystalline phase. This analysis in terms of CNA highlights and confirms that the TDA-GMM unsupervised learning approach is a powerful method to capture the structural picture in its finest details.

Figure 4
figure 4

Early nucleation stage for Ta. (a) Snapshot of the simulation with 1 million atoms along isotherm \(T=1900 K\) showing the onset of nucleation when the first nucleus starts to grow (highlighted by the black circle). Atoms have been coloured according to the cluster they belong to (see Fig. 1d). (b) p-values computed on the projection of atomic positions on each direction of the box from a Kolmogorov–Smirnov test against the uniform distribution.

The peculiar spatial distribution of structure of type \(C_3\) shown in Fig. 3a deserves further attention. Firstly, its location at the boundaries of the nuclei could suggest the existence of a new type of ordering like in nucleation of ice32. However, here it is consistent with the mixed bcc and FFS orderings. This is then seen as an effect of the TDA-GMM procedure that picks up structures covering a part of the nucleus and of the liquid in the configuration used for the training. As a matter of fact, increasing the number of clusters in the GMM reveal several clusters with various crystalline and FFS ordering as shown in Supplementary Information file using a Principal Component Analysis (PCA)12. More interestingly, its presence in the whole simulation box indicates that in the undercooled liquid, some regions with higher bcc ordering might develop apart from the vicinity of the growing nuclei. Figure 4a shows a snapshot of the simulation at the onset of nucleation when the first nucleus starts to grow, all atoms being coloured according to the cluster they belong to. Figure 4b depicts a table of the concurrent p-values obtained, for each cluster, on the projection of atomic positions on the 3 directions of space, from a Kolmogorov–Smirnov test33 against the uniform distribution. For a level 0.01, the test is always rejected in at least one direction, which proves that the distribution of the clusters in the box is not uniform, i.e. their heterogeneity. Focusing on atoms of type \(C_4\) (green) and \(C_6\) (yellow), which represent more than 90 % of the atoms at this stage (58% for \(C_4\) and 34% for \(C_6\)), it clearly shows that the undercooled liquid embodies structural heterogeneities with varying degree of FFS. Moreover, higher bcc ordering characterized by structures of type \(C_3\) appears in localized regions of lower FFS (green) from which, in most of the cases, embryos formed by structures from clusters \(C_1\) and \(C_2\) emerge. The same conclusion of structural heterogeneity is obtained for Mg and Al, with particularly low p-values for Mg (see Supplementary Information).

The question whether the onset of nucleation is initiated primarily by translational or by orientational ordering is still open2,34 and was debated during the last decade with a controversy essentially centred on the hard sphere and associated colloidal systems35,36,37. For Ta, the small emerging embryos at the onset of nucleation, corresponding to one structure of 55 to 70 atoms belonging to \(C_1\) or \(C_2\) with bcc crystalline BOO, show bond lengths of their bcc lattice close to the density of the bulk crystal at \(T=1900\) K, a feature that also holds for the other metals investigated here. This provides evidence for the size of embryos that can be detected here: translational and bond-orientational orders appear simultaneously and rule out the scenario in which homogeneous nucleation is driven by BOO first35,38 for pure metallic systems. For more complex metallic alloys like Al–Ni34, it was shown, using Steinhardt parameters, that nucleation is initiated by orientational ordering, followed by density ordering. However, our results are consistent with the fact that, at least for pure metallic systems with strong bonding, unlike hard spheres, are more energy driven rather than entropy driven systems.

All these features allow us to propose a nucleation pathway for the metals considered here, guided by their electronic structure characteristics and underlying stable crystalline structure. For Ta, our findings show a single step process with an onset of homogeneous nucleation taking place in low FFS domains of the heterogeneous liquid, where emerging bcc embryos have simultaneously the density of the bulk solid. After reaching the critical size, the nuclei grow in a rather globular shape with a bcc structure, a small amount of defects, and a diffuse interface with decreasing bcc ordering. During the growth the surrounding liquid keeps the bulk liquid density. This single-step nucleation might be explained by the strong stability of the bcc structure of Ta coming from its half filled 5d-band structure with one of the highest cohesive energy47. A similar one step nucleation pathway also holds for Al in which embryos emerge from the low FFS regions directly with the fcc bond ordering. The growing nuclei have here a more patchy morphology and a significant amount of fcc stacking faults. For Mg, a two steps process is identified as can be seen in the Supplementary Information: an onset of nucleation showing embryos having mainly a bcc ordering followed by growth of nuclei with a mixed fcc/hcp structure and some bcc ordering at the surface of the nuclei. The difference in the nucleation pathways might be surprising as the fcc and ideal hcp structures displayed by Al and Mg have identical first and second neighbors atoms47,48. In describing these sp-valent metals within the second-order perturbation theory, it was shown49,50 that their structural energy difference come from more distant neighbors and is small, and even smaller for Mg. Nevertheless, the relative stability with respect to the bcc structure is \(\sim 3\) times larger for Al than for Mg47,48. This might explain why the onset of nucleation is a one-step nucleation process due to the higher fcc stability, while it is a two-step process for Mg, starting with in the first step with embryos having locally preferred high symmetry bcc structure, due to the much smaller structural difference with respect to the hcp. The presence of defective hcp structures in fcc Al nuclei and fcc local structure in hcp Mg nuclei may result from the very small hcp-fcc energy difference. For Mg,, the scenario is more akin to the Lennard–Jones case5,22 following the Landau Theory in which the bcc precursor is favoured in the early stages of crystal nucleation39 as well as the Ostwald step rule40 for which the primary crystal phase nucleating from the liquid is not necessarily the thermodynamic stable one.


The present unsupervised learning approach was shown to be a powerful tool to unravel the atomic scale mechanisms of crystal nucleation in monatomic metals. It allowed us to reveal general aspects in the homogeneous nucleation process as well as specificities depending on the metallic element under consideration. Our results are in line with the emerging idea that heterogeneities which exist in the undercooled liquid37 play the foremost role in the onset of nucleation. For all metals, nucleation have been found to start in low FFS regions, which is consistent with Frank’s argument41 and also observed for Al–Ni alloys34. For pure metals, translational and orientational ordering taking place simultaneously in emerging embryos. Given the fact that metals considered here have different crystalline ordering, this feature is induced by the bonding character of interactions. Moreover, embryos as well as nuclei during the growth possess the bulk crystal density driven by the metallic bond length while the surrounding liquid keeps the bulk liquid density in accordance with the classical nucleation theory2. However, our analysis reveals also some aspects beyond the CNT, such as nuclei having a diffuse interface with the surrounding liquid and metals possessing their own nucleation pathways, involving e.g. for Mg a two step mechanism40. This may trigger further theoretical developments, and for instance in the diffuse interface approach2. The complexity and richness found here for metals and in other systems22,36,37 underline the future challenges in stepping forward in our theoretical understanding beyond the CNT. This promising methodology more generally opens the door to a deeper and autonomous investigation of atomic level mechanisms in materials science.


Simulation method

Molecular dynamics simulations were performed with the lammps code17 in a fully periodic situation. Verlet’s algorithm in the velocity form for the numerical integration of the phase space trajectory was used with a time step of 2 fs for Ta with a number of atoms \(N=10^7\) (\(10^6\) for the training) and 1 fs for Al and Mg with \(N=10^6\). Interatomic interaction were taken in the Embedded Atom Model form and chosen for their ability to reproduce the liquid and solid properties as described in the Supplementary Information. Control of the thermodynamic conditions was done with the Nosé–Hoover thermostat and barostat42 was used to maintain the ambient pressure whatever the temperature. The time-temperature transformation curves were first built for each metal following the procedure established recently18. The procedure to compute the TTT curves follows the one from our previous work on Zr18. Along an isotherm located slightly above the TTT nose, 6 configurations of interest were selected for the purpose of monitoring the crystal nucleation process. The unsupervised method is applied to the inherent structures19 of the chosen configurations. The latter are obtained from a minimization of the energy by means of a conjugate gradient algorithm to bring the system in a local minimum of the potential energy surface. In this manner, the thermal noise is removed and allows us to reveal the genuine structural features. This is an important aspect for the unsupervised method used here, either for the construction of the topological descriptor or for the clustering. It avoids identifying several clusters with similar structural characteristics in the same local minimum of the PEL. With the system size considered here neither partial criystallisation nor change in the nucleation process were observed by applying the minimization.

Definition of the local structures

Figure 5 shows schematically the pair-correlation function g(r) of the undercooled liquid and crystalline states of Ta at \(T=1900\) K with the respective mean structure assigned to the clusters \(C_{4}\) (preponderant liquid) and \(C_{1}\) (preponderant bcc). A cut-off radius of 6.8 \(\AA\) was set to capture topological informations with the help of the Python package gudhi43 and ripser.py44 up to the second neighbour shell. The local atomic structures were extracted with Python package pyscal45.

Figure 5
figure 5

Cut-off radius for the clustering based on the g(r) function with the second minimum leading to structures with two neighbors shells with the one associated in Ta to the preponderant liquid in green (a) and the one to bcc ordering in red (b).

In the context of classical descriptors like the averaged bond-orientational order analysis it was shown29 that information from the second neighbour shell increase the accuracy in the discrimination of local structures, but at the expense of a loss in the spatial resolution. This is also observed in the topological descriptors set up here which give rise to more \(H_0\) and \(H_1\) components as well as \(H_2\) components which appear only when considering more than just one neighbour shell. It should be pointed out that when increasing further the cut-off radius up to the third neighbour shell and beyond, the benefit gained in topological information is counterbalanced by a too large spatial extension leading to a loss of resolution in the Gaussian Mixture Model (GMM) clustering. This compromise between the accuracy and the spatial resolution bring us the optimal choice of the second neighbour shell to define local structures consistently with earlier findings29.

Persistent homological descriptors’ space (TDA)

The unsupervised learning in the MD configurations is performed in terms of the local atomic environment of each atom (called the local structure) within a cut-off radius defined as the second minimum of the pair-correlation function g(r) in the liquid, as described in Fig. S1 of Supplementary Information. The use of two atomic neighbour shells to represent the local environment was shown to optimize the local structural information of descriptors at the expense of a loss of the spatial resolution29. In Persistent Homology7,8, components of homological dimensions \(H_{0}\), \(H_{1}\) and \(H_{2}\) are then used in the form of a topological vector of dimension \(n_{PH}\) to represent each local structure. Its components are calculated from the Persistent Diagrams (PD) representing the birth and death characteristics of each topological component, as shown in Fig. 6. More precisely, for each pair of points (xy) in a PD, D, the values of the topological vector components are calculated, except for the infinite point, for a fixed level of homology8 by

$$\begin{aligned} m_D(x,y) = \min \{\Vert x-y\Vert _\infty , d_\Delta (x), d_\Delta (y)\}, \end{aligned}$$

where \(d_\Delta (\cdot )\) denotes the \(\ell ^\infty\) distance to the diagonal. The number of \(H_0\) is fixed by the number of neighbour atoms and the number of components of \(H_1\) and \(H_2\) is inferred from a subsampling approach as described in46 to remove the noise.

Figure 6
figure 6

Persistence diagrams of homological dimensions \(H_{0}\), \(H_{1}\) and \(H_{2}\) for the mean structures of \(C_{1}\) and \(C_{4}\) which represent respectively the bcc ordering and the preponderant liquid structure.

Clustering using a Gaussian mixture model (GMM)

In order to build a training set for the learning, a sampling of 5000–7000 structures, that covers the entire simulation box by means of their central particles at least separated by two times a cut-off radius are extracted from a million atoms configuration chosen during the nucleation.

From the built topological descriptors’ space as described above, a mixture of M Gaussian distributions \((\phi (\;\cdot \;;\varvec{\mu }_{m},\Sigma _{m}))_{1\le m \le M}\) of weights \((\alpha _{m})_{1\le m \le M}\) is written as

$$\begin{aligned} \sum _{m=1}^{M}\alpha _{m}\phi (\;\cdot \;;\varvec{\mu }_{m},\Sigma _{m}), \end{aligned}$$

where \(\varvec{\mu }_{m}\) is the position of the mean and \(\Sigma _{m}\) the covariance matrix of the mth Gaussian distribution. The number of Gaussian components is set using the ICL criterion15 and full covariance matrices with 3 000 K-means initializations are used to construct a model for applications on configurations along the nucleation process. Doubling manually the number of clusters to be identified, even if it is not optimal in the sense of the ICL, leads to the same evolution of the CNA of the clusters as in Fig. 3d with additional clusters having intermediate orderings displaying a “more continuous” evolution as shown in the Supplementary Information file.