The phase-field method has emerged as a powerful, heuristic tool for modeling and predicting mesoscale microstructural evolution in a wide variety of material processes1,2,3,4,5. This method models interfacial dynamics without the overhead of resorting to advanced interfacial tracking algorithms such as level-set6 or adaptive meshing7. Scalar, auxiliary, continuous field variables (so-called phase-field variables) are used to represent the evolutionary state of the microstructure dynamics such as in crack growth and propagation8,9, thin-film deposition10,11, and dislocation dynamics12 to name a few. The Cahn-Hilliard, nonlinear diffusion equation1,13,14, is one of the most commonly used governing equations in phase-field models. It describes the process of phase separation, by which a two-phase mixture spontaneously separates and form domains pure in each component. The Cahn-Hilliard equation finds applications in diverse fields ranging from complex fluids to soft matter and serves as the starting point of many phase-filed models for microstructure evolution.

Traditional numerical approaches to solve the fourth-order parabolic Cahn-Hilliard equation include finite differences15, spectral approximation16, finite element analysis with mixed methods17, and isogeometric analysis18,19. The coupled stiff equation simultaneously captures a quick phase separation and a very slow coalescence. Evidently, the two sub-processes operate on significantly different spatial and temporal scales, making it challenging to solve efficiently and accurately within realistic time constraints and reasonable computational capabilities20. Improvements in computational complexity have been enabled by the growing interest in data-driven models using machine learning (ML) methods. However, striking a balance between computational efficiency and accuracy has often been a challenge while employing these methods. Indeed, for complex and multi-variate phase-field models, the efficient Green’s function21 does not ensure an accurate solution, while Bayesian optimization22,23 techniques solve such coupled models but to the detriment of a higher computational cost.

Modern ML models have paved the way for the development of fast emulators for solving parametric partial differential equations (PDEs)23,24,25,26,27,28,29,30,31,32,33,34,35,36. There are strategies for accelerating the simulation of PDEs. A promising approach for accelerating the predictions of phase field-based microstructure evolution problems consists of using recurrent neural networks (RNNs) to learn the time-dependent, microstructure evolution in latent space37,38. Within this framework, statistical functions combined with linear and nonlinear embedding techniques are used to represent the microstructure evolution in latent space. Such RNN-based surrogate models demonstrated success in generating rapid predictions of the time evolution of the microstructural auto-correlation function. The microstructure reconstructed from these statistical functions, using for instance a phase recovery algorithm39, was then used as an input for a high-fidelity solver that marches ahead in time. The developed approach reported a 5% loss in accuracy against the high-fidelity phase-field solvers. However, this class of models also comes with challenges. First, the training and inference using RNNs as a surrogate model can be relatively slow due to the temporal dependence of the current predicted field on fields predicted at previous time steps, prohibiting the efficiency of the algorithm for large datasets. Second, the RNN-based architecture learns the underlying evolutionary dynamics in terms of statistical functions (non-primitive variables) of the microstructure. Reconstructing a microstructure from these statistical functions is a non-trivial and ill-posed problem40. This reconstruction step can incur additional errors especially for interfacial dynamics problems where resolving intricate spatial length scales such as in dendrite growth phase-field problems is key.

In this work, we propose an alternative approach to circumvent the aforementioned challenges. We formulate the microstructure evolution problem as being equivalent to learning a mapping function \({{{\mathscr{G}}}}:{{{\boldsymbol{u}}}}\to {{{\boldsymbol{\phi }}}}\) such that,

$${{{\mathscr{G}}}}\left({{{\boldsymbol{u}}}}(x,y,t)\right)={{{\boldsymbol{\phi }}}}(x,y,t),$$

where u is the history of the microstructure evolution and ϕ(x, y, t) is the state of the microstructure at time t. We develop a framework that integrates a convolutional autoencoder architecture with a Deep Operator Network41 (DeepONet) to learn this mapping. Figure 1 illustrates the complete end-to-end workflow of the proposed algorithm. We utilize a convolutional autoencoder to provide a compact representation of the microstructure data in a low-dimensional, latent space. This convolutional autoencoder approach is then combined with the DeepONet architecture to learn the dynamics of two-phase microstructures in the autoencoder latent space. The DeepONet architecture has demonstrated its ability to model the governing differential equations (ordinary differential equations (ODEs) and PDEs) of such problems by learning the underlying operator, a mapping from functions to functions, from the available datasets for a broad range of problems28,42. We show that such an architecture is more robust than the RNN-based architecture in terms of training, computational efficiency, and sensitivity to noise. The decoder part of the convolutional autoencoder can efficiently reconstruct the time-evolved microstructure from the DeepONet predictions bypassing the challenges associated with reconstruction-induced errors when using statistical functions to represent the microstructure for instance. Overall, the trained autoencoder–DeepONet framework can then be used to replace the high-fidelity phase-field numerical solver in interpolation tasks for parameters inside the distribution of inputs used during training or to accelerate the numerical solver in extrapolation tasks for parameters outside this distribution.

Fig. 1: Schematic representation of DeepONet with convolutional autoencoder.
figure 1

Step 1 involves training of the convolutional autoencoder to minimize \({{{{\mathscr{L}}}}}_{{{{\rm{ae}}}}}\). The encoder learns a suitable transformation from the high-dimensional microstructure to a low-dimensional latent space through a series of convolution (blue layers) and MaxPooling (green layers) operations. The decoder remaps the latent representation of the microstructure back to the original, real space by performing transpose convolution (orange layers) operations. A detailed description of the architecture is provided in Table 1. In step 2, we train the DeepONet in the latent space to minimize \({{{{\mathscr{L}}}}}_{{{{\rm{d}}}}}\). The entire history of 80 steps is encoded by the pre-trained convolutional encoder as \(\tilde{\Phi }\). DeepONet learns to predict \(\tilde{\phi }(t)\) at any desired time t, fed to the trunk network. The latent representation of the microstructure predicted by DeepONet is then re-mapped back to the primitive space by the transpose convolutional decoder.


Training and optimization of neural operators and autoencoder architectures

We first investigated the impact of the size of the latent dimension of the autoencoder, ld, on the model performance. To this end, we trained five autoencoder models with ld = 9, 25, 64, 100, and 196 respectively. Details of the hyper-parameters used in these five convolutional autoencoders are provided in Table 1. For any given time step during the evolution of the microstructure, the encoder reduced a 128 × 128 microstructure ϕ(x, y, t) to a latent vector of size ld. The decoder mapped the microstructural latent space representation back to a 128 × 128 microstructure \(\hat{\phi }(x,y,t)\) (see Methods for more details). Each autoencoder training took approximately 33 h on one NVIDIA GeForce RTX 3090 GPU. Next, we trained the DeepONet model for 120,000 epochs on the latent space learned by the convolutional encoder for each of the five trained autoencoder models. The last layer of the branch and trunk networks for all the models uses a linear activation function. The output of the DeepONet model was then sent to the trained convolutional decoder, which performed a mapping from the latent space back to the original microstructure space, \(\hat{\phi }(x,y,t)\).

Table 1 Details of the hyper-parameters used in the convolutional autoencoder.

We evaluated the effect of the size of the latent dimension of each of the models on the basis of the relative L2 norm computed across the training and testing dataset for all the time steps, including the forecasting time frames, t = {t90, … , t99} not seen by the surrogate model (Note that one time frame is equal to 500,000 time steps, t = 500,000Δt, see Methods for additional details). All the details of this survey analysis, including the DeepONet architecture, the L2 norm of relative error on train and test datasets, and the computational time taken for training the DeepONet model are reported in Table 2. From this survey, we observe that the model predictions improve when we increase the size of the latent dimension. In general, DeepONet models with \(\tanh\) and \(\sin\) activation functions performed better compared to models with a ReLU activation for this particular class of problems. As such, our best model consists of a convolutional autoencoder with ld = 196 and a DeepONet model with architecture 1 and \(\sin\) activation function (shown in Table 2). Although the training dataset consists of 1,600 different microstructure-evolution trajectories, each represented by over 80 snapshots from t = {t10, t11, … , t89}, the DeepONet training is faster compared to popular RNN architectures such as the Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) networks38,42,43. Since DeepONet does not have recurrent connections, there are no temporal dependencies during the training or at the inference stage. Instead, the network relies on the convolution operations that encode information about the history through the branch network. In addition, due to the lack of temporal dependencies, the fully connected layers in the trunk network and convolutional layers in the branch network of the DeepONet architecture can be easily parallelized, unlike LSTMs. This makes training and inference of DeepONet significantly faster than the RNN architectures.

Table 2 Detailed survey of different latent dimension size, ld, network architecture, and non-linear activation functions.

We carried out additional simulations to analyze the sensitivity of the proposed approach to the number of samples used for training. We considered training datasets with 25%, 50%, 75% and all the 1600 training samples. We adopted the same methodology proposed in Methods and trained separate autoencoder–DeepONet models on each of these datasets. Details can be found in Supplementary Note 2. The model performance was evaluated on the basis of forecasting errors on test data, shown in Supplementary Figure 2. As expected, we observe better accuracy in the model predictions when increasing the number of training samples. The model trained with 1200 data samples shows similar accuracy to the best model trained with 1,600 data samples, indicating convergence of the training procedure.

Finally, we also evaluated the effect of using different loss functions on training the autoencoder models. Specifically, we trained various autoencoders by minimizing L1 loss, relative L1 loss, L2 loss, relative L2 loss, and mixed loss (L2 loss for the initial 5000 epochs and L1 loss for the remaining epochs). The choice of a loss function determines the landscape in a hyperspace for the optimizer to traverse in pursuit of global minima/best local minima and avoiding the saddle points. For this task, we used a DeepONet model with architecture 1 (Table 2) on each of the learned latent microstructure data and re-transformed the DeepONet predictions using a pre-trained decoder to retrieve the microstructure. We analyzed the model performance by computing the forecasting error, \({{{{\mathscr{D}}}}}_{{{{\rm{test}}}}}(t)\), on unseen test data, as shown in Supplementary Note 3. We observed that models for which the autoencoder is trained on L2 loss performed better than the one which used L1 loss. When there are no outliers as solutions, L2 loss is expected to perform better than L1. In the presence of outliers, L2 squares them as compared to linear contribution in the L1 norm. Similarly, mean values of relative L1 and L2 are a better choice for autoencoder loss, \({{{{\mathscr{L}}}}}_{{{{\rm{ae}}}}}\), than the mean of L1 and L2, respectively. The relative loss values are always of \({{{\mathscr{O}}}}(1)\) and help in achieving convergence faster as the learning rate is of \({{{\mathscr{O}}}}(1{0}^{-3})\). We have observed such an improvement in convergence in other problems, e.g. electro-convection, where we had sharp interfaces and multiscale dynamics44. Overall, all the models performed consistently well. As such, our autoencoder architecture of choice is an autoencoder with ld = 196 using a relative L2 loss function. Taken together, these results demonstrate not only the ability of our framework to accurately provide a compact representation of the microstructure data in a low-dimensional latent space, but they also illustrate the robustness of the training of this framework.

Performance accuracy and forecasting ability

A comparison of the predictions from our accelerated framework with that from high-fidelity phase-field simulations for a representative case of microstructure evolution at three different time steps is shown in Fig. 2. During the initial time steps, the microstructure is rich with multiple features and evolves rapidly with respect to time. Our autoencoder–DeepONet as a surrogate model is able to successfully predict the larger features and the overall morphology of the microstructure. The point-wise error snapshots suggest that the model fails to identify the relatively smaller features in the microstructure and contains significant errors along the sharp boundaries. In other words, the spatial gradient of the phase concentration is not as sharp as that of the true microstructure obtained from high-fidelity, phase-field simulations.

Fig. 2: Predictions of microstructure evolutions.
figure 2

The true (top row), predicted (middle row), and point-wise error (bottom row) for a microstructure realization evolving in time. The snapshots at time frames t = t10, t30, t99 are shown here. The network used for this simulation has ld = 196, and uses the following network architecture: Branch network -- 2 × [conv(128, (3, 3))] + [3920]; Trunk network -- 2 × [100] + [3920].

From Fig. 2, we qualitatively get the intuition that the predicted microstructures contain errors at the earlier time steps because of the missing, small-size features and at the later time steps due to smoother boundaries predicted by the model. To confirm and quantify this notion, we computed the L2 norm of the relative error at each time step, \({{{\mathscr{D}}}}(t)\), defined as:

$${{{\mathscr{D}}}}(t)=\frac{{\sum }_{x}{\sum }_{y}{\left(\phi (x,y,t)-\hat{\phi }(x,y,t;{{{\boldsymbol{\theta }}}})\right)}^{2}}{{\sum }_{x}{\sum }_{y}\phi {\left(x,y,t\right)}^{2}},t\in \left[{t}_{10},{t}_{11},\ldots ,{t}_{99}\right].$$

To analyze the accuracy of the prediction at each time step, we calculated \({{{\mathscr{D}}}}(t)\) across the samples in the training and testing datasets, and created a boxplot as shown in Fig. 3. The error is high for the initial time steps, where features span multiple length scales and evolve rapidly with time. However, the predictions improve over time when the evolution process slows down and the microstructure features coarsen. The time steps shown in Fig. 3a were used during the training of the model.

Fig. 3: L2 norm of the relative error between true and predicted microstructures at each time step.
figure 3

a Box plot with respect to \({{{\mathscr{D}}}}(t)\) computed over the training time steps over the train and test datasets. Error bars are equivalent throughout the figure. b Same error metric, but in future time steps never seen during the training phase. Error bars are equivalent throughout the figure.

Next, we evaluated the capability of the model to forecast time frames t = {t90, t91, … , t99}. From Fig. 3b, the error is seen to increase gradually when the model extrapolates at unseen time instances. A closer look at the forecasting predictions offers further insights into the DeepONet predictive performance. We computed the mean of \({{{\mathscr{D}}}}(t)\) across the training and testing datasets for all the models given in Table 2. We also plotted these values in Supplementary Fig. 4 with additional details in Supplementary Note 4. We observe that the mean relative L2 error reduces when increasing the latent dimension of the autoencoder model. In other words, the model with a larger latent space is able to predict the evolution of the microstructure in forecasting mode. This is intuitive because a larger dimension of the latent space implies that there are more basis functions to express the encoded information about the microstructure and its evolution, and therefore the network has an improved representation capability. However, this trend seems to saturate beyond ld = 100. For the model with ld = 100 or ld = 196, the forecasting error is always less than 6%. The logarithm of the relative L2 error linearly increases for ld = 64,100 and 196 for the forecasting time step, whereas for ld = 9 and ld = 25, the error is high and remains constant.

Robustness of the surrogate DeepONet framework: Sensitivity to noise

We evaluated the accuracy and robustness of the predictions from our surrogate model by systematically increasing the noise levels in the model input. For this analysis, we considered the best model with ld = 196 and DeepONet architecture 1 (see Table 2) with \(\sin\) activation functions. We added Gaussian white noise to our microstructure data with zero mean and standard deviations, σ = 0.5%, 1%, 2%, 3%, 4%, 5%, 10%. To evaluate the model performance, we used the relative L2 norm, \({{{\mathscr{D}}}}\), as defined in Eq. (8). The forecasting error, \({{{\mathscr{D}}}}\), is calculated across the samples present in the test dataset. Details can be found in Supplementary Note 5.

From Supplementary Table 1 and Supplementary Figure 5, the relative L2 norm does not increase noticeably when noise is added to the model input. In fact, the surrogate is almost invariant to noise up to 10% Gaussian white noise, as presented in Supplementary Figure 5. Previous studies45,46,47 illustrated the capability of autoencoders to denoise noisy images. The transformation to a low-dimensional latent space forces the autoencoder to retain the dominant features alone while discarding unnecessary noise. The convolutional autoencoder used in our approach does exactly that by denoising the noisy microstructure input. The encoder filters out noise and only retains the dominant energy modes of the microstructure data. The output of the convolutional encoder is almost in its pure form, free from noise and therefore it enables the DeepONet to make stable predictions. The decoder accurately reconstructs the microstructure from the predictions made by DeepONet in the latent space. This performance illustrates the robustness and efficacy of the present framework as compared to other machine-learned frameworks that use statistical functions to encode the microstructure representation40. Indeed, it has been illustrated by Herman and coworkers40 and others48 that, while statistical functions such as the microstructure auto-correlation functions are sufficient to capture the salient features of the microstructure in latent space, such representation does not uniquely map back to the true microstructure as it is an ill-posed, inverse problem. Here, our results essentially bypass such a challenge by taking advantage of the fact that autoencoders are robust to corruption in the representations they learn. The denoising nature of a trained autoencoder enables the encoder to learn a stable and consistent mapping to the latent space. This makes training of the DeepONet much more stable and results in accurate predictions of the microstructure at any desired time step.

Effect of time resolution

The high-fidelity phase-field forward numerical solver (MEMPHIS) discretizes the time with a time step Δt = 1 × 10−4 (see Methods for additional details). The stability of this numerical integration scheme can be achieved by strictly following the Courant-Friedrichs-Lewy (CFL) condition to solve the Cahn-Hilliard equation for 50M time steps. The solver saves snapshots of the solution at every 500,000th time step resulting in 100 microstructure time frames for each realization. We initially utilized 80 equally spaced time frames between the 10th and 90th time frames for training the surrogate model. Therefore, for the surrogate DeepONet model, each time step was 500000 × Δt = 50.

To investigate the effect of different spacing of physical time on the surrogate DeepONet model, we performed a sensitivity study using data that are spaced differently in time to train the model. Specifically, we trained DeepONet models on datasets with a time spacing of 500kΔt, 2 × 500kΔt, 5 × 500kΔt, and 10 × 500kΔt. The DeepONet predictions were then remapped to the primitive space using the pre-trained convolutional decoder to recreate the microstructure at the required time step. We plot the mean relative L2 forecasting error corresponding to the models trained on differently spaced datasets in Fig. 4. As expected, just like with any other time-integration scheme, we observe that the forecasting error increases for larger spacing in physical time. However, the computational efficiency of the DeepONet predictions remained the same for predicting consecutive time frames regardless of the time spacing used.

Fig. 4: Variation of forecasting error on test data for models trained on data with different spacing in time.
figure 4

Spacing in time tested are: 500kΔt, 2 × 500kΔt, 5 × 500kΔt, 10 × 500kΔt. \({{{{\mathscr{D}}}}}_{{{{\rm{test}}}}}(t)\) represents the relative L2 error computed across the samples in test dataset at different time steps.

Training strategy for learning concurrently multi-scale features

We showed in a previous work37,38 that the microstructure evolution in latent space is non-linear. Indeed, for early time steps the microstructure evolves rapidly and then later on it evolves more slowly once the phase separation dynamics have taken effect.

We noted that the DeepONet model architecture presented above was not able to resolve the small-scale microstructural features as shown in Fig. 2. In the early time steps, which represent fast dynamics, small wavelength features are hard to capture due to spectral bias of neural networks; Fig. 3 quantifies this difficulty.

To circumvent this issue, during training of the DeepONet model, we increased the weight given to earlier time steps of each realization in the dataset. By placing more emphasis on early snapshots during training, we endow DeepONet with an inductive bias to learn the fast dynamic accurately. Practically speaking, we are forcing the DeepONet model, \({{{\mathscr{G}}}}(\tilde{{{{\mathbf{\Phi }}}}})(t;{{{{\boldsymbol{\theta }}}}}_{{{{\rm{d}}}}})\), to predict earlier time steps repeatedly by creating a new training dataset with repeated \(\tilde{{{{\boldsymbol{\phi }}}}}(t)\) for each realization. Since the DeepONet model is trained to minimize the mean squared error between the true and predicted microstructures, the model is driven to give greater emphasis to microstructures developing at earlier time steps. The results from this training procedure are depicted in Fig. 5. Indeed, from the comparison of the predicted microstructure without and with an emphasis on earlier time steps in Fig. 5b and c, respectively, we observe that increasing the weight given to the earlier time steps of evolution for each realization results in a DeepONet model capable of recovering smaller, high-frequency components. This ability to accurately resolve multiple length scales is particularly important in dynamic problems such as dendrite or grain growth problems, for instance, where the simulated microstructure dynamics can be extremely sensitive to the development of multiple length scales concurrently.

Fig. 5: Capturing small microstructural features at earlier time steps.
figure 5

a True microstructure at t = t10. b Predicted microstructure without emphasizing earlier time steps during training. c Microstructure predicted by DeepONet trained on a dataset where the earlier time steps where repeated to increase the importance given to earlier time steps.

Integration of DeepONet with a numerical high-fidelity phase-field solver

The results above show that a pre-trained autoencoder–DeepONet model can be used as a robust and efficient surrogate of the numerical solver when inference is requested for initial microstructure and parameters within the distributions of the training datasets (interpolation task). Our proposed framework can also be used for extrapolation tasks and be integrated into the phase-field numerical solver to accelerate the predictions for initial microstructure and parameters that are outside the aforementioned distributions (extrapolation task). To demonstrate this point, we devised a hybrid approach that integrates the autoencoder–DeepONet framework with our high-fidelity phase-field Mesoscale Multiphysics Phase Field Simulator (MEMPHIS solver). This hybrid model unites the efficiency and computational speed of the autoencoder–DeepONet framework with the accuracy of high-fidelity phase-field numerical solvers.

The hybrid framework consists of alternating between predictions from the high-fidelity phase-field simulations and those from the autoencoder–DeepONet model. The high-fidelity phase-field simulation step provides accuracy in the description of the dynamics, while the autoencoder–DeepONet model enables us to ‘leap in time’. The algorithm is presented in Algorithm 1. Here we choose to split the time evolution predicted between the high-fidelity simulation and those of the autoencoder–DeepONet to be equal to one another. Each solver within this integrated scheme sequentially predicts 10-time frames, which corresponds to 10 × 500k = 5M time steps for the high-fidelity phase-field solver alone. A schematic of the approach and results are shown in Fig. 6 for the training data. Results for the test data are provided in Supplementary Fig. 6. The discontinuity along the centerlines of the predictions made by the autoencoder–DeepONet model arises from splitting each realization into four different realizations, as discussed in the ‘Microstructure-evolution dataset’ subsection.

Fig. 6: Schematic of our hybrid approach for integrating DeepONet with the numerical solver MEMPHIS to accelerate phase-field predictions.
figure 6

The computational time corresponding to the autoencoder--DeepONet model and the MEMPHIS solver for one realization in the training dataset is reported in this figure. The error is shown on the third column.

Algorithm 1

Integration of DeepONet with high-fidelity phase field simulator (MEMPHIS)

Require: ϕ0 : Initial condition: ϕ(x, y, 0)

Require: NT : Number of total time steps

Require: nt : Initial number of time steps to be simulated by MEMPHIS

Require: DONnt : Number of time steps to be leaped by DeepONet

n ← 0 Initialize

While n ! = NT do

\({\phi }_{{n}_{t}}\leftarrow {{{\rm{MEMPHIS}}}}({\phi }_{0},{n}_{t})\) Solution from MEMPHIS

\({\phi }_{{n}_{t}+{{{{\rm{DON}}}}}_{nt}}\leftarrow {{{\rm{DeepONet}}}}({\phi }_{{n}_{t}})\) Prediction from DeepONet

\({\phi }_{0}\leftarrow {\phi }_{{n}_{t}+{{{{\rm{DON}}}}}_{nt}}\) Update the input for MEMPHIS

nnt + DONnt Leaping n: MEMPHIS + DeepONet

end while

We see in Fig. 6 that the forecasting from 10 time frames using the high-fidelity phase-field solver MEMPHIS running on 32 CPU-cores (Intel® Xeon®, e5-2670) takes approximately 90 min. The subsequent 10 time frames predicted by the autoencoder–DeepONet model take only 2 s. A comparison of the computational cost between the high-fidelity phase-field solver alone and the hybrid approach is reported in Table 3. Here, we achieve a speed-up of 29% . This performance can be improved by a much greater factor with more extensive offline training with a richer dataset of operating conditions, which will lead to better generalization. For each evolution, our hybrid approach saves 135 min, without loss of accuracy as shown previously in the Results section. The choice of a specific time step splitting, for which the system is evolved using the high-fidelity phase-field framework and then by the autoencoder–DeepONet model, is arbitrary and can be considered as a hyper-parameter. For instance, one could easily consider to use very short time steps within the high-fidelity phase-field solver window to course-correct the physical predictions and much longer time steps when using DeepONet to accelerate the time evolution predictions. This type of time splitting integration scheme would dramatically increase the speedup even further. Additionally, although we showed the microstructure evolution only until t = t105, such a hybrid time integration strategy can be adopted to forecast time evolution of the microstructure for time windows that can be much longer, for instance as long as the input time history used to train the DeepONet model while still keeping a good accuracy, leading to substantial savings in CPU hours.

Table 3 A comparison between computational time for high-fidelity phase-field simulations (MEMPHIS) and proposed hybrid model (Hybrid) for a single microstructure evolution realization from time frame t1 to time frame t105.


In this work, we investigated the effectiveness of a convolutional autoencoder–DeepONet approach for modeling the evolution dynamics of mesoscale microstructures. The proposed framework consists of two parts. First, learning a non-linear mapping to a latent manifold using convolutional autoencoders, and second, learning the dynamics in the latent space (from the first step) using DeepONet. We trained our model on high-fidelity, phase-field data generated by solving the Cahn-Hilliard equation. The results presented above show that the trained DeepOnet architecture can be used robustly to replace the high-fidelity phase-field numerical solver in interpolation tasks or to speed up the numerical solver for extrapolation tasks. We showed that increasing the latent dimension used to describe the microstructure evolution and putting more emphasis on earlier time steps during the training improve the overall representation capability of the framework. Given its performance, this framework offers several advantages as compared to other machine-learned architectures used for accelerating the prediction of the phase-field-based microstructure evolution.

First, unlike existing methods37,40 that train machine-learning-based surrogate models using low-dimensional representations of microstructures based on statistical functions (e.g. auto-correlation function), our autoencoder–DeepONet approach learns a suitable low-dimensional latent space using a convolutional autoencoder. We showed that this approach bypasses any post-processing steps (e.g. a phase-recovery algorithm) necessary to reconstruct the microstructure from statistical functions40. The advantage of training DeepONet in the autoencoder latent space is two-fold. On one hand, training DeepONet in a low-dimensional space is computationally efficient. On the other hand, the presence of high-gradient regions in the microstructure data (see Fig. 7a) can make the training of the model challenging. However, the encoder transforms microstructure data to a latent space (Fig. 7b), where the gradients are not as high and more gradual . In other words, the encoder learns a non-linear mapping of the microstructure data coming from an untrainable distribution in the primitive space, as shown in Fig. 7c, to a trainable distribution in the low-dimensional latent space, as shown in Fig. 7d. Such a transformation from data with a high gradient to data with gradual, smoother gradient facilitates the training of the DeepONet model. Although we have trained our own autoencoder model in this work, we believe that fine-tuning any autoencoder pre-trained on existing image datasets with similarities to our microstructure images will be suitable for this task. Reusing such readily available pre-trained autoencoders can further save on the computational cost of our workflow.

Fig. 7: Representation of microstructure data and statistical insights.
figure 7

a represents a 3D visualization of the function to be approximated by the surrogate model. The presence of several high-gradient regions at every time-step, makes it challenging for neural network models to learn the evolution dynamics of microstructures. Panel (b) represents a smoother latent-microstructure learned by the encoder during the autoencoder training. c The microstructure data, ϕ(x, y, t), is predominantly represented by 1s or 0s. d The encoder transforms ϕ to a latent space, \(\tilde{\phi }\), where deep neural networks can learn easily. The curves in (c) and (d) represent the smoothed density estimates of the histogram.

Second, even though the workflow presented in this work is focused on two-dimensional (2D) microstructure data, it can easily be extended to three-dimensional (3D) microstructure data. For the 3D microstructure evolution case, each realization can be represented by a sequence of 3D tensor data structures. Hence, we could use an autoencoder with 3D convolution layers in the encoder and 3D transpose convolution layers in the decoder and learn a suitable non-linear mapping to a low-dimensional latent manifold. Following the same approach, a DeepONet model can be trained to learn the dynamics in the latent space, and be then remapped to primitive space using the already trained decoder. As shown in a recent theoretical paper, DeepOnet can tackle the curse of dimensionality in the input space, so training it in highdimensions is not a prohibitive issue49.

Third, the autoencoder and DeepONet are trained solely from data, making the proposed approach purely data-driven, independent of the boundary conditions. The boundary conditions are never explicitly assumed as input data to the framework at any stage. They are implicitly fed through the latent representation of the microstructure history data inputted to the branch network of DeepONet. Therefore, any information regarding a change in the boundary condition will be available in the latent microstructure history fed to the branch network, enabling the DeepONet model to predict the dynamics accordingly in the latent manifold. In this manner, the purely data-driven nature makes the proposed autoencoder–DeepONet framework agnostic to any changes in boundary conditions within the considered history. We also need to clarify that periodic boundary conditions were imposed at all four boundaries of the computational domain while generating the data from the numerical solver MEMPHIS. We note that DeepONet can be trained to map boundary conditions to an output field if so desired for a very general set of variable boundary conditions.

There are several extensions to the present framework that can be implemented in order to improve the accuracy, predictability, and acceleration performance. These improvements are related to the training of the model and extension to a multi-fidelity implementation. The first topic is related to improving the accuracy of the model with physics constraints in order to better capture non-linearities in the model evolution. The second topic is related to fusing different sources of data within our dataset, e.g., ‘low’ fidelity from simulation and ‘high’ fidelity from physical experiments of the same nature.

Regarding a physics-informed implementation, Wang et al.50 for instance put forward a physics-informed DeepONet, where the PDE of the underlying system is added as a soft constraint to the loss functions. In the present study, the training framework is purely data-driven and we are learning the dynamical system in the latent space defined by non-primitive coordinates except for time, which is fed as an input to the trunk net. However, similar to Wang et al., some physical constraints could be injected into the current framework by using, for instance, the fact that mass ϕ is conserved at all times.

Regarding the multi-fidelity implementation, the present approach can be extended to incorporate experimental data coming from similar processes. For instance, recently, several researchers51,52,53 explored diverse ways of exploiting the inherent correlations between datasets coming from different sources of data with different levels of fidelity and obtained optimal predictions. In this context, the data obtained from phase-field models using a numerical solver can be considered as a low-fidelity dataset and the limited amounts of experimental microstructure image data from similar processes can be treated as a high-fidelity dataset. The autoencoder–DeepONet framework proposed here can be extended to generate accurate predictions from a limited number of high-fidelity experimental microstructure microscopy image data, by utilizing the high correlation with the surplus low-fidelity phase field data. The assimilation of experimental data in the present DeepONet architecture can be concatenated with numerical data as another realization and the proposed workflow will remain unchanged. Merging and taking advantage of both experimental and modeling efforts is a future direction of our research.

To summarize, we developed and applied a machine-learned framework based on neural operators and autoencoder architectures to efficiently and rapidly predict complex microstructural evolution problems. Such an architecture is not only computationally efficient and accurate, but it is also robust to noisy data. The demonstrated performance makes it an attractive alternative to other existing machined-learned strategies to accelerate the predictions of microstructure evolution. It opens up a computationally viable and efficient path forward for discovering, understanding, and predicting materials processes, where evolutionary mesoscale phenomena are critical, such as in the optimization and design of materials problems.


Phase-field model of the spinodal decomposition of a two-phase mixture

We illustrate our accelerated phase-field workflow on the simplest case of the spinodal decomposition of a two-phase mixture. This model is highly relevant to many phase-field models. In the spinodal decomposition of a two-phase mixture uses a single order parameter, ϕ(x, t) to describe the atomic fraction of solute diffusing within a matrix. The free energy of the system is expressed by the Cahn-Hilliard equation based on the Onsager force-flux relationship such that

$$\frac{\partial \phi }{\partial t}=\nabla \cdot \left({M}_{{{{\rm{c}}}}}(\phi )\nabla [{\omega }_{{{{\rm{c}}}}}({\phi }^{3}-\phi )+{\kappa }_{{{{\rm{c}}}}}{\nabla }^{2}\phi ]\right),$$

where ωc is the height of the energy barrier between the two phases, κc is the gradient energy coefficient, and Mc denotes the concentration dependent mobility, with Mc = s(ϕ)MA + (1 − s(ϕ))MB. The function s defines a smooth interpolation to switch from phase ‘A’ to phase ‘B’. This interpolation function is defined as \(s(\phi )=\frac{1}{4}(2-\phi ){(1+\phi )}^{2}\). In the present model, both the mobility and the interfacial energy are taken to be isotropic and ωc and κc are stet to unity for simplicity. The evolution of one phase is expressed as a symmetric double-well potential, with minima at ϕ ± 1.

Microstructure-evolution dataset

The phase-field model described above is implemented using Sandia’s in-house multi-physics phase-field modeling code MEMPHIS10,54. In order to generate a diverse and large set of simulation results exhibiting a rich variety of microstructure features, we independently sampled the phase fraction ϕA, such that each phase has at least a minimum concentration of 0.15 (note that ϕB = 1 − ϕA), and the phase mobilities MA and MB of species ‘A’ and ‘B’. Phase mobilities are sampled independently to vary in the range [0.01, 100]. In total we generated 500 triplets (ϕA, MA, MB) using Latin Hypercube Sampling. In the simple case of the spinodal decomposition, only the tuple (ϕA, MA/MB) is necessary. As demonstrated in other studies10,40 that share similar microstructure evolution as the spinodal decomposition, it is, however, necessary to handle (ϕA, MA, MB) separately since the ratio MA/MB by itself will not be sufficient anymore to characterize the dynamics of the microstructure evolution. Herein, we frame the present work in a broader context for generality. All the simulations were performed using a two-dimensional (2D) square domain Ω = [0, 1] × [0, 1], discretized with 512 × 512 grid points, with a dimensionless spatial discretization of unity in either direction, and a temporal discretization of Δt = 1 × 10−4. The simulation domain’s composition field is initialized using truncated random Gaussian distribution in the range [ − 1, 1] with μ = ϕA, and σ = 0.35. The microstructure was allowed to evolve and grow for 50,000,000 time steps, saving the state of the microstructural domain every 500,000 time steps, hence a total of 100-time frames were saved from each simulated case.

In order to use the data in the proposed algorithm, we down-sampled each snapshot of our 512 × 512 domain into four images of 256 × 256, and later used cubic interpolation55 to further reduce the resolution to 128 × 128. Hence, from the 500 microstructure evolution samples, we were able to generate 2000 microstructure evolution samples of 128 × 128 resolution. From this dataset, we have used 1600 cases for training the DeepONet and 400 cases for testing the network accuracy. Since the compositional field is randomly distributed spatially, the microstructure has no recognizable features at the first frame t0. The quick development of subdomains is then observed between frames t0 and t10, followed by a smooth and steady coalescence and growth of the microstructure from time frames t10 to t100. We have trained our proposed model based on this observation, starting at time frame, t10, when the microstructure had reached a slow and steady development regime.

Training the autoencoder: learning the latent microstructure representation

In this work, each microstructure evolution is represented by (NT, Nx, Ny) = (80 × 128 × 128), with NT representing the number of snapshots and Nx × Ny denoting the spatial resolution along x− and y− direction, respectively. To handle the entire feature space (\({{\mathbb{R}}}^{128\times 128}\)) 16,384 distinct features are required to represent the microstructure at each time step. Subsequently, to compute the prediction for all 1,600 microstructure evolutions, we will have 1600 × 80 × 16, 384 (≈2.5 Billion) 32-bit floating data points. Learning microstructure dynamics from such a high-dimensional dataset is challenging.

To circumvent issues pertaining to the data dimensionality and preparing the phase-field microstructure data for DeepONet training, we explored a couple of options. First, we tried using Principal Component Analysis (PCA) with a linear kernel, for reducing the dimensionality of the data21,56,57. The low-dimensional representation of the data obtained from PCA is a linear transformation of the high-dimensional data and discards the insignificant modes (eigen/singular) corresponding to the lower eigen/singular values (λi). However, the system considered here is non-diffusive, which is confirmed by cumulative explained variance and energy distribution of the system over principal modes. Therefore, using PCA for reducing the dimensionality of the microstructure description could result in the loss of valuable information, if only a convenient low number of principal components are considered. A detailed explanation on PCA of the microstructure dataset is presented in Supplementary Note 1. Learning a non-linear mapping from a high-dimensional to a low-dimensional latent space is one way to compress data without losing as much information as in the PCA. An autoencoder precisely does this by learning a non-linear transformation to a low-dimensional latent space using an encoder. The decoder learns the mapping to retrieve initial high-dimensional data from its latent representation.

In this study, we have used a convolutional autoencoder58 with convolutional layers in the encoder and transpose convolutional layers in the decoder as shown in Fig. 1. The encoder learns a nonlinear mapping of the high-dimensional microstructure data, ϕ(x, y, t), to a low-dimensional latent space represented by \(\tilde{{{{\boldsymbol{\phi }}}}}(t)\) and is expressed as

$${\alpha }_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{enc}}}}}}:{{{\boldsymbol{\phi }}}}({{{\boldsymbol{x}}}},{{{\boldsymbol{y}}}},t)\to \tilde{{{{\boldsymbol{\phi }}}}}(t),$$
$${\beta }_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{dec}}}}}}:\tilde{{{{\boldsymbol{\phi }}}}}(t)\to \hat{{{{\boldsymbol{\phi }}}}}({{{\boldsymbol{x}}}},{{{\boldsymbol{y}}}},t),$$

where α and β represent the mappings performed by the encoder and the decoder, respectively. In Eq. 4, the encoder takes \({{{\boldsymbol{\phi }}}}({{{\boldsymbol{x}}}},{{{\boldsymbol{y}}}},t)\in {{\mathbb{R}}}^{128\times 128}\) as input, and maps it to \(\tilde{{{{\boldsymbol{\phi }}}}}(t)\in {{\mathbb{R}}}^{{l}_{{{{\rm{d}}}}}}\), where ld is the dimension of the latent space. θenc represents the trainable parameters of the convolutional encoder. Equation 5 represents the decoder network, which takes the latent dimensional representation, \(\tilde{{{{\boldsymbol{\phi }}}}}(t)\in {{\mathbb{R}}}^{{l}_{{{{\rm{d}}}}}}\) as the input and predicts the primitive microstructure, \(\hat{{{{\boldsymbol{\phi }}}}}({{{\boldsymbol{x}}}},{{{\boldsymbol{y}}}},t)\in {{\mathbb{R}}}^{128\times 128}\), using transpose convolutional operations. The details of the autoencoder architecture are provided in Table 1.

θae represents the trainable parameters of the autoencoder. These parameters are learned by minimizing the loss function, \({{{{\mathscr{L}}}}}_{{{{\rm{ae}}}}}\), which reads

$${{{{\mathscr{L}}}}}_{{{{\rm{ae}}}}}=\mathop{\min }\limits_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ae}}}}}=\{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{enc}}}}},{{{{\boldsymbol{\theta }}}}}_{{{{\rm{dec}}}}}\}}{\Vert {{{\boldsymbol{\phi }}}}(x,y,t)-\hat{{{{\boldsymbol{\phi }}}}}(x,y,t;{{{{\boldsymbol{\theta }}}}}_{{{{\rm{ae}}}}})\Vert }_{2}^{2}.$$

Alternatively, the autoencoder provides a low-dimensional representation of the microstructure by learning a non-linear transformation to a latent space with ld features. We also observe that it is easier to learn the microstructure dynamics in the latent space representation, learned by the autoencoder, than the original primitive form of the microstructure in real space. This is due to the presence of several high gradient regions in the original form of the microstructure as shown in Fig. 7a. These high gradient regions in the solution are due to the nature of the governing Cahn-Hilliard equation. The latent microstructure representation in Fig. 7b is smoother and does not have high gradient regions. The latent representation of the data offers higher regularity and therefore, we achieve faster convergence during the training of the surrogate neural network model.

Training the DeepONet: learning the microstructure dynamics in lower dimensions

Neural operators generate nonlinear mappings across infinite-dimensional function spaces on bounded domains, giving a simulation framework for multidimensional complex dynamics prediction in real time. Once properly trained, such models are discretization invariant, which means they share the same network parameters regardless of how the underlying functional data is parameterized. DeepONet, originally proposed by Lu and coworkers41, allows the mapping between infinite-dimensional functions using deep neural networks. This subsection provides a detailed description of the training of DeepONet to model the evolution of the microstructure in the latent dimension.

The unstacked DeepONet architecture is made up of two concurrent deep neural networks: one encodes the input function at fixed sensor locations (branch network), while the other represents the domain of the output function (trunk network). Time, \(t\in {{\mathbb{R}}}^{1}\), is given as input to the trunk network while \(\tilde{{{{\mathbf{\Phi }}}}}=\{\tilde{\phi }({t}_{10}),\tilde{\phi }({t}_{11}),\ldots ,\tilde{\phi }({t}_{89})\}\in {{\mathbb{R}}}^{80\times {l}_{{{{\rm{d}}}}}}\) is the input fed to the branch network. \(\tilde{{{{\mathbf{\Phi }}}}}\) represents the phase field in the latent dimension, ld, for all the 80-time steps available in the given dataset. The goal of the DeepONet is to learn the solution operator, \(\tilde{\phi }(t)\approx \hat{\tilde{\phi }}(t)={{{\mathscr{G}}}}(\tilde{{{{\mathbf{\Phi }}}}})(t)\) from the 1600 microstructure evolutions provided in the training dataset. The output of the DeepONet is a vector \(\in {{\mathbb{R}}}^{{l}_{{{{\rm{d}}}}}}\) and is expressed as \({{{\mathscr{G}}}}(\tilde{{{{\mathbf{\Phi }}}}})(t;{{{{\boldsymbol{\theta }}}}}_{{{{\rm{d}}}}})\), where \({{{{\boldsymbol{\theta }}}}}_{{{{\rm{d}}}}}=\left\{{{{{\bf{W}}}}}_{{{{\rm{d}}}}},{{{{\bf{b}}}}}_{{{{\rm{d}}}}}\right\}\) includes the trainable weights, Wd, and biases, bd, of the DeepONet model. The framework of the DeepONet allows the branch network to have a flexible architecture. To model the microstructure evolution, we have considered a fully connected neural network for the trunk network. Due to the high-dimensional nature of the branch network input, \({{\mathbb{R}}}^{80\times {l}_{{{{\rm{d}}}}}}\), a convolutional neural network is used as the branch network because it utilizes the same kernels across the time axis and enables the branch network to encode the entire history in a memory efficient manner. Hence, the input has to be reshaped to \({{\mathbb{R}}}^{80\times \sqrt{{l}_{{{{\rm{d}}}}}}\times \sqrt{{l}_{{{{\rm{d}}}}}}}\) before feeding it to the branch network. The network architecture is presented in Fig. 1. The trainable parameters of the DeepONet, θd, are obtained by minimizing a loss function, d, defined as:

$${{{{\mathscr{L}}}}}_{{{{\rm{d}}}}}=\mathop{\min }\limits_{{\theta }_{{{{\rm{d}}}}}}{\left|\left|\tilde{{{{\mathbf{\phi }}}}}(t)-{{{\mathscr{G}}}}(\tilde{{{{\boldsymbol{\Phi }}}}})(t;{{{{\boldsymbol{\theta }}}}}_{{{{\rm{d}}}}})\right|\right|}_{2}^{2},$$

where \(\tilde{\phi }(t)\) is the ground truth for the low-dimensional phase field representation at time, t obtained from the convolutional encoder. The trained DeepONet is used to predict \(\tilde{\phi (t)}\in {{\mathbb{R}}}^{{l}_{{{{\rm{d}}}}}}\). The output of the DeepONet is fed into the transposed convolutional decoder to predict, \(\hat{\phi }(t)\in {{\mathbb{R}}}^{128\,\times\, 128}\). The DeepONet is trained using the Adam optimizer59. The implementation has been carried out using the TensorFlow framework60. We use Xavier Initialization61 to initialize the weights of all the models.

Error metrics

The L2 norm of relative error, \({{{\mathscr{D}}}}\), is used as the evaluation metric to analyze the performance of each model considered in this study. \({{{\mathscr{D}}}}\) is defined as:

$${{{\mathscr{D}}}}=\frac{{\sum }_{n}{\sum }_{x}{\sum }_{y}{\sum }_{t}{\left({\phi }^{(n)}(x,y,t)-{\hat{\phi }}^{(n)}(x,y,t;{{{\boldsymbol{\theta }}}})\right)}^{2}}{{\sum }_{n}{\sum }_{x}{\sum }_{y}{\sum }_{t}{\phi }^{(n)}{(x,y,t)}^{2}},$$

where n corresponds to the nth sample of the given dataset.