Functional data-driven framework for fast forecasting of electrode slurry rheology simulated by molecular dynamics

The computational simulation of the manufacturing process of lithium-ion battery composite electrodes based on mechanistic models allows capturing the influence of manufacturing parameters on electrode properties. However, ensuring that these properties match with experimental data is typically computationally expensive. In this work, we tackled this costly procedure by proposing a functional data-driven framework, aiming first to retrieve the early numerical values calculated from a molecular dynamics simulation to predict if the observable being calculated is prone to match with our range of experimental values, and in a second step, recover additional values of the ongoing simulation to predict its final result. We demonstrated this approach in the context of the calculation of electrode slurries viscosities. We report that for various electrode chemistries, the expected mechanistic simulation results can be obtained 11 times faster with respect to the complete simulations, while being accurate with a Rscore2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${R}_{\rm{score}}^{2}$$\end{document} equals to 0.96.


Introduction
In a modern world where Artificial Intelligence (AI) and Machine Mearning (ML) applications are blooming [1,2], one may think that the use of more traditional mechanistic models based on mathematical descriptions of physical processes, is becoming obsolete.However, this is not true since mechanistic models still represent nowadays unavoidable tools to support the analysis of complex systems.In contrast to empirical models (and per se ML approaches) which study the real world to elaborate a theory, mechanistic models are based on a theory to predict the real-world behavior.Models of that kind are omnipresent in numerous domains such as medicine [3], battery manufacturing [4,5], nanotechnologies [6], biology [7,8], or in environmental sciences [9].The significant progress in computational hardware achieved in the last decades boosted the emergence of a massive amount of academic and commercial software [10,11], allowing to create and solve the equations behind mechanistic models describing systems with increasing complexity [12,13].This is the case in a plethora of scales, e.g.materials properties from their electronic structure [14] or the dynamics of colliding galaxies from gravitational forces [15] and gas hydrodynamics [16].Furthermore, the capability of a mechanistic model to simulate the system behavior by precisely controlling the hypothesis of work, makes possible to perform on-demand computer experiments to evaluate the impact of different assumptions on the results.The latter makes this type of models particularly suitable to support analysis of experimental data in both fundamental and applied research activities.Despite the continuous improvement in the numerical methods used to closely match the experimental context of interest with high-fidelity data, computations are usually time and resources consuming.The overall time and mandatory resources required to recover the complete simulation process determine the computational cost of the corresponding simulation, and varies as a function of the specific model utilized.A high computational cost narrows the number of possible parameter sets for testing new conditions [17], restraining the use of mechanistic models in a high-throughput way [18,19,20,21].
However, Machine Learning (ML) techniques can be combined with mechanistic models to reduce their computational cost.ML techniques are very popular nowadays to support researchers to move away from pure trial-and-error approaches in experimental contexts, to facilitate properties calculations and to ease the parameterization of mechanistic models [22,23].For instance, Wang et.al. applied different ML algorithms to analyze molecular dynamics simulations [24], whereas Adam et.al. combined neural networks with physics-based models to improve the accuracy of lithographic process modeling [25].From an industrial perspective, Maleh et.al. studied the use of ML techniques for Internet of Things (IoT) intrusion detection in aerospace cyber-physical systems [26], and Tuncali et.al. evaluated cyber-physical systems with ML in the context of autonomous driving systems [27].In contrast to usual data analysis techniques for multivariate datasets, Functional Data Analysis (FDA) that processes time series, can also be connected with ML techniques, but for the forecasting or prognostic of time-series variables [28].Therefore, such a combination can constitute a straightforward solution for dealing with wavering data from mechanistic models as part of the computational cost reduction [29], for example, workflows dedicated to their parameters optimization.
One of these contexts is process engineering and the particular case of manufacturing process of lithium ion batteries (LIBs) [30], taken here as an example for the demonstration of our approach.This process encompasses multiple steps and parameters which are interlinked [31,32].Such steps concern the slurry preparation, its coating and drying, the calendering of the resulting electrodes, the cell assembly, the electrolyte infiltration and the formation of the solid electrolyte interphase [33,34,35,36].This complex process has been historically simulated using empirical models with parameters fitted to experimental trends or by using mechanistic models based for instance on the Continuum Fluid Dynamics (CFD) approach [37].In the recent years, ML algorithms started to emerge in this field, illustrating the powerful capabilities of these techniques to unravel interdependencies between manufacturing parameters and electrode properties [34,38,39].Furthermore, mechanistic models allowing to predict the electrode microstructure in 3D, as a function of the manufacturing parameters started also to emerge [40,41].Those 3D mechanistic models account by the explicit spatial location and trajectory of active material (AM) and carbon-binder domain (CBD) phases (Figure 1A) [42,43].As an example for the slurry and the coating/drying, these models are supported on the Coarse-Grained Molecular Dynamics (CGMD) approach accounting for attractive and repulsive forces between the considered particles [5].In contrast to such a support, the calendering accounts the Discrete Element Method (DEM) [4], whereas the Lattice Boltzmann Method is used for the electrolyte infiltration modeling.These models have a significant computational cost, but still they need to be run in order to generate high fidelity data (data that generally agrees with the experimental one) for derivation of surrogate models and optimization loops [44].It requires mechanistic models' validation by comparison with experimental observables.One of these observables is the slurry viscosity as a function of the applied shear rate.The electrode coating quality strongly depends on this rheological property.In the mechanistic model, once the slurry microstructure is predicted, non-equilibrium molecular dynamics (NEMD) simulations are performed to calculate the shear-viscosity curve (γ-η curve).This curve is calculated point by point, i.e. a shear-rate is applied through the deformation of the simulated slurry box and a viscosity value is retrieved (see details in Supporting  Information).Each calculation must to be run for a time long enough to be confident of having reached a stable η value, usually leading to overall high computational costs.
In this work, we tackled the issue of computational cost reduction of 3D-resolved mechanistic models, through the use of a functional data-driven framework for fast predictions of their simulation results.Briefly, this consists at running the first numerical steps (i.e.time frames) of the mechanistic simulation, to retrieve early results, and then bypass the full simulation process by predicting its results.More precisely, the aforementioned framework proposes first a screening step, whose main goal is to identify running simulations of interest from an experimental perspective, e.g.expected to give results in agreement with an experimental target value (viscosity for a given shear rate in the application case used here as an illustration).Second, it proposes a forecasting step to quickly predict their results, considering only the previously filtered simulations.Both steps couple two algorithms: one based on Functional Principal Component Analysis (FPCA) achieving a compression of time series in a low dimensional space (i.e.dimensionality reduction), and another one based on K-Nearest-Neighbors (KNN) performing the predictive task.We applied this framework for in silico slurry modeling based on NEMD (see details in the Supporting Information) after accumulating simulations over three different LIB electrode chemistries [45].We tracked the evolution of the calculated viscosity values (η) along the simulation process to determine the time series.Despite the illustration here for LIB electrode slurries, the proposed framework can be applied to other fields where mechanistic models generate time series and can trigger on significant computational cost diminution for faster optimization process of the mechanistic model parameters to match experimental observables.
2 Time series processing 2.1 Functional Data Analysis NEMD calculations of electrode slurry viscosities generate discretized values at each numerical step (i.e.time frame) along the calculation process.Those steps are independent from the allocated resources used, and produce a list of numerical values to constitute a curve as a function of time.Such a curve is the descriptor of one simulation, and an appropriate data treatment of this requires to fix the number of time frames for the definition of the corresponding functional space I [46].The functional form of the cure is written as opposed to a collection of functional variables associated to one simulation, written as Functional Data Analysis (FDA) properly carries out those types of variables by defining highly regular data over the space I through smoothing techniques.FDA as shown to be very relevant approach in numerous contexts, as for instance in clinical studies [47], sport performance analysis [48], or materials discovery [49,50].Within this study, we applied a Functional Principal Component Analysis (FPCA) on time series as a tool to achieve their data compression, i.e. dimensionality reduction, for further predictive tasks.In particular, FPCA balances between the deep of the compression and the resulting variability, to obtain a meaningful low dimensional representation of time series [51,52].Such a data processing is already well known within the usual Principal Component Analysis (PCA) [53], but here it is extended to the functional form of time series.
FPCA requires a smoothing technique to reconstruct a set of discrete values from time series (Eq.1), over the functional space.The latter space is characterized by a set of basis functions, that enables a basic decomposition of time series in a finite dimensional space [54].Let's consider time series like written in (1), and a set of basic functions φ = {φ 1 , φ 2 , ..., φ p } all defined on I with p in IN.X is given by where the set c = {c 1 , c 2 , ..., c p } is the vector of basis coefficients, which describes X by a finite number of values.
The most common solution for the basis decomposition is to use a B-Spline family [55].The associated methodology consists at interpolating time series with a family of polynomial functions (basic functions), which are sufficient differentiable in discrete knots.Considering I = [I 0 , I 1 ] with I 0 = t 0 < t 1 < ... < t p = I 1 , each polynomial is defined in sub-intervals of I, expressed by (t i ) (i≤p) values.The latter polynomials are positive on at the most p sub-intervals, and their degree is equal to 3 in this study.
Based on the expression in (3), FPCA can estimate M eigenfunctions (ψ i ) (i≤M ) associated to M eigenvalues (ν i ) (i≤M ) , and finally M scores (ρ i ) (i≤M ) representing the projection of time series along new axis, also called functional principal components.In particular, M is very limited compared to p, since the FPCA must keep as much variability (variance) as possible from the initial time series in a short amount functional principal components, and illustrates the concept of dimensionality reduction.Regarding the usual PCA [53,56], the covariance matrix denoting correlations between random variables is replaced by a covariance operator χ, due to the functional form of time series.It allows the assessment of the eigenvalues associated to the eigenfunctions (extension of the Hilbert-Schmidt theorem [57]) of χ, according to its approximation based on the above basis decomposition.From the linear algebra, the resulting FPCA is expressed by X following the Karhunen-Loève decomposition [58] The extension of the Singular Values Decomposition (SVD) [53] to functional forms for time series explicits a straightforward relationship between ψ m and ν m for all m, being In our study, we applied the same basis decomposition, i.e. basis family, for the smoothing of each time series (X j ) j≤N and eigenfunctions ψ j (t) = p i=1 b i,j × φ i (t) j ≤ M .As a result, (3) becomes matrix-wise with C and Φ, the matrix of basis coefficients and the vector of basis functions respectively.
Accordingly to previous expressions, the initial problem (5) can be shaped as the following product of matrices to link any couple of eigenfunction and eigenvalue [51] 1 where W is included in IR p×p , containing the pairwise scalar products of basis functions ([< φ i , φ j > 2 ]) (p) .b is the vector of coefficients from the basis decomposition of ψ.
We denote a = W 1/2 b to simplify the equation above, allowing to write Such a final equation relates an eigenproblem that is common for a SVD in an usual PCA, where the eigenvalues are associated to the matrix 1 [53,56].Therefore, we obtain a, then calculate b to find any (ψ i ) (i≤M ) , and lastly At the end of the FPCA, the resulting scores are sufficient to describe each time series in a discret manner, to considerably reduce the number of values.This benefits to further supervised ML implementations of our framework for predictive tasks.The main advantages concern the possibility to decrease the training time due to a narrowed number of possible variables combinations, but also to avoid an overfitting during the training step [59,60].The latter mentioned fitting usually happens when the feature space becomes increasingly sparse with a high number of parameters (i.e.inputs), because ML models can be very sensitive to a short set of parameters [61].Therefore, the application of dimensionality reduction for time series helps in that sense, to represent a meaningful data processing for defining inputs of ML models.

Physical relevance from an experimental perspective
In our framework, we coupled such a FPCA with a ML algorithm performing two different predictive tasks, being the two pillars of our study.The first one concerns an early identification of running simulations of physical relevance from an experimental perspective, commonly called supervised classification in the ML terminology.In other words, it means that the framework achieves a screening of mechanistic simulations in the very first numerical steps, by predicting if such a simulation will be ending with relevant results for further experimental comparisons.In our context of slurry modeling on NEMD, there is a physical relevance when the results associated to a simulation (i.e.slurry viscosity) are similar to experimental target values.At the end, the prediction provided by the framework concludes on the relevance of each running simulation to let it continue.It represents an automatized "go" or "no go"decision through the analysis of how a simulation behaves.
In terms of data processing, each simulation is labeled by a binary value representing the output of the supervised classification.This output depends on the final behavior of the simulation like Figure 2A displays.Indeed, we used the average of the numerical values resulting from the last 1000 time frames to consider the result of a simulation.In contrast to a simulation with a high result that suggests to stop its process, a low result recommends to let the simulation running, because of the agreement of the value with the experimental one.As a consequence, we defined a threshold to assess the label, distinguishing high and low results.Figure 2B displays the distribution of the average slurry viscosity, i.e. simulations result, that are available in our study.In accordance with an experimental perspectives and the same Figure 2B, we empirically set the threshold at 50 P a.s.

(B)
Time frames (Pa.s) (A) Figure 2 : (A) : Slurry mesostructure evolution and associated viscosity (η) during the non-equilibrium molecular dynamics (NEMD) simulation utilized to assess the slurry viscosity at a given shear-rate (γ) [5].The last 1000 viscosity values from a single simulation are used as a vector, to calculate an average value (µ) and a standard deviation (σ).These latter values enable to color-code the slurry viscosity evolution within a box, where the two limits are the average plus minor the standard deviation (µ +/σ).(B) : Empirical distribution of the average slurry viscosity value (η) associated to the mechanistic simulations carried out in this study.This show the heterogeneous distribution of represented values.

K-Nearest-Neighbors
This screening step (classification) within the aforementioned framework is done using a K-Neareast-Neighbors (KNN) classification [62].One basis of the algorithm is the memorization of distances between data from the features space.The latter of which is defined by the functional principal components from the FPCA, enhancing calculations of sparse distances between time series (inputs) to make it a meaningful choice for our study [63].The KNN algorithm lies on the choice of k neighbors and a distance metric to evaluate pairwise distances between input data, to then attribute a prediction (output) for a new unseen input data, based on the outputs of its k nearest neighbors [64].Moreover, a right choice of k is crucial to take into account a suitable number of neighbors when predicting the output.This affects the predictive capabilities of the KNN.In this direction, the algorithm is straightforward and can be summarized as it follows i Choose the number of neighbors k; ii Calculate the distance between data considering a specific metric; iii Get the k nearest neighbors for a new data we want to predict; iv Assign the associated output by a majority vote (classification), or by an average value (regression), for the outputs of its k nearest neighbors.
Section S2 in the Supporting Information provides a graphical interpretation of the KNN algorithm for the case of classification and regression.From the perspectives of the framework, the chosen number of time frames to determine the time series in the screening step must be low, reflecting the efficiency to predict a further physical relevance of a running simulation within the early numerical steps.In the meantime, this enables to stop quickly a running simulation if the latter is predicted as a case without agreement with experimental values.It is particularly relevant for freeing up computational resources and launching another simulation with new operating conditions.This is discussed below by comparing the predictive capabilities of the KNN classification as a function of the number of time frames for time series definition.

Fast predictions from mechanistic simulations
The second pillar of the aforesaid framework concerns the fast prediction of the results, only for mechanistic simulations that have been previously considered of physical relevance from an experimental perspective.In this context, the results represent the informations we can extract from the last numerical steps of the simulation.As Figure 2A shows, it is the average value (µ) and the standard deviation (σ) from the vector formed by the numerical values of slurry viscosities within the last 1000 time frames.These results provide not only the final average viscosity of the associated simulation, but also how the latter behaves.To achieve this forecasting step, we also couple a FPCA with a KNN algorithm, but in the context of a supervised regression for the latter algorithm with here two outputs, i.e. µ-σ and µ+σ.
Nevertheless, to still tackle the issue of high computational cost of the mechanistic simulation process, we used another number of time frames to define time series for the dimensionality reduction (FPCA).In practice, a simulation is run to be confident of having reaching a stable value to define its final result (Figure 4).This suggests that by recovering the numerical values within a large number of time frames, we may expect the KNN regression to have predictions very closed to the simulation result.In the meantime, this action does not reduce the computational cost since the simulation runs for a large number of time frames.As a consequence, the forecasting step balances between the number of time frames to use for the definition of time series, and the corresponding computational cost to fast predict the simulation result, with high predictive capabilities for the KNN.As we did for the screening step, we compared the predictive capabilities of the KNN regression for different number of time frames to evaluate the best compromise.The results are discussed below.
Figure 4 : Examples of electrode slurry viscosity calculations with two different behaviors.In contrast to the simulation color-coded in red, the blue one is considered with physical relevance from an experimental perspective.The threshold for the screening step is highlighted by the grey horizontal line at 50 P a.s.

Results and Discussion
Datasets Time series were extracted from NEMD simulations to capture the rheological behavior of LIB slurries (viscosity vs. shear-rate).Simulations were launched using LAMMPS software [10].In total, we executed 2172 simulations coming from the modeling of three different slurries: 1773 from Nickel-Manganese-Cobalt (NMC) slurries, 183 from graphite and 216 from Lithium-Iron-Phosphate (LFP).We notice that the type of chemistry does not determine an input value for the KNN algorithm, since only the corresponding rheological behavior was used for our data processing.Moreover, our code under LAMMPS was executed using the MatriCS HPC platform from Université de Picardie Jules Verne (Amiens, France) [65].In total, the number of time frames to reach stable viscosity values (η) was between 1450 and 1800, with an average number equals to 1734.In the screening step, all executed simulations complete a functional dataset to train and test our KNN classification.The training dataset contained 80 % of simulations randomly picked up from the functional dataset, and the testing dataset contained the remaining 20 %.In the forecasting step, only simulations labeled with physical relevance were used to form another functional dataset to train and test our KNN regression, whose size is equal to 1927 simulations.

Time frames selection
We chose the number of time frames to set each time series according to Figure 5, which displays the trends for the predictive capabilities, i.e. validation metrics, as a function of the number of time frames to set time series for the screening and forecasting step.The same Figure 5 reflects the best compromise between the maximization of the validation metrics and the maximization of the computational efficiency, inversely proportional to the computational cost.In our context, the higher the time to predict the simulation results, the lower the computational efficiency.More precisely, we calculated a numerical mean of the validation metric per number of time frames, by renewing a training/testing dataset 20 times.It is particularly interesting to avoid a calculation of the validation metrics biased by a one single random split of the data into the training and testing datasets.The selected validation metrics are detailed in the methods section below.score as a function of the number of time frames.The latter number define the time series needed for the FPCA within the forecasting step of the framework.(C) : Illustration of the compromise foreseen, between the selected number of time frames and the corresponding computational efficiency to calculate the simulation results.The green trend illustrates the results from (A) and (B).At the end, the best compromise is located for a number of time frames from where the validation metrics are not significantly increasing in order to narrow the computational cost.
We retained a number of time frames equals to 100 for the screening step, whereas the number for the forecasting step was equal to 150.Indeed, Figure 5A and Figure 5B capture increasing trends of validation metrics when incrementing the number of time frames, while starting to stagnate from those two chosen numbers.At the end, the corresponding validation metrics were in average equal to 0.84 and 0.90 for the Recall, F 1 score (KNN classification), and equal to 0.97 and 0.96 for the R 2 score of the two respective outputs (KNN regression).Figure 6 displays regression plots related to the chosen KNN regression, for the comparisons of predicted values as a function of their corresponding simulation results (µ +/σ).This visualization illustrates how predictions are close with real simulation results.
In terms of computational cost reduction, the framework enables to use time series defined over 100 and 150 time frames to accurately predict meaningful results from a running simulation.By considering the average number of time frames for all available simulations detailed above, we elaborated a framework efficient to determine the slurry viscosity 11 times faster.In the meantime, the same framework determines the physical relevance of a simulation here 17 times faster.Those results illustrate a drastic reduction of the time required to recover the simulation results, and also to provide quick information on how a given simulation behaves.This is essential to discard running simulations without physical relevance from an experimental perspective, and then avoid the analysis on simulation results that can not be consistent with an experimental comparison.Moreover, the fast prediction of simulation results (slurry viscosity) enables to allocate free available resources on a HPC platform for another simulation to fast study new operating conditions.In our context of slurry modeling, and especially of battery manufacturing, such advantages to fast predict simulation results, facilitate the seek of the best parameter set among extensive possibilities, to trigger further and faster optimization processes of mechanistic model parameters.

Conclusions
In this study, we proposed a functional data-driven framework for fast predictions of mechanistic simulations results, from the very first numerical iterations solving the underlying physical equations.We demonstrated this data-driven approach for the case of the calculations of the viscosities versus applied shear rate for lithium ion battery electrode slurries, with different active material chemistries.Such calculations are supported on Non-Equilibrium Molecular Dynamics and are usually computational expensive.The aforementioned framework quickly determined slurries viscosities within the forecasting step, leading to a computational cost reduction by a factor 11. Besides, such a framework enabled to predict the relevance of a mechanistic simulations from an experimental perspective, here 17 times faster within the screening step.In terms of predictive capabilities, the KNN classification reached a F 1 score equals to 0.90, whereas the KNN regression achieved a R 2 score equals to 0.96.Those validation metrics were considered as reliable from an experimental point to analyze viscosities.In particular, the results in this work presents the capabilities to treat mechanistic simulations as time-series, for the application of FDA techniques.In this way, it serves to support different supervised ML algorithms to accurately analyze the rheology of electrode slurries.In addition, the proposed framework can be transfered to any other field where mechanistic models generate time series, to figure out drastic decrease of corresponding computational costs for quicker parameters optimization processes to design better physics-based models.

Validation metrics
We evaluated the KNN classification and regression by using two different metrics on the testing datasets.The F 1 score [66] and especially the corresponding Recall score, were used to validate the KNN classification, whereas we assessed the goodness of fitting for the KNN regression with the R 2 score [67].The first mentioned metric adjusted the Precision and Recall (9) to increase the capabilities of a binary classifier to face the issue of wrong predictions.In our context, this is especially relevant with an unbalanced number between mechanistic simulations which will give a correct result and the ones which will not within the dataset.As an example, we considered to minimize the error from the KNN classification to forecast a running simulation with physical relevance, though its real output is without interest

Figure 1 :
Figure 1 : Schematic of a lithium ion battery (LIB) electrode manufacturing process by using mechanistic modeling.Our methodology is illustrated with three different electrode slurry chemistries: Nickel Manganese Cobalt Oxyde (NMC-111), Graphite and Lithium-Iron-Phosphate (LFP).

Figure 3 :
Figure 3 : Schematic representation of the functional framework with the different steps, to fast predict the mechanistic simulation results.IR N and IR M characterize the definition spaces of time series, with N and M (N ≤ M ) the number of numerical values already generated by the simulation.Such a framework illustrates how a running simulation is handled along its process within early numerical steps, to predict its results and narrow its computational cost.

Figure 5 :
Figure 5 : (A) : Evolution of the average Recall and F 1 score as a function of the number of time frames.The latter number define the time series needed for the FPCA within the screening step of the framework.The black slight bar represents the deviation around the mean value of the validation metric.(B) : Evolution of the average R 2score as a function of the number of time frames.The latter number define the time series needed for the FPCA within the forecasting step of the framework.(C) : Illustration of the compromise foreseen, between the selected number of time frames and the corresponding computational efficiency to calculate the simulation results.The green trend illustrates the results from (A) and (B).At the end, the best compromise is located for a number of time frames from where the validation metrics are not significantly increasing in order to narrow the computational cost.

Figure 6 :
Figure6: Regression plots to evaluate the predictive capabilities of the KNN regression selected in the data-driven framework, using the testing dataset.The two plots compared the predicted outcomes by the regression as function of the real outcomes from NEMD simulations.The plot of the left corresponds to the output µ − σ whereas the one of the right corresponds to the output µ + σ.For each case, the R 2 score was rounded with two significant digits.

Table 3 :
Hyperparameters tuning from the cross validation, for the KNN classification and regression.Others KNN hyperparameters are set with initial values proposed by the function.the Institut Universitaire de France for the support.F.C. acknowledges the European Union's Horizon 2020 research, an innovation program under grant agreement No. 957189 (BIG MAP).We acknowledge Daphne Boursier at LRCS for the proofreading of the article and useful comments.