Elucidating the constitutive relationship of calcium–silicate–hydrate gel using high throughput reactive molecular simulations and machine learning

Prediction of material behavior using machine learning (ML) requires consistent, accurate, and, representative large data for training. However, such consistent and reliable experimental datasets are not always available for materials. To address this challenge, we synergistically integrate ML with high-throughput reactive molecular dynamics (MD) simulations to elucidate the constitutive relationship of calcium–silicate–hydrate (C–S–H) gel—the primary binding phase in concrete formed via the hydration of ordinary portland cement. Specifically, a highly consistent dataset on the nine elastic constants of more than 300 compositions of C–S–H gel is developed using high-throughput reactive simulations. From a comparative analysis of various ML algorithms including neural networks (NN) and Gaussian process (GP), we observe that NN provides excellent predictions. To interpret the predicted results from NN, we employ SHapley Additive exPlanations (SHAP), which reveals that the influence of silicate network on all the elastic constants of C–S–H is significantly higher than that of water and CaO content. Additionally, the water content is found to have a more prominent influence on the shear components than the normal components along the direction of the interlayer spaces within C–S–H. This result suggests that the in-plane elastic response is controlled by water molecules whereas the transverse response is mainly governed by the silicate network. Overall, by seamlessly integrating MD simulations with ML, this paper can be used as a starting point toward accelerated optimization of C–S–H nanostructures to design efficient cementitious binders with targeted properties.

i.e., , = 2, 3, … , . Second, the mapped higher-order polynomial predictors are used to formulate a regression problem identical to the linear regression, Using the least-square method, the coefficients ( 0 and ) can be estimated by minimizing the error, which is the sum of the squared difference between the true responses with those predicted responses.
Hence, the complexity of the PR models highly depends on the choice of the N th polynomial degree considered. After obtaining 0 and , the unknown variable vector can be obtained from the new predictor vectors, as follows: The linear model with polynomial mapped features is selected by comparing the MSE with increasing polynomial order of the features. Without losing generality, the polynomial features are not limited to only the crossing terms. Supplementary Figure 1

Supplementary Figure 2. Comparison of the predicted stiffness components by polynomial regression (with a degree equal to 3) and measured value which is computed by molecular dynamics simulation.
Beyond a polynomial order of 3, almost all the cases show a significant increase in MSE for validation set with an increase in polynomial order. The train set also shows an irregular trend beyond a polynomial order of 3 (for the case of 44 and 55 ). The overall trend (the validation errors deviate farther from the train errors with increase in polynomial order beyond 3) from Supplementary Figure 2 indicates significant overfitting effects beyond the polynomial order of 3 (except for the case of 44 and 55 ). As such, an ideal polynomial order of 3 is chosen here.
The prediction results are shown in the Supplementary Figure 3 by directly overlaying the prediction over the train sets and the validation sets for all nine components of the elastic modulus.

Supplementary Figure 3. Comparison of the predicted stiffness components by polynomial regression (with a degree equal to 3) and measured value which is computed by molecular dynamics simulation.
While for most of the cases higher R 2 values are obtained, a few cases (such as C13, C55, C23) show relatively lower R 2 values due to the presence of irregularities in the C-S-H structure in the form of interlayer spaces.
Nevertheless, the overall results suggest that reasonable prediction accuracies (R 2 >0.7) can be achieved for all the elastic constants of C-S-H.

Support Vector Machine (SVM)
The support vector machine is a support vector classifier that determines the best separating hyperplanes in a higher-dimensional space of the original space of the predictors 1 . The realization of raising the predictors to a space of higher dimension is based on the kernel tricks applied to the predictors. The support vector regression is a convex optimization problem which gives a unique solution to a given set of predictors and responses. The support vector regression can be expressed as follows, Where ϵ is the pre-defined margin size or the maximum error tolerated by the model, ξ is the slack variable which accounts for the tolerance of out-of-margin data points, and is the constraint of overall tolerance of the out-of-margin cases for finding the SVM model. This constraint acts as a regularization term. As increases, the regression result is less prone to overfitting the given data. In this study, a radial basis function (RBF) kernel is adopted.
Supplementary Figure 4 The 2 value for different outputs are also computed. While the algorithm performs well for C11, C22, C66, and C12, some of the components (such as C13) show relatively lower 2 values for validation set.

k-Nearest Neighbor (kNN)
kNN is a simple ML technique that is used for both classification and regression problems. In kNN, the predictions are based on the entire train dataset by calculating the similarity between the input sample and each training instance. It is also a non-parametric model as it does not make strong assumptions about the internal mapping function and this is added to the model to be more flexible as it has the freedom to Here, the k-value characterizes the complexity of the model. For a very low value of k (equal to 1), the model tends to overfit for the train set which leads to high error values when tested with the validation set. On the other hand, for a high value of k, the model performs poorly on both the train and validation set. It can be clearly seen from the Supplementary Figure 6(a) that the validation error reaches minima and 2 value reaches its peak at a value of k equals to 4. Any further increase in k values does not improve the prediction accuracy. Thus, a k-value equals to 4 can be taken as the optimum value for the model.
Supplementary Figure 7 shows the predicted response vs measured values from MD simulations for all the elastic constants with the corresponding 2 values. Overall, the k-Nearest Neighbor algorithm shows good prediction accuracy for C11, C22, C12, and C66 whereas the 2 values are found to be relatively lower for the others.

Decision tree
Decision trees are a family of ML techniques that are used for both classification and regression problems [2][3][4] . While the other regression methods such as neural network fit a set of parameters in a mathematical formula, decision trees are "rule-based" models where they aim to identify logical splits in the data 2 . In this algorithm, the input space is split into a series of partitions (also denoted as leaf nodes), and then a simple model (i.e., often simply a constant value) is used to predict the output in each leaf node 2 . The splits are selected such that a minimum value of error is obtained. A popular method to determine the splits is the classification and regression tree (CART) algorithm 5 . This model overfits the data points when the size of the tree is too large or when each leaf-node contains an insufficient number of data points 2,5 .
Supplementary Here, the maximum depth characterizes the complexity of the model. It is observed that at a maximum depth of 5, a minimum MSE and maximum 2 of the validation data is achieved. Thus, a model complexity equals to 5 is considered as the best model for this dataset. With further increase in the model complexity, the MSE of the train set decreases however the MSE for the validation data increases. This show that the model has experienced an overfitting problem with further increase in maximum depth beyond 5, and this is also true for 2 plot.
accuracy is comparatively lower for both the training set and test set.
Supplementary Figure 9. Comparison of the predicted stiffness components by decision tree (with a maximum depth equal to 5) and measured values which are computed by molecular dynamics simulation.

Random Forest (RF)
Random forest is a decision forest or decision tree method that belongs to ensemble learning. It is an average of a large collection of decorrelated decision trees. Such an ensemble method can both increase the prediction accuracy and reduce over-fitting problems 2 . In this model, a large number of trees are trained individually using only a subset of the input variables 2,3 . In each tree, a bootstrap sample of the training data is used instead of the entire set of train data. This procedure is known as bootstrap aggregation or bagging 2 . The predictions of each individual are then averaged to obtain the prediction of the random forest ensemble. This method is similar to boosting in many aspects but can be easily trained and manipulated.
Supplementary Figure 10 To assess the accuracy of the models, Supplementary Figure 11 represents the prediction value obtained from the best RF model with the number of trees equal to 9 against measured values computed from MD.
It is observed that the RF model performed better than the decision tree. However, it failed to accurately predict for some elastic constant such as C23.

GP with Matern kernel
Supplementary Figure 12 shows the prediction of GPR with the Matern kernel against the measured value computed by MD simulation. Here, the GPR model is trained with the train set using the marten kernel along with white noise. The model is updated till the hyperparameters converged to a global optimum.
Overall, the prediction accuracies using GP with Matern kernel are found to be similar to the ones obtained using GP with rbf kernel.

Hyperparameter tuning for NN
In this work, a multilayer perceptron (MLP) approach is implemented which is a class of feedforward neural network containing an input layer, a hidden layer, and an output layer. The MLP NN model is trained using the back-propagation algorithm. Here, two hidden layers are trained with varying numbers of neurons. Supplementary Figure 13