Predicting densities and elastic moduli of SiO2-based glasses by machine learning

Chemical design of SiO2-based glasses with high elastic moduli and low weight is of great interest. However, it is difficult to find a universal expression to predict the elastic moduli according to the glass composition before synthesis since the elastic moduli are a complex function of interatomic bonds and their ordering at different length scales. Here we show that the densities and elastic moduli of SiO2-based glasses can be efficiently predicted by machine learning (ML) techniques across a complex compositional space with multiple (>10) types of additive oxides besides SiO2. Our machine learning approach relies on a training set generated by high-throughput molecular dynamic (MD) simulations, a set of elaborately constructed descriptors that bridges the empirical statistical modeling with the fundamental physics of interatomic bonding, and a statistical learning/predicting model developed by implementing least absolute shrinkage and selection operator with a gradient boost machine (GBM-LASSO). The predictions of the ML model are comprehensively compared and validated with a large amount of both simulation and experimental data. By just training with a dataset only composed of binary and ternary glass samples, our model shows very promising capabilities to predict the density and elastic moduli for k-nary SiO2-based glasses beyond the training set. As an example of its potential applications, our GBM-LASSO model was used to perform a rapid and low-cost screening of many (~105) compositions of a multicomponent glass system to construct a compositional-property database that allows for a fruitful overview on the glass density and elastic properties.


Introduction
SiO2-based glasses are a group of materials known for its diverse applications as both structural and functional materials in various industrial fields 1-3 . Density and elastic moduli are two of the most common properties of SiO2-based glasses. Particularly, discovering new glass compositions to achieve high elastic moduli and low densities is of great interests for the development of strengthened and durable SiO2-based glass materials nowadays. Finding universal expressions or correlations to efficiently predict and further optimize densities and elastic moduli of SiO2-based glasses according to the chemical composition is not very straightforward due to their non-crystalline structures.
Different from the crystalline materials, the elastic moduli of a SiO2-based glass are not only determined by the atomic bonding strength but also a complex function of many other physical properties at different length scales [4][5][6][7] , such as cation coordination, formation of atomic ring, chain, layer and polyhedral atomic clusters, and even the structural organization at mesoscopic scale, e.g. the formation of nanodomains 4 .
Moreover, the additive oxides besides SiO2 introduce cations with various valence states, which not only change the cation-oxygen bonding strengths but also modify the degree of network polymerization. As a result, elastic moduli of the glass are complex functions of the chemical compositions of the additive oxides.
Through linear or polynomial regression analyses, many efforts have been devoted previously to fit the densities and elastic moduli with either the glass composition only [8][9][10] or a single parameter related to atomistic structures, such as molar volume 11 and the correlation length of x-ray diffraction peak 5,12 . Although these regression models were demonstrated to provide valid descriptions for some certain glass systems, they may have two major shortcomings that impede their usage in the practical design of new glass compositions. Firstly, the models are usually accurate for specific glass systems.
Once the type of the additive oxides changed, the regression results may significantly be varied, or an alternative modeling method must be applied. As a result, it is difficult to extrapolate the developed models to capture the mixed effects of multiple additive oxides in the design space for industrial glass products. Secondly, for the models built on non-compositional variables, their outcomes are hard to be directly used for discovering new glass compositions, because it could be difficult to quantitatively interpret the optimization results with respect to glass chemistries. For example, elastic extremeness may occur at a certain correlation length of x-ray diffraction peak 5,12 , while it is still unknown what glass chemistries result in such correlation length. These shortcomings may originate from the fact that these models were usually built from regression algorithms based on presumed analytical formulas and a few variables that were predetermined relying on historical intuition and knowledge.
Machine learning (ML) techniques offer an alternative way to create predictive models that bridge the materials property of interest with its potential descriptors quickly and automatically [13][14][15][16] . In addition, the model created from ML does not require to rely on presumed fitting expressions or any historical intuition of material behaviors. As a result, the ML approaches can be a particularly powerful tool for modeling the property that is determined by many factors in a complex way with unclear underlying mechanisms. To date, the ML approaches have been widely used to build predictive models for a handful of materials properties and applications, including the modeling of elastic moduli of both crystalline [17][18][19] and amorphous materials [20][21][22][23][24] . Using the artificial neural networks and genetic evolution algorithms, Mauro et al. 20,24 recently showed that Young's moduli of over 250 different glass samples can be accurately regressed and predicted using glass compositions as inputs. Most recently, by using glass composition as input descriptors, Yang et al. performed extensive studies to show that Young's moduli of the CaO-Al2O3-SiO2 ternary glass system can be accurately predicted through several different ML models 21 . Additionally, in a recent work by Bishnoi et al., Young's moduli of four important ternary glass systems were comprehensively studied and well predicted based on nonparametric ML regression models 22 . All these recent works show great promise in the application of ML techniques on the chemistry design of advanced glass materials.
One could encounter several challenges to model densities and elastic moduli of SiO2based glasses under a ML-based framework. A typical one would be the availability of sufficient quantities of training data to sample the predictive space. It could be harmful for extrapolative predictions if the training data are clustered around one or several particular regions of the design space. However, unfortunately, experimental data are usually clustered due to the constraints of practical manufacturing. This situation can be overcome by employing atomistic simulations such as molecular dynamics (MD) and molecular statics (MS) simulations, which were proved to be able to accurately compute the elastic moduli of many glassy systems 5,7,25 . Particularly, the MD simulations offer a promise of being able to predict the elastic moduli for the glass compositions that have not been experimentally synthesized 20,26 . As a result, one can achieve a compositionally homogenous sampling for any glass system of interest without the need of concerning the practical manufacturing constraints. However, even though the MD simulation is an effective and efficient tool, with current and near-term computing techniques, it can only access a limited fraction of discrete compositions in a practical design space that contains several (~5) oxide-components using tens of millions of CPU hours. Therefore, from the practical view, it is expected that the developed ML model is capable of giving reliable predictions over a large and even the entire compositional space despite the training is based on a limit set of data of lowerorder systems (e.g., binary and ternary SiO2-based systems). To achieve this goal, the model cannot be purely empirical. A subtle set of descriptors should be constructed to include not only the information of glass composition but also the physical information related to the chemical characteristic of the components 26 , such as the parameters associated with atomic bond energies. In fact, several recently developed physic-based topological models have demonstrated quantitative connections between glass elasticity and the free energies associated with breaking different bond constraints between cations and anions 27,28 .
In this work, through merging ML approaches with high-throughput MD simulations, we aimed to develop a quantitatively accurate model to predict densities and elastic moduli of SiO2-based glasses according to the glass composition but across a complex compositional space. The effects of 11 types of additive oxides were investigated, namely Li2O, Na2O, K2O, CaO, SrO, Al2O3, Y2O3, La2O3, Ce2O3, Eu2O3 and Er2O3. The training set was generated using MD simulations to homogenously sample the density and elastic properties of a part of the constituent binary and ternary systems. A set of descriptors was carefully constructed from the force-field potentials used for MD simulations and elemental mole fractions to include both physical and compositional information. Sequentially, enlightened by the previous work 17 , a statistical learning/predicting model was developed by implementing the least absolute shrinkage and selection operator 29 with a gradient boost machine 30 (GBM-LASSO). As a comparison, a traditional decision tree-based model 31,32 was also employed. By validating with a large amount (>>1000) of both simulation and experimental data, the GBM-LASSO model was demonstrated to have promising prediction capabilities on both densities and elastic moduli for the SiO2-based glasses not only within the composition range of the training set but also the high-dimension compositional spaces beyond the training set. The developed ML model could be useful for rapid glass composition-property screening that allows for a fruitful estimation and overview on the density and elastic properties of the general multi-component glass systems, especially the novel composition regions.

Details of MD simulations
To establish the training set, high-throughput MD simulations followed by energy minimizations were employed to calculate the density, bulk and shear moduli over 498 different glass compositions. The compositions were from 11 binary and 20 ternaries systems, which are specified in Table 1 In the present work, the MD simulations were performed using a set of interatomic potentials developed by Du and Cormack 25,[33][34][35][36][37][38][39][40][41][42] , which are found to yield reliable predictions on the densities and elastic moduli of various SiO2-based glasses 25,[43][44][45][46] .
Another advantage of this potential set is that it covers the common oxides that include most of the industrial glass components. The potential consists of long-range where ri,j is the interatomic distance between atom i and j, qi and qj are the effective ionic charges of atom i and j, respectively, and Ai,j, Bi,j and Ci,j are the energy parameters of the Buckingham form between i and j. In this set of potential, the short-range interactions between cations are not considered since it is assumed that two cations cannot be the first-nearest neighbor ions/atoms. The values of the effective ionic charges and Buckingham parameters for each element are summarized in Table 2.
Moreover, it should be noted that, by following the method developed by Deng and All the MD simulations were performed using the LAMMPS package 48 . Coulomb interactions were evaluated by the Ewald summation method, with a cutoff of 12 Å.
The cutoff distance of the short-range interactions was chosen to be 8.0 Å. Cubic simulation boxes were constructed to consist of about 2100 atoms so that the mole fraction of each oxide species of the samples in the training set can be achieved. Initial atomic coordinates were randomly generated using the program PACKMOL 49 . The simulation protocol was initiated with relatively equilibration runs of 0.5 ns at 5000 K to remove the memory effects of the initial structure, followed by a linear cooling procedure with a nominal cooling rate of 10K/ps to 3000 K in the canonical (NVT) ensemble. Then, the system was further equilibrated for 0.5 ns at 3000K in the isothermal-isobaric ensemble (NPT with zero pressure) to allow a relaxation of the simulation box and atomic positions simultaneously. After this, a MD run with the microcanonical ensemble (NVE) was performed for another 0.5 ns to further equilibrate the system. After the equilibration at 3000K, the system was gradually cooled down to 300 K through steps of 2500 K, 2000 K, 1000 K, 300 K with a nominal cooling rate of 0.5 K/ps under NPT condition. At each step temperature, the system was equilibrated for 0.5 ns under NPT condition, and then run with an NVE ensemble for another 0.5 ns.
At 300 K, the system is equilibrated for 1 ns under NPT condition, which is then followed by a 0.5 ns NVE run. During the final 500,000 NVE steps, atomic configurations were recorded every 1000 steps, and an average of the configurations was taken every 50 records. Eventually, 10 (10 = 500,000/1000/50) atomic configurations of each glass composition were obtained and used for the further calculations of densities and elastic moduli. Recording multiple atomic configurations would allow us to avoid accidentally using a single unreasonable configuration that can lead to large errors in the following energy minimization calculations.
The elastic constants cij for a system are defined as the second derivative of the potential energy U at the corresponding local minimum (the curvature of the potential energy) with respect to small strain deformations, εi, Based on the Voigt approximation 50 , which provides the upper bound of elastic properties in terms of uniform strains, the bulk modulus (K) of the system is calculated as, Based on K and G, the Young's modulus (E) is given by, With the glassy structures collected from the MD simulations, the density and elastic moduli were computed by means of the GULP code 51 . The cutoffs for the Coulomb and short-range interactions were set to be same as the MD simulations. A Newton-Raphson energy minimization was performed at zero pressure and temperature to fully relax the output glassy structures from LAMMPS simulations. Then, the density was calculated theoretically by dividing the total system mass by the volume of the relaxed structure. In the present work, the descriptors associated with the Coulomb interactions for a given glass composition is written as, where and denote the effective ionic charges listed in Table 2 , where , 0 is the distance where the first derivative of the Buckingham form becomes zero. Therefore, for each type of the ions, , 0 is actually calculated from the values of , , , and , . In addition, since , of Li has a zero value, extra procedures were applied to obtain the value of the , 0 term for Li, which is described in detail in Section 3 in Supplementary Material. The calculated values of the , ′ term for all the elements studied in the present work are summarized in Table 2, along with their MD parameters.
Thus, the descriptors associated with the short-range interactions are eventually generated from , , , ′ and , based on the glass composition ( ) as the following, ). In addition, we include the multiplications between any two of the thirty descriptors as interaction terms to consider the non-linear relations among these descriptors. Finally, we also include the arithmetic mean of the atomic mass as an individual descriptor. As a result, overall 511 descriptors are generated for the ML models, in which there are fifteen descriptors associated with long-range Coulomb interactions, thirty descriptors generated from the MD parameters of the Buckingham term and 465 corresponding interaction terms (including self-interactions, thus 30 1 + 30 2 =465), and one descriptor representing the mean atomic mass.

Statistic models
To leverage the training data as wisely as possible, two types of statistical learning models, namely the GBM-LASSO and the M5P regression tree model 31

Regressions accuracy of training data
In the present work, the training dataset was generated by high-throughput MD simulations, which contains the densities, bulk and shear moduli (i.e., K and G) of 498 individual glass compositions in 11 binary and 20 ternary SiO2-based systems as summarized in Table S5 in Supplementary Material. 11 types of additive oxides were considered, namely Li2O, Na2O, K2O, CaO, SrO, Al2O3, Y2O3, La2O3, Ce2O3, Eu2O3 and Er2O3. The ML models were applied to learn each of the glass properties separately.
The densities from the MD-calculated training dataset are plotted in Figure 1 against the corresponding regression results from the ML models. For the sake of a clear representation, the data points are grouped into four categories, which are pure amorphous SiO2, type-I glasses that only contain alkali and alkaline earth oxides as additives, type-II glasses that contain Al2O3 and other oxides, and type-III glasses that contain rare earth and other oxides. As shown in Figure 1, the glass densities produced Here, to further evaluate the regression accuracy of the ML models, we define the relative error as,

(X=density, K or G)
where is the density or elastic modulus calculated from MD simulation and is the prediction from the GBM-LASSO or M5P model. As shown in Table 3, for both K and G, over 60% of the predictions from both ML models have a relative error of less than 5%, and over 90% predictions are within a relative error of less than 10%, indicating that excellent regression accuracy is achieved. Additionally, we find that the LASSO method has indeed significantly shrunk the size of the descriptor set. Among the 511 input descriptors, only 119, 127 and 87 descriptors are found to have non-zero regression coefficients when the ML models predict the glass density, K and G, respectively. It is also found that many of these descriptors have been multiply used for the LASSO regressions at different GBM iterative steps, indicating they are indeed important and useful to describe these glass properties.

Prediction capability
Since shown as parity plots in Figure 4. In addition, the prediction errors are analyzed and summarized in Table 4 in the same way as the error analysis of the training process (Table 3). On the one hand, it is found that the M5P model seems to yield large uncertainties when extrapolating. As shown in Table 4, the RMSEs of the predictions from the M5P model with respect to MD validations are 0.1774g·cm -3 , 5.24 and 2.27 GPa for density, K and G, respectively, which are much larger compared to the RMSEs of the learning results listed in Table 3 (0.0325g·cm -3 , 2.59 and 0.97 GPa for density, K and G). In addition, as shown in Figures 4a-4c, the data points in the parity plots of the extrapolative predictions are more scattered compared to the results of the training process (Figure 1c, Figure 2c and Figure 3c). Particularly, as marked out in Figure 4b and 4c, there are several predictions for the bulk and shear moduli that largely deviated from the MD results. Their relative errors are found to be over 20%. Moreover, it is worth to note that the M5P model is also trained by further decreasing the number of descriptors, which only resulted in a further increase in the training RMSEs but no significant improvements on the prediction RMSEs.
On the other hand, the developed GBM-LASSO model shows very promising prediction capabilities for the multicomponent glass systems beyond the training set.
As shown in Figures 4d-4f, the density, K and G predicted from the GBM-LASSO model are in very good agreement with the MD results. Nearly 85% of the predictions for K and over 90% for G have relative errors less than 10%. Moreover, as shown in Table 4, the RMSEs of the predictions from the GBM-LASSO model with respect to MD validations are 0.0536 g·cm -3 , 3.69 and 1.34 GPa for density, K and G, respectively, agreeing well to the training uncertainties of the model listed in Table 3.
The results suggest that, after training with a small set of data for only binary and  Table 3. As shown in Figures 5a and 5b, the non-linear effects of B2O3 on the bulk and shear moduli are accurately described for the xB2O3-(100-x)SiO2 and xB2O3-(30-x)Na2O-70SiO2 glasses after training. Moreover, the newly trained model can then be expanded to predict for the multicomponent glasses that contain B2O3 and ZrO2. As shown in Figure 5c, the ML predictions for several B2O3-containing compositions, which are not in the training set, are well confirmed by MD validations. Similar results are also observed for the ZrO2-containing glasses as shown in Figure S4. These results suggest that the developed GBM-LASSO has great potentials to be further expanded to cover more types of additive oxides in the future.
To achieve such expansions, we only need a few of MD simulations to generate the binary and ternary data containing new types of oxides for the training set. In these cases, the properties of interests (e.g., density and elastic moduli) are reasonably continuous and smooth to the descriptors (e.g., compositions), but the training set is relatively small and established from the studies of sparse regions.

Comparison between ML predictions and experimental measurements
To further evaluate the model reliability, the predictions of the present GBM-LASSO Besides the general agreement between the ML predictions and experimental data, as shown in Figure 6, it is noted that there are still scattered ML predictions that are largely deviated from the experimental values. After a careful analysis, we found that many of these prediction outliers should result from the inconsistency between the experimental data as they were gathered from different sources. In other words, the predictions of the  Table S4). In addition, we acknowledge that, for some of the prediction outliers in Figure 6,

Rapid screening of glass density and elastic moduli
The GBM-LASSO model developed in the present work is able to predict the density The prediction results are visualized in Figure 8 as a 2D histogram plot with respect to density and Young's modulus, E, which is calculated from predicted K and G based on . From a practical point of view, one would expect a structural glass to have Young's modulus as high as possible, and meanwhile keep a relatively low density.
From Figure 8 we can know that most of the glasses in the Na2O-CaO-Al2O3-Y2O3-SiO2 system have Young's moduli around 83 GPa and densities around 2.6 g·cm -3 .
From the screening, it is also found that low Young's moduli generally occur for the glasses with high Na2O contents, while the large additions of Al2O3 and Y2O3 result in a significant enhancement on Young's moduli, which is consistent with the previous experimental observation 66 . As marked by the red-dashed-line circle in Figure 8, one can achieve a series of glasses with Young's moduli higher than 100 GPa and densities ranging from 2.5 ~ 3.1 g·cm -3 by optimizing the contents of the additive oxides. In addition, from the screening results, one can also know that it is probably difficult to prepare glasses with densities lower than 2.4 g·cm -3 but Young's moduli larger than 80 GPa in this system. All in all, using the present developed GBM-LASSO model, a compositional-property database for any glass systems of interest can be rapidly generated as long as the corresponding force-field potentials are available and accurate enough to describe the structural and elastic properties. These databases allow the designers to have a fruitful overview on the density and elastic properties to enlighten their own design before experimental syntheses.

Conclusion
In this work, we demonstrated a novel machine-learning framework to efficiently learn The present work is focused entirely on the modeling of glass density and elastic moduli; however, our ML framework could also be advantageous for the study of other glass physical properties and structural features. Our future studies will be a ML modeling on a few of fundamental glass structural properties, such as bridge/non-bridge oxygen ratio and angle distribution, ring size distributions of the network formers and average coordination number and bond length of cations, which are well-known to be essential to understand many of the physical and mechanical behaves of the SiO2-based glasses.
With the present work and more future works, a composition-structure-property database that sits nicely in the "Materials Genome Initiative" landscape 26,77-80 is desired to be developed via ML techniques and serve as powerful tools for the practical design of new glasses in the future. More generally, the methods of descriptor construction and the ML framework introduced in the present work could also be advantageous for many other materials science problems, where the datasets are of modest size and extrapolative predictions in high-dimensional space are required from the learning based on the low-dimensional sparse regions.  Table 2 Effective ionic charge and Buckingham potential parameters used for MD simulations 25,[33][34][35][36][37][38][39][40][41][42] . Here, , , , and , are the short-range interaction parameters between an ion element i and oxygen anion. The short-range interactions between the cation elements are ignored in the present set of MD potentials.    The data points are grouped into four categories based on their glass chemistry, which are pure amorphous SiO2, type-I glasses that only contain alkane and alkane earth oxides as additives, type-II glasses that contain Al2O3 and other oxides, and type-III glasses that contain rare earth and other oxides.    (Table S6). The data points within the black dot-dashed region have relative errors less than 10%.  Table S7) that are not included in the training dataset.

Figure 7
Bulk modulus of the Y2O3-SiO2 binary glasses calculated from the classical MD and AIMD simulations and predicted from the GBM-LASSO and MM models 75 . The error bar of the AIMD results are generated from the results calculated under different applied strains.

Figure 8
A 2D histogram to visualize the distributions of the density and Young's modulus (E) of the glasses in the Na2O-CaO-Al2O3-Y2O3-SiO2 system. The histogram is generated from 82,251 compositions, where the GMB-LASSO model is employed to predict the density and Young's modulus. The content of SiO2 is constrained to be no less than 65 mol%. The hexagonal unit with a hotter color means that there are more glass compositions having density and E within the coverage area of the unit.