AtomSets -- A Hierarchical Transfer Learning Framework for Small and Large Materials Datasets

Predicting materials properties from composition or structure is of great interest to the materials science community. Deep learning has recently garnered considerable interest in materials predictive tasks with low model errors when dealing with large materials data. However, deep learning models suffer in the small data regime that is common in materials science. Here we leverage the transfer learning concept and the graph network deep learning framework and develop the AtomSets machine learning framework for consistent high model accuracy at both small and large materials data. The AtomSets models can work with both compositional and structural materials data. By combining with transfer learned features from graph networks, they can achieve state-of-the-art accuracy from using small compositional data (<400) to large structural data (>130,000). The AtomSets models show much lower errors than the state-of-the-art graph network models at small data limits and the classical machine learning models at large data limits. They also transfer better in the simulated materials discovery process where the targeted materials have property values out of the training data limits. The models require minimal domain knowledge inputs and are free from feature engineering. The presented AtomSets model framework opens new routes for machine learning-assisted materials design and discovery.


Introduction
Machine learning (ML) has garnered substantial interest as an effective method for developing surrogate models for materials property predictions in recent years. 1,2 However, a critical bottleneck is that materials datasets are often small and inhomogeneous, making it challenging to train reliable models. While large density functional theory (DFT) databases such as the Materials Project, 3 Open Quantum Materials Database 4 and AFLOWLIB 5 have ∼O(10 6 ) relaxed structures and computed energies, data on other computed properties such as band gaps, elastic constants, dielectric constants, etc. tend to be several times or even orders of magnitude fewer. 2 In general, deep learning models based on neural networks tend to require much more data to train, resulting in lower performance in small datasets relative to non-deep learning models. For example, Dunn et al. 6 have found that while deep learning models such as the MatErials Graph Networks (MEGNet) 7 and Crystal Graph Convolutional Neural Network (CGCNN) 8 achieve state-of-the-art performance for datasets with > O(10 4 ) data points, ensembles of non-deep-learning models (using AutoMatminer) outperform these deep learning models when the data set size is < O(10 4 ), and especially when the data set is < O(10 3 ).
Several approaches have been explored to address the data bottleneck. The most popular approach is transfer learning (TL), wherein the weights from models trained on a property with a large data size are "transferred" to a model on smaller data size. Most TL studies were performed on the same property. For example, Hutchinson et al. 9 developed three TL approaches that reduced the model errors in predicting experimental band gaps by including DFT band gaps. Similarly, Jha et al. 10 trained models on the formation energies in the large OQMD database and demonstrated that transferring the model weights from OQMD can improve the models on the small DFT-computed and even experimental formation energy data. TL has also been demonstrated between different properties in some cases. For example, the present authors 7 found that transferring the weights from large data-size formation energy MEGNet models to smaller-data-size band gap and elastic moduli models improved convergence rate and accuracy. Another approach uses multi-fidelity models, where datasets of multiple fidelities (e.g., band gaps computed with different functionals or measured experimentally) are used to improve prediction performance on the more valuable, high fidelity properties. For example, two-fidelity co-kriging methods have demonstrated successes in improving the predictions of the Heyd-Scuseria-Ernzerhof (HSE) 11 band gaps of perovskites, 12 defect energies in hafnia 13 and DFT bulk moduli. 14 In a recently published work, the present authors also developed multi-fidelity MEGNet models that utilize band gap data from four DFT functionals (Perdew-Burke-Ernzerhof 15 or PBE, Gritsenko-Leeuwen-Lenthe-Baerends with solid correction 16,17 or GLLB-SC, strongly constrained and appropriately normed 18 or SCAN and HSE 11 ) and experimental measurements to significant improve the prediction of experimental band gaps. 19 In this work, we develop "AtomSets", a hierarchical framework to TL using MEGNet models that can achieve uniformly excellent performance across diverse datasets with different sizes. The AtomSets framework unifies compositional and structural features under one umbrella. We show, for the first time, TL from structural models to compositional models.
Using 13 MatBench datasets, 6 we show that the AtomSets models can achieve excellent performance even when the inputs are compositional and the data size is small (∼ 300), while retaining MEGNet's state-of-the-art performance on properties with large data sizes. Furthermore, the model construction requires minimal domain knowledge and no feature engineering.

MatErials Graph Network
The MEGNet formalism has been described extensively in previous works 7, 19 and interested readers are referred to those publications for details. Briefly, the MEGNet framework featurizes a material into a graph G = (V, E, u), where v i ∈ V are the atom or node features, e k ∈ E are the edges or bonds, and u are state features. The features matrices/vectors are MLP Readout v r Atom embedding AE Figure 1: Graph networks and AtomSets schematics. a, The graph convolution (GC) takes an input graph with labeled atom (V), bond (E) and state (u) attributes and outputs a new computed graph with updated attributes. b, The graph network model architecture. The input to the model is the structure graph (S) with atomic number as the atom attributes. Then the graph is passed to an atom embedding (AE) layer, followed by three GC layers. After the GC, the graph is readout to a structure-wise vector f , and f is further passed to multi-layer perceptron (MLP) models. Within the model, each layer output is captured for later use. c, The AtomSets model takes a site-wise/element-wise feature matrix and passes to MLP layers. After the MLP, the a readout function is applied to derive a structurewise/formula-wise vector, followed by final MLP layers. Figure 1a, by updating the the atom, bond and state features as follows: where i is an index indicating the layer of the GC, e In the initial structural graph (S), the atom attributes are simply the atomic number of the element embedded into a vector space via an atom embedding (AE) layer (AE : Figure 1b. The bonds are constructed by considering atom pairs within certain cutoff radius R c . With each GC layer, information is exchanged between atoms, bonds and state. As more GC layers are stacked (e.g., GC 2 and GC 3 in Figure 1b), information on each atom can be propagated to further distances.
In this work, a MEGNet model with three GC layers was first trained on the formation energies of more than 130,000 Materials Project crystals as of Jun 1 2019, henceforth referred to as the "parent" model. The training procedures and hyperparameter settings of the MEGNet models are similar as the previous work. 7

AtomSets Framework
In our proposed AtomSets framework, the output atom Na ] features after each GC layer are extracted from the parent model and transferred to develop models for other properties. Bond features are not considered in TL since the number of bonds depends on the graph construction settings and parameters, such as cutoff radius. As shown in Figure  1c, an AtomSets model takes the atom-wise features V i matrix of shape N a × N f as inputs to an MLP model. These features can either be compositional, e.g., elemental properties, or structural, e.g., local environment descriptor. Afterwards, the output feature matrix is readout to a vector, compressing the atom number dimension.
The purpose of the readout function is to reduce the feature matrices with different number of atoms to structure-wise vectors subject to permutational invariance. Simple functions to calculate the statistics along the atom number dimension can be used as readout functions. In this work, we tested two types of readout functions. The linear mean readout function averages the feature vectors, as follows where x i is the feature row vector for atom i and w i is the corresponding weights. The weights are atom fractions on one site, e.g., w Fe = 0.01 and w Ni = 0.99 in Fe 0.01 Ni 0.99 .
We also tested a weight-modified attention-based set2set 20 readout function. We start with memory vectors m i = x i W + b, and initialize q 0 * = 0, where W and b are learnable weights and biases respectively. At step t, the updates are calculated using long short-term memory (LSTM) and attention mechanisms as follows A total of three steps are used in the weighted-set2set readout function.
Then, the readout vector can be used to predict properties with the help of MLP or other models, as shown in Figure 1c.  Table 1.

Data and Model Training
The 13 materials datasets were obtained from the matbench repository. 6 A summary is provided in Table ? The model with the lowest validation error was chosen as the "best" one. Each model was fitted five times using different random splits, and the average and standard deviations of the metric on the test set were reported. Table 1: Models investigated in this work, categorized by the models types, i.e., compositional (C) or structural (S), and whether they utilize transfer learning (TL). In our definition, S-type models contain compositional information as a superset. It should be noted that the MLP-u i , MLP-f and MLP-v r are classified as S-type models because u i , f or v r implicitly incorporate structural information due to information passing in the graph convolution layers.
Model name Type TL Description Compositional models directly trained from data AtomSets-V 0 C Yes Compositional models transferring learned V 0 from the parent formation energy model Compositional MLP models using statistics calculated on V 0 from the parent formation energy model as inputs MLP-u i (i = 1, 2, 3), MLP-f and MLP-v r S Yes MLP models using learned u i , f or v r from the parent formation energy model.

MEGNet S No
Graph network models trained directly using each data set without transfer learning A grid search was performed on the hyperparameters for the AtomSets models and MLP models. The parameter candidates are provided in Table ??. During the screening process, a 5-fold random shuffle split is applied to the data set, and the parameter set with the lowest average validation error was chosen. The matbench steels (compositional) and matbench phonons (structural) data sets were first used to perform an initial screening for relatively good parameter sets. Starting from these parameter sets, a further grid search for all datasets for the most generalizable AtomSets-V 0 (compositional) and AtomSets-V 1 (structural) models was then performed.

Model accuracies
The MAE of regression and the AUC of classification for various tasks are shown in Table 2.
In addition, hyperparameter optimization was carried out on the AtomSets-V 0 and V 1 models (see Table ??), but did not seem to have a significant effect on model performance. Here, we will focus our discussion on the models without further hyperparameter optimization.
To frame our analysis, we will first recapitulate that a key finding of Dunn et al. 6 is that MEGNet models tend to outperform other models when the data size is large (> 10, 000 data points) but underperform for small data sizes. This can be seen in the last two columns of AtomSets-V 0 models and structural AtomSets-V 1 models for the JDFT-2D exfoliation energy, the MP phonon DOS peak, and the refractive index datasets are within the stan- AutoMatminer 6

Regression Tasks
Expt. 09  to be 96 GPa and 490 GPa, respectively, while the AtomSets-V 0 model predicts them to be 177 GPa. In contrast, the perovskites and MP formation energy datasets require structural models to achieve accurate results. This observation is consistent with a recent study by Bartel et al. 32 Comparing AtomSets models with various V's, the results show that the features extracted from earlier stage GC layers, e.g., V 0 and V 1 , are more generalizable and have higher accuracy in all models compared to those produced by later GC layers. The structure-wise state vectors, u i (i = 1, 2, 3), and the readout atom feature vector v r , are relatively poor features, as shown by the large errors in all models in Table ??. However, the final structure-wise readout vector f , along with MLP models, offers excellent accuracy in MP metallicity and formation energy tasks.

Model Convergence
A convergence study of the best models -two compositional models, i.e., AtomSets, AtomSets-V 0 , and two structural models, i.e., AtomSets-V 1 and MLP-f -was performed relative to data size. Different data sizes in terms of the fractions of maximum available data are applied.
Comparing the two compositional models, the AtomSets-V 0 model achieves relatively higher performance throughout all the tasks and generally converges faster than the non-TL counterpart, i.e., the AtomSets model, as shown in Figure 2. For the structural datasets in Figure 2c and 2d, consistent with previous benchmark results, the structural AtomSets-V 1 and MLP-f models are generally more accurate than the compositional models. The rapid convergence of the MLP-f models in the MP formation energy dataset is expected since the structural features f were generated by the formation energy MEGNet models in the first place. Model convergences on the rest of the datasets are provided in Figure ??.
The model performance is also probed at tiny datasets. We used several MP datasets in this study to obtain consistent results and then down-sampled the datasets at 100, 200, 400, 600, 1000, and 2000 data points. For comparison, we also include the non-TL MEGNet structural models, as shown in Figure 3. Similar to the previous convergence study at relatively large data sizes, the TL compositional models AtomSets-V 0 outperform the non-TL compositional AtomSets models at all data sizes. For structural models, the TL AtomSets-V 1 models achieve consistent accuracy at small data limits for all four tasks and consistently outperform the non-TL MEGNet models.
Interestingly, the MLP-f models specialize in MP metallicity data and MP formation energy data, same as previous benchmark results shown in

Model Extrapolability
In a typical materials design problem, the target is not finding a material with similar performance as most existing materials, but rather materials with extraordinary properties that lie outside of the current materials pool. Such extrapolation presents a major challenge for most ML models. Previous works have designed leave-one cluster out cross-validation (LOCO CV) 34 or k-fold forward cross-validation 35 to evaluate the models' extrapolation ability in data regions outside the training data. Here we adopted the concept of forward cross-validation by splitting the data according to their target value ranges and applied the method to elasticity data (MP log(K V RH ) and MP log(G V RH )) to imitate the process of finding super-incompressible (high K) and superhard (roughly high G) materials. First, we held out the materials with the top 10% corresponding target values as the test dataset (hightest, extrapolation). Then for the remaining, we also split it into the train, validation, and test (low-test, interpolation) datasets, making two test data regimes in total. We selected AtomSets, AtomSets-V 0 , AtomSets-V 1 , and the MEGNet models for the comparison. For the a b c d Extrapolation Interpolation Figure 4: Absolute differences in predicted and DFT log(K V RH ), i.e., |∆ log(K V RH )| against the DFT value range for the test data. The training and validation data are randomly sampled from the 0% to 90% (vertical dash line) target quantile range. Half of test data comes from the 90%-100% quantile (extrapolation) and the other half is from the same target range as the train-validation data (interpolation). bulk moduli K, the low-test errors for the compositional models AtomSets and AtomSets-V 0 are identical. However, with the test target value outside of the training data range, the errors increase rapidly above the low-test errors. Nevertheless, the TL model AtomSets-V 0 are better generalized in the extrapolation high-test regime, as shown by the lower extrapolation errors in Figure 4a and Figure 4b. For structural models, the low-test errors are again almost the same, yet the TL AtomSets-V 1 models have lower errors than the MEGNet counterparts, see Figure 4c and Figure 4d. Similar conclusions can be reached using the shear moduli dataset, as shown in Figure ??. These results conclude that TL approaches can significantly enhance the models' accuracy in extrapolation tasks critical in new materials discovery.

Discussion
The hierarchical MEGNet features provide a cascade of descriptors that capture both shortranged interactions at early GC (e.g., V 0 , V 1 ) and long-ranged interactions at later GC (e.g., V 2 , V 3 ). The first GC features are better TL features across various tasks, while the latter GC generated features generally exhibit worse performance. We can explain this part by drawing an analogy to convolutional neural networks (CNN) in facial recognition, where the early feature maps capture generic features such as lines and shapes and the later feature maps form human faces. 36 It is not surprising that if such CNN is transferred to other domains, for example, recognizing general objects beyond faces, the early feature maps may work, while the later ones will not. One surprising result from our studies is the relatively good performance of the compositional models (AtomSets-V 0 ) on many properties, e.g., the phonon dos and bulk and shear moduli. It would be erroneous to conclude that these properties are not structuredependent. We believe the main reason for the outperformance of the compositional models is because most compositions either do not exhibit polymorphism or have many polymorphs with somewhat similar properties, e.g., the well-known family of SiC polymorphs. These results highlight the importance of generating a diversity of data beyond existing known materials. Existing databases such as the Materials Project typically prioritize computations on known materials, e.g., ICSD crystals. While such a strategy undoubtedly provides the most value to the community for the study of existing materials, the discovery of new materials with extraordinary properties require exploration beyond known materials; additional training data on hypothetical materials is critical for the development of ML models that can extrapolate beyond known materials design spaces. The use of TL, as shown in this work, is nevertheless critical for improving the extrapolability of models.
The AtomSets framework can be viewed as a particular case of the graph network mod-

Conclusion
This work introduces a new and straightforward deep learning model framework, the Atom-Sets, as an effective way to learn materials properties at all data sizes and for both compositional and structural data. By combining with TL, the structure-embedded compositional and structural information can be readily incorporated into the model. The simple model architecture makes it possible to train the models with much smaller datasets and lower computational resources compared to graph models. We show that the AtomSets models can achieve consistently low errors for small data tasks, e.g., steel strength datasets, to extensive data tasks, e.g., MP computational data, and the model accuracy further improves with TL. We also show better model convergence for the AtomSets models. The AtomSets framework introduces a facile deep learning framework and helps accelerate the materials discovery process by combining accurate compositional and structural materials models.

Code Availability
The  Table S1: Materials data name, data sizes, input types, property name, units and task types. Type shows the input data type, where Comp means composition and Struct means structure. The task includes regression (R) and classification (C). For the matbench perovskites datasets, the original data shows formation in eV, while in the final presentation of error metrics, we converted it into eV/atom.

Extrapolation Interpolation
Figure S3: Absolute differences in predicted and DFT log(G V RH ), i.e., |∆ log(G V RH )| against the DFT value range. The training and validation data are randomly sampled from the 0% to 90% (vertical dash line) target quantile range. Half of test data comes from the 90%-100% quantile (extrapolation) and the other half is from the same target range as the train-validation data (interpolation).