Benchmarking graph neural networks for materials chemistry

Graph neural networks (GNNs) have received intense interest as a rapidly expanding class of machine learning models remarkably well-suited for materials applications. To date, a number of successful GNNs have been proposed and demonstrated for systems ranging from crystal stability to electronic property prediction and to surface chemistry and heterogeneous catalysis. However, a consistent benchmark of these models remains lacking, hindering the development and consistent evaluation of new models in the materials field. Here, we present a workflow and testing platform, MatDeepLearn, for quickly and reproducibly assessing and comparing GNNs and other machine learning models. We use this platform to optimize and evaluate a selection of top performing GNNs on several representative datasets in computational materials chemistry. From our investigations we note the importance of hyperparameter selection and find roughly similar performances for the top models once optimized. We identify several strengths in GNNs over conventional models in cases with compositionally diverse datasets and in its overall flexibility with respect to inputs, due to learned rather than defined representations. Meanwhile several weaknesses of GNNs are also observed including high data requirements, and suggestions for further improvement for applications in materials chemistry are discussed.


INTRODUCTION
In the search for materials with various functional applications ranging from catalysis to energy storage to electronics, machine learning (ML) has quickly gained traction as a powerful and flexible approach, especially where a broad exploration of the materials space is needed [1][2][3][4][5][6] . The adoption of ML for materials discovery is expected to expand even further with the ongoing growth in the availability of high-throughput density functional theory (DFT) datasets and continued advancements in ML algorithms [7][8][9][10][11][12][13][14] . Conventionally, ML models in materials chemistry are descriptor-based, where the key descriptors representing the system must first be defined prior to fitting a suitable ML model for prediction. General examples of these descriptors include stoichiometry, the elemental properties such as group, period, electronegativity and radius, and electronic properties such as partial charges and s, p, d-band positions. A number of structural descriptors have also been proposed satisfying translation and rotational invariance, including but not limited to the Coulomb matrix 15 , atom-centered symmetry functions (ACSFs) 16 , and smooth overlap of atomic positions (SOAP) 17 . However, finding effective descriptors can prove challenging for problems with a large amount of compositionally and structurally diverse materials.
In recent years, graph neural networks (GNNs) [18][19][20] have received increasing attention as a method that could potentially overcome the limitations of static descriptors by learning the representations on flexible, graph-based inputs. Within this overarching class of ML method, a number of GNN models have been proposed for chemistry-related problems, with the earliest adopters focusing on molecular systems [21][22][23][24] . Subsequently, GNNs have also been used in materials prediction, with a number of studies tackling systems such as periodic crystals [25][26][27][28][29][30] and surfaces 11,[31][32][33][34] . These systems are generally described by their atomic structures, where the atoms can be represented by nodes and the neighbors encoded by the edges. Information regarding the atoms and bonds such as element type and bond distances, respectively, can be further encoded in the node and edge attributes. GNNs operate on these atom-based graphs to create node-level embeddings through convolutions with neighboring nodes and edges. This differs from static or pre-defined descriptors which are obtained via a dictionary lookup or a fixed function with no trainable parameters, whereas in a GNN embeddings are obtained from a trainable neural network 35 .
Given the rapid advances in GNNs for computational materials chemistry currently, a critical evaluation of the current state-ofthe-art (SOTA) is warranted. To accomplish this, several criteria should be met: (1) the same datasets should be used across the evaluated models, (2) the datasets used should represent diverse problems in materials chemistry, (3) the same input information and representation should be used, (4) the hyperparameters of the models should be optimized to the same extent, and (5) these should be performed in a reproducible manner. In this work, we attempt to address these criteria and provide an open-source framework, MatDeepLearn, which can be used for further studies in this area ( Fig. 1) 36 . Provided with atomic structures and target properties from any dataset of choice, MatDeepLearn handles the processing of structures to graphs, offers a library of GNN models, and provides hyperparameter optimization for the models. Within this framework, improvements and additions to the input representations and model architectures can be easily made, which can greatly reduce the development time needed for new ML models. GNNs can also be critically evaluated, and actionable information can be quickly obtained when applied to specific problems in chemistry and the materials sciences. We then use this framework to benchmark several SOTA models and provide a timely snapshot of current progress and suggestions for future progress.

Evaluation of performance across datasets
We evaluated a total of seven ML models for regression tasks on the five datasets, summarized in Table 1, with the specific  hyperparameters for each top model listed in Supplementary  Table 1, with GNN entries shaded in gray. The MAE is obtained through five-fold cross-validation over the entire dataset. We find all four SOTA models (SchNet, MPNN, CGCNN, MEGNet) performed very well for all tested datasets, once again demonstrating the capability of GNNs for accurate predictions when ample training data is available. Unexpectedly, we find the SOTA models all performed equally well in most cases, which shows hyperparameter optimization can be just as important as the choice of graph convolutional operator and overall machine learning architecture. The SOTA models also offer advantages over simpler GNNs that were not designed for materials chemistry, such as the GCN model, which performed markedly worse for nearly all datasets except the bulk crystals. The particularly poor GCN performance, especially for Pt clusters, suggests convolutions involving edge attributes containing interatomic distances in the Fig. 1 Scheme of the GNN testing framework and workflow. A general outline of the approach is presented, starting with materials data (with several examples shown) in the form of structure files and target properties, followed by data processing, model construction, and hyperparameter optimization as part of the MatDeepLearn framework. A general graph neural network architecture is constructed, taking in graphs containing nodes, edges, node attributes, and edge attributes, inputted into an embedding layer, GC blocks, pooling, and dense layers. This allows models to be similarly compared within a shared hyperparameter space. Gaussian basis, which the SOTA models share, are much more effective for capturing the spatial information of atomic structures. Meanwhile, for non-GNN models, SM performed poorly across the datasets, but SOAP performed surprisingly well for bulk crystals and 2D materials and had a performance similar to or better than the SOTA models for Pt clusters. On one hand, this suggests the SOAP descriptors, like other similar descriptors such as ACSFs, remain excellent choices for capturing spatial information in atomic structures, particularly for use in machine-learned potentials. On the other hand, this nevertheless highlights another strength in GNN-based models in that a similar level of performance can still be achieved without any pre-defined descriptors or existing knowledge.
Relative to the Baseline, the SOTA GNNs had the highest relative errors in predicting work functions for the 2D dataset, and this is likely due to the much smaller size of the dataset compared to the other datasets. This is followed by the MOFs dataset, which performed second worst. Next, the GNNs performed similarly well for bulk and surfaces, with both containing ample amounts of data for training. Finally, the GNNs performed best for the clusters relative to the Baseline; however, these errors, at approximately 0.015 eV/atom, remain relatively high for use as machine-learned potentials. A more thorough evaluation would be needed in the future for GNN performance in ML potentials, which may require more significant changes in the model architecture and training data and the inclusion of forces.

Visualization of GNN features
We then compare the ability of the different GNNs in learning the structure and composition representations by obtaining the graph-level embedding from the output of the readout/pooling layer and visualizing with t-distributed stochastic neighbor embedding (t-SNE) in Fig. 2, using the Bulk dataset as an example. The plots in Fig. 2 represent a combined structure-composition latent space for the trained materials where points within a grouping can be expected to share similarities in both their atomic structures and elemental compositions. Each of the four SOTA GNNs is able to generate viable representations leading to groupings of crystals delineated by similar formation energies. In some cases, each GNN obtained a similar grouping in the latent space, such as for structures labeled with the star or square symbols. In other cases, the GNNs obtained dissimilar grouping, for example with the ones denoted by the diamond, and to a lesser extent the triangle and circle symbols. Ultimately, for purposes of regression each of the plotted latent spaces is equally valid and provides similar prediction errors; though additional investigation may be needed in the future to reveal additional differences for applications such as generative machine learning.
Training size dependence Next, we investigated the training size dependence of the models, another avenue of comparison between models and datasets, and estimate the maximum performance with respect to data size in Fig. 3, with specific values in Supplementary Table 2. The model performance is obtained from five parallel runs for each training size using a different train/test split and averaging the errors. We find, in almost all situations, the training size dependence to be very similar between the GNN models for a particular dataset, approximately irrespective of the number of parameters in the model. Unsurprisingly, SOAP has a better training size scaling for the cluster dataset, likely thanks to the effectiveness of the predefined descriptors. However, when applied to compositionally diverse systems, SOAP loses this advantage over GNNs and has a similar training size scaling with the other models. We also estimate performance on data sizes by extrapolating training size dependence curves using a power-law function, ε(m) = αm β where ε is the error as a function of samples, m 37 . We extrapolate to five times the current data set size for each case. For the bulk datasets, fitting the power-law function gives exponents of~−0.3 for the GNNs, and extrapolating predicts MAEs of~0.02-0.03 for 5x data at~150,000 training data. For surfaces, power-law fitting suggests a better scaling than bulk with exponents of~−0.5 and provides an MAE estimate of~0.03-0.04 eV. Meanwhile, a recent study reached an opposite conclusion, finding worse scaling for surface adsorption compared to the bulk 11 . This is likely due to unrelaxed structures being used as inputs, which significantly increases the dimensionality and difficulty of the task.
Evaluation of representation sensitivity So far, we have focused primarily on model performance and maintained a consistent representation for all models and datasets. We examine the soundness of this methodology by testing several different representations and observing their impact on the performance (Table 2). Here we use the optimal hyperparameters in Supplementary Table 1, and test using the CGCNN model. Possible avenues for modifying the representations include changing the edge cutoffs and selection criteria for generating the graphs and changing the properties of the node and edge attributes. First, we find reducing the number of considered neighbors from 12 to 4 did not significantly change prediction accuracy for most cases besides the cluster dataset where the performance decreases significantly. Meanwhile, a fully connected graph did not improve prediction accuracy beyond the default, while greatly increasing the computational cost. Moving to node attributes, we found using node features based simply on element identity with one-hot encoding was also sufficient, and including additional elemental properties such as electronegativity and radius were generally unnecessary for sufficiently large data sets. We also tested using blank input node attributes (with all zeroes), thereby removing any information regarding the elemental composition; this significantly reduced the performance for all cases except the cluster dataset (which only has one element in the system). Finally, a test was performed where the length of edge attributes was reduced from 50 to 10, thereby greatly reducing the structural resolution encoded by the Gaussian basis. This had little to no impact on 4 out of 5 datasets, but greatly reduced the prediction ability for clusters as the information needed to distinguish small changes in Pt-Pt bond distances is lost. Thus, we find the default representation used in this work to be satisfied within the scope of including only structure and atomic information as inputs. Variations in the representation, up to an extent, did not appreciably affect performance likely due to the property of GNNs in producing learned representations via training.

DISCUSSION
With the rapid expansion of readily accessible high-throughput DFT datasets and new GNN models which can train on them, there is a strong demand for a general benchmarking tool to assess which ML models are best for a given target application in materials chemistry. We have developed such a framework, MatDeepLearn, and used it to gauge the performance of several GNNs on different materials data sets. We use the Pytorch and Pytorch-Geometric libraries which allow for fast ML calculations that are optimized for GPU-based computing resources, and the Ray library which provides distributed hyperparameter optimization on multiple nodes. In this work, training time ranges from 10 min to~1-2 h on an NVIDIA Tesla V100 GPU for the smallest and largest datasets, respectively. Hyperparameter training with 160 trials can be finished in roughly 2 days on a single GPU node containing 8 V100 GPUs. In this work, five GNN models were tested, but this list can be rapidly expanded with the provided message-passing network class in Pytorch-Geometric, and with approximately 40 existing methods already implemented for use. Consequently, the development time needed for GNNs can be shortened significantly, along with the rapid testing and benchmarking of these developed methods. In general, we find MatDeepLearn can provide a highly competitive baseline performance with little to no human interaction or effort required; only a suitable dataset containing atomic structures is needed as the inputs.
Our current study also provides some general observations regarding GNNs for materials applications. For the prediction tasks in this work, a GNN which contains nonlinear update functions for nodes (usually neural networks), and a sufficiently descriptive representation of bond distances generally performs quite well, and differences between the current tested graph convolutional operators become small through hyperparameter optimization. Within the scope of our study, including edge updates did not appear to improve the performance perceptibly. In moving forward, a more exhaustive screening of the GNN design space may be fruitful, such as those proposed in a recent study by You et al. 38 . Structure-agnostic models have also been developed in recent years which rely only on composition and their related properties 39,40 , with some approaches also using the GNN architecture 41 . For certain applications such as bulk crystal properties, these have shown to reach comparable performance with the methods in this work.
Training the GNNs can be data-intensive, and we find a minimum of 10 3 -10 4 data points are needed to achieve adequate accuracy in the models. This general sentiment is echoed in a similar study finding descriptor-based models performing better than GNNs with small data sizes, but with GNNs performing better when ample data is available 25 . For applications with very small datasets, pre-defined descriptors still work best, but will quickly fail when moving outside of its domain of applicability. Methods that can improve the quality of the initial guess structures could help reduce the training costs significantly.
Approaches to effectively incorporate domain knowledge with GNN models which complements their existing flexibility and ability for learnable representations would also likely improve performance. For example, additional information about the system can be incorporated into the node or edge attributes, such as the bond types, aromaticity, and chirality in the case of molecular systems, which is not inherently known from just the atomic structure. Along these lines, recent approaches incorporating some form of features relating to atomic orbitals and their interactions have also yielded promising results 30,42 . Alternatively, the inclusion of physical constraints in the loss functions or through other means which have been used for ML, in general, could also be considered for the GNNs here.
Additionally, effectively incorporating knowledge from preexisting datasets through transfer learning is also another promising avenue for improvement by leveraging current highthroughput computational databases. Besides the datasets used in this study, many other computational databases with 10 4 or more data entries are also available which would be well-suited for training with GNNs 7-11 . GNN methods that can train on multi- fidelity data have been demonstrated recently which would improve integration with multiple datasets 43 .

METHODS Datasets
The datasets we used were each chosen to reasonably represent a variety of different classes of inorganic materials ranging from 3D to 0D, each with different target properties, summarized in Table 3. 3D materials are periodic in three dimensions and include bulk crystals, while 2D materials refer to solids with the finite thickness (single or few-layer), and 0D materials refer to nanoclusters and nanoparticles which are non-periodic. All five datasets were generated through DFT calculations. Four of the datasets were obtained from curated, open computational databases. For bulk materials, we compiled approximately 37,000 structures from the Materials Project 12 , which contains the widest elemental diversity of the datasets in this work, as well as the greatest diversity in structure size from the number of atoms. Bulk crystals have been used as a de facto benchmark system in the computational materials literature, with many successful examples of regression and classification using both GNN models [25][26][27][28][29][30] . The property selected for this dataset is formation energy in units of eV/atoms. We note this database is constantly expanding, and here we used a snapshot of a subset of the available data.
For surfaces, we employed a dataset of another roughly 37,000 structures from the Catalysis-Hub database, containing eleven adsorbates on approximately 2000 unique metal alloy surfaces 44 . For this dataset, we used the relaxed surfaces containing the adsorbate as the input structures and adsorption energy in eV as the target property. Unrelaxed surfaces are not used in this study due to the ambiguities involved in the placement of the adsorbate which will not be explored in detail here. With this dataset, we intend to evaluate the general ability of the model for predicting surface chemistry properties, which are relevant for catalysis.
For porous materials, we used a metal-organic framework (MOF) dataset from the QMOF database containing roughly 13,000 processed structures at the time of access, from experimentally synthesized MOFs and containing band gaps in eV as the target property 45 .
For 2D materials, we used a dataset from C2DB, containing around 4000 structures, with the target property being work function in eV, another important electronic property 46 . This dataset is smaller than the other examples in this work while still being compositionally diverse with 60 elements included, and may prove challenging for GNNs with their generally high data requirements.
The last dataset is compositionally narrow with only one element, Pt, but is structurally diverse with around 20,000 different nanoclusters ranging from 10 to 14 atoms, and with total energies in eV as the target. These clusters were obtained from basin-hopping global optimization in a previous study 47 . This dataset was included to evaluate the ability of GNNs to capture structural sensitivity for potential use in machine-learned potentials, a task for which descriptor-based methods are currently most commonly used.

Structure/graph representations
The data are organized as a set of structures and a set of target properties associated with the structures. The structures are described by a set of atomic positions in Cartesian coordinates and an accompanying lattice vector describing the dimensions of the periodic cell. Non-periodic structures can also be represented by simply placing the molecule/cluster in a sufficiently large empty cell and ensuring the distance between images are larger than distance cutoffs for edges. Each atom is represented by a node in the graph, and the edges are usually determined by interatomic distances within a certain radius cutoff. The node attributes in this work contain elemental properties of the atoms which are one-hot encoded as described by Xie et al. 27 . Edge attributes encode the interatomic distances, described later in the section.

Machine learning models and training
The general architecture of overall GNN models used so far in materials chemistry contains several shared characteristics, which we unify into a general architecture here, illustrated in Fig. 1. First, an embedding or preprocessing layer is present which transforms the node attributes from the input to a specified dimension. This is followed by N number of graph convolutional blocks, which perform the convolution and aggregation of the nodes. We used the same graph convolutional operators used in the original GNN models. This is followed by a graph-wide readout/pooling layer which provides an overall graph representation by aggregating the node attributes; in this work, we choose from max, average, sum, and set2set pooling. This is followed by M dense layers and the scalar output. The dimensions of the embedding, graph convolutional, and dense blocks, the type of pooling layer used, layer counts N and M, batch size, and learning rate are hyperparameters that are selected through optimization for each model and each dataset (search ranges in the SI).
We tested five different graph convolutional operators here: The SchNet 22 convolutional operator: Here, d i;j is defined as the interatomic distance between atoms i and j, and h Θ is a neural network containing dense layers which generate filters from interatomic distances. Before being fed to the neural network, the distances are expanded by a Gaussian basis function, which provides a continuous, non-sparse representation. This was also later successfully applied for use in other GNNs such as CGCNN and MEGNet. In the rest of this work, we define this term as and use these as the edge attributes. Additionally, update functions are applied to the node attributes in the form of dense layers. In the original SchNet model, atom features are transformed to scalar atom-wise values before being summed over the whole graph. This is common practice for machine-learned potentials which first calculate the atom-wise energies before sum pooling to obtain the total energy 48,49 . Instead, in this work we pool first to obtain the graphlevel features before predicting the scalar property using dense layers; this is consistent with other approaches for materials property predictions such as MEGNet and CGCNN.
The Message Passing Neural Network (MPNN) 21 convolutional operator: h Θ is a neural network containing dense layers, and update functions are also applied to the nodes, this time in the form of a gated recurrent unit.  27 convolutional operator: Here, z i;j ¼ x i È x j È e i;j , and σ and g are sigmoid and softplus functions respectively.
The MatErials Graph Network (MEGNet) 26 convolutional operator: Two dense layers are added at the beginning of each MEGNet graph convolutional block to preprocess the inputs. Here h Θe and h Θv are edge and node update functions, which are also dense layers. The updates follow the order of edges, nodes, and global attributes. A skip connection adds the unprocessed input attributes of each block with the output attributes. In the original work, the global attributes were left blank for the inorganic crystals dataset and are similarly unused here.
The Graph Convolutional Network (GCN) 50 convolutional operator: We include this as a baseline graph convolutional with much simpler construction and not specifically developed for materials chemistry applications. In contrast to the earlier models, a purely linear update function Θ is used here for the node attributes, and edge attributes are not included. Instead, the edge weights are used, containing the inverse normalized atomic distances.
For each model and dataset, hyperparameter optimization was performed for 160 trials using the HyperOpt optimizer with an 80-20 split for training and validation, and the model with the lowest validation error was selected 51 . The hyperparameter search space can be found in the Supplementary Information. To confirm whether 160 trials were sufficient, two tests were performed for the bulk dataset with 640 trials and found no significant improvement in performance. The reported performance is then obtained from five-fold crossvalidation on the selected model. We found the performance may differ from the values in the literature; for example, the error for CGCNN in this study is 0.049 while it is 0.039 in the original paper 27 . This may be the result of many potential factors, such as data selection, processing, and model optimization.
Here, we limited the training to within 200 epochs to make hyperparameter optimization computationally affordable, while a more thoroughly optimized model may have a better performance. Meanwhile, we also note large variations in performance for the same model in the literature 25,29,30 . This furthermore reinforces the need for a consistent framework for fairly and reproducibly comparing models.
In addition, we tested two non-GNN models, using the Sine matrix 52 (SM) and the Smooth Overlap of Atomic Positions 17 (SOAP) descriptors, for comparison. For the SM descriptor, the size of the matrix is padded with zeros to the maximum atomic size of the dataset, and only the eigenvalues are used, sorted in descending order of their absolute values. For the SOAP descriptor, Gaussian type orbital basis functions are used and an "inner" average is obtained for each element present in the dataset, whereby the average is taken over the sites before summing up the magnetic quantum numbers. The SOAP parameters, the distance cutoff, the number of radial basis functions, the degree of spherical harmonics, and the standard deviation of the Gaussians in the basis functions are all considered as hyperparameters to be optimized. For both models, the descriptors are fed into dense layers with variable layer count and size as determined from hyperparameter optimization. Finally, an overall baseline is provided for comparison in the form of a dummy regressor which only returns the mean of the training dataset.
The code was written in Python 3.7 and uses PyTorch v1.6 and PyTorch-Geometric 53 v1.6 libraries for the ML models 36 . The DScribe library was used to obtain SM and SOAP descriptors 54 . We use the Ray library which provides distributed hyperparameter optimization on multiple nodes 55 .

DATA AVAILABILITY
The datasets used for testing are also provided in full or with instructions included at https://github.com/vxfung/MatDeepLearn.