Abstract
As explanations are increasingly used to understand the behavior of graph neural networks (GNNs), evaluating the quality and reliability of GNN explanations is crucial. However, assessing the quality of GNN explanations is challenging as existing graph datasets have no or unreliable groundtruth explanations. Here, we introduce a synthetic graph data generator, ShapeGGen, which can generate a variety of benchmark datasets (e.g., varying graph sizes, degree distributions, homophilic vs. heterophilic graphs) accompanied by groundtruth explanations. The flexibility to generate diverse synthetic datasets and corresponding groundtruth explanations allows ShapeGGen to mimic the data in various realworld areas. We include ShapeGGen and several realworld graph datasets in a graph explainability library, GraphXAI. In addition to synthetic and realworld graph datasets with groundtruth explanations, GraphXAI provides data loaders, data processing functions, visualizers, GNN model implementations, and evaluation metrics to benchmark GNN explainability methods.
Introduction
As graph neural networks (GNNs) are being increasingly used for learning representations of graphstructured data in highstakes applications, such as criminal justice^{1}, molecular chemistry^{2,3}, and biological networks^{4,5}, it becomes critical to ensure that the relevant stakeholders can understand and trust their functionality. To this end, previous work developed several methods to explain predictions made by GNNs^{6,7,8,9,10,11,12,13,14}. With the increase in newly proposed GNN explanation methods, it is critical to ensure their reliability. However, explainability in graph machine learning is an emerging area lacking standardized evaluation strategies and reliable data resources to evaluate, test, and compare GNN explanations^{15}. While several works have acknowledged this difficulty, they tend to base their analysis on specific realworld^{2} and synthetic^{16} datasets with limited groundtruth explanations. In addition, relying on these datasets and associated groundtruth explanations is insufficient as they are not indicative of diverse realworld applications^{15}. To this end, developing a broader ecosystem of data resources for benchmarking stateoftheart GNN explainers can support explainability research in GNNs.
A comprehensive data resource to correctly evaluate the quality of GNN explanations will ensure their reliability in highstake applications. However, the evaluation of GNN explanations is a growing research area with relatively little work, where existing approaches mainly leverage groundtruth explanations associated with specific datasets^{2} and are prone to several pitfalls (as outlined by Faber et al.^{16}). Further, multiple underlying rationales can generate the correct class labels, creating redundant or nonunique explanations. A trained GNN model may only capture one or an entirely different rationale. In such cases, evaluating the explanation output by a stateoftheart method using the groundtruth explanation is incorrect because the underlying GNN model does not rely on that groundtruth explanation. In addition, even if a unique groundtruth explanation generates the correct class label, the GNN model trained on the data could be a weak predictor using an entirely different rationale for prediction. Therefore, the groundtruth explanation cannot be used to assess post hoc explanations of such models. Lastly, the groundtruth explanations corresponding to some of the existing benchmark datasets are not good candidates for reliably evaluating explanations as they can be recovered using trivial baselines (e.g., random node or edge as explanation). The above discussion highlights a clear need for generalpurpose data resources which can evaluate post hoc explanations reliably across diverse realworld applications. While various benchmark datasets (e.g., Open Graph Benchmark (OGB)^{17}, GNNMark^{18}, GraphGT^{19}, MalNet^{20}, Graph Robustness Benchmark (GRB)^{21}, Therapeutics Data Commons^{22,23}, and EFO1QA^{24}) and programming libraries for deep learning on graphs (e.g., Dive Into Graphs (DIG)^{25}, Pytorch Geometric (PyG)^{26}, and Deep Graph Library (DGL)^{27}) in graph machine learning literature exist, they are mainly used to only benchmark the performance of GNN predictors and are not suited to evaluate the correctness of GNN explainers because they do not capture groundtruth explanations.
Here, we address the above challenges by introducing a generalpurpose data resource that is not prone to groundtruth pitfalls (e.g., redundant explanations, weak GNN predictors, trivial explanations) and can cater to diverse realworld applications. To this end, we present ShapeGGen (Fig. 2), a novel and flexible synthetic XAIready (explainable artificial intelligence ready) dataset generator which can automatically generate a variety of benchmark datasets (e.g., varying graph sizes, degree distributions, homophilic vs. heterophilic graphs) accompanied by groundtruth explanations. ShapeGGen ensures that the generated groundtruth explanations are not prone to the pitfalls described in Faber et al.^{16}, such as redundant explanations, weak GNN predictors, and trivially correct explanations. Furthermore, ShapeGGen can evaluate the goodness of any explanation (e.g., node featurebased, nodebased, edgebased) across diverse realworld applications by seamlessly generating synthetic datasets that can mimic the properties of realworld data in various domains.
We incorporate ShapeGGen and several other synthetic and realworld graphs^{28} into GraphXAI, a generalpurpose framework for benchmarking GNN explainers. GraphXAI also provides a broader ecosystem (Fig. 1) of data loaders, data processing functions, visualizers, and a set of evaluation metrics (e.g., accuracy, faithfulness, stability, fairness) to reliably benchmark the quality of any given GNN explanation (node featurebased or node/edgebased). We leverage various synthetic and realworld datasets and evaluation metrics from GraphXAI to empirically assess the quality of explanations output by eight stateoftheart GNN explanation methods. Across many GNN explainers, graphs, and graph tasks, we observe that stateoftheart GNN explainers fail on graphs with large groundtruth explanations (i.e., explanation subgraphs with a higher number of nodes and edges) and cannot produce explanations that preserve fairness properties of underlying GNN predictors.
Results
To evaluate GraphXAI, we show how GraphXAI enables systematic benchmarking of eight stateoftheart GNN explainers on both ShapeGGen (in the Methods section) and realworld graph datasets. We explore the utility of the ShapeGGen generator to benchmark GNN explainers on graphs with homophilic vs. heterophilic, small vs. large, and fair vs. unfair groundtruth explanations. Additionally, we examine the utility of GNN explanations on datasets with varying degrees of informative node features. Next, we outline the experimental setup, including details about performance metrics, GNN explainers, and underlying GNN predictors, and proceed with a discussion of benchmarking results.
Experimental setup
GNN explainers
The GraphXAI defines an Explanation class capable of storing multiple types of explanations produced by GNN explainers and provides a graphxai.BaseExplainer class that serves as the base for all explanation methods in GraphXAI. We incorporate eight GNN explainability methods, including gradientbased: Grad^{29}, GradCAM^{11}, GuidedBP^{6}, Integrated Gradients^{30}; perturbationbased: GNNExplainer^{14}, PGExplainer^{10}, SubgraphX^{31}; and surrogatebased methods: PGMExplainer^{13}. Finally, following Agarwal et al.^{15}, we consider random explanations as a reference: (1) Random Node Features, a node feature mask defined by an ddimensional Gaussian distributed vector; (2) Random Nodes, a 1 × n node mask is randomly sampled using a uniform distribution, where n is the number of nodes in the enclosing subgraph; and (3) Random Edges, an edge mask drawn from a uniform distribution over a node’s incident edges.
Implementation details
We use a threelayer GIN model^{32} and a GCN model^{33} as GNN predictors for our experiments. We use a model comprising three GIN convolution layers with ReLU nonlinear activation function and a Softmax activation for the final layer. The hidden dimensionality of the layers is set to 16. We follow an established approach for generating explanations^{8,15} and use reference algorithm implementations. We select topk (k = 25%) important nodes, node features, or edges and use them to generate explanations for all graph explainability methods. For training GIN models, we use an Adam optimizer with a learning rate of 1 × 10^{−2}, weight decay of 1 × 10^{−5}, and the number of epochs to 1000. We use an Adam optimizer with a learning rate of 3 × 10^{−2}, no weight decay, and 1500 training epochs for training GNN models. We set hyperparameters of GNN explainability models following the authors’ recommendations.
We use a fixed random split provided within the GraphXAI package to split the datasets. For each ShapeGGen dataset, we use a 70/5/25 split for training, validation, and testing, respectively. For MUTAG, Benzene, and Fluoride Carbonyl datasets, we use a 70/10/20 split throughout each dataset. Average performance is reported across each sample in the testing set of each dataset.
Performance metrics
In addition to the synthetic and realworld data resources, we consider four broad categories of performance metrics: (i) Graph Explanation Accuracy (GEA); (ii) Graph Explanation Faithfulness (GEF); (iii) Graph Explanation Stability (GES); and (iv) Graph Explanation Fairness (GECF, GEGF) to evaluate the explanations on the respective datasets. In particular, all evaluation metrics leverage predicted explanations, groundtruth explanations, and other usercontrolled parameters, such as topk features. Our GraphXAI package implements these performance metrics and additional utility functions within graphxai.metrics module. Figure 3 shows a code snippet for evaluating the correctness of output explanations for a given GNN prediction in GraphXAI.
Graph explanation accuracy (GEA)
We report graph explanation accuracy as an evaluation strategy that measures an explanation’s correctness using the groundtruth explanation M^{g}. In groundtruth and predicted explanation masks, every node, node feature, or edge belongs to {0, 1}, where ‘0’ means that an attribute is unimportant and ‘1’ means that it is important for the model prediction. To measure accuracy, we use Jaccard index^{34} between the groundtruth M^{g} and predicted M^{p}:
where TP denotes true positives, FP denotes false positives, and FN indicates false negatives. Most synthetic and realworld graphs have multiple groundtruth explanations. For example, in the MUTAG dataset^{35}, carbon rings with both NH_{2} or NO_{2} chemical groups are valid explanations for the GNN model to recognize a given molecule as mutagenic. For this reason, the accuracy metric must account for the possibility of multiple equally valid explanations existing for any given prediction. Hence, we define ζ as a set of all possible groundtruth explanations, where ζ = 1 for graphs having a unique explanation. Therefore, we calculate GEA as:
Here, we can calculate GEA using predicted node feature, node, or edge explanation masks. Finally, Eq. 1 quantifies the accuracy between the groundtruth and predicted explanation masks. Higher values mean a predicted explanation is more likely to match the groundtruth explanation.
Graph explanation faithfulness (GEF)
We extend existing faithfulness metrics^{2,15} to quantify how faithful explanations are to an underlying GNN predictor. In particular, we obtain the prediction probability vector \({\widehat{y}}_{u}\) using the GNN, i.e., \({\widehat{y}}_{u}=f({{\mathcal{S}}}_{u})\), and using the explanation, i.e., \({\widehat{y}}_{u{\prime} }=f({{\mathcal{S}}}_{u{\prime} })\), where we generate a masked subgraph \({{\mathcal{S}}}_{u{\prime} }\) by only keeping the original values of the topk features identified by an explanation, and get their respective predictions \({\widehat{y}}_{u{\prime} }\). Finally, we compute the graph explanation unfaithfulness metric as:
where KullbackLeibler (KL) divergence score quantifies the distance between two probability distributions, and the “” operator indicates statistical divergence measure. Note that Eq. 3 is a measure of the unfaithfulness of the explanation. So, higher values indicate a higher degree of unfaithfulness.
Graph explanation stability (GES)
Formally, an explanation is defined to be stable if the explanation for a given graph and its perturbed counterpart (generated by making infinitesimally small perturbations to the node feature vector and associated edges) are similar^{15,36}. We measure graph explanation stability w.r.t. the changes in the model behavior. In addition to similar output labels for \({{\mathcal{S}}}_{u}\) and the perturbed \({{\mathcal{S}}}_{u{\prime} }\), we employ a second level of check where the difference between model behaviors for \({{\mathcal{S}}}_{u}\) and \({{\mathcal{S}}}_{u{\prime} }\) is bounded by an infinitesimal constant \(\delta \), i.e., \(  {{\mathcal{L}}}_{{{\mathcal{S}}}_{u}}{{\mathcal{L}}}_{{{\mathcal{S}}}_{u{\prime} }} { }_{p}\le \delta \). Here, \({\mathcal{L}}(\cdot )\) refers to any form of model knowledge like output logits \({\widehat{y}}_{u}\) or embeddings \({{\bf{z}}}_{u}\). We compute graph explanation instability as:
where D represents the cosine distance metric, \({{\bf{M}}}_{{{\mathcal{S}}}_{u}}^{p}\) and \({{\bf{M}}}_{{{\mathcal{S}}}_{u{\prime} }}^{p}\) are the predicted explanation masks for \({{\mathcal{S}}}_{u}\) and \({{\mathcal{S}}}_{u{\prime} }\), and \(\beta \) represents a \(\delta \)radius ball around \({{\mathcal{S}}}_{u}\) for which the model behavior is same. Eq. 4 is a measure of instability, and higher values indicate a higher degree of instability.
Counterfactual fairness mismatch
To measure counterfactual fairness^{15}, we verify if the explanations corresponding to \({{\mathcal{S}}}_{u}\) and its counterfactual counterpart (where the protected node feature is modified) are similar (dissimilar) if the underlying model predictions are similar (dissimilar). We calculate counterfactual fairness mismatch as:
where, M^{p} and \({{\bf{M}}}_{s}^{p}\) are the predicted explanation mask for \({{\mathcal{S}}}_{u}\) and for the counterfactual counterpart of \({{\mathcal{S}}}_{u}\). Note that we expect a lower GECF score for graphs having weaklyunfair groundtruth explanations because explanations are similar for both original and counterfactual graphs, whereas, for graphs with stronglyunfair groundtruth explanations, we expect a higher GECF score as explanations change when we modify the protected attribute.
Group fairness mismatch
We measure group fairness mismatch^{15} as:
where \({\widehat{{\bf{y}}}}_{{\mathcal{K}}}\) and \({\widehat{{\bf{y}}}}_{{\mathcal{K}}}^{{{\bf{E}}}_{u}}\) are predictions for a set of \({\mathcal{K}}\) graphs using the original and the essential features identified by an explanation, respectively, and SP is the statistical parity. Finally, Eq. 6 is a measure of group fairness mismatch of an explanation where higher values indicate that the explanation is not preserving group fairness.
Evaluation and analysis of GNN explainability methods
Next, we discuss experimental results that answer critical questions concerning synthetic and realworld graphs and different groundtruth explanations.
Benchmarking GNN explainers on synthetic and realworld graphs
We evaluate the performance of GNN explainers on a collection of synthetically generated graphs with various properties and molecular datasets using metrics described in the experimental setup. Results in Tables 1–5 show that, while no explanation method performs well across all properties, across different ShapeGGen node classification datasets (Table 6), SubgraphX outperforms other methods on average. In particular, SubgraphX generates 145.95% more accurate and 64.80% less unfaithful explanations than other GNN explanation methods. Gradientbased methods, such as GradCam and GuidedBP, perform the next best of all methods, with Grad producing the secondlowest unfaithfulness score and GradCAM achieving the secondhighest explanation accuracy score. PGExplainer generates the least unstable explanations–35.35% less unstable explanations than the average instability across other GNN explainers. In summary, results of Tables 1–5 show that node explanation masks are more reliable than edge and node feature explanation masks and stateoftheart GNN explainers achieve better faithfulness for synthetic graph datasets as compared to realworld graphs.
Analyzing homophilic vs. heterophilic groundtruth explanations
We compare GNN explainers by generating explanations on GNN models trained on homophilic and heterophilic graphs generated using the SGHeterophilic generator. Then, we compute the graph explanation unfaithfulness scores of output explanations generated using stateoftheart GNN explainers. We find that GNN explainers produce 55.98% more faithful explanations when groundtruth explanations are homophilic than when groundtruth explanations are heterophilic (i.e., low unfaithfulness scores for light green bars in Fig. 4). These results reveal an important gap in existing GNN explainers. Namely, existing GNN explainers fail to perform well on diverse graph types, such as homophilic, heterophilic and attributed graphs. This observation, enabled by the flexibility of ShapeGGen generator, highlights an opportunity for future algorithmic innovation in GNN explainability.
Analyzing the reliability of graph explainers to smaller vs. larger groundtruth explanations
Next, we examine the reliability of GNN explainers when used to predict explanations for models trained on graphs generated using the SGSmallEx graph generator. Results in Fig. 5 show that explanations from existing GNN explainers are faithful (i.e., lower GEF scores) to the underlying GNN models when groundtruth explanations are smaller, i.e., S = ‘triangle’. On average, across all eight GNN explainers, we find that existing GNN explainers are highly unfaithful to graphs with large groundtruth explanations with an average GEF score of 0.7476. Further, we observe that explanations generated on ‘triangular’ (smaller) groundtruth explanations are 59.98% less unfaithful than explanations for ‘house’ (larger) groundtruth explanations (i.e., low unfaithfulness scores for light purple bars in Fig. 5). However, the Grad explainer, on average, achieves 9.33% lower unfaithfulness on large groundtruth explanations compared to other explanation methods. This limited behavior of existing GNN explainers has not been previously known and highlights an urgent need for additional analysis of GNN explainers.
Examining fair vs. unfair groundtruth explanations
To measure the fairness of predicted explanations, we train GNN models on SGUnfair, which generates graphs with controllable fairness properties. Next, we compute the average GECF and GEGF values for predicted explanations from eight GNN explainers. The fairness results in Fig. 6 show that GNN explainers do not preserve counterfactual fairness and are highly prone to producing unfair explanations. We note that for weaklyunfair groundtruth explanations (light red in Fig. 6), explanations M^{p} should not change as the labelgenerating process is independent of the protected attribute. Still, we observe high GECF scores for most explanation methods. For stronglyunfair groundtruth explanations, we find that explanations from most GNN explainers fail to capture (i.e., low GECF scores for dark red bars in Fig. 6) the unfairness enforced using the protected attribute and generate similar explanations even when we flip/modify the respective protected attribute. We see that GradCAM and PGEx explanations outperform other GNN explainers in preserving counterfactual explanations for weaklyunfair groundtruth explanations. In contrast, the PGMEx explainer preserves counterfactual fairness better than other explanation methods on stronglyunfair ground truth explanations. Our results highlight the importance of studying fairness in XAI as they can enhance a user’s confidence in the model and assist in detecting and correcting unwanted bias.
Faithfulness shift with varying degrees of node feature information
Using ShapeGGen’s support for node features and groundtruth explanations on those features, we evaluate explainers that generate explanations for node features. Results for node feature explanations on SGBase are given in Table 4. In addition, we explore the performance of explainers under varying proportions of informative node features. Informative node features, defined in the ShapeGGen construction (Algorithm 1), are node features correlated with the label of a given node, as opposed to redundant features, which are sampled randomly from a Gaussian distribution. Figure 7 shows the results of experiments on three datasets, SGMoreInform, SGBase, and SGLessInform. All datasets have similar graph topology, but SGMoreInform has a higher proportion of informative features while SGLessInform has a lower proportion of these informative features. SGBase is used as a baseline with a proportion of informative features greater than SGLessInform but less than SGMoreInform. There are minimal differences between explainers’ faithfulness across datasets, however, unfaithfulness tends to increase with fewer informative node features. As seen in Table 4, the Gradient explainer shows the best faithfulness score across all datasets for node feature explanation. Still, this faithfulness is relatively weak, only 0.001 better than random explanation. These results show that the faithfulness of node feature explanations worsens under sparse node feature signals.
Visualization results
GraphXAI provides functions that visualize explanations produced by GNN explainability methods. Users can compare both node and graphlevel explanations. In addition, function implementations for visualization are parameterized, allowing users to change colors and weight interpretation. Functions are compatible with matplotlib^{37} and networkx^{38}. Visualizations are generated by graphxai.Explanation.visualize_node for nodelevel explanations and graphxai.Explanation.visualize_graph functions for graphlevel explanations. In Fig. 8, we show the output explanation from four different GNN explainers as produced by our visualization function. Figure 9 shows example outputs from multiple explainers in the GraphXAI package on a ShapeGGen generated dataset.
Discussion
GraphXAI provides a generalpurpose framework to evaluate GNN explanations produced by stateoftheart GNN explanation methods. GraphXAI provides data loaders, data processing functions, visualizers, realworld graph datasets with groundtruth explanations, and evaluation metrics to benchmark the quality of GNN explanations. GraphXAI introduces a novel and flexible synthetic dataset generator called ShapeGGen to automatically generate benchmark datasets and corresponding ground truth explanations robust against known pitfalls of GNN explainability methods. Our experimental results show that existing GNN explainers perform well on graphs with homophilic groundtruth explanations but perform considerably worse on heterophilic and attributed graphs. Across multiple graph datasets and types of downstream prediction tasks, we show that existing GNN explanation methods fail on graphs with larger groundtruth explanations and cannot generate explanations that preserve the fairness properties of the underlying GNN model. In addition, GNN explainers tend to underperform on sparse node feature signals compared to more densely informative node features. These findings indicate the need for methodological innovation and a thorough analysis of future GNN explainability methods across performance dimensions.
GraphXAI provides a flexible framework for evaluating GNN explanation methods and promotes reproducible and transparent research. We maintain GraphXAI as a centralized library for evaluating GNN explanation methods and plan to add newer datasets, explanation methods, diverse evaluation metrics, and visualization features to our existing framework. In the current version of GraphXAI, we mostly employ realworld molecular chemistry datasets as the need for model understanding is motivated by experimental evaluation of model predictions in the laboratory, and it includes a wide variety of graph sizes (ranging from 1,768 to 12,000 instances in the dataset), node feature dimensions (ranging from 13 to 27 dimensions), and class imbalance ratios. In addition to the scalefree ShapeGGen dataset generator, we will include other random graph model generators in the next GraphXAI release to support benchmarking of GNN explainability methods on other graph types. Our evaluation metrics can be also extended to explanations from selfexplaining GNNs, e.g., GraphMASK^{12} identifies edges at each layer of the GNN during training that can be ignored without affecting the output model predictions. In general, selfexplaining GNNs also return a set of edge masks as an output explanation for a GNN prediction that can be converted to edge importance scores for computing GraphXAI metrics. We anticipate that GraphXAI can help algorithm developers and practitioners in graph representation learning develop and evaluate principled explainability techniques.
Methods
ShapeGGen is a key component of GraphXAI and serves as a synthetic dataset generator of XAIready graph datasets. It is founded in graph theory and designed to address the pitfalls (see Introduction) of existing graph datasets in the broad area of explainable AI. As such, ShapeGGen can facilitate the development, analysis, and evaluation of GNN explainability methods (see Results). We proceed with the description of ShapeGGen data generator.
Notation
Graphs
Let \({\mathcal{G}}=\left({{\mathcal{V}}}_{{\mathcal{G}}},{{\mathcal{E}}}_{{\mathcal{G}}},{{\bf{X}}}_{{\mathcal{G}}}\right)\) denote an undirected graph comprising of a set of nodes \({{\mathcal{V}}}_{{\mathcal{G}}}\) and a set of edges \({{\mathcal{E}}}_{{\mathcal{G}}}\). Let \({{\bf{X}}}_{{\mathcal{G}}}=\left\{{{\bf{x}}}_{1},{{\bf{x}}}_{2},\ldots ,{{\bf{x}}}_{N}\right\}\) denote the set of node feature vectors for all nodes in \({{\mathcal{V}}}_{{\mathcal{G}}}\), where \({{\bf{x}}}_{v}\in {{\bf{X}}}_{{\mathcal{G}}}\) is an ddimensional vector which captures the attribute values of a node v and \(N=\left{{\mathcal{V}}}_{{\mathcal{G}}}\right\) denotes the number of nodes in the graph. Let \({\bf{A}}\in {{\mathbb{R}}}^{N\times N}\) be the graph adjacency matrix where element A_{uv} = 1 if there exists an edge \(e\in {{\mathcal{E}}}_{{\mathcal{G}}}\) between nodes u and v and A_{uv} = 0 otherwise. We use \({{\mathcal{N}}}_{u}\) to denote the set of immediate neighbors of node u, \({{\mathcal{N}}}_{u}=\{v\in {{\mathcal{V}}}_{{\mathcal{G}}} {A}_{uv}=1\}\). Finally, the function \({\rm{\deg }}:{{\mathcal{V}}}_{{\mathcal{G}}}\mapsto {{\mathbb{Z}}}_{\ge 0}\) is defined as \({\rm{\deg }}(u)=\left{{\mathcal{N}}}_{u}\right\) and outputs the degree of a node \(u\in {{\mathcal{V}}}_{{\mathcal{G}}}\).
Graph neural networks
Most GNNs can be formulated as message passing networks^{39} using three operators: Msg, Agg, and Upd. In a Llayer GNN, these operators are recursively applied on \({\mathcal{G}}\), specifying how neural messages (i.e. embeddings) are exchanged between nodes, aggregated, and transformed to arrive at node representations in the last layer of transformations. Formally, a message between a pair of nodes (u, v) in layer l is defined as a function of hidden representations of nodes \({{\bf{h}}}_{u}^{l1}\) and \({{\bf{h}}}_{v}^{l1}\) from the previous layer: \({{\bf{m}}}_{uv}^{l}=Msg({{\bf{h}}}_{u}^{l1},{{\bf{h}}}_{v}^{l1})\). In Agg, messages from all nodes \(v\in {{\mathcal{N}}}_{u}\) are aggregated as: \({{\bf{m}}}_{u}^{l}=Agg({{\bf{m}}}_{uv}^{l} v\in {{\mathcal{N}}}_{u})\). In Upd, the aggregated message \({{\bf{m}}}_{u}^{l}\) is combined with \({{\bf{h}}}_{u}^{l1}\) to produce u’s representation for layer l as \({{\bf{h}}}_{u}^{l}=Upd({{\bf{m}}}_{u}^{l},{{\bf{h}}}_{u}^{l1})\). Final node representation \({{\bf{z}}}_{u}={{\bf{h}}}_{u}^{L}\) is the output of the last layer. Lastly, let f denote a downstream GNN classification model that maps the node representation z_{u} to a softmax prediction vector \({\widehat{y}}_{u}\in {{\mathbb{R}}}^{C}\), where C is the total number of labels.
GNN explainability methods
Given the prediction f(u) for node u made by a GNN model, a GNN explainer \({\mathcal{O}}\) outputs an explanation mask M^{p} that provides an explanation of f(u). These explanations can be given with respect to node attributes \({{\bf{M}}}_{a}\in {{\mathbb{R}}}^{d}\), nodes \({{\bf{M}}}_{n}\in {{\mathbb{R}}}^{N}\), or edges \({{\bf{M}}}_{e}\in {{\mathbb{R}}}^{N\times N}\), depending on specific GNN explainer, such as GNNExplainer^{14}, PGExplainer^{10}, and SubgraphX^{31}. For all explanation methods, we use a graph masking function MASK that outputs a new graph \({\mathcal{G}}{\prime} =\left({{\mathcal{V}}{\prime} }_{{\mathcal{G}}{\prime} },{{\mathcal{E}}{\prime} }_{{\mathcal{G}}{\prime} },{{\bf{X}}{\prime} }_{{\mathcal{G}}{\prime} }\right)\) by performing elementwise multiplication operation between the masks (M_{a}, M_{n}, M_{e}) and their respective attributes in the original graph \({\mathcal{G}}\), e.g. A′ = A ⊙ M_{e}. Finally, we denote the groundtruth explanation mask as M^{g} that is used to evaluate the performance of GNN explainers.
ShapeGGen dataset generator
We propose a novel and flexible synthetic dataset generator called ShapeGGen that can automatically generate a variety of benchmark datasets (e.g., varying graph sizes, degree distributions, homophilic vs. heterophilic graphs) accompanied by groundtruth explanations. Furthermore, the flexibility to generate diverse synthetic datasets and corresponding groundtruth explanations allows us to mimic the data generated by various realworld applications. ShapeGGen is a generator of XAIready graph datasets supported by graph theory and is particularly suitable for benchmarking GNN explainers and studying their limitations.
Flexible parameterization of ShapeGGen
ShapeGGen has tunable parameters for data generation. By varying these parameters, ShapeGGen can generate diverse types of graphs, including graphs with varying degrees of class imbalance and graph sizes. Formally, a graph is generated as: \({\mathcal{G}}\) = ShapeGGen \(({\mathcal{S}},{N}_{s},p,{n}_{s},K,{n}_{f},{n}_{i},{s}_{f},{c}_{f},\varphi ,\eta ,L)\), where:

S is the motif, defined as a subgraph, to be planted within the graph.

N_{s} denotes the number of subgraphs used in the initial graph generation process.

p represents the connection probability between two subgraphs and controls the average shortest path length for all possible pairs of nodes. Ideally, a smaller p value for larger N_{s} is preferred to avoid low average path length and, therefore, the poor performance of GNNs.

n_{s} is the expected size of each subgraph in the ShapeGGen generation procedure. For a fixed S shape, large n_{s} values produce graphs with longtailed degree distributions. Note that the expected total number of nodes in the generated graph is N × n_{s}.

K is the number of distinct classes defined in the graph downstream task.

n_{f} represents the number of dimensions for node features in the generated graph.

n_{i} is the number of informative node features. These are features correlated to the node labels instead of randomly generated noninformative features. The indices for the informative features define the groundtruth explanation mask for node features in the final ShapeGGen instance. In general, larger n_{i} results in an easier classification and explanation task, as it increases the node featurelevel groundtruth explanation size.

s_{f} is defined as the class separation factor that represents the strength of discrimination of class labels between features for each node. Higher s_{f} corresponds to a stronger signal in the node features, i.e., if a classifier is trained only on the node features, a higher s_{f} value would result in an easier classification task.

c_{f} is the number of clusters per class. A larger c_{f} value increases the difficulty of the classification task with respect to node features.

\({\varphi }\in [0,1]\) is the protected feature noise factor that controls the strength of correlation r between the protected features and the node labels. This value corresponds to the probability of “flipping” the protected feature with respect to the node’s label. For instance, \({\varphi }=0.5\) results in zero correlation (r = 0) between the protected feature and the label (i.e. complete fairness), \({\varphi }=0\) results in a positive correlation (r = 1), and \({\varphi }=1\) results in a negative correlation (r = −1) between the label and the protected feature.

η is the homophily coefficient that controls the strength of homophily or heterophily in the graph. Positive values (η > 0) produce a homophilic graph while negative values (η < 0) produce a heterophilic graph.

L is the number of layers in the GNN predictor corresponding to the GNN’s receptive field. For the purposes of ShapeGGen, L is used to define the size of the GNN’s receptive field and thus the size of the groundtruth explanation generated for each node.
This wide array of parameters for ShapeGGen allows for the generation of graph instances with vastly differing properties Fig. 10.
Generating graph structure
Figure 2 summarizes the process to generate a graph \({\mathcal{G}}=\left({{\mathcal{V}}}_{{\mathcal{G}}},{{\mathcal{E}}}_{{\mathcal{G}}},{{\bf{X}}}_{{\mathcal{G}}}\right)\). ShapeGGen generates N_{s} subgraphs that exhibit the preferential attachment property^{40}, which occurs in many realworld graphs. Each subgraph is first given a motif, i.e., a subgraph \({\mathcal{S}}=\left({{\mathcal{V}}}_{{\mathcal{S}}},{{\mathcal{E}}}_{{\mathcal{S}}},{{\bf{X}}}_{{\mathcal{S}}}\right)\). A preferential attachment algorithm is then performed on base structure \({\mathcal{S}}\), adding \(n{\prime} \) (\(n{\prime} \sim {\rm{Poisson}}(\lambda ={n}_{s} {{\mathcal{V}}}_{{\mathcal{S}}} )\)) nodes by creating edges to nodes in \({{\mathcal{V}}}_{{\mathcal{S}}}\). The Poisson distribution is used for determining the sizes of each subgraph used in the generation process, with \(\lambda ={n}_{s} {{\mathcal{V}}}_{{\mathcal{S}}} \), the difference between the number of nodes in the motif and the expected subgraph size \({n}_{s}\). After creating a list of randomlygenerated subgraphs \({\bf{S}}=\{{{\mathcal{S}}}_{1},\ldots ,{{\mathcal{S}}}_{N}\}\), edges are created to connect subgraphs, creating the structure of an instance of ShapeGGen. Subgraph connections and local subgraph construction is performed in such a way that each node in the final graph \({\mathcal{G}}\) has between 1 and K motifs within its neighborhood, i.e., \(\left{\bigcup }_{i=1}^{N}{{\mathcal{V}}}_{{{\mathcal{S}}}_{i}}\cap {{\mathcal{N}}}_{v}\right\;\in \;\left\{1,2,...,K\right\}\) for any v and \({{\mathcal{S}}}_{i}\). This naturally defines a classification task in the domain of f to {0, 1, …, K−1}. More details on the ShapeGGen structure generation can be found in Algorithm 2.
Generating labels for prediction
A motif defined as a subgraph \({\mathcal{S}}=\left({{\mathcal{V}}}_{{\mathcal{S}}},{{\mathcal{E}}}_{{\mathcal{S}}},{{\bf{X}}}_{{\mathcal{S}}}\right)\) occurs randomly throughout \({\mathcal{G}}\), with the set \({\bf{S}}=\{{{\mathcal{S}}}_{1},\ldots ,{{\mathcal{S}}}_{{N}_{S}}\}\). The task on this graph is a motif detection problem, where each node has exactly 1, 2, or K motifs in its 1hop neighborhood. A motif \({{\mathcal{S}}}_{i}\) is considered to be within the neighborhood of a node \(v\in {{\mathcal{V}}}_{{\mathcal{G}}}\) if any node \(s\in {{\mathcal{V}}}_{{{\mathcal{S}}}_{i}}\) is also in the neighborhood of v, i.e., if \( {{\mathcal{V}}}_{{{\mathcal{S}}}_{i}}\cap {{\mathcal{N}}}_{v} > 0\). Therefore, the task that a GNN predictor needs to solve is defined by:
where \({{\mathbb{1}}}_{{{\mathcal{V}}}_{{{\mathcal{S}}}_{i}}}\left({{\mathcal{N}}}_{v}\right)=0\) if \( {{\mathcal{V}}}_{{{\mathcal{S}}}_{i}}\cap {{\mathcal{N}}}_{v} =0\) and 1 otherwise.
Generating node feature vectors
ShapeGGen uses a latent variable model to create node feature vectors and associate them with network structure. This latent variable model is based on that developed by Guyon^{41} for the MADELON dataset and implemented in ScikitLearn’s make_classification function^{42}. The latent variable model creates n_{i} informative features for each node based on the node’s generated label and also creates noninformative features as noise. Having noninformative/redundant features allows us to evaluate GNN explainers, such as GNNExplainer^{14}, that formulate explanations as node feature masks. More detail on node feature generation is given in Algorithm 1.
ShapeGGen can generate graphs with both homophilic and heterophilic groundtruth explanations. We optimize between homophily vs. heterophily by taking advantage of redundant node features, i.e., features that do not associate with the generated labels, and manipulate them appropriately based on a userspecified homophily parameter η. The magnitude of the η parameter determines the degree of homophily/heterophily in the generated graph. The algorithm for node features is given in Algorithm 3. While not every node feature in the feature vector is optimized for homophily/heterophily indication, we experimentally verified the cosine similarity between node feature vectors produced by Algorithm 3 reveals a strong homophily/heterophily pattern. Finally, ShapeGGen can generate protected features to enable the study of fairness^{1}. By controlling the value assignment for a selected discrete node feature, ShapeGGen introduces bias between the protected feature and node labels. The biased feature is a proxy for a protected feature, such as gender or race. The procedure for generating node features is outlined in NodeFeatureVectors within Algorithm 1.
Generating groundtruth explanations
In addition to generating groundtruth labels, ShapeGGen has a unique capability to generate unique groundtruth explanations. To accommodate diverse types of GNN explainers, every groundtruth explanation in ShapeGGen contains information on a) the importance of nodes, b) the importance of node features, and c) the importance of edges. This information is represented by three distinct masks defined over enclosing subgraphs of nodes \(v\in {{\mathcal{V}}}_{{\mathcal{S}}}\), i.e., the Lhop neighborhood around the node. We denote the enclosing subgraph of node \(v\in {{\mathcal{V}}}_{{\mathcal{S}}}\) for a given GNN predictor with L layers as: \({\rm{Sub}}\left(v;L\right)=\left({{\mathcal{V}}}_{{\rm{Sub}}(v)},{{\mathcal{E}}}_{{\rm{Sub}}(v)},{{\bf{X}}}_{{\rm{Sub}}(v)}\right)\subseteq {\mathcal{G}}\). Let motifs within this enclosing subgraph be: \({{\bf{S}}}_{v}=\left({{\mathcal{V}}}_{{{\bf{S}}}_{v}},{{\mathcal{E}}}_{{{\bf{S}}}_{v}},{{\bf{X}}}_{{{\bf{S}}}_{v}}\right)={\bf{S}}\cap {\rm{Sub}}(v)\). Using this notation, we define groundtruth explanation masks:

a)
Node explanation mask. Nodes in Sub(v) are assigned a value of 0 or 1 based on whether they are located within a motif or not. For any node \({v}_{i}\in {{\mathcal{V}}}_{{\rm{Sub}}(v)}\), we set \({{\mathbb{1}}}_{{{\mathcal{V}}}_{{{\bf{S}}}_{v}}}({v}_{i})=1\) if \({v}_{i}\in {{\mathcal{V}}}_{{{\bf{S}}}_{v}}\) and 0 otherwise. This function \({{\mathbb{1}}}_{{{\mathcal{V}}}_{{{\bf{S}}}_{v}}}\) is applied to all nodes in the enclosing subgraph of v to produce an importance score for each node, yielding the final mask as: \({{\bf{M}}}_{n}=\{{{\mathbb{1}}}_{{{\mathcal{V}}}_{{{\bf{S}}}_{v}}}(u) u\in {{\mathcal{V}}}_{{\rm{Sub}}(v)}\}\).

b)
Node feature explanation mask. Each feature in v’s feature vector is labeled as 1 if it represents an informative feature and 0 if it is a random feature. This procedure produces a binary node feature mask for node v as: \({{\bf{M}}}_{f}\in {\{0,1\}}^{d}\).

c)
Edge explanation mask. To each \(e=\left({v}_{i},{v}_{j}\right)\in {{\mathcal{E}}}_{{\rm{Sub}}(v)}\) we assign a value of either 0 or 1 based on whether e connects any two nodes in Sub(v). The masking function is defined as follows:
This function \({{\mathbb{1}}}_{{{\mathcal{E}}}_{{{\bf{S}}}_{v}}}\) is applied to all edges \(e\in {{\mathcal{E}}}_{{\rm{Sub}}(v)}\) to produce groundtruth edge explanation as: \({{\bf{M}}}_{e}=\{{{\mathbb{1}}}_{{{\mathcal{E}}}_{{{\bf{S}}}_{v}}}(e) e\in {{\mathcal{E}}}_{{\rm{Sub}}(v)}\}\). The procedure to generate these groundtruth explanations is thoroughly described in Algorithm 1.
Datasets in GraphXAI
We proceed with a detailed description of synthetic and realworld graph data resources included in GraphXAI.
Synthetic graphs
The ShapeGGen generator outlined in the Methods section is a dataset generator that can be used to generate any number of userspecified graphs. In GraphXAI, we provide a collection of pregenerated XAIready graphs with diverse properties that are readily available for analysis and experimentation.
Base ShapeGGen graphs (SGBase)
We provide a base version of ShapeGGen graphs. This instance of ShapeGGen is homophilic, large, and contains houseshaped motifs for groundtruth explanations, formally described as \({\mathcal{G}}\) = ShapeGGen (\({\mathcal{S}}\) = ‘house’, N_{s} = 1200, p = 0.006, n_{s} = 11, K = 2, n_{f} = 11, n_{i} = 4, s_{f} = 0.6, c_{f} = 2, ϕ = 0.5, η = 1, L = 3). The node features in this graph exhibit homophily, a property commonly found in social networks. With over 10,000 nodes, this graph also provides enough examples of groundtruth explanations for rigorous statistical evaluation of explainer performance. The houseshaped motifs follow one of the earliest synthetic graphs used to evaluate GNN explainers^{14}.
Homophilic and heterophilic ShapeGGen graphs
GNN explainers are evaluated on homophilic graphs^{1,43,44,45} as low homophily levels in graphs can degrade the performance of GNN predictors^{46,47}. To this end, there are no heterophilic graphs with groundtruth explanations in current GNN XAI literature despite such graphs being abundant in realworld applications^{46}. To demonstrate the flexibility of the ShapeGGen data generator, we use it to generate graphs with: i) homophilic groundtruth explanations (SGBase) and ii) heterophilic groundtruth explanations (SGHeterophilic), i.e., \({\mathcal{G}}\) = ShapeGGen (\({\mathcal{S}}\) = ‘house’, N_{s} = 1200, p = 0.006, n_{s} = 11, K = 2, n_{f} = 11, n_{i} = 4, s_{f} = 0.6, c_{f} = 2, ϕ = 0.5, η = −1, L = 3).
Weakly and strongly unfair ShapeGGen graphs
We utilize the ShapeGGen data generator to generate graphs with controllable fairness properties, i.e., leverage ShapeGGen to generate synthetic graphs with realworld fairness properties where we can enforce unfairness w.r.t. a given protected attribute. We use the ShapeGGen to generate graphs with: i) weaklyunfair groundtruth explanations (SGBase) and ii) stronglyunfair groundtruth explanations (SGUnfair) \({\mathcal{G}}\) = ShapeGGen (\({\mathcal{S}}\) = ‘house’, N_{s} = 1200, p = 0.006, n_{s} = 11, K = 2, n_{f} = 11, n_{i} = 4, s_{f} = 0.6, c_{f} = 2, ϕ = 0.75, η = 1, L = 3). Here, for the first time, we generated unfair synthetic graphs that can serve as pseudogroundtruth for quantifying whether current GNN explainers preserve counterfactual fairness.
Small and large ShapeGGen explanations
We explore the faithfulness of explanations w.r.t. different groundtruth explanation sizes. This is important because a reliable explanation should identify important features independent of the explanation size. However, current data resources only provide graphs with smallersize groundtruth explanations. Here, we use the ShapeGGen data generator to generate graphs having (i) smaller groundtruth explanations size (SGSmallEx) \({\mathcal{G}}\) = ShapeGGen (\({\mathcal{S}}\) = ‘triangle’, N_{s} = 1200, p = 0.006, n_{s} = 12, K = 2, n_{f} = 11, n_{i} = 4, s_{f} = 0.5, c_{f} = 2, ϕ = 0.5, η = 1, L = 3) and (ii) larger groundtruth explanations size using house motifs.
Low and high proportions of salient features
We examine the faithfulness of node feature masks produced by explainers under different levels of sparsity for classassociated signal in the node features. The feature generation procedure in ShapeGGen specifies node feature parameters n_{i}, the number of informative features that are generated to correlate with node labels, and n_{f}, number of total node features. The remaining n_{f}−n_{i} features are redundant features that are randomly distributed and have no correlation to the node label. Using an equivalent graph topology to SGBase, we change the relative proportion of node features which are attributed to the label by adjusting n_{i} and n_{f}. We create SGMoreInform, a ShapeGGen instance with a high proportion of groundtruth features to total features (8:11). Likewise, we create SGLessInform, a ShapeGGen instance with a low proportion of groundtruth features to total features (4:21). This proportion in SGBase falls in the middle of SGMoreInform and SGLessInform instances with a proportion of groundtruth to total features of 4:11. Formally, we define SGMoreInform as \({\mathcal{G}}\) = ShapeGGen (\({\mathcal{S}}\) = ‘house’, N_{s} = 1200, p = 0.006, n_{s} = 11, K = 2, n_{f} = 11, n_{i} = 8, s_{f} = 0.6, c_{f} = 2, ϕ = 0.5, η = 1, L = 3) and SGLessInform as \({\mathcal{G}}\) = ShapeGGen (\({\mathcal{S}}\) = ‘house’, N_{s} = 1200, p = 0.006, n_{s} = 11, K = 2, n_{f} = 21, n_{i} = 4, s_{f} = 0.6, c_{f} = 2, ϕ = 0.5, η = 1, L = 3).
BAShapes
In addition to ShapeGGen, we provide a version of BAShapes^{14}, a synthetic graph data generator for node classification tasks. We start with a base BarabasiAlbert (BA)^{48} graph using N nodes and a set of fivenode “house”structured motifs K randomly attached to nodes of the base graph. The final graph is perturbed by adding random edges. The nodes in the output graph are categorized into two classes corresponding to whether the node is in a house (1) or not in a house (0).
Realworld graphs
We describe the realworld graph datasets with and without groundtruth explanations provided in GraphXAI. To this end, we provide data resources from crime forecasting, financial lending, and molecular chemistry and biology^{1,2,35}. We consider these datasets as they are used to train GNNs for generating predictions in highstakes downstream applications. In particular, we include chemical and biological datasets because they are used to identify whether an input graph (i.e., a molecular graph) contains a specific pattern (i.e., a chemical group with a specific property in the molecule). Knowledge about such patterns, which determine molecular properties, represents groundtruth explanations^{2}. We provide a statistical description of realworld graphs in Tables 6–8. Below, we discuss the details of each of the realworld datasets that we employ:
MUTAG
The MUTAG^{35} dataset contains 1,768 graph molecules labeled into two different classes according to their mutagenic properties, i.e., effect on the Gramnegative bacterium S. Typhimuriuma. Kazius et al.^{35} identifies several toxicophores  motifs in the molecular graph  that correlate with mutagenicity. The dataset is trimmed from its original 4,337 graphs to 1,768, based on those whose labels directly correspond to the presence or absence of our chosen toxicophores: NH_{2}, NO_{2}, aliphatic halide, nitroso, and azotype (terminology, as referred to in Kazius et al.^{35}).
Alkane carbonyl
The Alkane Carbonyl^{2} dataset contains 1,125 molecular graphs labeled into two classes where a positive sample indicates a molecule that contains an unbranched alkane and a carbonyl (C = O) functional group. The groundtruth explanations include any combinations of alkane and carbonyl functional groups within a given molecule.
Benzene
The Benzene^{2} dataset contains 12,000 molecular graphs extracted from the ZINC15^{49} database and labeled into two classes where the task is to identify whether a given molecule has a benzene ring or not. Naturally, the ground truth explanations are the nodes (atoms) comprising the benzene rings, and in the case of multiple benzenes, each of these benzene rings forms an explanation.
Fluoride carbonyl
The Fluoride Carbonyl^{2} dataset contains 8,671 molecular graphs labeled into two classes where a positive sample indicates a molecule that contains a fluoride (F^{−}) and a carbonyl (C = O) functional group. The groundtruth explanations consist of combinations of fluoride atoms and carbonyl functional groups within a given molecule.
German credit
The German Credit^{1} graph dataset contains 1,000 nodes representing clients in a German bank connected based on the similarity of their credit accounts. The dataset includes demographic and financial features like gender, residence, age, marital status, loan amount, credit history, and loan duration. The goal is to associate clients with credit risk.
Recidivism
The Recidivism^{1} dataset includes samples of bail outcomes collected from multiple state courts in the USA between 1990–2009. It contains past criminal records, demographic attributes, and other demographic details of 18,876 defendants (nodes) who got released on bail at the U.S. state courts. Defendants are connected based on the similarity of past criminal records and demographics, and the goal is to classify defendants into bail vs. no bail.
Credit defaulter
The Credit defaulter^{1} graph has 30,000 nodes representing individuals that we connected based on the similarity of their spending and payment patterns. The dataset contains applicant features like education, credit history, age, and features derived from their spending and payment patterns. The task is to predict whether an applicant will default on an upcoming credit card payment.
Data availability
GraphXAI data resource^{28} is hosted on Harvard Dataverse under a persistent identifier https://doi.org/10.7910/DVN/KULOS8. We have deposited different a number of ShapeGGen generated datasets and realworld graphs at this repository.
Code availability
Project website for GraphXAI is at https://zitniklab.hms.harvard.edu/projects/GraphXAI. The code to reproduce results, documentation, and tutorials are available in GraphXAI ‘s Github repository at https://github.com/mimsharvard/GraphXAI. The repository contains Python scripts to generate and evaluate explanations using performance metrics and also visualize explanationa. In addition, the repository contains information and Python scripts to build new versions of GraphXAI as the underlying primary resources get updated and new data become available.
References
Agarwal, C., Lakkaraju, H. & Zitnik, M. Towards a unified framework for fair and stable graph representation learning. In UAI (2021).
SanchezLengeling et al. Evaluating attribution for graph neural networks. NeurIPS (2020).
Giunchiglia, V., Shukla, C. V., Gonzalez, G. & Agarwal, C. Towards training GNNs using explanation directed message passing. In The First Learning on Graphs Conference (2022).
Morselli Gysi, D. et al. Network medicine framework for identifying drugrepurposing opportunities for covid19. Proceedings of the National Academy of Sciences (2021).
Zitnik, M., Agrawal, M. & Leskovec, J. Modeling polypharmacy side effects with graph convolutional networks. In Bioinformatics (2018).
Baldassarre, F. & Azizpour, H. Explainability techniques for graph convolutional networks. In ICML Workshop on LRGR (2019).
Faber, L. et al. Contrastive graph neural network explanation. In ICML Workshop on Graph Representation Learning and Beyond (2020).
Huang, Q., Yamada, M., Tian, Y., Singh, D. & Chang, Y. Graphlime: Local interpretable model explanations for graph neural networks. IEEE Transactions on Knowledge and Data Engineering (2022).
Lucic, A., Ter Hoeve, M. A., Tolomei, G., De Rijke, M. & Silvestri, F. Cfgnnexplainer: Counterfactual explanations for graph neural networks. In AISTATS (PMLR, 2022).
Luo, D. et al. Parameterized explainer for graph neural network. In NeurIPS (2020).
Pope, P. E., Kolouri, S., Rostami, M., Martin, C. E. & Hoffmann, H. Explainability methods for graph convolutional neural networks. In CVPR (2019).
Schlichtkrull, M. S., De Cao, N. & Titov, I. Interpreting graph neural networks for nlp with differentiable edge masking. In ICLR (2021).
Vu, M. N. & Thai, M. T. PGMExplainer: probabilistic graphical model explanations for graph neural networks. In NeurIPS (2020).
Ying, R., Bourgeois, D., You, J., Zitnik, M. & Leskovec, J. GNNExplainer: generating explanations for graph neural networks. In NeurIPS (2019).
Agarwal, C. et al. Probing GNN explainers: A rigorous theoretical and empirical analysis of GNN explanation methods. In AISTATS (2022).
Faber, L., K. Moghaddam, A. & Wattenhofer, R. When comparing to ground truth is wrong: On evaluating GNN explanation methods. In KDD (2021).
Hu, W. et al. Open Graph Benchmark: datasets for machine learning on graphs. In NeurIPS (2020).
Baruah, T. et al. GNNMark: a benchmark suite to characterize graph neural network training on gpus. In ISPASS (2021).
Du, Y. et al. GraphGT: machine learning datasets for graph generation and transformation. In NeurIPS Datasets and Benchmarks (2021).
Freitas, S. et al. A largescale database for graph representation learning. In NeurIPS Datasets and Benchmarks (2021).
Zheng, Q. et al. Graph robustness benchmark: Benchmarking the adversarial robustness of graph machine learning. In NeurIPS Datasets and Benchmarks (2021).
Huang, K. et al. Therapeutics Data Commons: Machine learning datasets and tasks for drug discovery and development. In NeurIPS Datasets and Benchmarks (2021).
Huang, K. et al. Artificial intelligence foundation for therapeutic science. Nature Chemical Biology 18, 1033–1036 (2022).
Wang, Z., Yin, H. & Song, Y. Benchmarking the combinatorial generalizability of complex query answering on knowledge graphs. In NeurIPS Datasets and Benchmarks (2021).
Liu, M. et al. DIG: A turnkey library for diving into graph deep learning research. JMLR (2021).
Fey, M. & Lenssen, J. E. Fast graph representation learning with PyTorch Geometric. ICLR 2019 (RLGM Workshop) (2019).
Wang, M. et al. Deep Graph Library: Towards efficient and scalable deep learning on graphs. In ICLR workshop on representation learning on graphs and manifolds (2019).
Agarwal, C., Queen, O., Lakkaraju, H. & Zitnik, M. Evaluating explainability for graph neural networks. Harvard Dataverse https://doi.org/10.7910/DVN/KULOS8 (2022).
Simonyan, K. et al. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR (2014).
Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. In ICML (2017).
Yuan, H., Yu, H., Wang, J., Li, K. & Ji, S. On explainability of graph neural networks via subgraph explorations. In ICML (2021).
Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How powerful are graph neural networks? In ICLR (2019).
Kipf, T. N. & Welling, M. Semisupervised classification with graph convolutional networks. In ICLR (2017).
Taha, A. A. & Hanbury, A. Metrics for evaluating 3d medical image segmentation: analysis, selection, and tool. In BMC Medical Imaging (2015).
Kazius, J. et al. Derivation and validation of toxicophores for mutagenicity prediction. In Journal of Medicinal Chemistry (2005).
Yuan, H., Yu, H., Gui, S. & Ji, S. Explainability in graph neural networks: A taxonomic survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
Hunter, J. D. Matplotlib: A 2d graphics environment. Computing in Science & Engineering (2007).
Hagberg, A. A., Schult, D. A. & Swart, P. J. Exploring network structure, dynamics, and function using networkx. In Proceedings of the 7th Python in Science Conference (2008).
Wu, Z. et al. A comprehensive survey on graph neural networks. In IEEE Transactions on Neural Networks and Learning Systems (2020).
Barabási, A.L. & Albert, R. Emergence of scaling in random networks. In Science (1999).
Guyon, I. Design of experiments of the nips 2003 variable selection benchmark. In NIPS 2003 workshop on feature extraction and feature selection (2003).
Pedregosa, F. et al. Scikitlearn: Machine learning in Python. In JMLR (2011).
McCallum, A. K. et al. Automating the construction of internet portals with machine learning. In Information Retrieval (2000).
Sen, P. et al. Collective classification in network data. In AI magazine (2008).
Wang, K. et al. Microsoft academic graph: When experts are not enough. In Quantitative Science Studies (2020).
Zhu, J. et al. Beyond homophily in graph neural networks: Current limitations and effective designs. In NeurIPS (2020).
Jin, W. et al. Node similarity preserving graph convolutional networks. In WSDM (2021).
Albert, R. & Barabási, A.L. Statistical mechanics of complex networks. Reviews of modern physics (2002).
Sterling, T. & Irwin, J. J. Zinc 15–ligand discovery for everyone. Journal of Chemical Information and Modeling (2015).
Acknowledgements
C.A., O.Q., and M.Z. gratefully acknowledge the support by NSF under Nos. IIS2030459 and IIS2033384, US Air Force Contract No. FA870215D0001, and awards from Harvard Data Science Initiative, Amazon Research, Bayer Early Excellence in Science, AstraZeneca Research, and Roche Alliance with Distinguished Scientists. H.L. was supported in part by NSF under Nos IIS2008461 and IIS2040989, and research awards from Google, JP Morgan, Amazon, Bayer, Harvard Data Science Initiative, and D3 Institute at Harvard. O.Q. was supported, in part, by Harvard Summer Institute in Biomedical Informatics. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funders.
Author information
Authors and Affiliations
Contributions
C.A., O.Q., H.L. and M.Z. contributed new analytic tools and wrote the manuscript. C.A. and O.Q. retrieved, processed, and harmonized datasets. C.A. and O.Q. implemented the synthetic dataset generator and performed the analyses for technical validation of the new resource. M.Z. conceived the study.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Agarwal, C., Queen, O., Lakkaraju, H. et al. Evaluating explainability for graph neural networks. Sci Data 10, 144 (2023). https://doi.org/10.1038/s4159702301974x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4159702301974x