As graph neural networks (GNNs) are being increasingly used for learning representations of graph-structured data in high-stakes applications, such as criminal justice1, molecular chemistry2,3, and biological networks4,5, it becomes critical to ensure that the relevant stakeholders can understand and trust their functionality. To this end, previous work developed several methods to explain predictions made by GNNs6,7,8,9,10,11,12,13,14. With the increase in newly proposed GNN explanation methods, it is critical to ensure their reliability. However, explainability in graph machine learning is an emerging area lacking standardized evaluation strategies and reliable data resources to evaluate, test, and compare GNN explanations15. While several works have acknowledged this difficulty, they tend to base their analysis on specific real-world2 and synthetic16 datasets with limited ground-truth explanations. In addition, relying on these datasets and associated ground-truth explanations is insufficient as they are not indicative of diverse real-world applications15. To this end, developing a broader ecosystem of data resources for benchmarking state-of-the-art GNN explainers can support explainability research in GNNs.

A comprehensive data resource to correctly evaluate the quality of GNN explanations will ensure their reliability in high-stake applications. However, the evaluation of GNN explanations is a growing research area with relatively little work, where existing approaches mainly leverage ground-truth explanations associated with specific datasets2 and are prone to several pitfalls (as outlined by Faber et al.16). Further, multiple underlying rationales can generate the correct class labels, creating redundant or non-unique explanations. A trained GNN model may only capture one or an entirely different rationale. In such cases, evaluating the explanation output by a state-of-the-art method using the ground-truth explanation is incorrect because the underlying GNN model does not rely on that ground-truth explanation. In addition, even if a unique ground-truth explanation generates the correct class label, the GNN model trained on the data could be a weak predictor using an entirely different rationale for prediction. Therefore, the ground-truth explanation cannot be used to assess post hoc explanations of such models. Lastly, the ground-truth explanations corresponding to some of the existing benchmark datasets are not good candidates for reliably evaluating explanations as they can be recovered using trivial baselines (e.g., random node or edge as explanation). The above discussion highlights a clear need for general-purpose data resources which can evaluate post hoc explanations reliably across diverse real-world applications. While various benchmark datasets (e.g., Open Graph Benchmark (OGB)17, GNNMark18, GraphGT19, MalNet20, Graph Robustness Benchmark (GRB)21, Therapeutics Data Commons22,23, and EFO-1-QA24) and programming libraries for deep learning on graphs (e.g., Dive Into Graphs (DIG)25, Pytorch Geometric (PyG)26, and Deep Graph Library (DGL)27) in graph machine learning literature exist, they are mainly used to only benchmark the performance of GNN predictors and are not suited to evaluate the correctness of GNN explainers because they do not capture ground-truth explanations.

Here, we address the above challenges by introducing a general-purpose data resource that is not prone to ground-truth pitfalls (e.g., redundant explanations, weak GNN predictors, trivial explanations) and can cater to diverse real-world applications. To this end, we present ShapeGGen (Fig. 2), a novel and flexible synthetic XAI-ready (explainable artificial intelligence ready) dataset generator which can automatically generate a variety of benchmark datasets (e.g., varying graph sizes, degree distributions, homophilic vs. heterophilic graphs) accompanied by ground-truth explanations. ShapeGGen ensures that the generated ground-truth explanations are not prone to the pitfalls described in Faber et al.16, such as redundant explanations, weak GNN predictors, and trivially correct explanations. Furthermore, ShapeGGen can evaluate the goodness of any explanation (e.g., node feature-based, node-based, edge-based) across diverse real-world applications by seamlessly generating synthetic datasets that can mimic the properties of real-world data in various domains.

We incorporate ShapeGGen and several other synthetic and real-world graphs28 into GraphXAI, a general-purpose framework for benchmarking GNN explainers. GraphXAI also provides a broader ecosystem (Fig. 1) of data loaders, data processing functions, visualizers, and a set of evaluation metrics (e.g., accuracy, faithfulness, stability, fairness) to reliably benchmark the quality of any given GNN explanation (node feature-based or node/edge-based). We leverage various synthetic and real-world datasets and evaluation metrics from GraphXAI to empirically assess the quality of explanations output by eight state-of-the-art GNN explanation methods. Across many GNN explainers, graphs, and graph tasks, we observe that state-of-the-art GNN explainers fail on graphs with large ground-truth explanations (i.e., explanation subgraphs with a higher number of nodes and edges) and cannot produce explanations that preserve fairness properties of underlying GNN predictors.

Fig. 1
figure 1

Overview of GraphXAI. GraphXAI provides data loader classes for XAI-ready synthetic and real-world graph datasets with ground-truth explanations for evaluating GNN explainers, implementations of explanation methods, visualization functions for GNN explainers, utility functions to support new GNN explainers, and a diverse set of performance metrics to evaluate the reliability of explanations generated by GNN explainers.

Fig. 2
figure 2

Overview of ShapeGGen graph dataset generator. ShapeGGen is a novel dataset generator for graph-structured data that can be used to benchmark graph explainability methods using ground-truth explanations. Graphs are created by combining subgraphs containing any given motif and additional nodes. The number of motifs in a k-hop neighborhood determines the node label (in the figure, we use a 1-hop neighborhood for labeling, and nodes with two motifs in its 1-hop neighborhood are highlighted in red). Feature explanations are some masks over important node features (green striped), with an option to add a protected feature (shown in purple) whose correlation to node labels is controllable. Node explanations are nodes contained in the motifs (horizontal striped nodes) and edge explanations (bold lines) are edges connecting nodes within motifs.


To evaluate GraphXAI, we show how GraphXAI enables systematic benchmarking of eight state-of-the-art GNN explainers on both ShapeGGen (in the Methods section) and real-world graph datasets. We explore the utility of the ShapeGGen generator to benchmark GNN explainers on graphs with homophilic vs. heterophilic, small vs. large, and fair vs. unfair ground-truth explanations. Additionally, we examine the utility of GNN explanations on datasets with varying degrees of informative node features. Next, we outline the experimental setup, including details about performance metrics, GNN explainers, and underlying GNN predictors, and proceed with a discussion of benchmarking results.

Experimental setup

GNN explainers

The GraphXAI defines an Explanation class capable of storing multiple types of explanations produced by GNN explainers and provides a graphxai.BaseExplainer class that serves as the base for all explanation methods in GraphXAI. We incorporate eight GNN explainability methods, including gradient-based: Grad29, GradCAM11, GuidedBP6, Integrated Gradients30; perturbation-based: GNNExplainer14, PGExplainer10, SubgraphX31; and surrogate-based methods: PGMExplainer13. Finally, following Agarwal et al.15, we consider random explanations as a reference: (1) Random Node Features, a node feature mask defined by an d-dimensional Gaussian distributed vector; (2) Random Nodes, a 1 × n node mask is randomly sampled using a uniform distribution, where n is the number of nodes in the enclosing subgraph; and (3) Random Edges, an edge mask drawn from a uniform distribution over a node’s incident edges.

Implementation details

We use a three-layer GIN model32 and a GCN model33 as GNN predictors for our experiments. We use a model comprising three GIN convolution layers with ReLU non-linear activation function and a Softmax activation for the final layer. The hidden dimensionality of the layers is set to 16. We follow an established approach for generating explanations8,15 and use reference algorithm implementations. We select top-k (k = 25%) important nodes, node features, or edges and use them to generate explanations for all graph explainability methods. For training GIN models, we use an Adam optimizer with a learning rate of 1 × 10−2, weight decay of 1 × 10−5, and the number of epochs to 1000. We use an Adam optimizer with a learning rate of 3 × 10−2, no weight decay, and 1500 training epochs for training GNN models. We set hyperparameters of GNN explainability models following the authors’ recommendations.

We use a fixed random split provided within the GraphXAI package to split the datasets. For each ShapeGGen dataset, we use a 70/5/25 split for training, validation, and testing, respectively. For MUTAG, Benzene, and Fluoride Carbonyl datasets, we use a 70/10/20 split throughout each dataset. Average performance is reported across each sample in the testing set of each dataset.

Performance metrics

In addition to the synthetic and real-world data resources, we consider four broad categories of performance metrics: (i) Graph Explanation Accuracy (GEA); (ii) Graph Explanation Faithfulness (GEF); (iii) Graph Explanation Stability (GES); and (iv) Graph Explanation Fairness (GECF, GEGF) to evaluate the explanations on the respective datasets. In particular, all evaluation metrics leverage predicted explanations, ground-truth explanations, and other user-controlled parameters, such as top-k features. Our GraphXAI package implements these performance metrics and additional utility functions within graphxai.metrics module. Figure 3 shows a code snippet for evaluating the correctness of output explanations for a given GNN prediction in GraphXAI.

Fig. 3
figure 3

Example use case of the GraphXAI package. An example of explaining a prediction in the GraphXAI package. With just a few lines of code, one can calculate an explanation for a node or graph, calculate metrics based on that explanation, and visualize the explanation.

Graph explanation accuracy (GEA)

We report graph explanation accuracy as an evaluation strategy that measures an explanation’s correctness using the ground-truth explanation Mg. In ground-truth and predicted explanation masks, every node, node feature, or edge belongs to {0, 1}, where ‘0’ means that an attribute is unimportant and ‘1’ means that it is important for the model prediction. To measure accuracy, we use Jaccard index34 between the ground-truth Mg and predicted Mp:


where TP denotes true positives, FP denotes false positives, and FN indicates false negatives. Most synthetic- and real-world graphs have multiple ground-truth explanations. For example, in the MUTAG dataset35, carbon rings with both NH2 or NO2 chemical groups are valid explanations for the GNN model to recognize a given molecule as mutagenic. For this reason, the accuracy metric must account for the possibility of multiple equally valid explanations existing for any given prediction. Hence, we define ζ as a set of all possible ground-truth explanations, where |ζ| = 1 for graphs having a unique explanation. Therefore, we calculate GEA as:

$${\rm{GEA}}(\zeta ,{{\bf{M}}}^{p})=\max \;{\rm{JAC}}({{\bf{M}}}^{g},{{\bf{M}}}^{p})\;\forall {{\bf{M}}}^{g}\in \zeta .$$

Here, we can calculate GEA using predicted node feature, node, or edge explanation masks. Finally, Eq. 1 quantifies the accuracy between the ground-truth and predicted explanation masks. Higher values mean a predicted explanation is more likely to match the ground-truth explanation.

Graph explanation faithfulness (GEF)

We extend existing faithfulness metrics2,15 to quantify how faithful explanations are to an underlying GNN predictor. In particular, we obtain the prediction probability vector \({\widehat{y}}_{u}\) using the GNN, i.e., \({\widehat{y}}_{u}=f({{\mathcal{S}}}_{u})\), and using the explanation, i.e., \({\widehat{y}}_{u{\prime} }=f({{\mathcal{S}}}_{u{\prime} })\), where we generate a masked subgraph \({{\mathcal{S}}}_{u{\prime} }\) by only keeping the original values of the top-k features identified by an explanation, and get their respective predictions \({\widehat{y}}_{u{\prime} }\). Finally, we compute the graph explanation unfaithfulness metric as:

$${\rm{GEF}}(f,{{\mathcal{S}}}_{u},{{\mathcal{S}}}_{u{\prime} })=1-{{\rm{\exp }}}^{-{\rm{KL}}(f({{\mathcal{S}}}_{u})| | f({{\mathcal{S}}}_{u{\prime} }))},$$

where Kullback-Leibler (KL) divergence score quantifies the distance between two probability distributions, and the “||” operator indicates statistical divergence measure. Note that Eq. 3 is a measure of the unfaithfulness of the explanation. So, higher values indicate a higher degree of unfaithfulness.

Graph explanation stability (GES)

Formally, an explanation is defined to be stable if the explanation for a given graph and its perturbed counterpart (generated by making infinitesimally small perturbations to the node feature vector and associated edges) are similar15,36. We measure graph explanation stability w.r.t. the changes in the model behavior. In addition to similar output labels for \({{\mathcal{S}}}_{u}\) and the perturbed \({{\mathcal{S}}}_{u{\prime} }\), we employ a second level of check where the difference between model behaviors for \({{\mathcal{S}}}_{u}\) and \({{\mathcal{S}}}_{u{\prime} }\) is bounded by an infinitesimal constant \(\delta \), i.e., \(| | {{\mathcal{L}}}_{{{\mathcal{S}}}_{u}}-{{\mathcal{L}}}_{{{\mathcal{S}}}_{u{\prime} }}| {| }_{p}\le \delta \). Here, \({\mathcal{L}}(\cdot )\) refers to any form of model knowledge like output logits \({\widehat{y}}_{u}\) or embeddings \({{\bf{z}}}_{u}\). We compute graph explanation instability as:

$${\rm{GES}}({{\bf{M}}}_{{{\mathcal{S}}}_{u}}^{p},{{\bf{M}}}_{{{\mathcal{S}}}_{u{\prime} }}^{p})={\rm{m}}{\rm{a}}{\rm{x}}\;D({{\bf{M}}}_{{{\mathcal{S}}}_{u}}^{p},{{\bf{M}}}_{{{\mathcal{S}}}_{u{\prime} }}^{p}),\quad \forall {{\mathcal{S}}}_{u{\prime} }\in \beta ({{\mathcal{S}}}_{u})$$

where D represents the cosine distance metric, \({{\bf{M}}}_{{{\mathcal{S}}}_{u}}^{p}\) and \({{\bf{M}}}_{{{\mathcal{S}}}_{u{\prime} }}^{p}\) are the predicted explanation masks for \({{\mathcal{S}}}_{u}\) and \({{\mathcal{S}}}_{u{\prime} }\), and \(\beta \) represents a \(\delta \)-radius ball around \({{\mathcal{S}}}_{u}\) for which the model behavior is same. Eq. 4 is a measure of instability, and higher values indicate a higher degree of instability.

Counterfactual fairness mismatch

To measure counterfactual fairness15, we verify if the explanations corresponding to \({{\mathcal{S}}}_{u}\) and its counterfactual counterpart (where the protected node feature is modified) are similar (dissimilar) if the underlying model predictions are similar (dissimilar). We calculate counterfactual fairness mismatch as:


where, Mp and \({{\bf{M}}}_{s}^{p}\) are the predicted explanation mask for \({{\mathcal{S}}}_{u}\) and for the counterfactual counterpart of \({{\mathcal{S}}}_{u}\). Note that we expect a lower GECF score for graphs having weakly-unfair ground-truth explanations because explanations are similar for both original and counterfactual graphs, whereas, for graphs with strongly-unfair ground-truth explanations, we expect a higher GECF score as explanations change when we modify the protected attribute.

Group fairness mismatch

We measure group fairness mismatch15 as:

$${\rm{GEGF}}({\widehat{{\bf{y}}}}_{{\mathcal{K}}},{\widehat{{\bf{y}}}}_{{\mathcal{K}}}^{{{\bf{E}}}_{u}})=| {\rm{SP}}({\widehat{{\bf{y}}}}_{{\mathcal{K}}})-{\rm{SP}}({\widehat{{\bf{y}}}}_{{\mathcal{K}}}^{{{\bf{E}}}_{u}})| ,$$

where \({\widehat{{\bf{y}}}}_{{\mathcal{K}}}\) and \({\widehat{{\bf{y}}}}_{{\mathcal{K}}}^{{{\bf{E}}}_{u}}\) are predictions for a set of \({\mathcal{K}}\) graphs using the original and the essential features identified by an explanation, respectively, and SP is the statistical parity. Finally, Eq. 6 is a measure of group fairness mismatch of an explanation where higher values indicate that the explanation is not preserving group fairness.

Evaluation and analysis of GNN explainability methods

Next, we discuss experimental results that answer critical questions concerning synthetic and real-world graphs and different ground-truth explanations.

Benchmarking GNN explainers on synthetic and real-world graphs

We evaluate the performance of GNN explainers on a collection of synthetically generated graphs with various properties and molecular datasets using metrics described in the experimental setup. Results in Tables 15 show that, while no explanation method performs well across all properties, across different ShapeGGen node classification datasets (Table 6), SubgraphX outperforms other methods on average. In particular, SubgraphX generates 145.95% more accurate and 64.80% less unfaithful explanations than other GNN explanation methods. Gradient-based methods, such as GradCam and GuidedBP, perform the next best of all methods, with Grad producing the second-lowest unfaithfulness score and GradCAM achieving the second-highest explanation accuracy score. PGExplainer generates the least unstable explanations–35.35% less unstable explanations than the average instability across other GNN explainers. In summary, results of Tables 15 show that node explanation masks are more reliable than edge- and node feature explanation masks and state-of-the-art GNN explainers achieve better faithfulness for synthetic graph datasets as compared to real-world graphs.

Table 1 Evaluation of GNN explainers on SG-Base graph dataset based on node explanation masks \({{\bf{M}}}_{N}^{p}\). Arrows (↑/↓) indicate the direction of better performance.
Table 2 Evaluation of GNN explainers on SG-Base graph dataset based on node explanation masks \({{\bf{M}}}_{N}^{p}\).
Table 3 Evaluation of GNN explainers on SG-Base graph dataset based on edge explanation masks \({{\bf{M}}}_{E}^{p}\).
Table 4 Evaluation of GNN explainers on SG-Base graph dataset based on node feature explanation masks \({{\bf{M}}}_{F}^{p}\).
Table 5 Evaluation of GNN explainers for real-world molecular datasets with ground-truth explanations.
Table 6 Statistics of graphs generated using ShapeGGen data generator for evaluating different properties of GNN explanations.

Analyzing homophilic vs. heterophilic ground-truth explanations

We compare GNN explainers by generating explanations on GNN models trained on homophilic and heterophilic graphs generated using the SG-Heterophilic generator. Then, we compute the graph explanation unfaithfulness scores of output explanations generated using state-of-the-art GNN explainers. We find that GNN explainers produce 55.98% more faithful explanations when ground-truth explanations are homophilic than when ground-truth explanations are heterophilic (i.e., low unfaithfulness scores for light green bars in Fig. 4). These results reveal an important gap in existing GNN explainers. Namely, existing GNN explainers fail to perform well on diverse graph types, such as homophilic, heterophilic and attributed graphs. This observation, enabled by the flexibility of ShapeGGen generator, highlights an opportunity for future algorithmic innovation in GNN explainability.

Fig. 4
figure 4

Unfaithfulness scores across eight GNN explainers on SG-Heterophilic graph dataset consisting of either homophilic or heterophilic ground-truth (GT) explanations. GNN explainers produce more faithful explanations (lower GEF scores) on homophilic graphs than heterophilic graphs, revealing an important limitation of existing GNN explainers.

Analyzing the reliability of graph explainers to smaller vs. larger ground-truth explanations

Next, we examine the reliability of GNN explainers when used to predict explanations for models trained on graphs generated using the SG-SmallEx graph generator. Results in Fig. 5 show that explanations from existing GNN explainers are faithful (i.e., lower GEF scores) to the underlying GNN models when ground-truth explanations are smaller, i.e., S = ‘triangle’. On average, across all eight GNN explainers, we find that existing GNN explainers are highly unfaithful to graphs with large ground-truth explanations with an average GEF score of 0.7476. Further, we observe that explanations generated on ‘triangular’ (smaller) ground-truth explanations are 59.98% less unfaithful than explanations for ‘house’ (larger) ground-truth explanations (i.e., low unfaithfulness scores for light purple bars in Fig. 5). However, the Grad explainer, on average, achieves 9.33% lower unfaithfulness on large ground-truth explanations compared to other explanation methods. This limited behavior of existing GNN explainers has not been previously known and highlights an urgent need for additional analysis of GNN explainers.

Fig. 5
figure 5

Unfaithfulness scores across eight GNN explainers on SG-SmallEx graph dataset with smaller (triangle shapes) or (house shapes) ground-truth (GT) explanations. Results show that GNN explainers produce more faithful explanations (lower GEF scores) on graphs with smaller GT explanations than on graphs with larger GT explanations.

Examining fair vs. unfair ground-truth explanations

To measure the fairness of predicted explanations, we train GNN models on SG-Unfair, which generates graphs with controllable fairness properties. Next, we compute the average GECF and GEGF values for predicted explanations from eight GNN explainers. The fairness results in Fig. 6 show that GNN explainers do not preserve counterfactual fairness and are highly prone to producing unfair explanations. We note that for weakly-unfair ground-truth explanations (light red in Fig. 6), explanations Mp should not change as the label-generating process is independent of the protected attribute. Still, we observe high GECF scores for most explanation methods. For strongly-unfair ground-truth explanations, we find that explanations from most GNN explainers fail to capture (i.e., low GECF scores for dark red bars in Fig. 6) the unfairness enforced using the protected attribute and generate similar explanations even when we flip/modify the respective protected attribute. We see that GradCAM and PGEx explanations outperform other GNN explainers in preserving counterfactual explanations for weakly-unfair ground-truth explanations. In contrast, the PGMEx explainer preserves counterfactual fairness better than other explanation methods on strongly-unfair ground truth explanations. Our results highlight the importance of studying fairness in XAI as they can enhance a user’s confidence in the model and assist in detecting and correcting unwanted bias.

Fig. 6
figure 6

Counterfactual fairness mismatch scores across eight GNN explainers on SG-Unfair graph dataset with weakly-unfair or strongly-unfair ground-truth (GT) explanations. Results show that explanations produced on graphs with strongly-unfair ground-truth explanations do not preserve fairness and are sensitive to flipping/modifying the protected node feature.

Faithfulness shift with varying degrees of node feature information

Using ShapeGGen’s support for node features and ground-truth explanations on those features, we evaluate explainers that generate explanations for node features. Results for node feature explanations on SG-Base are given in Table 4. In addition, we explore the performance of explainers under varying proportions of informative node features. Informative node features, defined in the ShapeGGen construction (Algorithm 1), are node features correlated with the label of a given node, as opposed to redundant features, which are sampled randomly from a Gaussian distribution. Figure 7 shows the results of experiments on three datasets, SG-MoreInform, SG-Base, and SG-LessInform. All datasets have similar graph topology, but SG-MoreInform has a higher proportion of informative features while SG-LessInform has a lower proportion of these informative features. SG-Base is used as a baseline with a proportion of informative features greater than SG-LessInform but less than SG-MoreInform. There are minimal differences between explainers’ faithfulness across datasets, however, unfaithfulness tends to increase with fewer informative node features. As seen in Table 4, the Gradient explainer shows the best faithfulness score across all datasets for node feature explanation. Still, this faithfulness is relatively weak, only 0.001 better than random explanation. These results show that the faithfulness of node feature explanations worsens under sparse node feature signals.

Fig. 7
figure 7

Unfaithfulness scores across five GNN explainers that produce node feature explanations. Every GNN explainer is evaluated on three datasets whose network topology is equivalent to SG-Base and by varying the ratio between informative and redundant node features: most informative node features, control node features, and least informative node features. Results show that across all explainers, unfaithfulness decreases as the proportion of informative to redundant features increases, with explainers trained on the graph with the most informative node features having consistently lower unfaithfulness scores than explainers trained on graphs with the least informative node features.

Visualization results

GraphXAI provides functions that visualize explanations produced by GNN explainability methods. Users can compare both node- and graph-level explanations. In addition, function implementations for visualization are parameterized, allowing users to change colors and weight interpretation. Functions are compatible with matplotlib37 and networkx38. Visualizations are generated by graphxai.Explanation.visualize_node for node-level explanations and graphxai.Explanation.visualize_graph functions for graph-level explanations. In Fig. 8, we show the output explanation from four different GNN explainers as produced by our visualization function. Figure 9 shows example outputs from multiple explainers in the GraphXAI package on a ShapeGGen -generated dataset.

Fig. 8
figure 8

Visualization of four explainers from the G-XAI Bench library on the BA-Shapes dataset. The visualization is for explaining the prediction of node u. We show the L + 1-hop around node u, where L is the number of layers of the GNN model predicting on the dataset. Two color bars indicate the intensity of attribution scores for the node and edge explanations. Note that edge importance is not defined for every method, so edges are set to black to indicate that the method does not provide edge scores. Visualization tools are a native part of the GraphXAI package, including user-friendly functions graphxai.Explanation.visualize_node and graphxai.Explanation.visualize_graph to visualize GNN explanations. The visualization tools in GraphXAI allow users to compare the explanations of different GNN explainers, such as gradient-based methods (Gradient and Grad-CAM) and perturbation-based methods (GNNExplainer and SubgraphX).

Fig. 9
figure 9

Example of a particularly challenging example in a ShapeGGen dataset. All explanation methods that output node-wise importance scores are shown, including the ground-truth explanation at the top of the figure. Importance and edge scores are highlighted by relative value across each explanation method, as shown by the scales at right in the figure. The central node, i.e., the node being classified in this example, is shown in red on each subgraph. Visualizations are generated by graphxai.Explanation.visualize_node, a function native to the graphxai package. Some explainers can capture portions of the ground-truth explanation, such as SubgraphX and GNNExplainer, but others attribute no importance to the ground-truth shape, such as CAM and Gradient.


GraphXAI provides a general-purpose framework to evaluate GNN explanations produced by state-of-the-art GNN explanation methods. GraphXAI provides data loaders, data processing functions, visualizers, real-world graph datasets with ground-truth explanations, and evaluation metrics to benchmark the quality of GNN explanations. GraphXAI introduces a novel and flexible synthetic dataset generator called ShapeGGen to automatically generate benchmark datasets and corresponding ground truth explanations robust against known pitfalls of GNN explainability methods. Our experimental results show that existing GNN explainers perform well on graphs with homophilic ground-truth explanations but perform considerably worse on heterophilic and attributed graphs. Across multiple graph datasets and types of downstream prediction tasks, we show that existing GNN explanation methods fail on graphs with larger ground-truth explanations and cannot generate explanations that preserve the fairness properties of the underlying GNN model. In addition, GNN explainers tend to underperform on sparse node feature signals compared to more densely informative node features. These findings indicate the need for methodological innovation and a thorough analysis of future GNN explainability methods across performance dimensions.

GraphXAI provides a flexible framework for evaluating GNN explanation methods and promotes reproducible and transparent research. We maintain GraphXAI as a centralized library for evaluating GNN explanation methods and plan to add newer datasets, explanation methods, diverse evaluation metrics, and visualization features to our existing framework. In the current version of GraphXAI, we mostly employ real-world molecular chemistry datasets as the need for model understanding is motivated by experimental evaluation of model predictions in the laboratory, and it includes a wide variety of graph sizes (ranging from 1,768 to 12,000 instances in the dataset), node feature dimensions (ranging from 13 to 27 dimensions), and class imbalance ratios. In addition to the scale-free ShapeGGen dataset generator, we will include other random graph model generators in the next GraphXAI release to support benchmarking of GNN explainability methods on other graph types. Our evaluation metrics can be also extended to explanations from self-explaining GNNs, e.g., GraphMASK12 identifies edges at each layer of the GNN during training that can be ignored without affecting the output model predictions. In general, self-explaining GNNs also return a set of edge masks as an output explanation for a GNN prediction that can be converted to edge importance scores for computing GraphXAI metrics. We anticipate that GraphXAI can help algorithm developers and practitioners in graph representation learning develop and evaluate principled explainability techniques.


ShapeGGen is a key component of GraphXAI and serves as a synthetic dataset generator of XAI-ready graph datasets. It is founded in graph theory and designed to address the pitfalls (see Introduction) of existing graph datasets in the broad area of explainable AI. As such, ShapeGGen can facilitate the development, analysis, and evaluation of GNN explainability methods (see Results). We proceed with the description of ShapeGGen data generator.



Let \({\mathcal{G}}=\left({{\mathcal{V}}}_{{\mathcal{G}}},{{\mathcal{E}}}_{{\mathcal{G}}},{{\bf{X}}}_{{\mathcal{G}}}\right)\) denote an undirected graph comprising of a set of nodes \({{\mathcal{V}}}_{{\mathcal{G}}}\) and a set of edges \({{\mathcal{E}}}_{{\mathcal{G}}}\). Let \({{\bf{X}}}_{{\mathcal{G}}}=\left\{{{\bf{x}}}_{1},{{\bf{x}}}_{2},\ldots ,{{\bf{x}}}_{N}\right\}\) denote the set of node feature vectors for all nodes in \({{\mathcal{V}}}_{{\mathcal{G}}}\), where \({{\bf{x}}}_{v}\in {{\bf{X}}}_{{\mathcal{G}}}\) is an d-dimensional vector which captures the attribute values of a node v and \(N=\left|{{\mathcal{V}}}_{{\mathcal{G}}}\right|\) denotes the number of nodes in the graph. Let \({\bf{A}}\in {{\mathbb{R}}}^{N\times N}\) be the graph adjacency matrix where element Auv = 1 if there exists an edge \(e\in {{\mathcal{E}}}_{{\mathcal{G}}}\) between nodes u and v and Auv = 0 otherwise. We use \({{\mathcal{N}}}_{u}\) to denote the set of immediate neighbors of node u, \({{\mathcal{N}}}_{u}=\{v\in {{\mathcal{V}}}_{{\mathcal{G}}}| {A}_{uv}=1\}\). Finally, the function \({\rm{\deg }}:{{\mathcal{V}}}_{{\mathcal{G}}}\mapsto {{\mathbb{Z}}}_{\ge 0}\) is defined as \({\rm{\deg }}(u)=\left|{{\mathcal{N}}}_{u}\right|\) and outputs the degree of a node \(u\in {{\mathcal{V}}}_{{\mathcal{G}}}\).

Graph neural networks

Most GNNs can be formulated as message passing networks39 using three operators: Msg, Agg, and Upd. In a L-layer GNN, these operators are recursively applied on \({\mathcal{G}}\), specifying how neural messages (i.e. embeddings) are exchanged between nodes, aggregated, and transformed to arrive at node representations in the last layer of transformations. Formally, a message between a pair of nodes (u, v) in layer l is defined as a function of hidden representations of nodes \({{\bf{h}}}_{u}^{l-1}\) and \({{\bf{h}}}_{v}^{l-1}\) from the previous layer: \({{\bf{m}}}_{uv}^{l}=Msg({{\bf{h}}}_{u}^{l-1},{{\bf{h}}}_{v}^{l-1})\). In Agg, messages from all nodes \(v\in {{\mathcal{N}}}_{u}\) are aggregated as: \({{\bf{m}}}_{u}^{l}=Agg({{\bf{m}}}_{uv}^{l}| v\in {{\mathcal{N}}}_{u})\). In Upd, the aggregated message \({{\bf{m}}}_{u}^{l}\) is combined with \({{\bf{h}}}_{u}^{l-1}\) to produce u’s representation for layer l as \({{\bf{h}}}_{u}^{l}=Upd({{\bf{m}}}_{u}^{l},{{\bf{h}}}_{u}^{l-1})\). Final node representation \({{\bf{z}}}_{u}={{\bf{h}}}_{u}^{L}\) is the output of the last layer. Lastly, let f denote a downstream GNN classification model that maps the node representation zu to a softmax prediction vector \({\widehat{y}}_{u}\in {{\mathbb{R}}}^{C}\), where C is the total number of labels.

GNN explainability methods

Given the prediction f(u) for node u made by a GNN model, a GNN explainer \({\mathcal{O}}\) outputs an explanation mask Mp that provides an explanation of f(u). These explanations can be given with respect to node attributes \({{\bf{M}}}_{a}\in {{\mathbb{R}}}^{d}\), nodes \({{\bf{M}}}_{n}\in {{\mathbb{R}}}^{N}\), or edges \({{\bf{M}}}_{e}\in {{\mathbb{R}}}^{N\times N}\), depending on specific GNN explainer, such as GNNExplainer14, PGExplainer10, and SubgraphX31. For all explanation methods, we use a graph masking function MASK that outputs a new graph \({\mathcal{G}}{\prime} =\left({{\mathcal{V}}{\prime} }_{{\mathcal{G}}{\prime} },{{\mathcal{E}}{\prime} }_{{\mathcal{G}}{\prime} },{{\bf{X}}{\prime} }_{{\mathcal{G}}{\prime} }\right)\) by performing element-wise multiplication operation between the masks (Ma, Mn, Me) and their respective attributes in the original graph \({\mathcal{G}}\), e.g. A′ = AMe. Finally, we denote the ground-truth explanation mask as Mg that is used to evaluate the performance of GNN explainers.

ShapeGGen dataset generator

We propose a novel and flexible synthetic dataset generator called ShapeGGen that can automatically generate a variety of benchmark datasets (e.g., varying graph sizes, degree distributions, homophilic vs. heterophilic graphs) accompanied by ground-truth explanations. Furthermore, the flexibility to generate diverse synthetic datasets and corresponding ground-truth explanations allows us to mimic the data generated by various real-world applications. ShapeGGen is a generator of XAI-ready graph datasets supported by graph theory and is particularly suitable for benchmarking GNN explainers and studying their limitations.

Flexible parameterization of ShapeGGen

ShapeGGen has tunable parameters for data generation. By varying these parameters, ShapeGGen can generate diverse types of graphs, including graphs with varying degrees of class imbalance and graph sizes. Formally, a graph is generated as: \({\mathcal{G}}\) = ShapeGGen \(({\mathcal{S}},{N}_{s},p,{n}_{s},K,{n}_{f},{n}_{i},{s}_{f},{c}_{f},\varphi ,\eta ,L)\), where:

  • S is the motif, defined as a subgraph, to be planted within the graph.

  • Ns denotes the number of subgraphs used in the initial graph generation process.

  • p represents the connection probability between two subgraphs and controls the average shortest path length for all possible pairs of nodes. Ideally, a smaller p value for larger Ns is preferred to avoid low average path length and, therefore, the poor performance of GNNs.

  • ns is the expected size of each subgraph in the ShapeGGen generation procedure. For a fixed S shape, large ns values produce graphs with long-tailed degree distributions. Note that the expected total number of nodes in the generated graph is N × ns.

  • K is the number of distinct classes defined in the graph downstream task.

  • nf represents the number of dimensions for node features in the generated graph.

  • ni is the number of informative node features. These are features correlated to the node labels instead of randomly generated non-informative features. The indices for the informative features define the ground-truth explanation mask for node features in the final ShapeGGen instance. In general, larger ni results in an easier classification and explanation task, as it increases the node feature-level ground-truth explanation size.

  • sf is defined as the class separation factor that represents the strength of discrimination of class labels between features for each node. Higher sf corresponds to a stronger signal in the node features, i.e., if a classifier is trained only on the node features, a higher sf value would result in an easier classification task.

  • cf is the number of clusters per class. A larger cf value increases the difficulty of the classification task with respect to node features.

  • \({\varphi }\in [0,1]\) is the protected feature noise factor that controls the strength of correlation r between the protected features and the node labels. This value corresponds to the probability of “flipping” the protected feature with respect to the node’s label. For instance, \({\varphi }=0.5\) results in zero correlation (r = 0) between the protected feature and the label (i.e. complete fairness), \({\varphi }=0\) results in a positive correlation (r = 1), and \({\varphi }=1\) results in a negative correlation (r = −1) between the label and the protected feature.

  • η is the homophily coefficient that controls the strength of homophily or heterophily in the graph. Positive values (η > 0) produce a homophilic graph while negative values (η < 0) produce a heterophilic graph.

  • L is the number of layers in the GNN predictor corresponding to the GNN’s receptive field. For the purposes of ShapeGGen, L is used to define the size of the GNN’s receptive field and thus the size of the ground-truth explanation generated for each node.

This wide array of parameters for ShapeGGen allows for the generation of graph instances with vastly differing properties Fig. 10.

Fig. 10
figure 10

Comparison of degree distribution for (a) ShapeGGen dataset (SG-Base), (b) random Erdös-Rényi graph (p = 5 × 10−4) graph, (c) German Credit dataset, and (d) Credit Defaulter dataset. All plots are shown with a log scale of frequency for the y-axis. SG-Base and both real-world graphs show a power-law degree distribution commonly observed in real-world datasets exhibiting preferential attachment properties. Datasets generated by ShapeGGen are designed to present power-law degree distributions to match real-world dataset topologies, such as those observed in German Credit and Credit Defaulter. The degree distribution of SG-Base is much different than the binomial distribution exhibited in Erdös-Rényi graph (b), an unstructured random graph model.

Generating graph structure

Figure 2 summarizes the process to generate a graph \({\mathcal{G}}=\left({{\mathcal{V}}}_{{\mathcal{G}}},{{\mathcal{E}}}_{{\mathcal{G}}},{{\bf{X}}}_{{\mathcal{G}}}\right)\). ShapeGGen generates Ns subgraphs that exhibit the preferential attachment property40, which occurs in many real-world graphs. Each subgraph is first given a motif, i.e., a subgraph \({\mathcal{S}}=\left({{\mathcal{V}}}_{{\mathcal{S}}},{{\mathcal{E}}}_{{\mathcal{S}}},{{\bf{X}}}_{{\mathcal{S}}}\right)\). A preferential attachment algorithm is then performed on base structure \({\mathcal{S}}\), adding \(n{\prime} \) (\(n{\prime} \sim {\rm{Poisson}}(\lambda ={n}_{s}-| {{\mathcal{V}}}_{{\mathcal{S}}}| )\)) nodes by creating edges to nodes in \({{\mathcal{V}}}_{{\mathcal{S}}}\). The Poisson distribution is used for determining the sizes of each subgraph used in the generation process, with \(\lambda ={n}_{s}-| {{\mathcal{V}}}_{{\mathcal{S}}}| \), the difference between the number of nodes in the motif and the expected subgraph size \({n}_{s}\). After creating a list of randomly-generated subgraphs \({\bf{S}}=\{{{\mathcal{S}}}_{1},\ldots ,{{\mathcal{S}}}_{N}\}\), edges are created to connect subgraphs, creating the structure of an instance of ShapeGGen. Subgraph connections and local subgraph construction is performed in such a way that each node in the final graph \({\mathcal{G}}\) has between 1 and K motifs within its neighborhood, i.e., \(\left|{\bigcup }_{i=1}^{N}{{\mathcal{V}}}_{{{\mathcal{S}}}_{i}}\cap {{\mathcal{N}}}_{v}\right|\;\in \;\left\{1,2,...,K\right\}\) for any v and \({{\mathcal{S}}}_{i}\). This naturally defines a classification task in the domain of f to {0, 1, …, K−1}. More details on the ShapeGGen structure generation can be found in Algorithm 2.

Generating labels for prediction

A motif defined as a subgraph \({\mathcal{S}}=\left({{\mathcal{V}}}_{{\mathcal{S}}},{{\mathcal{E}}}_{{\mathcal{S}}},{{\bf{X}}}_{{\mathcal{S}}}\right)\) occurs randomly throughout \({\mathcal{G}}\), with the set \({\bf{S}}=\{{{\mathcal{S}}}_{1},\ldots ,{{\mathcal{S}}}_{{N}_{S}}\}\). The task on this graph is a motif detection problem, where each node has exactly 1, 2, or K motifs in its 1-hop neighborhood. A motif \({{\mathcal{S}}}_{i}\) is considered to be within the neighborhood of a node \(v\in {{\mathcal{V}}}_{{\mathcal{G}}}\) if any node \(s\in {{\mathcal{V}}}_{{{\mathcal{S}}}_{i}}\) is also in the neighborhood of v, i.e., if \(| {{\mathcal{V}}}_{{{\mathcal{S}}}_{i}}\cap {{\mathcal{N}}}_{v}| > 0\). Therefore, the task that a GNN predictor needs to solve is defined by:

$$f(v\in {{\mathcal{V}}}_{{\mathcal{G}}})=\sum _{{{\mathcal{S}}}_{i}\in {\bf{S}}}{{\mathbb{1}}}_{{{\mathcal{V}}}_{{{\mathcal{S}}}_{i}}}({{\mathcal{N}}}_{v})-1,$$

where \({{\mathbb{1}}}_{{{\mathcal{V}}}_{{{\mathcal{S}}}_{i}}}\left({{\mathcal{N}}}_{v}\right)=0\) if \(| {{\mathcal{V}}}_{{{\mathcal{S}}}_{i}}\cap {{\mathcal{N}}}_{v}| =0\) and 1 otherwise.

Generating node feature vectors

ShapeGGen uses a latent variable model to create node feature vectors and associate them with network structure. This latent variable model is based on that developed by Guyon41 for the MADELON dataset and implemented in Scikit-Learn’s make_classification function42. The latent variable model creates ni informative features for each node based on the node’s generated label and also creates non-informative features as noise. Having non-informative/redundant features allows us to evaluate GNN explainers, such as GNNExplainer14, that formulate explanations as node feature masks. More detail on node feature generation is given in Algorithm 1.

ShapeGGen can generate graphs with both homophilic and heterophilic ground-truth explanations. We optimize between homophily vs. heterophily by taking advantage of redundant node features, i.e., features that do not associate with the generated labels, and manipulate them appropriately based on a user-specified homophily parameter η. The magnitude of the η parameter determines the degree of homophily/heterophily in the generated graph. The algorithm for node features is given in Algorithm 3. While not every node feature in the feature vector is optimized for homophily/heterophily indication, we experimentally verified the cosine similarity between node feature vectors produced by Algorithm 3 reveals a strong homophily/heterophily pattern. Finally, ShapeGGen can generate protected features to enable the study of fairness1. By controlling the value assignment for a selected discrete node feature, ShapeGGen introduces bias between the protected feature and node labels. The biased feature is a proxy for a protected feature, such as gender or race. The procedure for generating node features is outlined in NodeFeatureVectors within Algorithm 1.

Generating ground-truth explanations

In addition to generating ground-truth labels, ShapeGGen has a unique capability to generate unique ground-truth explanations. To accommodate diverse types of GNN explainers, every ground-truth explanation in ShapeGGen contains information on a) the importance of nodes, b) the importance of node features, and c) the importance of edges. This information is represented by three distinct masks defined over enclosing subgraphs of nodes \(v\in {{\mathcal{V}}}_{{\mathcal{S}}}\), i.e., the L-hop neighborhood around the node. We denote the enclosing subgraph of node \(v\in {{\mathcal{V}}}_{{\mathcal{S}}}\) for a given GNN predictor with L layers as: \({\rm{Sub}}\left(v;L\right)=\left({{\mathcal{V}}}_{{\rm{Sub}}(v)},{{\mathcal{E}}}_{{\rm{Sub}}(v)},{{\bf{X}}}_{{\rm{Sub}}(v)}\right)\subseteq {\mathcal{G}}\). Let motifs within this enclosing subgraph be: \({{\bf{S}}}_{v}=\left({{\mathcal{V}}}_{{{\bf{S}}}_{v}},{{\mathcal{E}}}_{{{\bf{S}}}_{v}},{{\bf{X}}}_{{{\bf{S}}}_{v}}\right)={\bf{S}}\cap {\rm{Sub}}(v)\). Using this notation, we define ground-truth explanation masks:

  1. a)

    Node explanation mask. Nodes in Sub(v) are assigned a value of 0 or 1 based on whether they are located within a motif or not. For any node \({v}_{i}\in {{\mathcal{V}}}_{{\rm{Sub}}(v)}\), we set \({{\mathbb{1}}}_{{{\mathcal{V}}}_{{{\bf{S}}}_{v}}}({v}_{i})=1\) if \({v}_{i}\in {{\mathcal{V}}}_{{{\bf{S}}}_{v}}\) and 0 otherwise. This function \({{\mathbb{1}}}_{{{\mathcal{V}}}_{{{\bf{S}}}_{v}}}\) is applied to all nodes in the enclosing subgraph of v to produce an importance score for each node, yielding the final mask as: \({{\bf{M}}}_{n}=\{{{\mathbb{1}}}_{{{\mathcal{V}}}_{{{\bf{S}}}_{v}}}(u)| u\in {{\mathcal{V}}}_{{\rm{Sub}}(v)}\}\).

  2. b)

    Node feature explanation mask. Each feature in v’s feature vector is labeled as 1 if it represents an informative feature and 0 if it is a random feature. This procedure produces a binary node feature mask for node v as: \({{\bf{M}}}_{f}\in {\{0,1\}}^{d}\).

  3. c)

    Edge explanation mask. To each \(e=\left({v}_{i},{v}_{j}\right)\in {{\mathcal{E}}}_{{\rm{Sub}}(v)}\) we assign a value of either 0 or 1 based on whether e connects any two nodes in Sub(v). The masking function is defined as follows:

$$\begin{array}{ccc}{{\mathbb{1}}}_{{\varepsilon }_{{{\bf{S}}}_{v}}}(e) & = & \left\{\begin{array}{cc}0 & {\rm{if}}\;{v}_{{\rm{i}}}\notin {{\mathcal{V}}}_{{{\bf{S}}}_{{\rm{v}}}}\cup {\rm{\{}}v{\rm{\}}}\vee {v}_{{\rm{j}}}\notin {{\mathcal{V}}}_{{{\bf{S}}}_{v}}\cup {\rm{\{}}v{\rm{\}}}\\ 1 & {\rm{if}}\;{v}_{{\rm{i}}}\in {{\mathcal{V}}}_{{{\bf{S}}}_{{\rm{v}}}}\cup {\rm{\{}}v{\rm{\}}}\wedge {v}_{{\rm{j}}}\in {{\mathcal{V}}}_{{{\bf{S}}}_{{\rm{v}}}}\cup {\rm{\{}}v{\rm{\}}}\end{array}\right.\end{array}$$

This function \({{\mathbb{1}}}_{{{\mathcal{E}}}_{{{\bf{S}}}_{v}}}\) is applied to all edges \(e\in {{\mathcal{E}}}_{{\rm{Sub}}(v)}\) to produce ground-truth edge explanation as: \({{\bf{M}}}_{e}=\{{{\mathbb{1}}}_{{{\mathcal{E}}}_{{{\bf{S}}}_{v}}}(e)| e\in {{\mathcal{E}}}_{{\rm{Sub}}(v)}\}\). The procedure to generate these ground-truth explanations is thoroughly described in Algorithm 1.

Datasets in GraphXAI

We proceed with a detailed description of synthetic and real-world graph data resources included in GraphXAI.

Synthetic graphs

The ShapeGGen generator outlined in the Methods section is a dataset generator that can be used to generate any number of user-specified graphs. In GraphXAI, we provide a collection of pre-generated XAI-ready graphs with diverse properties that are readily available for analysis and experimentation.

Base ShapeGGen graphs (SG-Base)

We provide a base version of ShapeGGen graphs. This instance of ShapeGGen is homophilic, large, and contains house-shaped motifs for ground-truth explanations, formally described as \({\mathcal{G}}\) = ShapeGGen (\({\mathcal{S}}\) = ‘house’, Ns = 1200, p = 0.006, ns = 11, K = 2, nf = 11, ni = 4, sf = 0.6, cf = 2, ϕ = 0.5, η = 1, L = 3). The node features in this graph exhibit homophily, a property commonly found in social networks. With over 10,000 nodes, this graph also provides enough examples of ground-truth explanations for rigorous statistical evaluation of explainer performance. The house-shaped motifs follow one of the earliest synthetic graphs used to evaluate GNN explainers14.

Homophilic and heterophilic ShapeGGen graphs

GNN explainers are evaluated on homophilic graphs1,43,44,45 as low homophily levels in graphs can degrade the performance of GNN predictors46,47. To this end, there are no heterophilic graphs with ground-truth explanations in current GNN XAI literature despite such graphs being abundant in real-world applications46. To demonstrate the flexibility of the ShapeGGen data generator, we use it to generate graphs with: i) homophilic ground-truth explanations (SG-Base) and ii) heterophilic ground-truth explanations (SG-Heterophilic), i.e., \({\mathcal{G}}\) = ShapeGGen (\({\mathcal{S}}\) = ‘house’, Ns = 1200, p = 0.006, ns = 11, K = 2, nf = 11, ni = 4, sf = 0.6, cf = 2, ϕ = 0.5, η = −1, L = 3).

Weakly and strongly unfair ShapeGGen graphs

We utilize the ShapeGGen data generator to generate graphs with controllable fairness properties, i.e., leverage ShapeGGen to generate synthetic graphs with real-world fairness properties where we can enforce unfairness w.r.t. a given protected attribute. We use the ShapeGGen to generate graphs with: i) weakly-unfair ground-truth explanations (SG-Base) and ii) strongly-unfair ground-truth explanations (SG-Unfair) \({\mathcal{G}}\) = ShapeGGen (\({\mathcal{S}}\) = ‘house’, Ns = 1200, p = 0.006, ns = 11, K = 2, nf = 11, ni = 4, sf = 0.6, cf = 2, ϕ = 0.75, η = 1, L = 3). Here, for the first time, we generated unfair synthetic graphs that can serve as pseudo-ground-truth for quantifying whether current GNN explainers preserve counterfactual fairness.

Small and large ShapeGGen explanations

We explore the faithfulness of explanations w.r.t. different ground-truth explanation sizes. This is important because a reliable explanation should identify important features independent of the explanation size. However, current data resources only provide graphs with smaller-size ground-truth explanations. Here, we use the ShapeGGen data generator to generate graphs having (i) smaller ground-truth explanations size (SG-SmallEx) \({\mathcal{G}}\) = ShapeGGen (\({\mathcal{S}}\) = ‘triangle’, Ns = 1200, p = 0.006, ns = 12, K = 2, nf = 11, ni = 4, sf = 0.5, cf = 2, ϕ = 0.5, η = 1, L = 3) and (ii) larger ground-truth explanations size using house motifs.

Low and high proportions of salient features

We examine the faithfulness of node feature masks produced by explainers under different levels of sparsity for class-associated signal in the node features. The feature generation procedure in ShapeGGen specifies node feature parameters ni, the number of informative features that are generated to correlate with node labels, and nf, number of total node features. The remaining nfni features are redundant features that are randomly distributed and have no correlation to the node label. Using an equivalent graph topology to SG-Base, we change the relative proportion of node features which are attributed to the label by adjusting ni and nf. We create SG-MoreInform, a ShapeGGen instance with a high proportion of ground-truth features to total features (8:11). Likewise, we create SG-LessInform, a ShapeGGen instance with a low proportion of ground-truth features to total features (4:21). This proportion in SG-Base falls in the middle of SG-MoreInform and SG-LessInform instances with a proportion of ground-truth to total features of 4:11. Formally, we define SG-MoreInform as \({\mathcal{G}}\) = ShapeGGen (\({\mathcal{S}}\) = ‘house’, Ns = 1200, p = 0.006, ns = 11, K = 2, nf = 11, ni = 8, sf = 0.6, cf = 2, ϕ = 0.5, η = 1, L = 3) and SG-LessInform as \({\mathcal{G}}\) = ShapeGGen (\({\mathcal{S}}\) = ‘house’, Ns = 1200, p = 0.006, ns = 11, K = 2, nf = 21, ni = 4, sf = 0.6, cf = 2, ϕ = 0.5, η = 1, L = 3).


In addition to ShapeGGen, we provide a version of BA-Shapes14, a synthetic graph data generator for node classification tasks. We start with a base Barabasi-Albert (BA)48 graph using N nodes and a set of five-node “house”-structured motifs K randomly attached to nodes of the base graph. The final graph is perturbed by adding random edges. The nodes in the output graph are categorized into two classes corresponding to whether the node is in a house (1) or not in a house (0).

Real-world graphs

We describe the real-world graph datasets with and without ground-truth explanations provided in GraphXAI. To this end, we provide data resources from crime forecasting, financial lending, and molecular chemistry and biology1,2,35. We consider these datasets as they are used to train GNNs for generating predictions in high-stakes downstream applications. In particular, we include chemical and biological datasets because they are used to identify whether an input graph (i.e., a molecular graph) contains a specific pattern (i.e., a chemical group with a specific property in the molecule). Knowledge about such patterns, which determine molecular properties, represents ground-truth explanations2. We provide a statistical description of real-world graphs in Tables 68. Below, we discuss the details of each of the real-world datasets that we employ:

Table 7 Statistics of real-world graph classification datasets in GraphXAI with ground-truth (GT) explanations.
Table 8 Statistics of real-world node classification datasets in GraphXAI without ground-truth (GT) explanations.


The MUTAG35 dataset contains 1,768 graph molecules labeled into two different classes according to their mutagenic properties, i.e., effect on the Gram-negative bacterium S. Typhimuriuma. Kazius et al.35 identifies several toxicophores - motifs in the molecular graph - that correlate with mutagenicity. The dataset is trimmed from its original 4,337 graphs to 1,768, based on those whose labels directly correspond to the presence or absence of our chosen toxicophores: NH2, NO2, aliphatic halide, nitroso, and azo-type (terminology, as referred to in Kazius et al.35).

Alkane carbonyl

The Alkane Carbonyl2 dataset contains 1,125 molecular graphs labeled into two classes where a positive sample indicates a molecule that contains an unbranched alkane and a carbonyl (C = O) functional group. The ground-truth explanations include any combinations of alkane and carbonyl functional groups within a given molecule.


The Benzene2 dataset contains 12,000 molecular graphs extracted from the ZINC1549 database and labeled into two classes where the task is to identify whether a given molecule has a benzene ring or not. Naturally, the ground truth explanations are the nodes (atoms) comprising the benzene rings, and in the case of multiple benzenes, each of these benzene rings forms an explanation.

Fluoride carbonyl

The Fluoride Carbonyl2 dataset contains 8,671 molecular graphs labeled into two classes where a positive sample indicates a molecule that contains a fluoride (F) and a carbonyl (C = O) functional group. The ground-truth explanations consist of combinations of fluoride atoms and carbonyl functional groups within a given molecule.

German credit

The German Credit1 graph dataset contains 1,000 nodes representing clients in a German bank connected based on the similarity of their credit accounts. The dataset includes demographic and financial features like gender, residence, age, marital status, loan amount, credit history, and loan duration. The goal is to associate clients with credit risk.


The Recidivism1 dataset includes samples of bail outcomes collected from multiple state courts in the USA between 1990–2009. It contains past criminal records, demographic attributes, and other demographic details of 18,876 defendants (nodes) who got released on bail at the U.S. state courts. Defendants are connected based on the similarity of past criminal records and demographics, and the goal is to classify defendants into bail vs. no bail.

Credit defaulter

The Credit defaulter1 graph has 30,000 nodes representing individuals that we connected based on the similarity of their spending and payment patterns. The dataset contains applicant features like education, credit history, age, and features derived from their spending and payment patterns. The task is to predict whether an applicant will default on an upcoming credit card payment.