Evaluating explainability for graph neural networks

Agarwal, Chirag; Queen, Owen; Lakkaraju, Himabindu; Zitnik, Marinka

doi:10.1038/s41597-023-01974-x

Download PDF

Article
Open access
Published: 18 March 2023

Evaluating explainability for graph neural networks

Chirag Agarwal^1,2^na1,
Owen Queen^2,3^na1,
Himabindu Lakkaraju^4,5,6 &
…
Marinka Zitnik ORCID: orcid.org/0000-0001-8530-7228^2,5,7

Scientific Data volume 10, Article number: 144 (2023) Cite this article

19k Accesses
15 Citations
11 Altmetric
Metrics details

Subjects

Abstract

As explanations are increasingly used to understand the behavior of graph neural networks (GNNs), evaluating the quality and reliability of GNN explanations is crucial. However, assessing the quality of GNN explanations is challenging as existing graph datasets have no or unreliable ground-truth explanations. Here, we introduce a synthetic graph data generator, ShapeGGen, which can generate a variety of benchmark datasets (e.g., varying graph sizes, degree distributions, homophilic vs. heterophilic graphs) accompanied by ground-truth explanations. The flexibility to generate diverse synthetic datasets and corresponding ground-truth explanations allows ShapeGGen to mimic the data in various real-world areas. We include ShapeGGen and several real-world graph datasets in a graph explainability library, GraphXAI. In addition to synthetic and real-world graph datasets with ground-truth explanations, GraphXAI provides data loaders, data processing functions, visualizers, GNN model implementations, and evaluation metrics to benchmark GNN explainability methods.

Enhancing property and activity prediction and interpretation using multiple molecular graph representations with MMGX

Article Open access 05 April 2024

Chemistry-intuitive explanation of graph neural networks for molecular property prediction with substructure masking

Article Open access 04 May 2023

Graph neural networks

Article 07 March 2024

Introduction

As graph neural networks (GNNs) are being increasingly used for learning representations of graph-structured data in high-stakes applications, such as criminal justice¹, molecular chemistry^2,3, and biological networks^4,5, it becomes critical to ensure that the relevant stakeholders can understand and trust their functionality. To this end, previous work developed several methods to explain predictions made by GNNs^{6,7,8,9,10,11,12,13,14}. With the increase in newly proposed GNN explanation methods, it is critical to ensure their reliability. However, explainability in graph machine learning is an emerging area lacking standardized evaluation strategies and reliable data resources to evaluate, test, and compare GNN explanations¹⁵. While several works have acknowledged this difficulty, they tend to base their analysis on specific real-world² and synthetic¹⁶ datasets with limited ground-truth explanations. In addition, relying on these datasets and associated ground-truth explanations is insufficient as they are not indicative of diverse real-world applications¹⁵. To this end, developing a broader ecosystem of data resources for benchmarking state-of-the-art GNN explainers can support explainability research in GNNs.

A comprehensive data resource to correctly evaluate the quality of GNN explanations will ensure their reliability in high-stake applications. However, the evaluation of GNN explanations is a growing research area with relatively little work, where existing approaches mainly leverage ground-truth explanations associated with specific datasets² and are prone to several pitfalls (as outlined by Faber et al.¹⁶). Further, multiple underlying rationales can generate the correct class labels, creating redundant or non-unique explanations. A trained GNN model may only capture one or an entirely different rationale. In such cases, evaluating the explanation output by a state-of-the-art method using the ground-truth explanation is incorrect because the underlying GNN model does not rely on that ground-truth explanation. In addition, even if a unique ground-truth explanation generates the correct class label, the GNN model trained on the data could be a weak predictor using an entirely different rationale for prediction. Therefore, the ground-truth explanation cannot be used to assess post hoc explanations of such models. Lastly, the ground-truth explanations corresponding to some of the existing benchmark datasets are not good candidates for reliably evaluating explanations as they can be recovered using trivial baselines (e.g., random node or edge as explanation). The above discussion highlights a clear need for general-purpose data resources which can evaluate post hoc explanations reliably across diverse real-world applications. While various benchmark datasets (e.g., Open Graph Benchmark (OGB)¹⁷, GNNMark¹⁸, GraphGT¹⁹, MalNet²⁰, Graph Robustness Benchmark (GRB)²¹, Therapeutics Data Commons^22,23, and EFO-1-QA²⁴) and programming libraries for deep learning on graphs (e.g., Dive Into Graphs (DIG)²⁵, Pytorch Geometric (PyG)²⁶, and Deep Graph Library (DGL)²⁷) in graph machine learning literature exist, they are mainly used to only benchmark the performance of GNN predictors and are not suited to evaluate the correctness of GNN explainers because they do not capture ground-truth explanations.

Here, we address the above challenges by introducing a general-purpose data resource that is not prone to ground-truth pitfalls (e.g., redundant explanations, weak GNN predictors, trivial explanations) and can cater to diverse real-world applications. To this end, we present ShapeGGen (Fig. 2), a novel and flexible synthetic XAI-ready (explainable artificial intelligence ready) dataset generator which can automatically generate a variety of benchmark datasets (e.g., varying graph sizes, degree distributions, homophilic vs. heterophilic graphs) accompanied by ground-truth explanations. ShapeGGen ensures that the generated ground-truth explanations are not prone to the pitfalls described in Faber et al.¹⁶, such as redundant explanations, weak GNN predictors, and trivially correct explanations. Furthermore, ShapeGGen can evaluate the goodness of any explanation (e.g., node feature-based, node-based, edge-based) across diverse real-world applications by seamlessly generating synthetic datasets that can mimic the properties of real-world data in various domains.

We incorporate ShapeGGen and several other synthetic and real-world graphs²⁸ into GraphXAI, a general-purpose framework for benchmarking GNN explainers. GraphXAI also provides a broader ecosystem (Fig. 1) of data loaders, data processing functions, visualizers, and a set of evaluation metrics (e.g., accuracy, faithfulness, stability, fairness) to reliably benchmark the quality of any given GNN explanation (node feature-based or node/edge-based). We leverage various synthetic and real-world datasets and evaluation metrics from GraphXAI to empirically assess the quality of explanations output by eight state-of-the-art GNN explanation methods. Across many GNN explainers, graphs, and graph tasks, we observe that state-of-the-art GNN explainers fail on graphs with large ground-truth explanations (i.e., explanation subgraphs with a higher number of nodes and edges) and cannot produce explanations that preserve fairness properties of underlying GNN predictors.

Results

To evaluate GraphXAI, we show how GraphXAI enables systematic benchmarking of eight state-of-the-art GNN explainers on both ShapeGGen (in the Methods section) and real-world graph datasets. We explore the utility of the ShapeGGen generator to benchmark GNN explainers on graphs with homophilic vs. heterophilic, small vs. large, and fair vs. unfair ground-truth explanations. Additionally, we examine the utility of GNN explanations on datasets with varying degrees of informative node features. Next, we outline the experimental setup, including details about performance metrics, GNN explainers, and underlying GNN predictors, and proceed with a discussion of benchmarking results.

Experimental setup

GNN explainers

The GraphXAI defines an Explanation class capable of storing multiple types of explanations produced by GNN explainers and provides a graphxai.BaseExplainer class that serves as the base for all explanation methods in GraphXAI. We incorporate eight GNN explainability methods, including gradient-based: Grad²⁹, GradCAM¹¹, GuidedBP⁶, Integrated Gradients³⁰; perturbation-based: GNNExplainer¹⁴, PGExplainer¹⁰, SubgraphX³¹; and surrogate-based methods: PGMExplainer¹³. Finally, following Agarwal et al.¹⁵, we consider random explanations as a reference: (1) Random Node Features, a node feature mask defined by an d-dimensional Gaussian distributed vector; (2) Random Nodes, a 1 × n node mask is randomly sampled using a uniform distribution, where n is the number of nodes in the enclosing subgraph; and (3) Random Edges, an edge mask drawn from a uniform distribution over a node’s incident edges.

Implementation details

We use a three-layer GIN model³² and a GCN model³³ as GNN predictors for our experiments. We use a model comprising three GIN convolution layers with ReLU non-linear activation function and a Softmax activation for the final layer. The hidden dimensionality of the layers is set to 16. We follow an established approach for generating explanations^8,15 and use reference algorithm implementations. We select top-k (k = 25%) important nodes, node features, or edges and use them to generate explanations for all graph explainability methods. For training GIN models, we use an Adam optimizer with a learning rate of 1 × 10⁻², weight decay of 1 × 10⁻⁵, and the number of epochs to 1000. We use an Adam optimizer with a learning rate of 3 × 10⁻², no weight decay, and 1500 training epochs for training GNN models. We set hyperparameters of GNN explainability models following the authors’ recommendations.

We use a fixed random split provided within the GraphXAI package to split the datasets. For each ShapeGGen dataset, we use a 70/5/25 split for training, validation, and testing, respectively. For MUTAG, Benzene, and Fluoride Carbonyl datasets, we use a 70/10/20 split throughout each dataset. Average performance is reported across each sample in the testing set of each dataset.

Performance metrics

In addition to the synthetic and real-world data resources, we consider four broad categories of performance metrics: (i) Graph Explanation Accuracy (GEA); (ii) Graph Explanation Faithfulness (GEF); (iii) Graph Explanation Stability (GES); and (iv) Graph Explanation Fairness (GECF, GEGF) to evaluate the explanations on the respective datasets. In particular, all evaluation metrics leverage predicted explanations, ground-truth explanations, and other user-controlled parameters, such as top-k features. Our GraphXAI package implements these performance metrics and additional utility functions within graphxai.metrics module. Figure 3 shows a code snippet for evaluating the correctness of output explanations for a given GNN prediction in GraphXAI.

Graph explanation accuracy (GEA)

We report graph explanation accuracy as an evaluation strategy that measures an explanation’s correctness using the ground-truth explanation M^g. In ground-truth and predicted explanation masks, every node, node feature, or edge belongs to {0, 1}, where ‘0’ means that an attribute is unimportant and ‘1’ means that it is important for the model prediction. To measure accuracy, we use Jaccard index³⁴ between the ground-truth M^g and predicted M^p:

$${\rm{J}}{\rm{A}}{\rm{C}}({{\bf{M}}}^{g},{{\bf{M}}}^{p})=\frac{{\rm{T}}{\rm{P}}({{\bf{M}}}^{g},{{\bf{M}}}^{p})}{{\rm{T}}{\rm{P}}({{\bf{M}}}^{g},{{\bf{M}}}^{p})+{\rm{F}}{\rm{P}}({{\bf{M}}}^{g},{{\bf{M}}}^{p})+{\rm{F}}{\rm{N}}({{\bf{M}}}^{g},{{\bf{M}}}^{p})},$$

(1)

where TP denotes true positives, FP denotes false positives, and FN indicates false negatives. Most synthetic- and real-world graphs have multiple ground-truth explanations. For example, in the MUTAG dataset³⁵, carbon rings with both NH₂ or NO₂ chemical groups are valid explanations for the GNN model to recognize a given molecule as mutagenic. For this reason, the accuracy metric must account for the possibility of multiple equally valid explanations existing for any given prediction. Hence, we define ζ as a set of all possible ground-truth explanations, where |ζ| = 1 for graphs having a unique explanation. Therefore, we calculate GEA as:

$${\rm{GEA}}(\zeta ,{{\bf{M}}}^{p})=\max \;{\rm{JAC}}({{\bf{M}}}^{g},{{\bf{M}}}^{p})\;\forall {{\bf{M}}}^{g}\in \zeta .$$

(2)

Here, we can calculate GEA using predicted node feature, node, or edge explanation masks. Finally, Eq. 1 quantifies the accuracy between the ground-truth and predicted explanation masks. Higher values mean a predicted explanation is more likely to match the ground-truth explanation.

Graph explanation faithfulness (GEF)

We extend existing faithfulness metrics^2,15 to quantify how faithful explanations are to an underlying GNN predictor. In particular, we obtain the prediction probability vector ${\widehat{y}}_{u}$ using the GNN, i.e., ${\widehat{y}}_{u}=f({{\mathcal{S}}}_{u})$, and using the explanation, i.e., ${\widehat{y}}_{u{\prime} }=f({{\mathcal{S}}}_{u{\prime} })$, where we generate a masked subgraph ${{\mathcal{S}}}_{u{\prime} }$ by only keeping the original values of the top-k features identified by an explanation, and get their respective predictions ${\widehat{y}}_{u{\prime} }$. Finally, we compute the graph explanation unfaithfulness metric as:

$${\rm{GEF}}(f,{{\mathcal{S}}}_{u},{{\mathcal{S}}}_{u{\prime} })=1-{{\rm{\exp }}}^{-{\rm{KL}}(f({{\mathcal{S}}}_{u})| | f({{\mathcal{S}}}_{u{\prime} }))},$$

(3)

where Kullback-Leibler (KL) divergence score quantifies the distance between two probability distributions, and the “||” operator indicates statistical divergence measure. Note that Eq. 3 is a measure of the unfaithfulness of the explanation. So, higher values indicate a higher degree of unfaithfulness.

Graph explanation stability (GES)

Formally, an explanation is defined to be stable if the explanation for a given graph and its perturbed counterpart (generated by making infinitesimally small perturbations to the node feature vector and associated edges) are similar^15,36. We measure graph explanation stability w.r.t. the changes in the model behavior. In addition to similar output labels for ${{\mathcal{S}}}_{u}$ and the perturbed ${{\mathcal{S}}}_{u{\prime} }$, we employ a second level of check where the difference between model behaviors for ${{\mathcal{S}}}_{u}$ and ${{\mathcal{S}}}_{u{\prime} }$ is bounded by an infinitesimal constant $\delta $, i.e., $| | {{\mathcal{L}}}_{{{\mathcal{S}}}_{u}}-{{\mathcal{L}}}_{{{\mathcal{S}}}_{u{\prime} }}| {| }_{p}\le \delta $. Here, ${\mathcal{L}}(\cdot )$ refers to any form of model knowledge like output logits ${\widehat{y}}_{u}$ or embeddings ${{\bf{z}}}_{u}$. We compute graph explanation instability as:

$${\rm{GES}}({{\bf{M}}}_{{{\mathcal{S}}}_{u}}^{p},{{\bf{M}}}_{{{\mathcal{S}}}_{u{\prime} }}^{p})={\rm{m}}{\rm{a}}{\rm{x}}\;D({{\bf{M}}}_{{{\mathcal{S}}}_{u}}^{p},{{\bf{M}}}_{{{\mathcal{S}}}_{u{\prime} }}^{p}),\quad \forall {{\mathcal{S}}}_{u{\prime} }\in \beta ({{\mathcal{S}}}_{u})$$

(4)

where D represents the cosine distance metric, ${{\bf{M}}}_{{{\mathcal{S}}}_{u}}^{p}$ and ${{\bf{M}}}_{{{\mathcal{S}}}_{u{\prime} }}^{p}$ are the predicted explanation masks for ${{\mathcal{S}}}_{u}$ and ${{\mathcal{S}}}_{u{\prime} }$, and $\beta $ represents a $\delta $-radius ball around ${{\mathcal{S}}}_{u}$ for which the model behavior is same. Eq. 4 is a measure of instability, and higher values indicate a higher degree of instability.

Counterfactual fairness mismatch

To measure counterfactual fairness¹⁵, we verify if the explanations corresponding to ${{\mathcal{S}}}_{u}$ and its counterfactual counterpart (where the protected node feature is modified) are similar (dissimilar) if the underlying model predictions are similar (dissimilar). We calculate counterfactual fairness mismatch as:

$${\rm{GECF}}({{\bf{M}}}^{p},{{\bf{M}}}_{s}^{p})=D({{\bf{M}}}^{p},{{\bf{M}}}_{s}^{p}),$$

(5)

where, M^p and ${{\bf{M}}}_{s}^{p}$ are the predicted explanation mask for ${{\mathcal{S}}}_{u}$ and for the counterfactual counterpart of ${{\mathcal{S}}}_{u}$. Note that we expect a lower GECF score for graphs having weakly-unfair ground-truth explanations because explanations are similar for both original and counterfactual graphs, whereas, for graphs with strongly-unfair ground-truth explanations, we expect a higher GECF score as explanations change when we modify the protected attribute.

Group fairness mismatch

We measure group fairness mismatch¹⁵ as:

$${\rm{GEGF}}({\widehat{{\bf{y}}}}_{{\mathcal{K}}},{\widehat{{\bf{y}}}}_{{\mathcal{K}}}^{{{\bf{E}}}_{u}})=| {\rm{SP}}({\widehat{{\bf{y}}}}_{{\mathcal{K}}})-{\rm{SP}}({\widehat{{\bf{y}}}}_{{\mathcal{K}}}^{{{\bf{E}}}_{u}})| ,$$

(6)

where ${\widehat{{\bf{y}}}}_{{\mathcal{K}}}$ and ${\widehat{{\bf{y}}}}_{{\mathcal{K}}}^{{{\bf{E}}}_{u}}$ are predictions for a set of ${\mathcal{K}}$ graphs using the original and the essential features identified by an explanation, respectively, and SP is the statistical parity. Finally, Eq. 6 is a measure of group fairness mismatch of an explanation where higher values indicate that the explanation is not preserving group fairness.

Evaluation and analysis of GNN explainability methods

Next, we discuss experimental results that answer critical questions concerning synthetic and real-world graphs and different ground-truth explanations.

Benchmarking GNN explainers on synthetic and real-world graphs

We evaluate the performance of GNN explainers on a collection of synthetically generated graphs with various properties and molecular datasets using metrics described in the experimental setup. Results in Tables 1–5 show that, while no explanation method performs well across all properties, across different ShapeGGen node classification datasets (Table 6), SubgraphX outperforms other methods on average. In particular, SubgraphX generates 145.95% more accurate and 64.80% less unfaithful explanations than other GNN explanation methods. Gradient-based methods, such as GradCam and GuidedBP, perform the next best of all methods, with Grad producing the second-lowest unfaithfulness score and GradCAM achieving the second-highest explanation accuracy score. PGExplainer generates the least unstable explanations–35.35% less unstable explanations than the average instability across other GNN explainers. In summary, results of Tables 1–5 show that node explanation masks are more reliable than edge- and node feature explanation masks and state-of-the-art GNN explainers achieve better faithfulness for synthetic graph datasets as compared to real-world graphs.

Table 1 Evaluation of GNN explainers on SG-Base graph dataset based on node explanation masks ${{\bf{M}}}_{N}^{p}$. Arrows (↑/↓) indicate the direction of better performance.

Full size table

Table 2 Evaluation of GNN explainers on SG-Base graph dataset based on node explanation masks ${{\bf{M}}}_{N}^{p}$.

Full size table

Table 3 Evaluation of GNN explainers on SG-Base graph dataset based on edge explanation masks ${{\bf{M}}}_{E}^{p}$.

Full size table

Table 4 Evaluation of GNN explainers on SG-Base graph dataset based on node feature explanation masks ${{\bf{M}}}_{F}^{p}$.

Full size table

Table 5 Evaluation of GNN explainers for real-world molecular datasets with ground-truth explanations.

Full size table

Table 6 Statistics of graphs generated using ShapeGGen data generator for evaluating different properties of GNN explanations.

Full size table

Analyzing homophilic vs. heterophilic ground-truth explanations

We compare GNN explainers by generating explanations on GNN models trained on homophilic and heterophilic graphs generated using the SG-Heterophilic generator. Then, we compute the graph explanation unfaithfulness scores of output explanations generated using state-of-the-art GNN explainers. We find that GNN explainers produce 55.98% more faithful explanations when ground-truth explanations are homophilic than when ground-truth explanations are heterophilic (i.e., low unfaithfulness scores for light green bars in Fig. 4). These results reveal an important gap in existing GNN explainers. Namely, existing GNN explainers fail to perform well on diverse graph types, such as homophilic, heterophilic and attributed graphs. This observation, enabled by the flexibility of ShapeGGen generator, highlights an opportunity for future algorithmic innovation in GNN explainability.

Analyzing the reliability of graph explainers to smaller vs. larger ground-truth explanations

Next, we examine the reliability of GNN explainers when used to predict explanations for models trained on graphs generated using the SG-SmallEx graph generator. Results in Fig. 5 show that explanations from existing GNN explainers are faithful (i.e., lower GEF scores) to the underlying GNN models when ground-truth explanations are smaller, i.e., S = ‘triangle’. On average, across all eight GNN explainers, we find that existing GNN explainers are highly unfaithful to graphs with large ground-truth explanations with an average GEF score of 0.7476. Further, we observe that explanations generated on ‘triangular’ (smaller) ground-truth explanations are 59.98% less unfaithful than explanations for ‘house’ (larger) ground-truth explanations (i.e., low unfaithfulness scores for light purple bars in Fig. 5). However, the Grad explainer, on average, achieves 9.33% lower unfaithfulness on large ground-truth explanations compared to other explanation methods. This limited behavior of existing GNN explainers has not been previously known and highlights an urgent need for additional analysis of GNN explainers.

Examining fair vs. unfair ground-truth explanations

To measure the fairness of predicted explanations, we train GNN models on SG-Unfair, which generates graphs with controllable fairness properties. Next, we compute the average GECF and GEGF values for predicted explanations from eight GNN explainers. The fairness results in Fig. 6 show that GNN explainers do not preserve counterfactual fairness and are highly prone to producing unfair explanations. We note that for weakly-unfair ground-truth explanations (light red in Fig. 6), explanations M^p should not change as the label-generating process is independent of the protected attribute. Still, we observe high GECF scores for most explanation methods. For strongly-unfair ground-truth explanations, we find that explanations from most GNN explainers fail to capture (i.e., low GECF scores for dark red bars in Fig. 6) the unfairness enforced using the protected attribute and generate similar explanations even when we flip/modify the respective protected attribute. We see that GradCAM and PGEx explanations outperform other GNN explainers in preserving counterfactual explanations for weakly-unfair ground-truth explanations. In contrast, the PGMEx explainer preserves counterfactual fairness better than other explanation methods on strongly-unfair ground truth explanations. Our results highlight the importance of studying fairness in XAI as they can enhance a user’s confidence in the model and assist in detecting and correcting unwanted bias.

Faithfulness shift with varying degrees of node feature information

Using ShapeGGen’s support for node features and ground-truth explanations on those features, we evaluate explainers that generate explanations for node features. Results for node feature explanations on SG-Base are given in Table 4. In addition, we explore the performance of explainers under varying proportions of informative node features. Informative node features, defined in the ShapeGGen construction (Algorithm 1), are node features correlated with the label of a given node, as opposed to redundant features, which are sampled randomly from a Gaussian distribution. Figure 7 shows the results of experiments on three datasets, SG-MoreInform, SG-Base, and SG-LessInform. All datasets have similar graph topology, but SG-MoreInform has a higher proportion of informative features while SG-LessInform has a lower proportion of these informative features. SG-Base is used as a baseline with a proportion of informative features greater than SG-LessInform but less than SG-MoreInform. There are minimal differences between explainers’ faithfulness across datasets, however, unfaithfulness tends to increase with fewer informative node features. As seen in Table 4, the Gradient explainer shows the best faithfulness score across all datasets for node feature explanation. Still, this faithfulness is relatively weak, only 0.001 better than random explanation. These results show that the faithfulness of node feature explanations worsens under sparse node feature signals.

Visualization results

GraphXAI provides functions that visualize explanations produced by GNN explainability methods. Users can compare both node- and graph-level explanations. In addition, function implementations for visualization are parameterized, allowing users to change colors and weight interpretation. Functions are compatible with matplotlib³⁷ and networkx³⁸. Visualizations are generated by graphxai.Explanation.visualize_node for node-level explanations and graphxai.Explanation.visualize_graph functions for graph-level explanations. In Fig. 8, we show the output explanation from four different GNN explainers as produced by our visualization function. Figure 9 shows example outputs from multiple explainers in the GraphXAI package on a ShapeGGen -generated dataset.

Discussion

GraphXAI provides a general-purpose framework to evaluate GNN explanations produced by state-of-the-art GNN explanation methods. GraphXAI provides data loaders, data processing functions, visualizers, real-world graph datasets with ground-truth explanations, and evaluation metrics to benchmark the quality of GNN explanations. GraphXAI introduces a novel and flexible synthetic dataset generator called ShapeGGen to automatically generate benchmark datasets and corresponding ground truth explanations robust against known pitfalls of GNN explainability methods. Our experimental results show that existing GNN explainers perform well on graphs with homophilic ground-truth explanations but perform considerably worse on heterophilic and attributed graphs. Across multiple graph datasets and types of downstream prediction tasks, we show that existing GNN explanation methods fail on graphs with larger ground-truth explanations and cannot generate explanations that preserve the fairness properties of the underlying GNN model. In addition, GNN explainers tend to underperform on sparse node feature signals compared to more densely informative node features. These findings indicate the need for methodological innovation and a thorough analysis of future GNN explainability methods across performance dimensions.

GraphXAI provides a flexible framework for evaluating GNN explanation methods and promotes reproducible and transparent research. We maintain GraphXAI as a centralized library for evaluating GNN explanation methods and plan to add newer datasets, explanation methods, diverse evaluation metrics, and visualization features to our existing framework. In the current version of GraphXAI, we mostly employ real-world molecular chemistry datasets as the need for model understanding is motivated by experimental evaluation of model predictions in the laboratory, and it includes a wide variety of graph sizes (ranging from 1,768 to 12,000 instances in the dataset), node feature dimensions (ranging from 13 to 27 dimensions), and class imbalance ratios. In addition to the scale-free ShapeGGen dataset generator, we will include other random graph model generators in the next GraphXAI release to support benchmarking of GNN explainability methods on other graph types. Our evaluation metrics can be also extended to explanations from self-explaining GNNs, e.g., GraphMASK¹² identifies edges at each layer of the GNN during training that can be ignored without affecting the output model predictions. In general, self-explaining GNNs also return a set of edge masks as an output explanation for a GNN prediction that can be converted to edge importance scores for computing GraphXAI metrics. We anticipate that GraphXAI can help algorithm developers and practitioners in graph representation learning develop and evaluate principled explainability techniques.

Methods

ShapeGGen is a key component of GraphXAI and serves as a synthetic dataset generator of XAI-ready graph datasets. It is founded in graph theory and designed to address the pitfalls (see Introduction) of existing graph datasets in the broad area of explainable AI. As such, ShapeGGen can facilitate the development, analysis, and evaluation of GNN explainability methods (see Results). We proceed with the description of ShapeGGen data generator.

Notation

Graphs

Let ${\mathcal{G}}=\left({{\mathcal{V}}}_{{\mathcal{G}}},{{\mathcal{E}}}_{{\mathcal{G}}},{{\bf{X}}}_{{\mathcal{G}}}\right)$ denote an undirected graph comprising of a set of nodes ${{\mathcal{V}}}_{{\mathcal{G}}}$ and a set of edges ${{\mathcal{E}}}_{{\mathcal{G}}}$. Let ${{\bf{X}}}_{{\mathcal{G}}}=\left\{{{\bf{x}}}_{1},{{\bf{x}}}_{2},\ldots ,{{\bf{x}}}_{N}\right\}$ denote the set of node feature vectors for all nodes in ${{\mathcal{V}}}_{{\mathcal{G}}}$, where ${{\bf{x}}}_{v}\in {{\bf{X}}}_{{\mathcal{G}}}$ is an d-dimensional vector which captures the attribute values of a node v and $N=\left|{{\mathcal{V}}}_{{\mathcal{G}}}\right|$ denotes the number of nodes in the graph. Let ${\bf{A}}\in {{\mathbb{R}}}^{N\times N}$ be the graph adjacency matrix where element A_uv = 1 if there exists an edge $e\in {{\mathcal{E}}}_{{\mathcal{G}}}$ between nodes u and v and A_uv = 0 otherwise. We use ${{\mathcal{N}}}_{u}$ to denote the set of immediate neighbors of node u, ${{\mathcal{N}}}_{u}=\{v\in {{\mathcal{V}}}_{{\mathcal{G}}}| {A}_{uv}=1\}$. Finally, the function ${\rm{\deg }}:{{\mathcal{V}}}_{{\mathcal{G}}}\mapsto {{\mathbb{Z}}}_{\ge 0}$ is defined as ${\rm{\deg }}(u)=\left|{{\mathcal{N}}}_{u}\right|$ and outputs the degree of a node $u\in {{\mathcal{V}}}_{{\mathcal{G}}}$.

Graph neural networks

Most GNNs can be formulated as message passing networks³⁹ using three operators: Msg, Agg, and Upd. In a L-layer GNN, these operators are recursively applied on ${\mathcal{G}}$, specifying how neural messages (i.e. embeddings) are exchanged between nodes, aggregated, and transformed to arrive at node representations in the last layer of transformations. Formally, a message between a pair of nodes (u, v) in layer l is defined as a function of hidden representations of nodes ${{\bf{h}}}_{u}^{l-1}$ and ${{\bf{h}}}_{v}^{l-1}$ from the previous layer: ${{\bf{m}}}_{uv}^{l}=Msg({{\bf{h}}}_{u}^{l-1},{{\bf{h}}}_{v}^{l-1})$. In Agg, messages from all nodes $v\in {{\mathcal{N}}}_{u}$ are aggregated as: ${{\bf{m}}}_{u}^{l}=Agg({{\bf{m}}}_{uv}^{l}| v\in {{\mathcal{N}}}_{u})$. In Upd, the aggregated message ${{\bf{m}}}_{u}^{l}$ is combined with ${{\bf{h}}}_{u}^{l-1}$ to produce u’s representation for layer l as ${{\bf{h}}}_{u}^{l}=Upd({{\bf{m}}}_{u}^{l},{{\bf{h}}}_{u}^{l-1})$. Final node representation ${{\bf{z}}}_{u}={{\bf{h}}}_{u}^{L}$ is the output of the last layer. Lastly, let f denote a downstream GNN classification model that maps the node representation z_u to a softmax prediction vector ${\widehat{y}}_{u}\in {{\mathbb{R}}}^{C}$, where C is the total number of labels.

GNN explainability methods

Given the prediction f(u) for node u made by a GNN model, a GNN explainer ${\mathcal{O}}$ outputs an explanation mask M^p that provides an explanation of f(u). These explanations can be given with respect to node attributes ${{\bf{M}}}_{a}\in {{\mathbb{R}}}^{d}$, nodes ${{\bf{M}}}_{n}\in {{\mathbb{R}}}^{N}$, or edges ${{\bf{M}}}_{e}\in {{\mathbb{R}}}^{N\times N}$, depending on specific GNN explainer, such as GNNExplainer¹⁴, PGExplainer¹⁰, and SubgraphX³¹. For all explanation methods, we use a graph masking function MASK that outputs a new graph ${\mathcal{G}}{\prime} =\left({{\mathcal{V}}{\prime} }_{{\mathcal{G}}{\prime} },{{\mathcal{E}}{\prime} }_{{\mathcal{G}}{\prime} },{{\bf{X}}{\prime} }_{{\mathcal{G}}{\prime} }\right)$ by performing element-wise multiplication operation between the masks (M_a, M_n, M_e) and their respective attributes in the original graph ${\mathcal{G}}$, e.g. A′ = A ⊙ M_e. Finally, we denote the ground-truth explanation mask as M^g that is used to evaluate the performance of GNN explainers.

ShapeGGen dataset generator

We propose a novel and flexible synthetic dataset generator called ShapeGGen that can automatically generate a variety of benchmark datasets (e.g., varying graph sizes, degree distributions, homophilic vs. heterophilic graphs) accompanied by ground-truth explanations. Furthermore, the flexibility to generate diverse synthetic datasets and corresponding ground-truth explanations allows us to mimic the data generated by various real-world applications. ShapeGGen is a generator of XAI-ready graph datasets supported by graph theory and is particularly suitable for benchmarking GNN explainers and studying their limitations.

Flexible parameterization of ShapeGGen

ShapeGGen has tunable parameters for data generation. By varying these parameters, ShapeGGen can generate diverse types of graphs, including graphs with varying degrees of class imbalance and graph sizes. Formally, a graph is generated as: ${\mathcal{G}}$ = ShapeGGen $({\mathcal{S}},{N}_{s},p,{n}_{s},K,{n}_{f},{n}_{i},{s}_{f},{c}_{f},\varphi ,\eta ,L)$, where:

S is the motif, defined as a subgraph, to be planted within the graph.
N_s denotes the number of subgraphs used in the initial graph generation process.
p represents the connection probability between two subgraphs and controls the average shortest path length for all possible pairs of nodes. Ideally, a smaller p value for larger N_s is preferred to avoid low average path length and, therefore, the poor performance of GNNs.
n_s is the expected size of each subgraph in the ShapeGGen generation procedure. For a fixed S shape, large n_s values produce graphs with long-tailed degree distributions. Note that the expected total number of nodes in the generated graph is N × n_s.
K is the number of distinct classes defined in the graph downstream task.
n_f represents the number of dimensions for node features in the generated graph.
n_i is the number of informative node features. These are features correlated to the node labels instead of randomly generated non-informative features. The indices for the informative features define the ground-truth explanation mask for node features in the final ShapeGGen instance. In general, larger n_i results in an easier classification and explanation task, as it increases the node feature-level ground-truth explanation size.
s_f is defined as the class separation factor that represents the strength of discrimination of class labels between features for each node. Higher s_f corresponds to a stronger signal in the node features, i.e., if a classifier is trained only on the node features, a higher s_f value would result in an easier classification task.
c_f is the number of clusters per class. A larger c_f value increases the difficulty of the classification task with respect to node features.
${\varphi }\in [0,1]$ is the protected feature noise factor that controls the strength of correlation r between the protected features and the node labels. This value corresponds to the probability of “flipping” the protected feature with respect to the node’s label. For instance, ${\varphi }=0.5$ results in zero correlation (r = 0) between the protected feature and the label (i.e. complete fairness), ${\varphi }=0$ results in a positive correlation (r = 1), and ${\varphi }=1$ results in a negative correlation (r = −1) between the label and the protected feature.
η is the homophily coefficient that controls the strength of homophily or heterophily in the graph. Positive values (η > 0) produce a homophilic graph while negative values (η < 0) produce a heterophilic graph.
L is the number of layers in the GNN predictor corresponding to the GNN’s receptive field. For the purposes of ShapeGGen, L is used to define the size of the GNN’s receptive field and thus the size of the ground-truth explanation generated for each node.

This wide array of parameters for ShapeGGen allows for the generation of graph instances with vastly differing properties Fig. 10.

Generating graph structure

Figure 2 summarizes the process to generate a graph ${\mathcal{G}}=\left({{\mathcal{V}}}_{{\mathcal{G}}},{{\mathcal{E}}}_{{\mathcal{G}}},{{\bf{X}}}_{{\mathcal{G}}}\right)$. ShapeGGen generates N_s subgraphs that exhibit the preferential attachment property⁴⁰, which occurs in many real-world graphs. Each subgraph is first given a motif, i.e., a subgraph ${\mathcal{S}}=\left({{\mathcal{V}}}_{{\mathcal{S}}},{{\mathcal{E}}}_{{\mathcal{S}}},{{\bf{X}}}_{{\mathcal{S}}}\right)$. A preferential attachment algorithm is then performed on base structure ${\mathcal{S}}$, adding $n{\prime} $ ($n{\prime} \sim {\rm{Poisson}}(\lambda ={n}_{s}-| {{\mathcal{V}}}_{{\mathcal{S}}}| )$) nodes by creating edges to nodes in ${{\mathcal{V}}}_{{\mathcal{S}}}$. The Poisson distribution is used for determining the sizes of each subgraph used in the generation process, with $\lambda ={n}_{s}-| {{\mathcal{V}}}_{{\mathcal{S}}}| $, the difference between the number of nodes in the motif and the expected subgraph size ${n}_{s}$. After creating a list of randomly-generated subgraphs ${\bf{S}}=\{{{\mathcal{S}}}_{1},\ldots ,{{\mathcal{S}}}_{N}\}$, edges are created to connect subgraphs, creating the structure of an instance of ShapeGGen. Subgraph connections and local subgraph construction is performed in such a way that each node in the final graph ${\mathcal{G}}$ has between 1 and K motifs within its neighborhood, i.e., $\left|{\bigcup }_{i=1}^{N}{{\mathcal{V}}}_{{{\mathcal{S}}}_{i}}\cap {{\mathcal{N}}}_{v}\right|\;\in \;\left\{1,2,...,K\right\}$ for any v and ${{\mathcal{S}}}_{i}$. This naturally defines a classification task in the domain of f to {0, 1, …, K−1}. More details on the ShapeGGen structure generation can be found in Algorithm 2.

Generating labels for prediction

A motif defined as a subgraph ${\mathcal{S}}=\left({{\mathcal{V}}}_{{\mathcal{S}}},{{\mathcal{E}}}_{{\mathcal{S}}},{{\bf{X}}}_{{\mathcal{S}}}\right)$ occurs randomly throughout ${\mathcal{G}}$, with the set ${\bf{S}}=\{{{\mathcal{S}}}_{1},\ldots ,{{\mathcal{S}}}_{{N}_{S}}\}$. The task on this graph is a motif detection problem, where each node has exactly 1, 2, or K motifs in its 1-hop neighborhood. A motif ${{\mathcal{S}}}_{i}$ is considered to be within the neighborhood of a node $v\in {{\mathcal{V}}}_{{\mathcal{G}}}$ if any node $s\in {{\mathcal{V}}}_{{{\mathcal{S}}}_{i}}$ is also in the neighborhood of v, i.e., if $| {{\mathcal{V}}}_{{{\mathcal{S}}}_{i}}\cap {{\mathcal{N}}}_{v}| > 0$. Therefore, the task that a GNN predictor needs to solve is defined by:

$$f(v\in {{\mathcal{V}}}_{{\mathcal{G}}})=\sum _{{{\mathcal{S}}}_{i}\in {\bf{S}}}{{\mathbb{1}}}_{{{\mathcal{V}}}_{{{\mathcal{S}}}_{i}}}({{\mathcal{N}}}_{v})-1,$$

(7)

where ${{\mathbb{1}}}_{{{\mathcal{V}}}_{{{\mathcal{S}}}_{i}}}\left({{\mathcal{N}}}_{v}\right)=0$ if $| {{\mathcal{V}}}_{{{\mathcal{S}}}_{i}}\cap {{\mathcal{N}}}_{v}| =0$ and 1 otherwise.

Generating node feature vectors

ShapeGGen uses a latent variable model to create node feature vectors and associate them with network structure. This latent variable model is based on that developed by Guyon⁴¹ for the MADELON dataset and implemented in Scikit-Learn’s make_classification function⁴². The latent variable model creates n_i informative features for each node based on the node’s generated label and also creates non-informative features as noise. Having non-informative/redundant features allows us to evaluate GNN explainers, such as GNNExplainer¹⁴, that formulate explanations as node feature masks. More detail on node feature generation is given in Algorithm 1.

ShapeGGen can generate graphs with both homophilic and heterophilic ground-truth explanations. We optimize between homophily vs. heterophily by taking advantage of redundant node features, i.e., features that do not associate with the generated labels, and manipulate them appropriately based on a user-specified homophily parameter η. The magnitude of the η parameter determines the degree of homophily/heterophily in the generated graph. The algorithm for node features is given in Algorithm 3. While not every node feature in the feature vector is optimized for homophily/heterophily indication, we experimentally verified the cosine similarity between node feature vectors produced by Algorithm 3 reveals a strong homophily/heterophily pattern. Finally, ShapeGGen can generate protected features to enable the study of fairness¹. By controlling the value assignment for a selected discrete node feature, ShapeGGen introduces bias between the protected feature and node labels. The biased feature is a proxy for a protected feature, such as gender or race. The procedure for generating node features is outlined in NodeFeatureVectors within Algorithm 1.

Generating ground-truth explanations

In addition to generating ground-truth labels, ShapeGGen has a unique capability to generate unique ground-truth explanations. To accommodate diverse types of GNN explainers, every ground-truth explanation in ShapeGGen contains information on a) the importance of nodes, b) the importance of node features, and c) the importance of edges. This information is represented by three distinct masks defined over enclosing subgraphs of nodes $v\in {{\mathcal{V}}}_{{\mathcal{S}}}$, i.e., the L-hop neighborhood around the node. We denote the enclosing subgraph of node $v\in {{\mathcal{V}}}_{{\mathcal{S}}}$ for a given GNN predictor with L layers as: ${\rm{Sub}}\left(v;L\right)=\left({{\mathcal{V}}}_{{\rm{Sub}}(v)},{{\mathcal{E}}}_{{\rm{Sub}}(v)},{{\bf{X}}}_{{\rm{Sub}}(v)}\right)\subseteq {\mathcal{G}}$. Let motifs within this enclosing subgraph be: ${{\bf{S}}}_{v}=\left({{\mathcal{V}}}_{{{\bf{S}}}_{v}},{{\mathcal{E}}}_{{{\bf{S}}}_{v}},{{\bf{X}}}_{{{\bf{S}}}_{v}}\right)={\bf{S}}\cap {\rm{Sub}}(v)$. Using this notation, we define ground-truth explanation masks:

a)
Node explanation mask. Nodes in Sub(v) are assigned a value of 0 or 1 based on whether they are located within a motif or not. For any node ${v}_{i}\in {{\mathcal{V}}}_{{\rm{Sub}}(v)}$, we set ${{\mathbb{1}}}_{{{\mathcal{V}}}_{{{\bf{S}}}_{v}}}({v}_{i})=1$ if ${v}_{i}\in {{\mathcal{V}}}_{{{\bf{S}}}_{v}}$ and 0 otherwise. This function ${{\mathbb{1}}}_{{{\mathcal{V}}}_{{{\bf{S}}}_{v}}}$ is applied to all nodes in the enclosing subgraph of v to produce an importance score for each node, yielding the final mask as: ${{\bf{M}}}_{n}=\{{{\mathbb{1}}}_{{{\mathcal{V}}}_{{{\bf{S}}}_{v}}}(u)| u\in {{\mathcal{V}}}_{{\rm{Sub}}(v)}\}$.
b)
Node feature explanation mask. Each feature in v’s feature vector is labeled as 1 if it represents an informative feature and 0 if it is a random feature. This procedure produces a binary node feature mask for node v as: ${{\bf{M}}}_{f}\in {\{0,1\}}^{d}$.
c)
Edge explanation mask. To each $e=\left({v}_{i},{v}_{j}\right)\in {{\mathcal{E}}}_{{\rm{Sub}}(v)}$ we assign a value of either 0 or 1 based on whether e connects any two nodes in Sub(v). The masking function is defined as follows:

$$\begin{array}{ccc}{{\mathbb{1}}}_{{\varepsilon }_{{{\bf{S}}}_{v}}}(e) & = & \left\{\begin{array}{cc}0 & {\rm{if}}\;{v}_{{\rm{i}}}\notin {{\mathcal{V}}}_{{{\bf{S}}}_{{\rm{v}}}}\cup {\rm{\{}}v{\rm{\}}}\vee {v}_{{\rm{j}}}\notin {{\mathcal{V}}}_{{{\bf{S}}}_{v}}\cup {\rm{\{}}v{\rm{\}}}\\ 1 & {\rm{if}}\;{v}_{{\rm{i}}}\in {{\mathcal{V}}}_{{{\bf{S}}}_{{\rm{v}}}}\cup {\rm{\{}}v{\rm{\}}}\wedge {v}_{{\rm{j}}}\in {{\mathcal{V}}}_{{{\bf{S}}}_{{\rm{v}}}}\cup {\rm{\{}}v{\rm{\}}}\end{array}\right.\end{array}$$

(8)

This function ${{\mathbb{1}}}_{{{\mathcal{E}}}_{{{\bf{S}}}_{v}}}$ is applied to all edges $e\in {{\mathcal{E}}}_{{\rm{Sub}}(v)}$ to produce ground-truth edge explanation as: ${{\bf{M}}}_{e}=\{{{\mathbb{1}}}_{{{\mathcal{E}}}_{{{\bf{S}}}_{v}}}(e)| e\in {{\mathcal{E}}}_{{\rm{Sub}}(v)}\}$. The procedure to generate these ground-truth explanations is thoroughly described in Algorithm 1.

Datasets in GraphXAI

We proceed with a detailed description of synthetic and real-world graph data resources included in GraphXAI.

Synthetic graphs

The ShapeGGen generator outlined in the Methods section is a dataset generator that can be used to generate any number of user-specified graphs. In GraphXAI, we provide a collection of pre-generated XAI-ready graphs with diverse properties that are readily available for analysis and experimentation.

Base ShapeGGen graphs (SG-Base)

We provide a base version of ShapeGGen graphs. This instance of ShapeGGen is homophilic, large, and contains house-shaped motifs for ground-truth explanations, formally described as ${\mathcal{G}}$ = ShapeGGen (${\mathcal{S}}$ = ‘house’, N_s = 1200, p = 0.006, n_s = 11, K = 2, n_f = 11, n_i = 4, s_f = 0.6, c_f = 2, ϕ = 0.5, η = 1, L = 3). The node features in this graph exhibit homophily, a property commonly found in social networks. With over 10,000 nodes, this graph also provides enough examples of ground-truth explanations for rigorous statistical evaluation of explainer performance. The house-shaped motifs follow one of the earliest synthetic graphs used to evaluate GNN explainers¹⁴.

Homophilic and heterophilic ShapeGGen graphs

GNN explainers are evaluated on homophilic graphs^1,43,44,45 as low homophily levels in graphs can degrade the performance of GNN predictors^46,47. To this end, there are no heterophilic graphs with ground-truth explanations in current GNN XAI literature despite such graphs being abundant in real-world applications⁴⁶. To demonstrate the flexibility of the ShapeGGen data generator, we use it to generate graphs with: i) homophilic ground-truth explanations (SG-Base) and ii) heterophilic ground-truth explanations (SG-Heterophilic), i.e., ${\mathcal{G}}$ = ShapeGGen (${\mathcal{S}}$ = ‘house’, N_s = 1200, p = 0.006, n_s = 11, K = 2, n_f = 11, n_i = 4, s_f = 0.6, c_f = 2, ϕ = 0.5, η = −1, L = 3).

Weakly and strongly unfair ShapeGGen graphs

We utilize the ShapeGGen data generator to generate graphs with controllable fairness properties, i.e., leverage ShapeGGen to generate synthetic graphs with real-world fairness properties where we can enforce unfairness w.r.t. a given protected attribute. We use the ShapeGGen to generate graphs with: i) weakly-unfair ground-truth explanations (SG-Base) and ii) strongly-unfair ground-truth explanations (SG-Unfair) ${\mathcal{G}}$ = ShapeGGen (${\mathcal{S}}$ = ‘house’, N_s = 1200, p = 0.006, n_s = 11, K = 2, n_f = 11, n_i = 4, s_f = 0.6, c_f = 2, ϕ = 0.75, η = 1, L = 3). Here, for the first time, we generated unfair synthetic graphs that can serve as pseudo-ground-truth for quantifying whether current GNN explainers preserve counterfactual fairness.

Small and large ShapeGGen explanations

We explore the faithfulness of explanations w.r.t. different ground-truth explanation sizes. This is important because a reliable explanation should identify important features independent of the explanation size. However, current data resources only provide graphs with smaller-size ground-truth explanations. Here, we use the ShapeGGen data generator to generate graphs having (i) smaller ground-truth explanations size (SG-SmallEx) ${\mathcal{G}}$ = ShapeGGen (${\mathcal{S}}$ = ‘triangle’, N_s = 1200, p = 0.006, n_s = 12, K = 2, n_f = 11, n_i = 4, s_f = 0.5, c_f = 2, ϕ = 0.5, η = 1, L = 3) and (ii) larger ground-truth explanations size using house motifs.

Low and high proportions of salient features

We examine the faithfulness of node feature masks produced by explainers under different levels of sparsity for class-associated signal in the node features. The feature generation procedure in ShapeGGen specifies node feature parameters n_i, the number of informative features that are generated to correlate with node labels, and n_f, number of total node features. The remaining n_f−n_i features are redundant features that are randomly distributed and have no correlation to the node label. Using an equivalent graph topology to SG-Base, we change the relative proportion of node features which are attributed to the label by adjusting n_i and n_f. We create SG-MoreInform, a ShapeGGen instance with a high proportion of ground-truth features to total features (8:11). Likewise, we create SG-LessInform, a ShapeGGen instance with a low proportion of ground-truth features to total features (4:21). This proportion in SG-Base falls in the middle of SG-MoreInform and SG-LessInform instances with a proportion of ground-truth to total features of 4:11. Formally, we define SG-MoreInform as ${\mathcal{G}}$ = ShapeGGen (${\mathcal{S}}$ = ‘house’, N_s = 1200, p = 0.006, n_s = 11, K = 2, n_f = 11, n_i = 8, s_f = 0.6, c_f = 2, ϕ = 0.5, η = 1, L = 3) and SG-LessInform as ${\mathcal{G}}$ = ShapeGGen (${\mathcal{S}}$ = ‘house’, N_s = 1200, p = 0.006, n_s = 11, K = 2, n_f = 21, n_i = 4, s_f = 0.6, c_f = 2, ϕ = 0.5, η = 1, L = 3).

BA-Shapes

In addition to ShapeGGen, we provide a version of BA-Shapes¹⁴, a synthetic graph data generator for node classification tasks. We start with a base Barabasi-Albert (BA)⁴⁸ graph using N nodes and a set of five-node “house”-structured motifs K randomly attached to nodes of the base graph. The final graph is perturbed by adding random edges. The nodes in the output graph are categorized into two classes corresponding to whether the node is in a house (1) or not in a house (0).

Real-world graphs

We describe the real-world graph datasets with and without ground-truth explanations provided in GraphXAI. To this end, we provide data resources from crime forecasting, financial lending, and molecular chemistry and biology^1,2,35. We consider these datasets as they are used to train GNNs for generating predictions in high-stakes downstream applications. In particular, we include chemical and biological datasets because they are used to identify whether an input graph (i.e., a molecular graph) contains a specific pattern (i.e., a chemical group with a specific property in the molecule). Knowledge about such patterns, which determine molecular properties, represents ground-truth explanations². We provide a statistical description of real-world graphs in Tables 6–8. Below, we discuss the details of each of the real-world datasets that we employ:

Table 7 Statistics of real-world graph classification datasets in GraphXAI with ground-truth (GT) explanations.

Full size table

Table 8 Statistics of real-world node classification datasets in GraphXAI without ground-truth (GT) explanations.

Full size table

MUTAG

The MUTAG³⁵ dataset contains 1,768 graph molecules labeled into two different classes according to their mutagenic properties, i.e., effect on the Gram-negative bacterium S. Typhimuriuma. Kazius et al.³⁵ identifies several toxicophores - motifs in the molecular graph - that correlate with mutagenicity. The dataset is trimmed from its original 4,337 graphs to 1,768, based on those whose labels directly correspond to the presence or absence of our chosen toxicophores: NH₂, NO₂, aliphatic halide, nitroso, and azo-type (terminology, as referred to in Kazius et al.³⁵).

Alkane carbonyl

The Alkane Carbonyl² dataset contains 1,125 molecular graphs labeled into two classes where a positive sample indicates a molecule that contains an unbranched alkane and a carbonyl (C = O) functional group. The ground-truth explanations include any combinations of alkane and carbonyl functional groups within a given molecule.

Benzene

The Benzene² dataset contains 12,000 molecular graphs extracted from the ZINC15⁴⁹ database and labeled into two classes where the task is to identify whether a given molecule has a benzene ring or not. Naturally, the ground truth explanations are the nodes (atoms) comprising the benzene rings, and in the case of multiple benzenes, each of these benzene rings forms an explanation.

Fluoride carbonyl

The Fluoride Carbonyl² dataset contains 8,671 molecular graphs labeled into two classes where a positive sample indicates a molecule that contains a fluoride (F⁻) and a carbonyl (C = O) functional group. The ground-truth explanations consist of combinations of fluoride atoms and carbonyl functional groups within a given molecule.

German credit

The German Credit¹ graph dataset contains 1,000 nodes representing clients in a German bank connected based on the similarity of their credit accounts. The dataset includes demographic and financial features like gender, residence, age, marital status, loan amount, credit history, and loan duration. The goal is to associate clients with credit risk.

Recidivism

The Recidivism¹ dataset includes samples of bail outcomes collected from multiple state courts in the USA between 1990–2009. It contains past criminal records, demographic attributes, and other demographic details of 18,876 defendants (nodes) who got released on bail at the U.S. state courts. Defendants are connected based on the similarity of past criminal records and demographics, and the goal is to classify defendants into bail vs. no bail.

Credit defaulter

The Credit defaulter¹ graph has 30,000 nodes representing individuals that we connected based on the similarity of their spending and payment patterns. The dataset contains applicant features like education, credit history, age, and features derived from their spending and payment patterns. The task is to predict whether an applicant will default on an upcoming credit card payment.

Data availability

GraphXAI data resource²⁸ is hosted on Harvard Dataverse under a persistent identifier https://doi.org/10.7910/DVN/KULOS8. We have deposited different a number of ShapeGGen -generated datasets and real-world graphs at this repository.

Code availability

Project website for GraphXAI is at https://zitniklab.hms.harvard.edu/projects/GraphXAI. The code to reproduce results, documentation, and tutorials are available in GraphXAI ‘s Github repository at https://github.com/mims-harvard/GraphXAI. The repository contains Python scripts to generate and evaluate explanations using performance metrics and also visualize explanationa. In addition, the repository contains information and Python scripts to build new versions of GraphXAI as the underlying primary resources get updated and new data become available.

References

Agarwal, C., Lakkaraju, H. & Zitnik, M. Towards a unified framework for fair and stable graph representation learning. In UAI (2021).
Sanchez-Lengeling et al. Evaluating attribution for graph neural networks. NeurIPS (2020).
Giunchiglia, V., Shukla, C. V., Gonzalez, G. & Agarwal, C. Towards training GNNs using explanation directed message passing. In The First Learning on Graphs Conference (2022).
Morselli Gysi, D. et al. Network medicine framework for identifying drug-repurposing opportunities for covid-19. Proceedings of the National Academy of Sciences (2021).
Zitnik, M., Agrawal, M. & Leskovec, J. Modeling polypharmacy side effects with graph convolutional networks. In Bioinformatics (2018).
Baldassarre, F. & Azizpour, H. Explainability techniques for graph convolutional networks. In ICML Workshop on LRGR (2019).
Faber, L. et al. Contrastive graph neural network explanation. In ICML Workshop on Graph Representation Learning and Beyond (2020).
Huang, Q., Yamada, M., Tian, Y., Singh, D. & Chang, Y. Graphlime: Local interpretable model explanations for graph neural networks. IEEE Transactions on Knowledge and Data Engineering (2022).
Lucic, A., Ter Hoeve, M. A., Tolomei, G., De Rijke, M. & Silvestri, F. Cf-gnnexplainer: Counterfactual explanations for graph neural networks. In AISTATS (PMLR, 2022).
Luo, D. et al. Parameterized explainer for graph neural network. In NeurIPS (2020).
Pope, P. E., Kolouri, S., Rostami, M., Martin, C. E. & Hoffmann, H. Explainability methods for graph convolutional neural networks. In CVPR (2019).
Schlichtkrull, M. S., De Cao, N. & Titov, I. Interpreting graph neural networks for nlp with differentiable edge masking. In ICLR (2021).
Vu, M. N. & Thai, M. T. PGM-Explainer: probabilistic graphical model explanations for graph neural networks. In NeurIPS (2020).
Ying, R., Bourgeois, D., You, J., Zitnik, M. & Leskovec, J. GNNExplainer: generating explanations for graph neural networks. In NeurIPS (2019).
Agarwal, C. et al. Probing GNN explainers: A rigorous theoretical and empirical analysis of GNN explanation methods. In AISTATS (2022).
Faber, L., K. Moghaddam, A. & Wattenhofer, R. When comparing to ground truth is wrong: On evaluating GNN explanation methods. In KDD (2021).
Hu, W. et al. Open Graph Benchmark: datasets for machine learning on graphs. In NeurIPS (2020).
Baruah, T. et al. GNNMark: a benchmark suite to characterize graph neural network training on gpus. In ISPASS (2021).
Du, Y. et al. GraphGT: machine learning datasets for graph generation and transformation. In NeurIPS Datasets and Benchmarks (2021).
Freitas, S. et al. A large-scale database for graph representation learning. In NeurIPS Datasets and Benchmarks (2021).
Zheng, Q. et al. Graph robustness benchmark: Benchmarking the adversarial robustness of graph machine learning. In NeurIPS Datasets and Benchmarks (2021).
Huang, K. et al. Therapeutics Data Commons: Machine learning datasets and tasks for drug discovery and development. In NeurIPS Datasets and Benchmarks (2021).
Huang, K. et al. Artificial intelligence foundation for therapeutic science. Nature Chemical Biology 18, 1033–1036 (2022).
Article CAS PubMed Google Scholar
Wang, Z., Yin, H. & Song, Y. Benchmarking the combinatorial generalizability of complex query answering on knowledge graphs. In NeurIPS Datasets and Benchmarks (2021).
Liu, M. et al. DIG: A turnkey library for diving into graph deep learning research. JMLR (2021).
Fey, M. & Lenssen, J. E. Fast graph representation learning with PyTorch Geometric. ICLR 2019 (RLGM Workshop) (2019).
Wang, M. et al. Deep Graph Library: Towards efficient and scalable deep learning on graphs. In ICLR workshop on representation learning on graphs and manifolds (2019).
Agarwal, C., Queen, O., Lakkaraju, H. & Zitnik, M. Evaluating explainability for graph neural networks. Harvard Dataverse https://doi.org/10.7910/DVN/KULOS8 (2022).
Simonyan, K. et al. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR (2014).
Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. In ICML (2017).
Yuan, H., Yu, H., Wang, J., Li, K. & Ji, S. On explainability of graph neural networks via subgraph explorations. In ICML (2021).
Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How powerful are graph neural networks? In ICLR (2019).
Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. In ICLR (2017).
Taha, A. A. & Hanbury, A. Metrics for evaluating 3d medical image segmentation: analysis, selection, and tool. In BMC Medical Imaging (2015).
Kazius, J. et al. Derivation and validation of toxicophores for mutagenicity prediction. In Journal of Medicinal Chemistry (2005).
Yuan, H., Yu, H., Gui, S. & Ji, S. Explainability in graph neural networks: A taxonomic survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
Hunter, J. D. Matplotlib: A 2d graphics environment. Computing in Science & Engineering (2007).
Hagberg, A. A., Schult, D. A. & Swart, P. J. Exploring network structure, dynamics, and function using networkx. In Proceedings of the 7th Python in Science Conference (2008).
Wu, Z. et al. A comprehensive survey on graph neural networks. In IEEE Transactions on Neural Networks and Learning Systems (2020).
Barabási, A.-L. & Albert, R. Emergence of scaling in random networks. In Science (1999).
Guyon, I. Design of experiments of the nips 2003 variable selection benchmark. In NIPS 2003 workshop on feature extraction and feature selection (2003).
Pedregosa, F. et al. Scikit-learn: Machine learning in Python. In JMLR (2011).
McCallum, A. K. et al. Automating the construction of internet portals with machine learning. In Information Retrieval (2000).
Sen, P. et al. Collective classification in network data. In AI magazine (2008).
Wang, K. et al. Microsoft academic graph: When experts are not enough. In Quantitative Science Studies (2020).
Zhu, J. et al. Beyond homophily in graph neural networks: Current limitations and effective designs. In NeurIPS (2020).
Jin, W. et al. Node similarity preserving graph convolutional networks. In WSDM (2021).
Albert, R. & Barabási, A.-L. Statistical mechanics of complex networks. Reviews of modern physics (2002).
Sterling, T. & Irwin, J. J. Zinc 15–ligand discovery for everyone. Journal of Chemical Information and Modeling (2015).

Download references

Acknowledgements

C.A., O.Q., and M.Z. gratefully acknowledge the support by NSF under Nos. IIS-2030459 and IIS-2033384, US Air Force Contract No. FA8702-15-D-0001, and awards from Harvard Data Science Initiative, Amazon Research, Bayer Early Excellence in Science, AstraZeneca Research, and Roche Alliance with Distinguished Scientists. H.L. was supported in part by NSF under Nos IIS-2008461 and IIS-2040989, and research awards from Google, JP Morgan, Amazon, Bayer, Harvard Data Science Initiative, and D3 Institute at Harvard. O.Q. was supported, in part, by Harvard Summer Institute in Biomedical Informatics. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funders.

Author information

These authors contributed equally: Chirag Agarwal, Owen Queen.

Authors and Affiliations

Media and Data Science Research Lab, Adobe, Noida, 201304, India
Chirag Agarwal
Department of Biomedical Informatics, Harvard University, Boston, MA, 02115, USA
Chirag Agarwal, Owen Queen & Marinka Zitnik
Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN, 37996, USA
Owen Queen
Harvard Business School, Boston, MA, 02163, USA
Himabindu Lakkaraju
Harvard Data Science Initiative, Cambridge, MA, 02138, USA
Himabindu Lakkaraju & Marinka Zitnik
Department of Computer Science, Harvard University, Boston, MA, 02134, USA
Himabindu Lakkaraju
Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
Marinka Zitnik

Authors

Chirag Agarwal
View author publications
You can also search for this author in PubMed Google Scholar
Owen Queen
View author publications
You can also search for this author in PubMed Google Scholar
Himabindu Lakkaraju
View author publications
You can also search for this author in PubMed Google Scholar
Marinka Zitnik
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.A., O.Q., H.L. and M.Z. contributed new analytic tools and wrote the manuscript. C.A. and O.Q. retrieved, processed, and harmonized datasets. C.A. and O.Q. implemented the synthetic dataset generator and performed the analyses for technical validation of the new resource. M.Z. conceived the study.

Corresponding author

Correspondence to Marinka Zitnik.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Agarwal, C., Queen, O., Lakkaraju, H. et al. Evaluating explainability for graph neural networks. Sci Data 10, 144 (2023). https://doi.org/10.1038/s41597-023-01974-x

Download citation

Received: 19 August 2022
Accepted: 17 January 2023
Published: 18 March 2023
DOI: https://doi.org/10.1038/s41597-023-01974-x

This article is cited by

A unified pre-training and adaptation framework for combinatorial optimization on graphs
- Ruibin Zeng
- Minglong Lei
- Lan Cheng
Science China Mathematics (2024)
Reliable interpretability of biology-inspired deep neural networks
- Wolfgang Esser-Skala
- Nikolaus Fortelny
npj Systems Biology and Applications (2023)
CoSP: co-selection pick for a global explainability of black box machine learning models
- Dou El Kefel Mansouri
- Seif-Eddine Benkabou
- Souleyman Chaib
World Wide Web (2023)

Subjects

Abstract

Similar content being viewed by others

Enhancing property and activity prediction and interpretation using multiple molecular graph representations with MMGX

Chemistry-intuitive explanation of graph neural networks for molecular property prediction with substructure masking

Graph neural networks

Introduction

Results

Experimental setup

GNN explainers

Implementation details

Performance metrics

Graph explanation accuracy (GEA)

Graph explanation faithfulness (GEF)

Graph explanation stability (GES)

Counterfactual fairness mismatch

Group fairness mismatch

Evaluation and analysis of GNN explainability methods

Benchmarking GNN explainers on synthetic and real-world graphs

Analyzing homophilic vs. heterophilic ground-truth explanations

Analyzing the reliability of graph explainers to smaller vs. larger ground-truth explanations

Examining fair vs. unfair ground-truth explanations

Faithfulness shift with varying degrees of node feature information

Visualization results

Discussion

Methods

Notation

Graphs

Graph neural networks

GNN explainability methods

ShapeGGen dataset generator

Flexible parameterization of ShapeGGen

Generating graph structure

Generating labels for prediction

Generating node feature vectors

Generating ground-truth explanations

Datasets in GraphXAI

Synthetic graphs

Base ShapeGGen graphs (SG-Base)

Homophilic and heterophilic ShapeGGen graphs

Weakly and strongly unfair ShapeGGen graphs

Small and large ShapeGGen explanations

Low and high proportions of salient features

BA-Shapes

Real-world graphs

MUTAG

Alkane carbonyl

Benzene

Fluoride carbonyl

German credit

Recidivism

Credit defaulter

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

A unified pre-training and adaptation framework for combinatorial optimization on graphs

Reliable interpretability of biology-inspired deep neural networks

CoSP: co-selection pick for a global explainability of black box machine learning models

Search

Quick links