Evaluating explainability for graph neural networks

As explanations are increasingly used to understand the behavior of graph neural networks (GNNs), evaluating the quality and reliability of GNN explanations is crucial. However, assessing the quality of GNN explanations is challenging as existing graph datasets have no or unreliable ground-truth explanations. Here, we introduce a synthetic graph data generator, ShapeGGen, which can generate a variety of benchmark datasets (e.g., varying graph sizes, degree distributions, homophilic vs. heterophilic graphs) accompanied by ground-truth explanations. The flexibility to generate diverse synthetic datasets and corresponding ground-truth explanations allows ShapeGGen to mimic the data in various real-world areas. We include ShapeGGen and several real-world graph datasets in a graph explainability library, GraphXAI. In addition to synthetic and real-world graph datasets with ground-truth explanations, GraphXAI provides data loaders, data processing functions, visualizers, GNN model implementations, and evaluation metrics to benchmark GNN explainability methods.


Introduction
As graph neural networks (GNNs) are being increasingly used for learning representations of graphstructured data in high-stakes applications, such as criminal justice [1], molecular chemistry [2,3], and biological networks [4,5], it becomes critical to ensure that the relevant stakeholders can understand and trust their functionality.To this end, previous work developed several methods to explain predictions made by GNNs [6][7][8][9][10][11][12][13][14].With the increase in newly proposed GNN explanation methods, it is critical to ensure their reliability.However, explainability in graph machine learning is an emerging area lacking standardized evaluation strategies and reliable data resources to evaluate, test, and compare GNN explanations [15].While several works have acknowledged this difficulty, they tend to base their analysis on specific real-world [2] and synthetic [16] datasets with limited ground-truth explanations.In addition, relying on these datasets and associated ground-truth explanations is insufficient as they are not indicative of diverse real-world applications [15].To this end, developing a broader ecosystem of data resources for benchmarking state-of-the-art GNN explainers can support explainability research in GNNs.
A comprehensive data resource to correctly evaluate the quality of GNN explanations will ensure their reliability in high-stake applications.However, the evaluation of GNN explanations is a growing research area with relatively little work, where existing approaches mainly leverage ground-truth explanations associated with specific datasets [2] and are prone to several pitfalls (as outlined by Faber et al. [16]).Further, multiple underlying rationales can generate the correct class labels, creating redundant or non-unique explanations.A trained GNN model may only capture one or an entirely different rationale.In such cases, evaluating the explanation output by a state-of-the-art method using the ground-truth explanation is incorrect because the underlying GNN model does not rely on that ground-truth explanation.In addition, even if a unique ground-truth explanation generates the correct class label, the GNN model trained on the data could be a weak predictor using an entirely different rationale for prediction.Therefore, the ground-truth explanation cannot be used to assess post hoc explanations of such models.Lastly, the ground-truth explanations corresponding to some of the existing benchmark datasets are not good candidates for reliably evaluating explanations as they can be recovered using trivial baselines (e.g., random node or edge as explanation).The above discussion highlights a clear need for general-purpose data resources which can evaluate post hoc explanations reliably across diverse real-world applications.While various benchmark datasets (e.g., Open Graph Benchmark (OGB) [17], GNNMark [18], GraphGT [19], MalNet [20], Graph Robustness Benchmark (GRB) [21], Therapeutics Data Commons [22,23], and EFO-1-QA [24]) and programming libraries for deep learning on graphs (e.g., Dive Into Graphs (DIG) [25], Pytorch Geometric (PyG) [26], and Deep Graph Library (DGL) [27]) in graph machine learning literature exist, they are mainly used to only benchmark the performance of GNN predictors and are not suited to evaluate the correctness of GNN explainers because they do not capture ground-truth explanations.
Here, we address the above challenges by introducing a general-purpose data resource that is not prone to ground-truth pitfalls (e.g., redundant explanations, weak GNN predictors, trivial explanations) and can cater to diverse real-world applications.To this end, we present SHAPEGGEN (Figure 2), a novel and flexible synthetic XAI-ready (explainable artificial intelligence ready) dataset generator which can automatically generate a variety of benchmark datasets (e.g., varying graph sizes, degree distributions, homophilic vs. heterophilic graphs) accompanied by groundtruth explanations.SHAPEGGEN ensures that the generated ground-truth explanations are not prone to the pitfalls described in Faber et al. [16], such as redundant explanations, weak GNN predictors, and trivially correct explanations.Furthermore, SHAPEGGEN can evaluate the goodness of any explanation (e.g., node feature-based, node-based, edge-based) across diverse real-world applications by seamlessly generating synthetic datasets that can mimic the properties of real-world data in various domains.
We incorporate SHAPEGGEN and several other synthetic and real-world graphs [28] into GRAPHXAI, a general-purpose framework for benchmarking GNN explainers.GRAPHXAI also provides a broader ecosystem (Figure 1) of data loaders, data processing functions, visualizers, and a set of evaluation metrics (e.g., accuracy, faithfulness, stability, fairness) to reliably benchmark the quality of any given GNN explanation (node feature-based or node/edge-based).We leverage various synthetic and real-world datasets and evaluation metrics from GRAPHXAI to empirically assess the quality of explanations output by eight state-of-the-art GNN explanation methods.Across many GNN explainers, graphs, and graph tasks, we observe that state-of-the-art GNN explainers fail on graphs with large ground-truth explanations (i.e., explanation subgraphs with a higher number of nodes and edges) and cannot produce explanations that preserve fairness properties of underlying GNN predictors.

Results
To evaluate GRAPHXAI, we show how GRAPHXAI enables systematic benchmarking of eight state-of-the-art GNN explainers on both SHAPEGGEN (in the Methods section) and real-world graph datasets.We explore the utility of the SHAPEGGEN generator to benchmark GNN explainers on graphs with homophilic vs. heterophilic, small vs. large, and fair vs. unfair ground-truth explanations.Additionally, we examine the utility of GNN explanations on datasets with varying degrees of informative node features.Next, we outline the experimental setup, including details about performance metrics, GNN explainers, and underlying GNN predictors, and proceed with a discussion of benchmarking results.
Experimental setup GNN explainers.The GRAPHXAI defines an Explanation class capable of storing multiple types of explanations produced by GNN explainers and provides a graphxai.BaseExplainer class that serves as the base for all explanation methods in GRAPHXAI.We incorporate eight GNN explainability methods, including gradient-based: Grad [29], GradCAM [11], GuidedBP [6], Integrated Gradients [30]; perturbation-based: GNNExplainer [14], PGExplainer [10], SubgraphX [31]; and surrogate-based methods: PGMExplainer [13].Finally, following Agarwal et al. [15], we consider random explanations as a reference: 1) Random Node Features, a node feature mask defined by an d-dimensional Gaussian distributed vector; 2) Random Nodes, a 1 × n node mask is randomly sampled using a uniform distribution, where n is the number of nodes in the enclosing subgraph; and 3) Random Edges, an N × N edge mask drawn from a uniform distribution over a node's incident edges.
Implementation details.We use a three-layer GIN model [32] and a GCN model [33] as GNN predictors for our experiments.We use a model comprising three GIN convolution layers with ReLU non-linear activation function and a Softmax activation for the final layer.The hidden dimensionality of the layers is set to 16.We follow an established approach for generating explanations [8,15] and use reference algorithm implementations.We select top-k (k = 25%) important nodes, node features, or edges and use them to generate explanations for all graph explainability methods.
For training GIN models, we use an Adam optimizer with a learning rate of 1 × 10 −2 , weight decay of 1 × 10 −5 , and the number of epochs to 1000.We use an Adam optimizer with a learning rate of 3 × 10 −2 , no weight decay, and 1500 training epochs for training GNN models.We set hyperparameters of GNN explainability models following the authors' recommendations.
We use a fixed random split provided within the GRAPHXAI package to split the datasets.For each SHAPEGGEN dataset, we use a 70/5/25 split for training, validation, and testing, respectively.
For MUTAG, Benzene, and Fluoride Carbonyl datasets, we use a 70/10/20 split throughout each dataset.Average performance is reported across each sample in the testing set of each dataset.

Performance metrics
In addition to the synthetic and real-world data resources, we consider four broad categories of performance metrics: i) Graph Explanation Accuracy (GEA); ii) Graph Explanation Faithfulness (GEF); iii) Graph Explanation Stability (GES); and iv) Graph Explanation Fairness (GECF, GEGF) to evaluate the explanations on the respective datasets.In particular, all evaluation metrics leverage predicted explanations, ground-truth explanations, and other user-controlled parameters, such as topk features.Our GRAPHXAI package implements these performance metrics and additional utility functions within graphxai.metricsmodule.Figure 3 shows a code snippet for evaluating the correctness of output explanations for a given GNN prediction in GRAPHXAI.
Graph explanation accuracy (GEA).We report graph explanation accuracy as an evaluation strategy that measures an explanation's correctness using the ground-truth explanation M g .In ground-truth and predicted explanation masks, every node, node feature, or edge belongs to {0, 1}, where '0' means that an attribute is unimportant and '1' means that it is important for the model prediction.To measure accuracy, we use Jaccard index [34] between the ground-truth M g and predicted M p : where TP denotes true positives, FP denotes false positives, and FN indicates false negatives.Most synthetic-and real-world graphs have multiple ground-truth explanations.For example, in the MUTAG dataset [35], carbon rings with both NH 2 or NO 2 chemical groups are valid explanations for the GNN model to recognize a given molecule as mutagenic.For this reason, the accuracy metric must account for the possibility of multiple equally valid explanations existing for any given prediction.Hence, we define ζ as a set of all possible ground-truth explanations, where |ζ| = 1 for graphs having a unique explanation.Therefore, we calculate GEA as: Here, we can calculate GEA using predicted node feature, node, or edge explanation masks.Finally, Equation 1 quantifies the accuracy between the ground-truth and predicted explanation masks.
Higher values mean a predicted explanation is more likely to match the ground-truth explanation.
Graph explanation faithfulness (GEF).We extend existing faithfulness metrics [2,15] to quantify how faithful explanations are to an underlying GNN predictor.In particular, we obtain the prediction probability vector ŷu using the GNN, i.e., ŷu = f (S u ), and using the explanation, i.e., ŷu = f (S u ), where we generate a masked subgraph S u by only keeping the original values of the top-k features identified by an explanation, and get their respective predictions ŷu .Finally, we compute the graph explanation unfaithfulness metric as: where Kullback-Leibler (KL) divergence score quantifies the distance between two probability distributions, and the "||" operator indicates statistical divergence measure.Note that Equation 3 is a measure of the unfaithfulness of the explanation.So, higher values indicate a higher degree of unfaithfulness.
Graph explanation stability (GES).Formally, an explanation is defined to be stable if the explanation for a given graph and its perturbed counterpart (generated by making infinitesimally small perturbations to the node feature vector and associated edges) are similar [15,36].We measure graph explanation stability w.r.t. the changes in the model behavior.In addition to similar output labels for S u and the perturbed S u , we employ a second level of check where the difference between model behaviors for S u and S u is bounded by an infinitesimal constant δ, i.e., ||L Su −L S u || p ≤ δ.
Here, L(•) refers to any form of model knowledge like output logits ŷu or embeddings z u .We compute graph explanation instability as: where D(•) represents the cosine distance metric, M p Su and M p S u are the predicted explanation masks for S u and S u , and B(•) represents a δ-radius ball around S u for which the model behavior is same.Equation 4 is a measure of instability, and higher values indicate a higher degree of instability.
Counterfactual fairness mismatch.To measure counterfactual fairness [15], we verify if the explanations corresponding to S u and its counterfactual counterpart (where the protected node feature is modified) are similar (dissimilar) if the underlying model predictions are similar (dissimilar).We calculate counterfactual fairness mismatch as: where, M p and M p s are the predicted explanation mask for S u and for the counterfactual counterpart of S u .Note that we expect a lower GECF score for graphs having weakly-unfair ground-truth explanations because explanations are similar for both original and counterfactual graphs, whereas, for graphs with strongly-unfair ground-truth explanations, we expect a higher GECF score as explanations change when we modify the protected attribute.
Group fairness mismatch.We measure group fairness mismatch [15] as: where ŷK and ŷEu K are predictions for a set of K graphs using the original and the essential features identified by an explanation, respectively, and SP is the statistical parity.Finally, Equation 6 is a measure of group fairness mismatch of an explanation where higher values indicate that the explanation is not preserving group fairness.

Evaluation and analysis of GNN explainability methods
Next, we discuss experimental results that answer critical questions concerning synthetic and real-world graphs and different ground-truth explanations.
Benchmarking GNN explainers on synthetic and real-world graphs.We evaluate the performance of GNN explainers on a collection of synthetically generated graphs with various properties and molecular datasets using metrics described in the experimental setup.Results in Tables 1-5 show that, while no explanation method performs well across all properties, across different SHAPEGGEN node classification datasets (Table 6), SubgraphX outperforms other methods on average.In particular, SubgraphX generates 145.95% more accurate and 64.80% less unfaithful explanations than other GNN explanation methods.Gradient-based methods, such as GradCam and GuidedBP, perform the next best of all methods, with Grad producing the second-lowest unfaithfulness score and GradCAM achieving the second-highest explanation accuracy score.PGExplainer generates the least unstable explanations-35.35%less unstable explanations than the average instability across other GNN explainers.In summary, results of Table 1-5 show that i) node explanation masks are more reliable than edge-and node feature explanation masks and ii) state-of-the-art GNN explainers achieve better faithfulness for synthetic graph datasets as compared to real-world graphs.
Analyzing homophilic vs. heterophilic ground-truth explanations.We compare GNN explainers by generating explanations on GNN models trained on homophilic and heterophilic graphs generated using the SG-HETEROPHILIC generator.Then, we compute the graph explanation unfaithfulness scores of output explanations generated using state-of-the-art GNN explainers.We find that GNN explainers produce 55.98% more faithful explanations when ground-truth explanations are homophilic than when ground-truth explanations are heterophilic (i.e., low unfaithfulness scores for light green bars in Figure 4).These results reveal an important gap in existing GNN explainers.
Namely, existing GNN explainers fail to perform well on diverse graph types, such as homophilic, heterophilic and attributed graphs.This observation, enabled by the flexibility of SHAPEGGEN generator, highlights an opportunity for future algorithmic innovation in GNN explainability.
Analyzing the reliability of graph explainers to smaller vs. larger ground-truth explanations.
Next, we examine the reliability of GNN explainers when used to predict explanations for models trained on graphs generated using the SG-SMALLEX graph generator.Results in Figure 5 show that explanations from existing GNN explainers are faithful (i.e., lower GEF scores) to the underlying GNN models when ground-truth explanations are smaller, i.e., S='triangle'.On average, across all 8 GNN explainers, we find that existing GNN explainers are highly unfaithful to graphs with large ground-truth explanations with an average GEF score of 0.7476.Further, we observe that explanations generated on 'triangular' (smaller) ground-truth explanations are 59.98% less unfaithful than explanations for 'house' (larger) ground-truth explanations (i.e., low unfaithfulness scores for light purple bars in Figure 5).However, the Grad explainer, on average, achieves 9.33% lower unfaithfulness on large ground-truth explanations compared to other explanation methods.This limited behavior of existing GNN explainers has not been previously known and highlights an urgent need for additional analysis of GNN explainers.
Examining fair vs. unfair ground-truth explanations.To measure the fairness of predicted explanations, we train GNN models on SG-UNFAIR, which generates graphs with controllable fairness properties.Next, we compute the average GECF and GEGF values for predicted explanations from eight GNN explainers.The fairness results in Figure 6 show that GNN explainers do not preserve counterfactual fairness and are highly prone to producing unfair explanations.We note that for weakly-unfair ground-truth explanations (light red in Figure 6), explanations M p should not change as the label-generating process is independent of the protected attribute.Still, we observe high GECF scores for most explanation methods.For strongly-unfair ground-truth explanations, we find that explanations from most GNN explainers fail to capture (i.e., low GECF scores for dark red bars in Figure 6) the unfairness enforced using the protected attribute and generate similar explanations even when we flip/modify the respective protected attribute.We see that GradCAM and PGEx explanations outperform other GNN explainers in preserving counterfactual explanations for weakly-unfair ground-truth explanations.In contrast, the PGMEx explainer preserves counterfactual fairness better than other explanation methods on strongly-unfair ground truth explanations.Our results highlight the importance of studying fairness in XAI as they can enhance a user's confidence in the model and assist in detecting and correcting unwanted bias.
Faithfulness shift with varying degrees of node feature information.Using SHAPEGGEN's support for node features and ground-truth explanations on those features, we evaluate explainers that generate explanations for node features.Results for node feature explanations on SG-BASE are given in Table 4.In addition, we explore the performance of explainers under varying proportions of informative node features.Informative node features, defined in the SHAPEGGEN construction (Algorithm 1), are node features correlated with the label of a given node, as opposed to redundant features, which are sampled randomly from a Gaussian distribution.Figure 7 shows the results of experiments on three datasets, SG-MOREINFORM, SG-BASE, and SG-LESSINFORM.All datasets have similar graph topology, but SG-MOREINFORM has a higher proportion of informative features while SG-LESSINFORM has a lower proportion of these informative features.SG-BASE is used as a baseline with a proportion of informative features greater than SG-LESSINFORM but less than SG-MOREINFORM.There are minimal differences between explainers' faithfulness across datasets, however, unfaithfulness tends to increase with fewer informative node features.As seen in Table 4, the Gradient explainer shows the best faithfulness score across all datasets for node feature explanation.Still, this faithfulness is relatively weak, only 0.001 better than random explanation.
These results show that the faithfulness of node feature explanations worsens under sparse node feature signals.
Visualization results.GRAPHXAI provides functions that visualize explanations produced by GNN explainability methods.Users can compare both node-and graph-level explanations.In addition, function implementations for visualization are parameterized, allowing users to change colors and weight interpretation.Functions are compatible with matplotlib [37] and networkx [38].
Visualizations are generated by graphxai.Explanation.visualizenode for node-level explanations and graphxai.Explanation.visualizegraph functions for graph-level explanations.In Figure 8, we show the output explanation from four different GNN explainers as produced by our visualization function.Figure 9 shows example outputs from multiple explainers in the GRAPHXAI package on a SHAPEGGEN-generated dataset.

Discussion
GRAPHXAI provides a general-purpose framework to evaluate GNN explanations produced by state-of-the-art GNN explanation methods.GRAPHXAI provides data loaders, data processing functions, visualizers, real-world graph datasets with ground-truth explanations, and evaluation metrics to benchmark the quality of GNN explanations.GRAPHXAI introduces a novel and flexible synthetic dataset generator called SHAPEGGEN to automatically generate benchmark datasets and corresponding ground truth explanations robust against known pitfalls of GNN explainability methods.Our experimental results show that existing GNN explainers perform well on graphs with homophilic ground-truth explanations but perform considerably worse on heterophilic and attributed graphs.Across multiple graph datasets and types of downstream prediction tasks, we show that existing GNN explanation methods fail on graphs with larger ground-truth explanations and cannot generate explanations that preserve the fairness properties of the underlying GNN model.
In addition, GNN explainers tend to underperform on sparse node feature signals compared to more densely informative node features.These findings indicate the need for methodological innovation and a thorough analysis of future GNN explainability methods across performance dimensions.GRAPHXAI provides a flexible framework for evaluating GNN explanation methods and promotes reproducible and transparent research.We maintain GRAPHXAI as a centralized library for evaluating GNN explanation methods and plan to add newer datasets, explanation methods, diverse evaluation metrics, and visualization features to our existing framework.In the current version of GRAPHXAI, we mostly employ real-world molecular chemistry datasets as the need for model understanding is motivated by experimental evaluation of model predictions in the laboratory, and it includes a wide variety of graph sizes (ranging from 1,768 to 12,000 instances in the dataset), node feature dimensions (ranging from 13 to 27 dimensions), and class imbalance ratios.In addition to the scale-free SHAPEGGEN dataset generator, we will include other random graph model generators in the next GRAPHXAI release to support benchmarking of GNN explainability methods on other graph types.Our evaluation metrics can be also extended to explanations from selfexplaining GNNs, e.g., GraphMASK [12] identifies edges at each layer of the GNN during training that can be ignored without affecting the output model predictions.In general, self-explaining GNNs also return a set of edge masks as an output explanation for a GNN prediction that can be converted to edge importance scores for computing GraphXAI metrics.We anticipate that GRAPHXAI can help algorithm developers and practitioners in graph representation learning develop and evaluate principled explainability techniques.

Methods
SHAPEGGEN is a key component of GRAPHXAI and serves as a synthetic dataset generator of XAI-ready graph datasets.It is founded in graph theory and designed to address the pitfalls (see Introduction) of existing graph datasets in the broad area of explainable AI.As such, SHAPEGGEN can facilitate the development, analysis, and evaluation of GNN explainability methods (see Results).
We proceed with the description of SHAPEGGEN data generator.

Notation
Graphs.Let G = (V G , E G , X G ) denote an undirected graph comprising of a set of nodes V G and a set of edges E G .Let X G ={x 1 , x 2 , . . ., x N } denote the set of node feature vectors for all nodes in V G , where x v ∈ X G is an d-dimensional vector which captures the attribute values of a node v and N =|V G | denotes the number of nodes in the graph.Let A ∈ R N ×N be the graph adjacency matrix where element A uv = 1 if there exists an edge e ∈ E G between nodes u and v and A uv = 0 otherwise.We use N u to denote the set of immediate neighbors of node u, i.e., Graph neural networks.Most GNNs can be formulated as message passing networks [39] using three operators: MSG, AGG, and UPD.In a L-layer GNN, these operators are recursively applied on G, specifying how neural messages (i.e.embeddings) are exchanged between nodes, aggregated, and transformed to arrive at node representations in the last layer of transformations.Formally, a message between a pair of nodes (u, v) in layer l is defined as a function of hidden representations of nodes h l−1 u and h l−1 v from the previous layer: In AGG, messages from all nodes v ∈ N u are aggregated as: m l u = AGG(m l uv |v ∈ N u ).In UPD, the aggregated message m l u is combined with h l−1 u to produce u's representation for layer l as u is the output of the last layer.Lastly, let f denote a downstream GNN classification model that maps the node representation z u to a softmax prediction vector ŷu ∈ R C , where C is the total number of labels.
GNN explainability methods.Given the prediction f (u) for node u made by a GNN model, a GNN explainer O outputs an explanation mask M p that provides an explanation of f (u).These explanations can be given with respect to node attributes M a ∈ R d , nodes M n ∈ R N , or edges M e ∈ R N ×N , depending on specific GNN explainer, such as GNNExplainer [14], PGExplainer [10], and SubgraphX [31].For all explanation methods, we use a graph masking function MASK that outputs a new graph G = (V G , E G , X G ) by performing element-wise multiplication operation between the masks (M a , M n , M e ) and their respective attributes in the original graph G, e.g.A = A M e .Finally, we denote the ground-truth explanation mask as M g that is used to evaluate the performance of GNN explainers.

SHAPEGGEN dataset generator
We propose a novel and flexible synthetic dataset generator called SHAPEGGEN that can automatically generate a variety of benchmark datasets (e.g., varying graph sizes, degree distributions, homophilic vs. heterophilic graphs) accompanied by ground-truth explanations.Furthermore, the flexibility to generate diverse synthetic datasets and corresponding ground-truth explanations allows us to mimic the data generated by various real-world applications.SHAPEGGEN is a generator of XAI-ready graph datasets supported by graph theory and is particularly suitable for benchmarking GNN explainers and studying their limitations.
Flexible parameterization of SHAPEGGEN.SHAPEGGEN has tunable parameters for data generation.By varying these parameters, SHAPEGGEN can generate diverse types of graphs, including graphs with varying degrees of class imbalance and graph sizes.Formally, a graph is generated as: G =SHAPEGGEN (S, N s , p, n s , K, n f , n i , s f , c f , φ, η, L), where: • S is the motif, defined as a subgraph, to be planted within the graph.
• N s denotes the number of subgraphs used in the initial graph generation process.
• p represents the connection probability between two subgraphs and controls the average shortest path length for all possible pairs of nodes.Ideally, a smaller p value for larger N s is preferred to avoid low average path length and, therefore, the poor performance of GNNs.
• n s is the expected size of each subgraph in the SHAPEGGEN generation procedure.For a fixed S shape, large n s values produce graphs with long-tailed degree distributions.Note that the expected total number of nodes in the generated graph G is N × n s .
• K is the number of distinct classes defined in the graph downstream task.
• n f represents the number of dimensions for node features in the generated graph.
• n i is the number of informative node features.These are features correlated to the node labels instead of randomly generated non-informative features.The indices for the informative features define the ground-truth explanation mask for node features in the final SHAPEGGEN instance.In general, larger n i results in an easier classification and explanation task, as it increases the node feature-level ground-truth explanation size.
• s f is defined as the class separation factor that represents the strength of discrimination of class labels between features for each node.Higher s f corresponds to a stronger signal in the node features, i.e., if a classifier is trained only on the node features, a higher s f value would result in an easier classification task.
• c f is the number of clusters per class.A larger c f value increases the difficulty of the classification task with respect to node features.
• φ ∈ [0, 1] is the protected feature noise factor that controls the strength of correlation r between the protected features and the node labels.This value corresponds to the probability of "flipping" the protected feature with respect to the node's label.For instance, φ=0.5 results in zero correlation (r=0) between the protected feature and the label (i.e.complete fairness), φ=0 results in a positive correlation (r=1), and φ=1 results in a negative correlation (r= − 1) between the label and the protected feature.
• η is the homophily coefficient that controls the strength of homophily or heterophily in the graph.
• L is the number of layers in the GNN predictor corresponding to the GNN's receptive field.For the purposes of SHAPEGGEN, L is used to define the size of the GNN's receptive field and thus the size of the ground-truth explanation generated for each node.This wide array of parameters for SHAPEGGEN allows for the generation of graph instances with vastly differing properties.
Generating graph structure.Figure 2 summarizes the process to generate a graph G = (V G , E G , X G ). SHAPEGGEN generates N s subgraphs that exhibit the preferential attachment property [40], which occurs in many real-world graphs.Each subgraph is first given a motif, i.e., a subgraph S = (V S , E S , X S ).A preferential attachment algorithm is then performed on base structure S, adding n (n ∼ Poisson(λ = n s − |V S |)) nodes by creating edges to nodes in V S .The Poisson distribution is used for determining the sizes of each subgraph used in the generation process, with λ = n s − |V S |, the difference between the number of nodes in the motif and the expected subgraph size n s .After creating a list of randomly-generated subgraphs S = {S 1 , . . ., S N }, edges are created to connect subgraphs, creating the structure of an instance of SHAPEGGEN.Subgraph connections and local subgraph construction is performed in such a way that each node in the final graph G has between 1 and K motifs within its neighborhood, i.e., N i=1 V S i ∩ N v ∈ {1, 2, ..., K} for any v and S i .This naturally defines a classification task in the domain of f to {0, 1, ..., K−1}.More details on the SHAPEGGEN structure generation can be found in Algorithm 2.
Generating labels for prediction.A motif defined as a subgraph S = (V S , E S , X S ) occurs randomly throughout G, with the set S = {S 1 , . . ., S N }.The task on this graph is a motif detection problem, where each node has exactly 1, 2, or K motifs in its 1-hop neighborhood.A motif S i is considered to be within the neighborhood of a node v ∈ V G if any node s ∈ V S i is also in the neighborhood of v, i.e., if |V S i ∩ N v | > 0. Therefore, the task that a GNN predictor needs to solve is defined by: where Generating node feature vectors.SHAPEGGEN uses a latent variable model to create node feature vectors and associate them with network structure.This latent variable model is based on that developed by Guyon [41] for the MADELON dataset and implemented in Scikit-Learn's make classification function [42].The latent variable model creates n i informative features for each node based on the node's generated label and also creates non-informative features as noise.Having non-informative/redundant features allows us to evaluate GNN explainers, such as GNNExplainer [14], that formulate explanations as node feature masks.More detail on node feature generation is given in Algorithm 1.
SHAPEGGEN can generate graphs with both homophilic and heterophilic ground-truth explanations.We optimize between homophily vs. heterophily by taking advantage of redundant node features, i.e., features that do not associate with the generated labels, and manipulate them appropriately based on a user-specified homophily parameter η.The magnitude of the η parameter determines the degree of homophily/heterophily in the generated graph.The algorithm for node features is given in Algorithm 3.While not every node feature in the feature vector is optimized for homophily/heterophily indication, we experimentally verified the cosine similarity between node feature vectors produced by Algorithm 3 reveals a strong homophily/heterophily pattern.Finally, SHAPEGGEN can generate protected features to enable the study of fairness [1].By controlling the value assignment for a selected discrete node feature, SHAPEGGEN introduces bias between the protected feature and node labels.The biased feature is a proxy for a protected feature, such as gender or race.The procedure for generating node features is outlined in NODEFEATUREVECTORS within Algorithm 1.
Generating ground-truth explanations.In addition to generating ground-truth labels, SHAPEGGEN has a unique capability to generate unique ground-truth explanations.To accommodate diverse types of GNN explainers, every ground-truth explanation in SHAPEGGEN contains information on a) the importance of nodes, b) the importance of node features, and c) the importance of edges.This information is represented by three distinct masks defined over enclosing subgraphs of nodes v ∈ V G , i.e., the L-hop neighborhood around the node.We denote the enclosing subgraph of node v ∈ V G for a given GNN predictor with L layers as: Let motifs within this enclosing subgraph be: Using this notation, we define ground-truth explanation masks: a) Node explanation mask.Nodes in SUB(v) are assigned a value of 0 or 1 based on whether they are located within a motif or not.For any node and 0 otherwise.This function 1 V Sv is applied to all nodes in the enclosing subgraph of v to produce an importance score for each node, yielding the final mask as: b) Node feature explanation mask.Each feature in v's feature vector is labeled as 1 if it represents an informative feature and 0 if it is a random feature.This procedure produces a binary node feature mask for node v as: M f ∈ {0, 1} d .c) Edge explanation mask.To each e = (v i , v j ) ∈ E SUB(v) we assign a value of either 0 or 1 based on whether e connects any two nodes in SUB(v).The masking function is defined as follows: This function 1 E Sv is applied to all edges e ∈ E SUB(v) to produce ground-truth edge explanation as: M e = {1 E Sv (e)|e ∈ E SUB(v) }.The procedure to generate these ground-truth explanations is thoroughly described in Algorithm 1.

Datasets in GRAPHXAI
We proceed with a detailed description of synthetic and real-world graph data resources included in GRAPHXAI.

Synthetic graphs
The SHAPEGGEN generator outlined in the Methods section is a dataset generator that can be used to generate any number of user-specified graphs.In GRAPHXAI, we provide a collection of pre-generated XAI-ready graphs with diverse properties that are readily available for analysis and experimentation.
Base SHAPEGGEN graphs (SG-BASE).We provide a base version of SHAPEGGEN graphs.This instance of SHAPEGGEN is homophilic, large, and contains house-shaped motifs for ground-truth explanations, formally described as G = SHAPEGGEN (S='house', N s =1200, p=0.006, n s =11, ).The node features in this graph exhibit homophily, a property commonly found in social networks.With over 10,000 nodes, this graph also provides enough examples of ground-truth explanations for rigorous statistical evaluation of explainer performance.The house-shaped motifs follow one of the earliest synthetic graphs used to evaluate GNN explainers [14].
Homophilic and heterophilic SHAPEGGEN graphs.GNN explainers are evaluated on homophilic graphs [1,[43][44][45] as low homophily levels in graphs can degrade the performance of GNN predictors [46,47].To this end, there are no heterophilic graphs with ground-truth explanations in current GNN XAI literature despite such graphs being abundant in real-world applications [46].To demonstrate the flexibility of the SHAPEGGEN data generator, we use it to generate graphs with: i) homophilic ground-truth explanations (SG-BASE) and ii) heterophilic ground-truth explanations (SG-HETEROPHILIC), i.e., G = SHAPEGGEN (S='house', N s =1200, p=0.006, n s =11, K=2, Weakly and strongly unfair SHAPEGGEN graphs.We utilize the SHAPEGGEN data generator to generate graphs with controllable fairness properties, i.e., leverage SHAPEGGEN to generate synthetic graphs with real-world fairness properties where we can enforce unfairness w.r.t. a given protected attribute.We use the SHAPEGGEN to generate graphs with: i) weakly-unfair groundtruth explanations (SG-BASE) and ii) strongly-unfair ground-truth explanations (SG-UNFAIR) φ=0.75, η=1, L=3).Here, for the first time, we generated unfair synthetic graphs that can serve

Real-world graphs
We describe the real-world graph datasets with and without ground-truth explanations provided in GRAPHXAI.To this end, we provide data resources from crime forecasting, financial lending, and molecular chemistry and biology [1,2,35].We consider these datasets as they are used to train GNNs for generating predictions in high-stakes downstream applications.In particular, we include chemical and biological datasets because they are used to identify whether an input graph (i.e., a molecular graph) contains a specific pattern (i.e., a chemical group with a specific property in the molecule).Knowledge about such patterns, which determine molecular properties, represents ground-truth explanations [2].We provide a statistical description of real-world graphs in Tables 6-8.
Below, we discuss the details of each of the real-world datasets that we employ: MUTAG.The MUTAG [35] dataset contains 1,768 graph molecules labeled into two different classes according to their mutagenic properties, i.e., effect on the Gram-negative bacterium S.
Typhimuriuma.Kazius et al. [35] identifies several toxicophores -motifs in the molecular graph -that correlate with mutagenicity.The dataset is trimmed from its original 4,337 graphs to 1,768, based on those whose labels directly correspond to the presence or absence of our chosen toxicophores: NH 2 , NO 2 , aliphatic halide, nitroso, and azo-type (terminology, as referred to in Kazius et al. [35]).
Alkanecarbonyl.The Alkane-Carbonyl [2] dataset contains 1,125 molecular graphs labeled into two classes where a positive sample indicates a molecule that contains an unbranched alkane and a carbonyl (C=O) functional group.The ground-truth explanations include any combinations of alkane and carbonyl functional groups within a given molecule.
Benzene.The Benzene [2] dataset contains 12,000 molecular graphs extracted from the ZINC15 [49] database and labeled into two classes where the task is to identify whether a given molecule has a benzene ring or not.Naturally, the ground truth explanations are the nodes (atoms) comprising the benzene rings, and in the case of multiple benzenes, each of these benzene rings forms an explanation.Credit defaulter.The Credit defaulter [1] graph has 30,000 nodes representing individuals that we connected based on the similarity of their spending and payment patterns.The dataset contains applicant features like education, credit history, age, and features derived from their spending and payment patterns.The task is to predict whether an applicant will default on an upcoming credit card payment.

GNN predictor evaluator
Graph Dataset  With just a few lines of code, one can calculate an explanation for a node or graph, calculate metrics based on that explanation, and visualize the explanation.7: Unfaithfulness scores across five GNN explainers that produce node feature explanations.Every GNN explainer is evaluated on three datasets whose network topology is equivalent to SG-BASE and by varying the ratio between informative and redundant node features: most informative node features, control node features, and least informative node features.Results show that across all explainers, unfaithfulness decreases as the proportion of informative to redundant features increases, with explainers trained on the graph with the most informative node features having consistently lower unfaithfulness scores than explainers trained on graphs with the least informative node features.7) NODEFEATUREVECTORS Input: Nodes V; labels of all nodes Y; number of classes/labels K; number of features n f ; number of informative features n i ; separation factor s f ; number of clusters per class c f ; protected feature noise factor φ; homophily coefficient η Output: Node features X, Ground-truth feature explanation M F Partition V into K sets, separated by labels,{V 1 , ..., Set protected feature q equal to label i For each node, flip q with probability of φ to another random label Table 2: Evaluation of GNN explainers on SG-BASE graph dataset based on node explanation masks M p N .Base GNN is a GCN [33] as opposed to Table 1 which is based on explaining a GIN model [32].Overall, explainer performance is very similar to that of the GIN with SubgraphX performing the best on faithfulness and accuracy metrics while gradient-based methods and PGExplainer typically perform best for fairness and stability.

Fluoride carbonyl.
The Fluoride Carbonyl[2] dataset contains 8,671 molecular graphs labeled into two classes where a positive sample indicates a molecule that contains a fluoride (F − ) and a carbonyl (C=O) functional group.The ground-truth explanations consist of combinations of fluoride atoms and carbonyl functional groups within a given molecule.German credit.The German Credit[1] graph dataset contains 1,000 nodes representing clients in a German bank connected based on the similarity of their credit accounts.The dataset includes demographic and financial features like Gender, Residence, Age, Marital Status, Loan Amount, Credit History, and Loan Duration.Finally, the goal is to classify clients into good vs. bad credit risks.Recidivism.The Recidivism[1] dataset includes samples of bail outcomes collected from multiple state courts in the USA between 1990-2009.It contains past criminal records, demographic attributes, and other demographic details of 18,876 defendants (nodes) who got released on bail at the U.S. state courts.Defendants are connected based on the similarity of past criminal records and demographics, and the goal is to classify defendants into bail vs. no bail.

Figure 1 :Figure 2 :
Figure 1: Overview of GRAPHXAI.provides data loader classes for XAI-ready synthetic and real-world graph datasets with ground-truth explanations for evaluating GNN explainers, implementations of explanation methods, visualization for GNN explainers, utility functions to support new GNN explainers, and a diverse set of performance metrics to evaluate the reliability of explanations generated by GNN explainers.

Figure 3 :
Figure 3: Example use case of the GRAPHXAI package.An example of explaining a prediction in the GRAPHXAI package.With just a few lines of code, one can calculate an explanation for a node or graph, calculate metrics based on that explanation, and visualize the explanation.

Figure 6 :
Figure6: Counterfactual fairness mismatch scores across eight GNN explainers on SG-UNFAIR graph dataset with weaklyunfair or strongly-unfair ground-truth (GT) explanations.Results show that explanations produced on graphs with stronglyunfair ground-truth explanations do not preserve fairness and are sensitive to flipping/modifying the protected node feature.

Figure
Figure7: Unfaithfulness scores across five GNN explainers that produce node feature explanations.Every GNN explainer is evaluated on three datasets whose network topology is equivalent to SG-BASE and by varying the ratio between informative and redundant node features: most informative node features, control node features, and least informative node features.Results show that across all explainers, unfaithfulness decreases as the proportion of informative to redundant features increases, with explainers trained on the graph with the most informative node features having consistently lower unfaithfulness scores than explainers trained on graphs with the least informative node features.

Figure 8 :Figure 9 :
Figure 8: Visualization of four explainers from the G-XAI Bench library on the BA-Shapes dataset.The visualization is for explaining the prediction of node u.We show the L+1-hop around node u, where L is the number of layers of the GNN model predicting on the dataset.Two color bars indicate the intensity of attribution scores for the node and edge explanations.Note that edge importance is not defined for every method, so edges are set to black to indicate that the method does not provide edge scores.Visualization tools are a native part of the GRAPHXAI package, including user-friendly functions graphxai.Explanation.visualizenode and graphxai.Explanation.visualizegraph to visualize GNN explanations.The visualization tools in GRAPHXAI allow users to compare the explanations of different GNN explainers, such as gradient-based methods (Gradient and Grad-CAM) and perturbation-based methods (GNNExplainer and SubgraphX).

Figure 10 :
Figure 10: Comparison of degree distribution for a (a) SHAPEGGEN dataset (SG-BASE), (b) a random baseline graph, and two real-world datasets: (c) German Credit and (d) Credit Defaulter.All plots are shown with a log scale of frequency for the y-axis.SG-BASE and both real-world graphs show a power-law degree distribution commonly observed in real-world datasets exhibiting preferential attachment properties.Datasets generated by SHAPEGGEN are designed to present power-law degree distributions to match real-world dataset topologies, such as those observed in German Credit and Credit Defaulter.The degree distribution of SG-BASE is much different than the binomial distribution exhibited in Erdős-Rényi graph (b), an unstructured random graph model.
Unfaithfulness scores across eight GNN explainers on SG-HETEROPHILIC graph dataset consisting of either homophilic or heterophilic ground-truth (GT) explanations.GNN explainers produce more faithful explanations (lower GEF scores) on homophilic graphs than heterophilic graphs, revealing an important limitation of existing GNN explainers.Unfaithfulness scores across eight GNN explainers on SG-SMALLEX graph dataset with smaller (triangle shapes) or (house shapes) ground-truth (GT) explanations.Results show that GNN explainers produce more faithful explanations (lower GEF scores) on graphs with smaller GT explanations than on graphs with larger GT explanations.

Algorithm 1 :
Overview of SHAPEGGEN Algorithm Input: Shape S = (V S , E S ); number of subgraphs N s ; probability of connection p; subgraph size n s ; number of classes K; number of features n f ; number of informative features n i ; class separation factor s f ; number of clusters per class c f ; protected feature noise factor φ; homophily coefficient η; model layers L G ← SHAPEGGENSTRUCTURE(S, N s , p, n s , K) (See Algorithm 2) Y ← Labels for each node v ∈ G by Equation (

Table 1 :
Evaluation of GNN explainers on SG-BASE graph dataset based on node explanation masks M p N .Arrows (↑/↓) indicate the direction of better performance.SubgraphX far outperforms other methods in accuracy and faithfulness while PGExplainer is best for stability and counterfactual fairness.In general, gradient methods produce the most fair explanations across both counterfactual and group fairness metrics.See Table3-4 for results on edge and feature explanation masks.

Table 3 :
Evaluation of GNN explainers on SG-BASE graph dataset based on edge explanation masks M p E .Arrows (↑/↓) indicate the direction of better performance.SubgraphX method, on average, produces the most reliable edge explanations when evaluated across all five performance metrics.Note that of the explainers tested in this study, only the above four methods produce edge explanations.

Table 4 :
Evaluation of GNN explainers on SG-BASE graph dataset based on node feature explanation masks M p F .Arrows (↑/↓) indicate the direction of better performance.All GNN explainers produce highly faithful node feature explanations.However, the stability of these methods on node features is more similar to random explanations than is observed for node explanations in Table1and Table2.Note that of the explainers tested in this study, only the above five methods produce node feature explanations.

Table 6 :
Statistics of graphs generated using SHAPEGGEN data generator for evaluating different properties of GNN explanations.

Table 7 :
Statistics of real-world graph classification datasets in GRAPHXAI with ground-truth (GT) explanations.

Table 8 :
Statistics of real-world node classification datasets in GRAPHXAI without ground-truth (GT) explanations.