Evaluating Explainability for Graph Neural Networks

GraphXAI is a resource to systematically evaluate and benchmark the quality of GNN explanations. A key component is a novel and flexible synthetic dataset generator called ShapeGGen that can automatically generate a variety of benchmark datasets (e.g., varying graph sizes, degree distributions, homophilic vs. heterophilic graphs) together with ground-truth explanations that are robust to known pitfalls of explainable algorithms.

As graph AI models are increasingly used in high-stakes applications, it becomes essential to ensure that the relevant stakeholders can understand and trust their functionality. Only if the stakeholders clearly understand the behavior of these models, they can evaluate when and how much to rely on these models, and detect potential biases or errors in them. To this end, several approaches have been proposed to explain the predictions of GNNs. Based on the techniques they employ, these approaches can be broadly categorized into perturbation-based, gradient-based, and surrogate-based models.

To ensure that GNN explanations are reliable, it is important to correctly evaluate their quality. However, evaluating the quality of GNN explanations is a rather nascent research area with relatively little work. The approaches proposed thus far mainly leverage ground-truth explanations associated with specific datasets. However, this strategy is prone to several pitfalls:

  • For instance, there could be multiple underlying rationales (redundant/non-unique explanations) that could generate the true class labels and a given ground-truth explanation may only capture one of those, but the GNN model trained on the data may be relying on an entirely different rationale. In such a case, evaluating the explanation output by a state-of-the-art method using the ground-truth explanation is incorrect because the underlying GNN model itself is not relying on that ground-truth explanation.

  • In addition, even if there is a unique ground-truth explanation which generates the true class labels, the GNN model trained on the data could be a weak predictor which uses an entirely different rationale for making predictions. Post hoc explanations of such a model should not be evaluated based on the ground-truth explanation either.

  • Lastly, the ground-truth explanations corresponding to some of the existing benchmark datasets can be recovered using trivial baselines (e.g., random node or edge as explanation), and such datasets are not good candidates for reliably evaluating explanation quality.

Overview of GraphXAI

GraphXAI is a resource for systematic benchmarking and evaluation of GNN explainability methods. The process to evaluate explanation methods is to choose a graph problem and a GNN architecture to train, then train the GNN model and use a GNN explainer on its predictions to generate explanations. Finally, we compare explanations with a problem-given ground truth to provide a performance score for the GNN explainer. To this end, GraphXAI provides the following:

  • Dataset generator D} that can generate diverse types of graphs G, including homophilic, heterophilic, and attributed graphs suitable for the study of graph explainability. Prevailing benchmark datasets are designed for benchmarking GNN predictors and typically consist of a graph or a set of graphs and associated ground-truth label information. While these datasets are sufficient for studying GNN predictors, they cannot be readily used for studying GNN explainers because they lack a critical component, namely information on ground-truth explanations. GraphXAI addresses this critical gap by providing the SHapeGraph generator to create graphs with ground-truth explanations that are uniquely suited for studying GNN explainers.

  • GNN predictor f that is a user-specified GNN model trained on a dataset produced by D and optimized to predict labels for a particular downstream task.

  • GNN explanation method(s) O that takes a prediction f(u) and returns an explanation M(u) = O(f, u) for it.

  • Explanation quality metrics P such that each metric takes a set of explanations and evaluates them for correctness relative to ground-truth explanations.

When taken together, GraphXAI provides all the necessary functionality needed to systematically benchmark and evaluate GNN explainability methods. Further, it addresses the above mentioned pitfalls of state-of-the-art evaluation setups for GNN explanation methods.

GraphXAI includes the following:

  • novel generator ShapeGGen to automatically generate diverse types of XAI-ready benchmark datasets, including homophilic, heterophilic, and attributed graphs, each accompanied by ground-truth explanations,

  • graph and explanation functions compatible with deep learning frameworks, such as PyTorch and PyTorch Geometric libraries,

  • training and visualization functions for GNN explainers,

  • utility functions to support the development of new GNN explainers, and

  • comprehensive set of performance metrics to evaluate the correctness of explanations produced by GNN explainers relative to ground-truth explanations.

ShapeGGen Data Generator

ShapeGGen is a generator of XAI-ready graph datasets supported by graph theory and particularly suitable for benchmarking GNN explainers and study their limitations.

ShapeGGen generates graphs by combining subgraphs containing any given motif and additional nodes. The number of motifs in a k-hop neighborhood determines the node label (in the figure, we use a 1-hop neighborhood for labeling, and nodes with two motifs in its 1-hop neighborhood are highlighted in red). Feature explanations are some mask over important node features (green striped), with an option to add a protected feature (shown in purple) whose correlation to node labels is controllable. Node explanations are nodes contained in the motifs (horizontal striped nodes) and edge explanations (bold lines) are edges connecting nodes within motifs.


Evaluating Explainability for Graph Neural Networks
Chirag Agarwal*, Owen Queen*, Himabindu Lakkaraju and Marinka Zitnik
Scientific Data 2023 [arXiv]

* Equal Contribution

  title={Evaluating Explainability for Graph Neural Networks},
  author={Agarwal, Chirag and Queen, Owen and Lakkaraju, Himabindu and Zitnik, Marinka},
  journal={Scientific Data},
  publisher={Nature Publishing Group}


Datasets and Pytorch implementation of GraphXAI are available in the GitHub repository.


Latest News

Feb 2024:   Kaneb Fellowship and Dean’s Innovation Award

Feb 2024:   NSF CAREER Award

The lab receives the NSF CAREER Award for our research in geometric deep learning to facilitate algorithmic and scientific advances in therapeutics.

Jan 2024:   AI's Prospects in Nature Machine Intelligence

We discussed AI’s 2024 prospects with Nature Machine Intelligence, covering LLM progress, multimodal AI, multi-task agents, and how to bridge the digital divide across communities and world regions.

Jan 2024:   Combinatorial Therapeutic Perturbations

New paper introducing PDGrapher for combinatorial prediction of chemical and genetic perturbations using causally-inspired neural networks.

Nov 2023:   Next Generation of Therapeutics Commons

Oct 2023:   Structure-Based Drug Design

Geometric deep learning has emerged as a valuable tool for structure-based drug design, to generate and refine biomolecules by leveraging detailed three-dimensional geometric and molecular interaction information.

Oct 2023:   Graph AI in Medicine

Graph AI models in medicine integrate diverse data modalities through pre-training, facilitate interactive feedback loops, and foster human-AI collaboration, paving the way to clinically meaningful predictions.

Sep 2023:   New papers accepted at NeurIPS

Sep 2023:   Future Directions in Network Biology

Excited to share our perspectives on current and future directions in network biology.

Aug 2023:   Scientific Discovery in the Age of AI

Jul 2023:   PINNACLE - Contextual AI protein model

PINNACLE is a contextual AI model for protein understanding that dynamically adjusts its outputs based on biological contexts in which it operates. Project website.

Jun 2023:   Our Group is Joining the Kempner Institute

Excited to join Kempner’s inaugural cohort of associate faculty to advance Kempner’s mission of studying the intersection of natural and artificial intelligence.

Jun 2023:   Welcoming a New Postdoctoral Fellow

An enthusiastic welcome to Shanghua Gao who is joining our group as a postdoctoral research fellow.

Jun 2023:   On Pretraining in Nature Machine Intelligence

May 2023:   Congratulations to Ada and Michelle

Congrats to PhD student Michelle on being selected as the 2023 Albert J. Ryan Fellow and also to participate in the Heidelberg Laureate Forum. Congratulations to PhD student Ada for being selected as the Kempner Institute Graduate Fellow!

Apr 2023:   Universal Domain Adaptation at ICML 2023

New paper introducing the first model for closed-set and universal domain adaptation on time series accepted at ICML 2023. Raincoat addresses feature and label shifts and can detect private labels. Project website.

Apr 2023:   Celebrating Achievements of Our Undergrads

Undergraduate researchers Ziyuan, Nick, Yepeng, Jiali, Julia, and Marissa are moving onto their PhD research in Computer Science, Systems Biology, Neuroscience, and Biological & Medical Sciences at Harvard, MIT, Carnegie Mellon University, and UMass Lowell. We are excited for the bright future they created for themselves.

Apr 2023:   Welcoming a New Postdoctoral Fellow

An enthusiastic welcome to Tianlong Chen, our newly appointed postdoctoral fellow.

Apr 2023:   New Study in Nature Machine Intelligence

New paper in Nature Machine Intelligence introducing the blueprint for multimodal learning with graphs.

Mar 2023:   Precision Health in Nature Machine Intelligence

New paper with NASA in Nature Machine Intelligence on biomonitoring and precision health in deep space supported by artificial intelligence.

Zitnik Lab  ·  Artificial Intelligence in Medicine and Science  ·  Harvard  ·  Department of Biomedical Informatics