Evaluating Explainability for Graph Neural Networks

GraphXAI is a resource to systematically evaluate and benchmark the quality of GNN explanations. A key component is a novel and flexible synthetic dataset generator called ShapeGGen that can automatically generate a variety of benchmark datasets (e.g., varying graph sizes, degree distributions, homophilic vs. heterophilic graphs) together with ground-truth explanations that are robust to known pitfalls of explainable algorithms.

As graph AI models are increasingly used in high-stakes applications, it becomes essential to ensure that the relevant stakeholders can understand and trust their functionality. Only if the stakeholders clearly understand the behavior of these models, they can evaluate when and how much to rely on these models, and detect potential biases or errors in them. To this end, several approaches have been proposed to explain the predictions of GNNs. Based on the techniques they employ, these approaches can be broadly categorized into perturbation-based, gradient-based, and surrogate-based models.

To ensure that GNN explanations are reliable, it is important to correctly evaluate their quality. However, evaluating the quality of GNN explanations is a rather nascent research area with relatively little work. The approaches proposed thus far mainly leverage ground-truth explanations associated with specific datasets. However, this strategy is prone to several pitfalls:

  • For instance, there could be multiple underlying rationales (redundant/non-unique explanations) that could generate the true class labels and a given ground-truth explanation may only capture one of those, but the GNN model trained on the data may be relying on an entirely different rationale. In such a case, evaluating the explanation output by a state-of-the-art method using the ground-truth explanation is incorrect because the underlying GNN model itself is not relying on that ground-truth explanation.

  • In addition, even if there is a unique ground-truth explanation which generates the true class labels, the GNN model trained on the data could be a weak predictor which uses an entirely different rationale for making predictions. Post hoc explanations of such a model should not be evaluated based on the ground-truth explanation either.

  • Lastly, the ground-truth explanations corresponding to some of the existing benchmark datasets can be recovered using trivial baselines (e.g., random node or edge as explanation), and such datasets are not good candidates for reliably evaluating explanation quality.

Overview of GraphXAI

GraphXAI is a resource for systematic benchmarking and evaluation of GNN explainability methods. The process to evaluate explanation methods is to choose a graph problem and a GNN architecture to train, then train the GNN model and use a GNN explainer on its predictions to generate explanations. Finally, we compare explanations with a problem-given ground truth to provide a performance score for the GNN explainer. To this end, GraphXAI provides the following:

  • Dataset generator D} that can generate diverse types of graphs G, including homophilic, heterophilic, and attributed graphs suitable for the study of graph explainability. Prevailing benchmark datasets are designed for benchmarking GNN predictors and typically consist of a graph or a set of graphs and associated ground-truth label information. While these datasets are sufficient for studying GNN predictors, they cannot be readily used for studying GNN explainers because they lack a critical component, namely information on ground-truth explanations. GraphXAI addresses this critical gap by providing the SHapeGraph generator to create graphs with ground-truth explanations that are uniquely suited for studying GNN explainers.

  • GNN predictor f that is a user-specified GNN model trained on a dataset produced by D and optimized to predict labels for a particular downstream task.

  • GNN explanation method(s) O that takes a prediction f(u) and returns an explanation M(u) = O(f, u) for it.

  • Explanation quality metrics P such that each metric takes a set of explanations and evaluates them for correctness relative to ground-truth explanations.

When taken together, GraphXAI provides all the necessary functionality needed to systematically benchmark and evaluate GNN explainability methods. Further, it addresses the above mentioned pitfalls of state-of-the-art evaluation setups for GNN explanation methods.

GraphXAI includes the following:

  • novel generator ShapeGGen to automatically generate diverse types of XAI-ready benchmark datasets, including homophilic, heterophilic, and attributed graphs, each accompanied by ground-truth explanations,

  • graph and explanation functions compatible with deep learning frameworks, such as PyTorch and PyTorch Geometric libraries,

  • training and visualization functions for GNN explainers,

  • utility functions to support the development of new GNN explainers, and

  • comprehensive set of performance metrics to evaluate the correctness of explanations produced by GNN explainers relative to ground-truth explanations.

ShapeGGen Data Generator

ShapeGGen is a generator of XAI-ready graph datasets supported by graph theory and particularly suitable for benchmarking GNN explainers and study their limitations.

ShapeGGen generates graphs by combining subgraphs containing any given motif and additional nodes. The number of motifs in a k-hop neighborhood determines the node label (in the figure, we use a 1-hop neighborhood for labeling, and nodes with two motifs in its 1-hop neighborhood are highlighted in red). Feature explanations are some mask over important node features (green striped), with an option to add a protected feature (shown in purple) whose correlation to node labels is controllable. Node explanations are nodes contained in the motifs (horizontal striped nodes) and edge explanations (bold lines) are edges connecting nodes within motifs.

Publication

Evaluating Explainability for Graph Neural Networks
Chirag Agarwal*, Owen Queen*, Himabindu Lakkaraju and Marinka Zitnik
Scientific Data 2023 [arXiv]

* Equal Contribution

@article{agarwal2023evaluating,
  title={Evaluating Explainability for Graph Neural Networks},
  author={Agarwal, Chirag and Queen, Owen and Lakkaraju, Himabindu and Zitnik, Marinka},
  journal={Scientific Data},
  volume={10},
  number={144},
  url={https://www.nature.com/articles/s41597-023-01974-x},
  year={2023},
  publisher={Nature Publishing Group}
}

Code

Datasets and Pytorch implementation of GraphXAI are available in the GitHub repository.

Authors

Latest News

Aug 2024:   Graph AI in Medicine

Excited to share a new perspective on Graph Artificial Intelligence in Medicine in Annual Reviews.

Aug 2024:   How Proteins Behave in Context

Harvard Medicine News on our new AI tool that captures how proteins behave in context. Kempner Institute on how context matters for foundation models in biology.

Jul 2024:   PINNACLE in Nature Methods

PINNACLE contextual AI model is published in Nature Methods. Paper. Research Briefing. Project website.

Jul 2024:   Digital Twins as Global Health and Disease Models of Individuals

Paper on digitial twins outlining strategies to leverage molecular and computational techniques to construct dynamic digital twins on the scale of populations to individuals.

Jul 2024:   Three Papers: TrialBench, 3D Structure Design, LLM Editing

Jun 2024:   TDC-2: Multimodal Foundation for Therapeutics

The Commons 2.0 (TDC-2) is an overhaul of Therapeutic Data Commons to catalyze research in multimodal models for drug discovery by unifying single-cell biology of diseases, biochemistry of molecules, and effects of drugs through multimodal datasets, AI-powered API endpoints, new tasks and benchmarks. Our paper.

May 2024:   Broad MIA: Protein Language Models

Apr 2024:   Biomedical AI Agents

Mar 2024:   Efficient ML Seminar Series

We started a Harvard University Efficient ML Seminar Series. Congrats to Jonathan for spearheading this initiative. Harvard Magazine covered the first meeting focusing on LLMs.

Mar 2024:   UniTS - Unified Time Series Model

UniTS is a unified time series model that can process classification, forecasting, anomaly detection and imputation tasks within a single model with no task-specific modules. UniTS has zero-shot, few-shot, and prompt learning capabilities. Project website.

Mar 2024:   Weintraub Graduate Student Award

Michelle receives the 2024 Harold M. Weintraub Graduate Student Award. The award recognizes exceptional achievement in graduate studies in biological sciences. News Story. Congratulations!

Mar 2024:   PocketGen - Generating Full-Atom Ligand-Binding Protein Pockets

PocketGen is a deep generative model that generates residue sequence and full-atom structure of protein pockets, maximizing binding to ligands. Project website.

Feb 2024:   SPECTRA - Generalizability of Molecular AI

Feb 2024:   Kaneb Fellowship Award

The lab receives the John and Virginia Kaneb Fellowship Award at Harvard Medical School to enhance research progress in the lab.

Feb 2024:   NSF CAREER Award

The lab receives the NSF CAREER Award for our research in geometric deep learning to facilitate algorithmic and scientific advances in therapeutics.

Feb 2024:   Dean’s Innovation Award in AI

Jan 2024:   AI's Prospects in Nature Machine Intelligence

We discussed AI’s 2024 prospects with Nature Machine Intelligence, covering LLM progress, multimodal AI, multi-task agents, and how to bridge the digital divide across communities and world regions.

Jan 2024:   Combinatorial Therapeutic Perturbations

New paper introducing PDGrapher for combinatorial prediction of chemical and genetic perturbations using causally-inspired neural networks.

Zitnik Lab  ·  Artificial Intelligence in Medicine and Science  ·  Harvard  ·  Department of Biomedical Informatics