Knowledge Graph Based Agent for Knowledge-Intensive QA in Medicine

Biomedical knowledge is uniquely complex and structured, requiring distinct reasoning strategies compared to other scientific disciplines like physics or chemistry. Biomedical scientists do not rely on a single approach to reasoning; instead, they use various strategies, including rule-based, prototype-based, and case-based reasoning. This diversity calls for flexible approaches that accommodate multiple reasoning strategies while leveraging in-domain knowledge. We introduce KGARevion, a knowledge graph (KG) based agent designed to address the complexity of knowledge-intensive medical queries. Upon receiving a query, KGARevion generates relevant triplets by using the knowledge base of the LLM. These triplets are then verified against a grounded KG to filter out erroneous information and ensure that only accurate, relevant data contribute to the final answer. Unlike RAG-based models, this multi-step process ensures robustness in reasoning while adapting to different models of medical reasoning. Evaluations on four gold-standard medical QA datasets show that KGARevion improves accuracy by over 5.2%, outperforming 15 models in handling complex medical questions. To test its capabilities, we curated three new medical QA datasets with varying levels of semantic complexity, where KGARevion achieved a 10.4% improvement in accuracy.

Motivation

Medical reasoning involves making diagnostic and therapeutic decisions while also understanding the pathology of diseases. Unlike many other scientific domains, medical reasoning often relies on vertical reasoning, using analogy more heavily. For instance, in biomedical research, an organism such as Drosophila is used as an exemplar to model a disease mechanism, which is then applied by analogy to other organisms, including humans. In clinical practice, the patient serves as an exemplar, with generalizations drawn from many overlapping disease models and similar patient populations. In contrast, fields like physics and chemistry tend to be horizontally organized, where general principles are applied to specific cases. This distinction highlights the unique challenges that medical reasoning poses for question-answering (QA) models.

While large language models (LLMs) have demonstrated strong general capabilities, their responses to medical questions often suffer from incorrect retrieval, missing key information, and misalignment with current scientific and medical knowledge. Additionally, they can struggle to provide contextually relevant answers that account for specific local contexts, such as patient demographics or geography, as well as specific areas of biology. A major issue lies in these models’ inability to systematically integrate different types of evidence. Specifically, they have difficulty combining scientific factual (structured, codified) knowledge derived from formal, rigorous research with tacit (noncodified) knowledge—expertise and lessons learned—which is crucial for contextualizing and interpreting scientific evidence in relation to the specific modifying factors of a given medical question.

LLM-powered QA models often lack such multi-source and grounded knowledge necessary for medical reasoning, which requires understanding the nuanced and specialized nature of medical concepts. Additionally, LLMs trained on general knowledge may struggle to solve medical problems that demand specialized in-domain knowledge. This shortcoming arises from their inability to discern subtle, granular differences that are critical in medical contexts. As a result, LLMs face challenges in complex medical reasoning because such reasoning requires both: 1) simultaneous consideration of dependencies across multiple medical concepts within an input question, and 2) precise, local in-domain knowledge of semantically similar concepts that can carry different medical meanings, as we show in the figure below.

The prevailing strategy to address these challenges is the use of information retrieval techniques, such as retrieval-augmented generation (RAG), which follows a Retrieve-then-Answer paradigm. Although these methods can provide multi-source knowledge from external databases, the accuracy of the generated answers depends heavily on the quality of the retrieved information, making them vulnerable to potential errors. Data repositories and knowledge bases these models draw from contain incomplete or incorrect information, leading to inaccurate retrieval. Further, many RAG-based methods lack post-retrieval verification mechanisms to validate that retrieved information is factually correct and does not miss key information. Knowledge graphs (KGs) of medical concepts have been widely adopted as a grounded knowledge base to provide precise and specialized in-domain knowledge for medical QA models. While KGs can enhance the performance of these models, they are often incomplete. Consequently, approaches that retrieve medical concepts from a KG based solely on the presence of direct associations (edges) between concepts are insufficient. For instance, concepts representing two proteins with distinct biological roles may not be directly connected in the KG, even though these proteins share similar biological representations.

To advance LLM-powered models for knowledge-intensive medical QA, it is essential to develop models that can (1) consider complex associations between several medical concepts at the same time, (2) systematically integrate multi-source knowledge, and (3) effectively verify and ground the retrieved information to ensure contextual relevance and accuracy.

KGARevion: KG-Based Agent for Knowledge-Intensive QA in Medicine

We introduce KGARevion, a knowledge graph-based LLM agent designed for complex medical question answering (QA). KGARevion integrates the non-codified knowledge of LLMs with the structured, codified knowledge embedded in medical concept KGs. It operates through four key actions, as shown in the figure below.

First, KGARevion prompts the LLM to generate relevant triplets based on the input question. To ensure the accuracy of these generated triplets and fully leverage the structured KG, KGARevion fine-tunes the LLM on a KG completion task. This involves incorporating pre-trained structural embeddings of triplets as prefix tokens. The fine-tuned model is then used to evaluate the correctness of the generated triplets.

Next, KGARevion performs a ‘Revise’ action to correct any erroneous triplets, ultimately identifying the correct answer based on the verified triplets. Given the complexity of medical reasoning, KGARevion adaptively selects the most appropriate reasoning strategy for each question, allowing for more nuanced and context-aware QA. This flexibility enables KGARevion to handle both multiple-choice and open-ended questions effectively.

Publication

Knowledge Graph Based Agent for Complex, Knowledge-Intensive QA in Medicine
Xiaorui Su, Yibo Wang, Shanghua Gao, Xiaolong Liu, Valentina Giunchiglia, Djork-Arné Clevert, Marinka Zitnik
International Conference on Learning Representations, ICLR 2025 [arXiv] [OpenReview]

@article{su2025knowledge,
  title={Knowledge Graph Based Agent for Complex, Knowledge-Intensive QA in Medicine},
  author={Su, Xiaorui and Wang, Yibo and Gao, Shanghua and Liu, Xiaolong and Giunchiglia, Valentina and Clevert, Djork-Arn{\'e} and Zitnik, Marinka},
  journal={International Conference on Learning Representations, ICLR},
  year={2025}
}

Code Availability

Pytorch implementation of KGARevion is available in the GitHub repository.

Authors

Latest News

Feb 2025:   MedTok: Unlocking Medical Codes for GenAI

Meet MedTok, a multimodal medical code tokenizer that transforms how AI understands structured medical data. By integrating textual descriptions and relational contexts, MedTok enhances tokenization for transformer-based models—powering everything from EHR foundation models to medical QA. [Project website]

Feb 2025:   What If You Could Rewrite Biology? Meet CLEF

What if we could anticipate molecular and medical changes before they happen? Introducing CLEF, an approach for counterfactual generation in biological and medical sequence models. [Project website]

Feb 2025:   Digital Twins as Global Health and Disease Models

Jan 2025:   LLM and KG+LLM agent papers at ICLR

Jan 2025:   Artificial Intelligence in Medicine 2

Excited to share our new graduate course on Artificial Intelligence in Medicine 2.

Jan 2025:   ProCyon AI Highlighted by Kempner

Thanks to Kempner Institute for highlighting our latest research, ProCyon, our protein-text foundation model for modeling protein functions.

Jan 2025:   AI Design of Proteins for Therapeutics

Dec 2024:   Unified Clinical Vocabulary Embeddings

New paper: A unified resource provides a new representation of clinical knowledge by unifying medical vocabularies. (1) Phenotype risk score analysis across 4.57 million patients, (2) Inter-institutional clinician panels evaluate alignment with clinical knowledge across 90 diseases and 3,000 clinical codes.

Dec 2024:   SPECTRA in Nature Machine Intelligence

Are biomedical AI models truly as smart as they seem? SPECTRA is a framework that evaluates models by considering the full spectrum of cross-split overlap: train-test similarity. SPECTRA reveals gaps in benchmarks for molecular sequence data across 19 models, including LLMs, GNNs, diffusion models, and conv nets.

Nov 2024:   Ayush Noori Selected as a Rhodes Scholar

Congratulations to Ayush Noori on being named a Rhodes Scholar! Such an incredible achievement!

Nov 2024:   PocketGen in Nature Machine Intelligence

Oct 2024:   Activity Cliffs in Molecular Properties

Oct 2024:   Knowledge Graph Agent for Medical Reasoning

Sep 2024:   Three Papers Accepted to NeurIPS

Exciting projects include a unified multi-task time series model, a flow-matching approach for generating protein pockets using geometric priors, and a tokenization method that produces invariant molecular representations for integration into large language models.

Sep 2024:   TxGNN Published in Nature Medicine

Aug 2024:   Graph AI in Medicine

Excited to share a new perspective on Graph Artificial Intelligence in Medicine in Annual Reviews.

Aug 2024:   How Proteins Behave in Context

Harvard Medicine News on our new AI tool that captures how proteins behave in context. Kempner Institute on how context matters for foundation models in biology.

Jul 2024:   PINNACLE in Nature Methods

PINNACLE contextual AI model is published in Nature Methods. Paper. Research Briefing. Project website.

Zitnik Lab  ·  Artificial Intelligence in Medicine and Science  ·  Harvard  ·  Department of Biomedical Informatics