KGARevion - An AI Agent for Knowledge-Intensive Biomedical QA

Biomedical reasoning integrates structured, codified knowledge with tacit, experience-driven insights. Depending on the context, quantity, and nature of available evidence, researchers and clinicians use diverse strategies, including rule-based, prototype-based, and case-based reasoning. Effective medical AI models must handle this complexity while ensuring reliability and adaptability.

We introduce KGARevion, a knowledge graph-based agent that answers knowledge-intensive questions. Upon receiving a query, KGARevion generates relevant triplets by leveraging the latent knowledge embedded in a large language model. It then verifies these triplets against a grounded knowledge graph, filtering out errors and retaining only accurate, contextually relevant information for the final answer. This multi-step process strengthens reasoning, adapts to different models of medical inference, and outperforms retrieval-augmented generation-based approaches that lack effective verification mechanisms.

Evaluations on medical QA benchmarks show that KGARevion improves accuracy by over 5.2% over 15 models in handling complex medical queries. To further assess its effectiveness, we curated three new medical QA datasets with varying levels of semantic complexity, where KGARevion improved accuracy by 10.4%. The agent integrates with different LLMs and biomedical knowledge graphs for broad applicability across knowledge-intensive tasks. We evaluated KGARevion on AfriMed-QA, a newly introduced dataset focused on African healthcare, demonstrating its strong zero-shot generalization to underrepresented medical contexts.

Motivation

Medical reasoning involves making diagnostic and therapeutic decisions while also understanding the pathology of diseases. Unlike many other scientific domains, medical reasoning often relies on vertical reasoning, using analogy more heavily. For instance, in biomedical research, an organism such as Drosophila is used as an exemplar to model a disease mechanism, which is then applied by analogy to other organisms, including humans. In clinical practice, the patient serves as an exemplar, with generalizations drawn from many overlapping disease models and similar patient populations. In contrast, fields like physics and chemistry tend to be horizontally organized, where general principles are applied to specific cases. This distinction highlights the unique challenges that medical reasoning poses for question-answering (QA) models.

While large language models (LLMs) have demonstrated strong general capabilities, their responses to medical questions often suffer from incorrect retrieval, missing key information, and misalignment with current scientific and medical knowledge. Additionally, they can struggle to provide contextually relevant answers that account for specific local contexts, such as patient demographics or geography, as well as specific areas of biology. A major issue lies in these models’ inability to systematically integrate different types of evidence. Specifically, they have difficulty combining scientific factual (structured, codified) knowledge derived from formal, rigorous research with tacit (noncodified) knowledge—expertise and lessons learned—which is crucial for contextualizing and interpreting scientific evidence in relation to the specific modifying factors of a given medical question.

LLM-powered QA models often lack such multi-source and grounded knowledge necessary for medical reasoning, which requires understanding the nuanced and specialized nature of medical concepts. Additionally, LLMs trained on general knowledge may struggle to solve medical problems that demand specialized in-domain knowledge. This shortcoming arises from their inability to discern subtle, granular differences that are critical in medical contexts. As a result, LLMs face challenges in complex medical reasoning because such reasoning requires both: 1) simultaneous consideration of dependencies across multiple medical concepts within an input question, and 2) precise, local in-domain knowledge of semantically similar concepts that can carry different medical meanings, as we show in the figure below.

The prevailing strategy to address these challenges is the use of information retrieval techniques, such as retrieval-augmented generation (RAG), which follows a Retrieve-then-Answer paradigm. Although these methods can provide multi-source knowledge from external databases, the accuracy of the generated answers depends heavily on the quality of the retrieved information, making them vulnerable to potential errors. Data repositories and knowledge bases these models draw from contain incomplete or incorrect information, leading to inaccurate retrieval. Further, many RAG-based methods lack post-retrieval verification mechanisms to validate that retrieved information is factually correct and does not miss key information. Knowledge graphs (KGs) of medical concepts have been widely adopted as a grounded knowledge base to provide precise and specialized in-domain knowledge for medical QA models. While KGs can enhance the performance of these models, they are often incomplete. Consequently, approaches that retrieve medical concepts from a KG based solely on the presence of direct associations (edges) between concepts are insufficient. For instance, concepts representing two proteins with distinct biological roles may not be directly connected in the KG, even though these proteins share similar biological representations.

To advance LLM-powered models for knowledge-intensive medical QA, it is essential to develop models that can (1) consider complex associations between several medical concepts at the same time, (2) systematically integrate multi-source knowledge, and (3) effectively verify and ground the retrieved information to ensure contextual relevance and accuracy.

KGARevion: KG-Based Agent for Knowledge-Intensive QA in Medicine

We introduce KGARevion, a knowledge graph-based LLM agent designed for complex medical question answering (QA). KGARevion integrates the non-codified knowledge of LLMs with the structured, codified knowledge embedded in medical concept KGs. It operates through four key actions, as shown in the figure below.

First, KGARevion prompts the LLM to generate relevant triplets based on the input question. To ensure the accuracy of these generated triplets and fully leverage the structured KG, KGARevion fine-tunes the LLM on a KG completion task. This involves incorporating pre-trained structural embeddings of triplets as prefix tokens. The fine-tuned model is then used to evaluate the correctness of the generated triplets.

Next, KGARevion performs a ‘Revise’ action to correct any erroneous triplets, ultimately identifying the correct answer based on the verified triplets. Given the complexity of medical reasoning, KGARevion adaptively selects the most appropriate reasoning strategy for each question, allowing for more nuanced and context-aware QA. This flexibility enables KGARevion to handle both multiple-choice and open-ended questions effectively.

Publication

KGARevion: An AI Agent for Knowledge-Intensive Biomedical QA
Xiaorui Su, Yibo Wang, Shanghua Gao, Xiaolong Liu, Valentina Giunchiglia, Djork-Arné Clevert, Marinka Zitnik
International Conference on Learning Representations, ICLR 2025 [arXiv] [OpenReview]

@article{su2025knowledge,
  title={KGARevion: An AI Agent for Knowledge-Intensive Biomedical QA},
  author={Su, Xiaorui and Wang, Yibo and Gao, Shanghua and Liu, Xiaolong and Giunchiglia, Valentina and Clevert, Djork-Arn{\'e} and Zitnik, Marinka},
  journal={International Conference on Learning Representations, ICLR},
  year={2025}
}

Code Availability

Pytorch implementation of KGARevion is available in the GitHub repository.

Authors

Latest News

Mar 2025:   Multimodal AI predicts clinical outcomes of drug combinations from preclinical data

Mar 2025:   KGARevion: AI Agent for Knowledge-Intensive Biomedical QA

KGARevion is an AI agent designed for complex biomedical QA that integrates the non-codified knowledge of LLMs with the structured, codified knowledge found in knowledge graphs. [ICLR 2025 publication]

Feb 2025:   MedTok: Unlocking Medical Codes for GenAI

Meet MedTok, a multimodal medical code tokenizer that transforms how AI understands structured medical data. By integrating textual descriptions and relational contexts, MedTok enhances tokenization for transformer-based models—powering everything from EHR foundation models to medical QA. [Project website]

Feb 2025:   What If You Could Rewrite Biology? Meet CLEF

What if we could anticipate molecular and medical changes before they happen? Introducing CLEF, an approach for counterfactual generation in biological and medical sequence models. [Project website]

Feb 2025:   Digital Twins as Global Health and Disease Models

Jan 2025:   LLM and KG+LLM agent papers at ICLR

Jan 2025:   Artificial Intelligence in Medicine 2

Excited to share our new graduate course on Artificial Intelligence in Medicine 2.

Jan 2025:   ProCyon AI Highlighted by Kempner

Thanks to Kempner Institute for highlighting our latest research, ProCyon, our protein-text foundation model for modeling protein functions.

Jan 2025:   AI Design of Proteins for Therapeutics

Dec 2024:   Unified Clinical Vocabulary Embeddings

New paper: A unified resource provides a new representation of clinical knowledge by unifying medical vocabularies. (1) Phenotype risk score analysis across 4.57 million patients, (2) Inter-institutional clinician panels evaluate alignment with clinical knowledge across 90 diseases and 3,000 clinical codes.

Dec 2024:   SPECTRA in Nature Machine Intelligence

Are biomedical AI models truly as smart as they seem? SPECTRA is a framework that evaluates models by considering the full spectrum of cross-split overlap: train-test similarity. SPECTRA reveals gaps in benchmarks for molecular sequence data across 19 models, including LLMs, GNNs, diffusion models, and conv nets.

Nov 2024:   Ayush Noori Selected as a Rhodes Scholar

Congratulations to Ayush Noori on being named a Rhodes Scholar! Such an incredible achievement!

Nov 2024:   PocketGen in Nature Machine Intelligence

Oct 2024:   Activity Cliffs in Molecular Properties

Oct 2024:   Knowledge Graph Agent for Medical Reasoning

Sep 2024:   Three Papers Accepted to NeurIPS

Exciting projects include a unified multi-task time series model, a flow-matching approach for generating protein pockets using geometric priors, and a tokenization method that produces invariant molecular representations for integration into large language models.

Sep 2024:   TxGNN Published in Nature Medicine

Aug 2024:   Graph AI in Medicine

Excited to share a new perspective on Graph Artificial Intelligence in Medicine in Annual Reviews.

Zitnik Lab  ·  Artificial Intelligence in Medicine and Science  ·  Harvard  ·  Department of Biomedical Informatics