Biomedical knowledge is uniquely complex and structured, requiring distinct reasoning strategies compared to other scientific disciplines like physics or chemistry. Biomedical scientists do not rely on a single approach to reasoning; instead, they use various strategies, including rule-based, prototype-based, and case-based reasoning. This diversity calls for flexible approaches that accommodate multiple reasoning strategies while leveraging in-domain knowledge. We introduce KGARevion, a knowledge graph (KG) based agent designed to address the complexity of knowledge-intensive medical queries. Upon receiving a query, KGARevion generates relevant triplets by using the knowledge base of the LLM. These triplets are then verified against a grounded KG to filter out erroneous information and ensure that only accurate, relevant data contribute to the final answer. Unlike RAG-based models, this multi-step process ensures robustness in reasoning while adapting to different models of medical reasoning. Evaluations on four gold-standard medical QA datasets show that KGARevion improves accuracy by over 5.2%, outperforming 15 models in handling complex medical questions. To test its capabilities, we curated three new medical QA datasets with varying levels of semantic complexity, where KGARevion achieved a 10.4% improvement in accuracy.
Motivation
Medical reasoning involves making diagnostic and therapeutic decisions while also understanding the pathology of diseases. Unlike many other scientific domains, medical reasoning often relies on vertical reasoning, using analogy more heavily. For instance, in biomedical research, an organism such as Drosophila is used as an exemplar to model a disease mechanism, which is then applied by analogy to other organisms, including humans. In clinical practice, the patient serves as an exemplar, with generalizations drawn from many overlapping disease models and similar patient populations. In contrast, fields like physics and chemistry tend to be horizontally organized, where general principles are applied to specific cases. This distinction highlights the unique challenges that medical reasoning poses for question-answering (QA) models.
While large language models (LLMs) have demonstrated strong general capabilities, their responses to medical questions often suffer from incorrect retrieval, missing key information, and misalignment with current scientific and medical knowledge. Additionally, they can struggle to provide contextually relevant answers that account for specific local contexts, such as patient demographics or geography, as well as specific areas of biology. A major issue lies in these models’ inability to systematically integrate different types of evidence. Specifically, they have difficulty combining scientific factual (structured, codified) knowledge derived from formal, rigorous research with tacit (noncodified) knowledge—expertise and lessons learned—which is crucial for contextualizing and interpreting scientific evidence in relation to the specific modifying factors of a given medical question.
LLM-powered QA models often lack such multi-source and grounded knowledge necessary for medical reasoning, which requires understanding the nuanced and specialized nature of medical concepts. Additionally, LLMs trained on general knowledge may struggle to solve medical problems that demand specialized in-domain knowledge. This shortcoming arises from their inability to discern subtle, granular differences that are critical in medical contexts. As a result, LLMs face challenges in complex medical reasoning because such reasoning requires both: 1) simultaneous consideration of dependencies across multiple medical concepts within an input question, and 2) precise, local in-domain knowledge of semantically similar concepts that can carry different medical meanings, as we show in the figure below.

The prevailing strategy to address these challenges is the use of information retrieval techniques, such as retrieval-augmented generation (RAG), which follows a Retrieve-then-Answer paradigm. Although these methods can provide multi-source knowledge from external databases, the accuracy of the generated answers depends heavily on the quality of the retrieved information, making them vulnerable to potential errors. Data repositories and knowledge bases these models draw from contain incomplete or incorrect information, leading to inaccurate retrieval. Further, many RAG-based methods lack post-retrieval verification mechanisms to validate that retrieved information is factually correct and does not miss key information. Knowledge graphs (KGs) of medical concepts have been widely adopted as a grounded knowledge base to provide precise and specialized in-domain knowledge for medical QA models. While KGs can enhance the performance of these models, they are often incomplete. Consequently, approaches that retrieve medical concepts from a KG based solely on the presence of direct associations (edges) between concepts are insufficient. For instance, concepts representing two proteins with distinct biological roles may not be directly connected in the KG, even though these proteins share similar biological representations.
To advance LLM-powered models for knowledge-intensive medical QA, it is essential to develop models that can (1) consider complex associations between several medical concepts at the same time, (2) systematically integrate multi-source knowledge, and (3) effectively verify and ground the retrieved information to ensure contextual relevance and accuracy.
KGARevion: KG-Based Agent for Knowledge-Intensive QA in Medicine
We introduce KGARevion, a knowledge graph-based LLM agent designed for complex medical question answering (QA). KGARevion integrates the non-codified knowledge of LLMs with the structured, codified knowledge embedded in medical concept KGs. It operates through four key actions, as shown in the figure below.

First, KGARevion prompts the LLM to generate relevant triplets based on the input question. To ensure the accuracy of these generated triplets and fully leverage the structured KG, KGARevion fine-tunes the LLM on a KG completion task. This involves incorporating pre-trained structural embeddings of triplets as prefix tokens. The fine-tuned model is then used to evaluate the correctness of the generated triplets.
Next, KGARevion performs a ‘Revise’ action to correct any erroneous triplets, ultimately identifying the correct answer based on the verified triplets. Given the complexity of medical reasoning, KGARevion adaptively selects the most appropriate reasoning strategy for each question, allowing for more nuanced and context-aware QA. This flexibility enables KGARevion to handle both multiple-choice and open-ended questions effectively.
Publication
Knowledge Graph Based Agent for Complex, Knowledge-Intensive QA in Medicine
Xiaorui Su, Yibo Wang, Shanghua Gao, Xiaolong Liu, Valentina Giunchiglia, Djork-Arné Clevert, Marinka Zitnik
International Conference on Learning Representations, ICLR 2025 [arXiv] [OpenReview]
@article{su2025knowledge,
title={Knowledge Graph Based Agent for Complex, Knowledge-Intensive QA in Medicine},
author={Su, Xiaorui and Wang, Yibo and Gao, Shanghua and Liu, Xiaolong and Giunchiglia, Valentina and Clevert, Djork-Arn{\'e} and Zitnik, Marinka},
journal={International Conference on Learning Representations, ICLR},
year={2025}
}
Code Availability
Pytorch implementation of KGARevion is available in the GitHub repository.