Relational Reasoning
Relational reasoning is the ability to infer relations that jointly bind multiple entities, attributes, or variables.
* Co-first authors. † Corresponding author.
Relational reasoning is the ability to infer relations that jointly bind multiple entities, attributes, or variables. While this capability is essential for scientific reasoning, most existing evaluations of relational reasoning in large language models focus on structured inputs such as tables, graphs, or synthetic relational tasks, and do not isolate the sources of difficulty that arise from higher-arity relational binding. We study this problem through the lens of Relational Complexity (RC), defined as the minimum number of independent entities or operands that must be simultaneously bound to apply a relation. RC provides a principled way to vary reasoning difficulty independently of confounders such as input size, vocabulary, and representational choices. Building on RC, we introduce REL, a generative benchmark framework spanning algebra, chemistry, and biology that varies RC within each domain. Evaluating frontier LLMs, we observe a consistent and monotonic degradation in performance as RC increases, even when the total number of entities is held fixed. This failure mode persists under increased test-time compute and with in-context learning, suggesting a limitation tied to the arity of the required relational binding rather than insufficient inference steps or exposure to examples. Our results identify a well-defined regime of higher-arity reasoning in which current models struggle, and motivate revisiting reasoning benchmarks through the lens of relational complexity.
Capability Overview
We study relational reasoning through the lens of relational complexity, defined in the paper as the minimum number of independent entities or operands that must be simultaneously bound to apply a relation.
Relational Reasoning
Relational reasoning is the ability to infer relations that jointly bind multiple entities, attributes, or variables.
Relational Complexity
RC provides a principled way to vary reasoning difficulty independently of confounders such as input size, vocabulary, and representational choices.
REL Benchmark
Building on RC, REL spans algebra, chemistry, and biology and varies relational complexity within each domain.
Application Areas
Each area below includes a task sketch, a short explanation of why the task is relational, and a toy generator with adjustable parameters.
Math · REL-A1
The task presents a Raven-style matrix whose visible panels all instantiate the same latent attribute triple, and asks the model to choose the missing panel from a candidate set.
Correctness depends on binding three attributes together under one shared rule and matching that structure against multiple distractor panels.
Adjust the sliders and generate a REL-A1 example.
Math · REL-A2
The task asks the model to recover a progression over structured numeric panels and identify which candidate completes the bottom-right cell of the Raven matrix.
Success depends on coordinating row-major structure across multiple positions and attributes rather than matching a single local arithmetic cue.
Adjust the sliders and generate a REL-A2 example.
Biology · REL-B1
The model is given a multiple sequence alignment and phylogenetic tree, and must identify whether there is homoplasy and which taxa are in homoplasy.
The model must bind the same motif across multiple independent lineages simultaneously to decide which taxa belong together.
Generator Controls
Generated Tree
Generated Alignment
Generate a REL-B1 example to view the prompt, tree, and alignment.
Chemistry · REL-C1
The task asks whether a set of molecules is consistent with a shared molecular formula.
The model must compare multiple molecular candidates under the same formula-level constraint.
Adjust the slider and generate a REL-C1 example.
Chemistry · REL-C2
This task asks the model to reason over common structure across several molecules rather than evaluating one candidate alone.
The answer requires jointly comparing multiple operands and identifying what they share under a common constraint.
Adjust the slider and generate a REL-C2 example.
Chemistry · REL-C3
The task shows a partial family of molecules and asks the model to infer which constitutional isomer is missing.
The model must jointly bind the molecular formula, the observed subset, and the full candidate family to recover the missing structure.
Adjust the slider and generate a REL-C3 example.
Results
Select an application area to inspect the main empirical pattern and a compact table summarizing how performance changes.
Math Results
Performance drops as the decision rule becomes more relationally demanding and the input grid grows.
The hardest rows remain low accuracy even when the input tensor grows, showing a limit in coordinating multiple interacting algebraic relations.
| Rank | Model | Avg Accuracy | REL-A1 | REL-A2 | REL-A3 | REL-A4 |
|---|---|---|---|---|---|---|
| #1 | GPT-5.2 | 0.761 | 1.000 | 0.966 | 0.798 | 0.280 |
| #2 | Gemini 3 Pro | 0.698 | 1.000 | 0.968 | 0.588 | 0.236 |
| #3 | Claude Opus 4.5 | 0.674 | 1.000 | 0.986 | 0.542 | 0.168 |
| Rank | Model | Avg Accuracy | REL-A5 | REL-A6 | REL-A7 |
|---|---|---|---|---|---|
| #1 | GPT-5.2 | 0.497 | 0.792 | 0.594 | 0.106 |
| #2 | Claude Opus 4.5 | 0.464 | 0.742 | 0.542 | 0.108 |
| #3 | Gemini 3 Pro | 0.449 | 0.716 | 0.508 | 0.122 |
Biology Results
Models struggle to consider multiple taxa at once in determining their relative position to each other and other taxa in the tree.
| Rank | Model | Avg Accuracy | 2 | 3 | 4 | 5 | 10 | 15 | 20 | 25 |
|---|---|---|---|---|---|---|---|---|---|---|
| #1 | Gemini 3 Pro Preview | 0.202 | 0.161 | 0.247 | 0.430 | 0.400 | 0.203 | 0.077 | 0.070 | 0.026 |
| #2 | GPT-5.2 | 0.200 | 0.207 | 0.444 | 0.350 | 0.380 | 0.137 | 0.056 | 0.022 | 0.000 |
| #3 | Claude Opus 4.5 | 0.115 | 0.096 | 0.214 | 0.266 | 0.240 | 0.059 | 0.043 | 0.000 | 0.000 |
Chemistry Results
The model must compare candidate molecules and reason over whether they belong to the same structural family under one shared compositional constraint.
Difficulty rises because the model must distinguish the correct relation from a broader and more confusable set of alternatives.
This is the most demanding chemistry condition because the right answer depends on coordinating the broadest set of compositional and structural constraints.
| Rank | Model | Avg Score | <20 mol | 20–40 mol | ≥40 mol |
|---|---|---|---|---|---|
| #1 | Gemini 3 Pro | 75.7% | 64.3% | 75.5% | 87.3% |
| #2 | Claude Opus 4.5 | 70.7% | 60.0% | 75.8% | 76.3% |
| #3 | GPT 5.2 | 50.1% | 50.3% | 50.0% | 50.0% |
| Rank | Model | Avg Score | <20 mol | 20–40 mol | ≥40 mol |
|---|---|---|---|---|---|
| #1 | Claude Opus 4.5 | 40.1% | 38.8% | 41.7% | 39.7% |
| #2 | Gemini 3 Pro | 37.8% | 43.2% | 41.1% | 29.2% |
| #3 | GPT 5.2 | 36.1% | 37.4% | 33.4% | 37.3% |
| Rank | Model | Avg Score | <20 mol | 20–40 mol | ≥40 mol |
|---|---|---|---|---|---|
| #1 | Gemini 3 Pro | 34.6% | 41.7% | 36.2% | 25.9% |
| #2 | Claude Opus 4.5 | 30.2% | 34.1% | 31.3% | 25.3% |
| #3 | GPT 5.2 | 12.8% | 14.2% | 14.0% | 10.2% |
@article{fesser2026rel,
title = {Evaluating Relational Reasoning in LLMs with REL},
author = {Lukas Fesser and Yasha Ektefaie and Ada Fang and Sham M. Kakade and Marinka Zitnik},
year = {2026},
journal = {arXiv preprint arXiv:2604.12176},
eprint = {2604.12176},
archivePrefix = {arXiv},
url = {https://arxiv.org/abs/2604.12176}
}