Exploring relational reasoning capabilities in LLMs with REL.

Accepted to ICML 2026

Lukas Fesser*, Yasha Ektefaie*, Ada Fang*, Sham M. Kakade, Marinka Zitnik†

* Co-first authors. † Corresponding author.

Relational reasoning is the ability to infer relations that jointly bind multiple entities, attributes, or variables. While this capability is essential for scientific reasoning, most existing evaluations of relational reasoning in large language models focus on structured inputs such as tables, graphs, or synthetic relational tasks, and do not isolate the sources of difficulty that arise from higher-arity relational binding. We study this problem through the lens of Relational Complexity (RC), defined as the minimum number of independent entities or operands that must be simultaneously bound to apply a relation. RC provides a principled way to vary reasoning difficulty independently of confounders such as input size, vocabulary, and representational choices. Building on RC, we introduce REL, a generative benchmark framework spanning algebra, chemistry, and biology that varies RC within each domain. Evaluating frontier LLMs, we observe a consistent and monotonic degradation in performance as RC increases, even when the total number of entities is held fixed. This failure mode persists under increased test-time compute and with in-context learning, suggesting a limitation tied to the arity of the required relational binding rather than insufficient inference steps or exposure to examples. Our results identify a well-defined regime of higher-arity reasoning in which current models struggle, and motivate revisiting reasoning benchmarks through the lens of relational complexity.

Capability Overview

REL studies higher-arity relational binding directly.

We study relational reasoning through the lens of relational complexity, defined in the paper as the minimum number of independent entities or operands that must be simultaneously bound to apply a relation.

Relational Reasoning

Relational reasoning is the ability to infer relations that jointly bind multiple entities, attributes, or variables.

Relational Complexity

RC provides a principled way to vary reasoning difficulty independently of confounders such as input size, vocabulary, and representational choices.

REL Benchmark

Building on RC, REL spans algebra, chemistry, and biology and varies relational complexity within each domain.

Reasoning

Relational reasoning

Math

Chemistry

Biology

Application Areas

Math, chemistry, and biology each instantiate the same capability in a different form.

Each area below includes a task sketch, a short explanation of why the task is relational, and a toy generator with adjustable parameters.

Math · REL-A1

Constant-rule Raven completion over structured numeric panels.

The task presents a Raven-style matrix whose visible panels all instantiate the same latent attribute triple, and asks the model to choose the missing panel from a candidate set.

Why it is relational reasoning

Correctness depends on binding three attributes together under one shared rule and matching that structure against multiple distractor panels.

(4,7,2)

Value range 30

Adjust the sliders and generate a REL-A1 example.

Math · REL-A2

Progression-rule Raven completion across the full matrix.

The task asks the model to recover a progression over structured numeric panels and identify which candidate completes the bottom-right cell of the Raven matrix.

Why it is relational reasoning

Success depends on coordinating row-major structure across multiple positions and attributes rather than matching a single local arithmetic cue.

(3,9,12)

(5,11,14)

(7,13,16)

(4,10,13)

(6,12,15)

(8,14,17)

(5,11,14)

(7,13,16)

Value range 40

Progression step

Adjust the sliders and generate a REL-A2 example.

Biology · REL-B1

Identifying homoplasy in phylogenetic trees.

The model is given a multiple sequence alignment and phylogenetic tree, and must identify whether there is homoplasy and which taxa are in homoplasy.

Why it is relational reasoning

The model must bind the same motif across multiple independent lineages simultaneously to decide which taxa belong together.

Generator Controls

Change Parameters

Taxa in tree 16 Alignment length 100 Homoplastic taxa 4 Convergent block 12

Generated Tree

Adjust the dials and generate a REL-B1 example.

Generated Alignment

The highlighted block and convergent taxa will appear here.

Generate a REL-B1 example to view the prompt, tree, and alignment.

Chemistry · REL-C1

Formula agreement within a molecular family.

The task asks whether a set of molecules is consistent with a shared molecular formula.

Why it is relational reasoning

The model must compare multiple molecular candidates under the same formula-level constraint.

Number of molecules 3

Adjust the slider and generate a REL-C1 example.

Chemistry · REL-C2

Compare observed molecules to recover a shared substructure relation.

This task asks the model to reason over common structure across several molecules rather than evaluating one candidate alone.

Why it is relational reasoning

The answer requires jointly comparing multiple operands and identifying what they share under a common constraint.

Number of molecules 3

Adjust the slider and generate a REL-C2 example.

Chemistry · REL-C3

Missing-isomer completion under a shared molecular formula.

The task shows a partial family of molecules and asks the model to infer which constitutional isomer is missing.

Why it is relational reasoning

The model must jointly bind the molecular formula, the observed subset, and the full candidate family to recover the missing structure.

Number of molecules 3

Adjust the slider and generate a REL-C3 example.

Results

Performance degrades across models as relational complexity increases.

Select an application area to inspect the main empirical pattern and a compact table summarizing how performance changes.

Math Results

RPM measures latent rule induction over structured numeric inputs.

Performance drops as the decision rule becomes more relationally demanding and the input grid grows.

RPT measures whether models can jointly track higher-arity tensor constraints.

The hardest rows remain low accuracy even when the input tensor grows, showing a limit in coordinating multiple interacting algebraic relations.

Rank	Model	Avg Accuracy	REL-A1	REL-A2	REL-A3	REL-A4
#1	GPT-5.2	0.761	1.000	0.966	0.798	0.280
#2	Gemini 3 Pro	0.698	1.000	0.968	0.588	0.236
#3	Claude Opus 4.5	0.674	1.000	0.986	0.542	0.168

Rank	Model	Avg Accuracy	REL-A5	REL-A6	REL-A7
#1	GPT-5.2	0.497	0.792	0.594	0.106
#2	Claude Opus 4.5	0.464	0.742	0.542	0.108
#3	Gemini 3 Pro	0.449	0.716	0.508	0.122

Biology Results

REL-B1 tests the ability to jointly consider the relative position of multiple taxa in a phylogenetic tree to determine whether homoplasy is present or not

Models struggle to consider multiple taxa at once in determining their relative position to each other and other taxa in the tree.

Rank	Model	Avg Accuracy	2	3	4	5	10	15	20	25
#1	Gemini 3 Pro Preview	0.202	0.161	0.247	0.430	0.400	0.203	0.077	0.070	0.026
#2	GPT-5.2	0.200	0.207	0.444	0.350	0.380	0.137	0.056	0.022	0.000
#3	Claude Opus 4.5	0.115	0.096	0.214	0.266	0.240	0.059	0.043	0.000	0.000

Chemistry Results

REL-C1 tests molecular-family consistency under a shared formula-level constraint.

The model must compare candidate molecules and reason over whether they belong to the same structural family under one shared compositional constraint.

REL-C2 tests structural comparison across larger candidate sets with more ambiguity.

Difficulty rises because the model must distinguish the correct relation from a broader and more confusable set of alternatives.

REL-C3 tests the hardest chemistry setting, where multiple structural cues must be integrated at once.

This is the most demanding chemistry condition because the right answer depends on coordinating the broadest set of compositional and structural constraints.

Rank	Model	Avg Score	<20 mol	20–40 mol	≥40 mol
#1	Gemini 3 Pro	75.7%	64.3%	75.5%	87.3%
#2	Claude Opus 4.5	70.7%	60.0%	75.8%	76.3%
#3	GPT 5.2	50.1%	50.3%	50.0%	50.0%

Rank	Model	Avg Score	<20 mol	20–40 mol	≥40 mol
#1	Claude Opus 4.5	40.1%	38.8%	41.7%	39.7%
#2	Gemini 3 Pro	37.8%	43.2%	41.1%	29.2%
#3	GPT 5.2	36.1%	37.4%	33.4%	37.3%

Rank	Model	Avg Score	<20 mol	20–40 mol	≥40 mol
#1	Gemini 3 Pro	34.6%	41.7%	36.2%	25.9%
#2	Claude Opus 4.5	30.2%	34.1%	31.3%	25.3%
#3	GPT 5.2	12.8%	14.2%	14.0%	10.2%

@inproceedings{fesser2026evaluating,
  title     = {Evaluating Relational Reasoning in {LLM}s with {REL}},
  author    = {Lukas Fesser and Yasha Ektefaie and Ada Fang and Sham M. Kakade and Marinka Zitnik},
  booktitle = {Forty-third International Conference on Machine Learning},
  year      = {2026},
  url       = {https://openreview.net/forum?id=4oU82peL8f}
}