Do LLMs have fluid intelligence?

Lessons from competing in ARC AGI 2

Dibya Chakravorty, Debsankha Manik & Bernhard Altaner

TNG Big Techday 26

2026-05-22

Intro

About us

and many thanks to everyone involved!

Dibya Chakravorty

Debsankha Manik

  • Dynamical systems, graph theory; data science × discrete optimisation.

  • Loves teaching.

Bernhard Altaner

  • Thermodynamics and information processing in complex systems.

  • Has local GlaDOS in his smart home.

Supported by Nicolas Berg – co-host of the AGI Munich meetup.

Thank you to Somayeh Vojdani & Tariq Baig-Meininghaus for initial discussions and continued encouragement.

Thanks to lambda.ai for providing free credits that helped support this research.

What is this talk about?

ARC-AGI 2 and our journey into forefront AI research

  • ARC-AGI 2: Abstraction and Reasoning Corpus for Artificial General Intelligence (2nd edition, 2025)
  • Task: Transform 2D input pixel grids into corresponding output grids

From a few train grid pairs, infer a general transformation rule…

…and apply it to a test grid!

Easy for humans, hard for AI?

What gives human an edge?

  • ARC AGI puzzles are not well-defined mathematically.
  • Humans still agree on a correct™ solution
  • Core human perceptive priors: objects, groups, proximity, numbers, etc.
  • Another example: What happens when you think about this puzzle?
  • A child with sandbox experience might solve it, but AI has never played in the mud…
  • … and neither had you at some point in your life.

ARC-AGI as a general intelligence benchmark

What constitutes general intelligence?

Raymond Catell, 1948, defined two aspects of general intelligence

General intelligence combines knowledge with adaptation!

  • ARC AGI designed to test skill acquisition efficiency1, a hallmark of fluid intelligence.
  • Requires In-Context-Learning, i.e. adaption to new situations not encountered in training data.
  • Essential problem: AI model weights are frozen during inference.
  • So how could LLMs ever exhibit fluid intelligence?

State of the ARC in April 2025

On the shoulders of giants

Snapshot of https://www.arcprize.org/leaderboard (Wayback Machine, 17 April 2025).

Snapshot of https://www.arcprize.org/leaderboard (Wayback Machine, 17 April 2025).

ARC-AGI 1: Not saturated; multiple paradigms used; reasoning models give good results, but expensive

ARC-AGI 2: barely any scores beyond noise (~3%) level

Challenge accepted!

Curiosity-driven YOLO meets Dunning-Krueger

  • Reasoning models seemed like an an interesting way to scale compute during test time
  • Deepseek R1 and its distillates had just been released
  • Straightforward plan:
    • Step one: Generate synthetic data for this domain.
    • Step two: Fine-tune a small open weights models and introduce reasoning like Deepseek R1 with reenforcement learning.
    • Step three: Profit!
  • How hard could it be?

Surely this seems quite straightforward, Right?

Surely this seems quite straightforward, Right?

Main part

Our journey

Our journey

Our journey

Our journey

Let’s begin

Training data

Solution trajectories

Solution trajectories

Solution trajectories

Given the following training input-output pairs, find the correct output for the given test input.

011    000
000 -> 000
000    011

222    000
020 -> 222
000    020

500
500 -> ?
000
Let me think step by step. The first example seems to be a reflection. The second example is not a reflection. It seems that clusters of numbers are actually moving downwards until they hit the last row. By that logic the answer is:

000
500
500

Solution trajectories

Rewards

DeepSeek-R1 recipe: Reinforcement Learning

Problem: training data shortage

Problem: training data shortage

Solution: make synthetic puzzles

But how?

  • The best model OpenAI o3 had near zero score on ARC AGI 2

Hypothesis

Even though o3 cannot solve puzzles with ARC AGI 2 level difficulty, it might be able to create difficult puzzles.

8000 synthetic puzzles, 671 concepts / priors

Complex, difficult puzzles like ARC AGI 2

8000 synthetic puzzles, 671 concepts / priors

Complex, difficult puzzles like ARC AGI 2

8000 synthetic puzzles, 671 concepts / priors

Complex, difficult puzzles like ARC AGI 2

Puzzle generation as program synthesis

def generate_inputs(type: Literal["train", "test"]):
    """
    Generates train and test input grids.
    Usage: train_inputs = list(generate_inputs("train"))
    """
    ...

def transform(grid: list[list[int]]):
    """
    Implements the puzzle's transformation logic.
    Usage: output_grid = transform(input_grid)
    """
   ...

Puzzle generation pipeline

Puzzle generation pipeline

Puzzle generation pipeline

Puzzle generation pipeline

Lessons

Lesson 1

Non-reasoning models are necessary for creativity, diversity and naturalness.

Lesson 2

Tool calling (e.g. code execution) makes reasoning models more reliable.

Tool calling before o3 and o4-mini

Non-reasoning models e.g. gpt-4o

Tool calling before o3 and o4-mini

Reasoning models e.g. o1, Claude 4

Tool calling before o3 and o4-mini:

Reasoning models e.g. o1, Claude 4

Tool calling with o3 and o4-mini

Interleaved thinking

Interleaved thinking scales test-time compute

Instead of DeepSeek-R1 recipe…

Agentic RL

Example trace

Solve this puzzle. The training examples and test inputs are already loaded in the interpreter.

Let me think. The transformation seems to be a reflection. Let me verify with code.

def transform(grid):
...

assert transform(train0_inp) == train0_out
AssertionError

There is a mismatch. Perhaps the transformation is gravity related? Let me test.

def transform(grid):
    ...

assert transform(train0_inp) == train0_out

No output

Here is the transform function that solves the puzzle:

def transform(grid):
    ...

Our journey

Side quest: learning visual priors

ARC AGI assumes core visual priors

  • 9 chambers in 3 x 3 formation, separated by gray walls
  • Each chamber contains a single shape
  • 5 chambers have blue plus shapes
  • 2 chambers have red T shapes
  • 2 chambers have green Z shapes

Vision models expected to win, but…

  • Counterintuitively, language models performed better on ARC AGI 1.
  • Do language models have core visual priors?

Language models are efficient visual learners

Language models are efficient visual learners

Qwen3-14B before and after fine-tuning

Held-out puzzle grid

Before SFT (summary)

  • Yellow pixels at (2,2), (2,3)
  • Navy cross at (5, 3), …
  • Gray pixels at …
  • Pink square at …

After SFT (summary)

  • Yellow hollow square spanning row, col (3, 2) to (5, 4) with red central pixel
  • Navy hollow square spanning … with green central pixel
  • Green hollow square spanning … with yellow central pixel
  • Pink hollow square spanning … with blue central pixel
  • Surrounded by gray walls

Lesson

The power of text

Text is a surprisingly rich medium that can be used to teach about other sensory modes.

Our journey

Our journey

The plan

Reinforce the successes

The reality with Qwen3-14B

There’s nothing to reinforce

Solution: distillation from successful traces

Cold-start SFT

Solution: distillation from successful traces

Not all is well in the synthetic world

Solution: distillation from successful traces

This time, the real thing

Example trace

Solve this puzzle. The training examples and test inputs are already loaded in the interpreter.

Let me think. The transformation seems to be a reflection. Let me verify with code.

def transform(grid):
...

assert transform(train0_inp) == train0_out
AssertionError

There is a mismatch. Perhaps the transformation is gravity related? Let me test.

def transform(grid):
    ...

assert transform(train0_inp) == train0_out

No output

Here is the transform function that solves the puzzle:

def transform(grid):
    ...

Model requirements for distillation

  • Open reasoning
  • Interleaved thinking capable
  • Intelligent enough to get 10% on ARC AGI 2

Reality

  • No model satisfied these requirements
  • Closest match: GPT OSS 120B
    • Expected to get 0% on ARC AGI 2

GPT OSS 120B seemed dumb

Something was fundamentally off

  • Invalid tool calls
  • Server errors, most likely originating from the inference engine
  • No interleaved thinking

Inference providers we tried

Inference engines we tried

vLLM and SGLang on Lambda AI H100 and GH200 GPUs

Harmony chat template

  • OpenAI didn’t follow existing chat template standards like ChatML with GPT OSS
  • They released their own standard: Harmony

ChatML vs. Harmony

System / developer message

ChatML

<|im_start|>system
# Tools

<tools>
{"type": "function", "function": {"name": "get_weather", "description": "Get weather for a city", "parameters": {"type": "object", "properties": {"city": {"type": "string"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}}, "required": ["city"]}}}
</tools>

Harmony

<|start|>developer<|message|># Tools
## functions
namespace functions {
// Get weather for a city
type get_weather = (_: {
  city: string,
  unit?: "celsius" | "fahrenheit", // default: celsius
}) => any;
} // namespace functions<|end|>

ChatML vs. Harmony

User message

ChatML

<|im_start|>user
What's the weather in Tokyo?<|im_end|>

Harmony

<|start|>user<|message|>What's the weather in Tokyo?<|end|>

ChatML vs. Harmony

Reasoning

ChatML

<|im_start|>assistant
</think>
User wants Tokyo weather. Need to call get_weather function.
</think>

Harmony

<|start|>assistant<|channel|>analysis<|message|>User wants Tokyo weather. Need to call get_weather function.<|end|>

ChatML vs. Harmony

Tool call preamble

ChatML

I'll check the current weather in Tokyo for you.

Harmony

<|start|>assistant<|channel|>commentary<|message|>I'll check the current weather in Tokyo for you.<|end|>

ChatML vs. Harmony

Tool call

ChatML

<tool_call>
{"name": "get_weather", "arguments": {"city": "Tokyo", "unit": "celsius"}}
</tool_call><|im_end|>

Harmony

<|start|>assistant<|channel|>commentary to=functions.get_weather<|constrain|>json<|message|>{"city": "Tokyo", "unit": "celsius"}<|call|>

ChatML vs. Harmony

Tool response

ChatML

<|im_start|>user
<tool_response>
{"temperature": 22, "condition": "partly cloudy", "humidity": 65}
</tool_response>

Harmony

<|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>{"temperature": 22, "condition": "partly cloudy", "humidity": 65}<|end|>

ChatML vs. Harmony

Final assistant response

ChatML

<|im_start|>assistant
The current weather in Tokyo is 22°C with partly cloudy skies and 65% humidity.<|im_end|>

Harmony

<|start|>assistant<|channel|>final<|message|>The current weather in Tokyo is 22°C with partly cloudy skies and 65% humidity.<|return|>

Harmony chat template bug

  • Both vLLM and SGLang had wrong implementations of Harmony
  • Inference providers who used these inference engines inherited the bugs
  • We patched vLLM and started using our patched version

GPT OSS after fix

An interleaved thinking pro

GPT OSS after fix

Plain CoT vs interleaved thinking

State of the art results

Conclusions

Our journey

At the bleeding edge

The frontier may be closer than it appears to be

  • Tool call + interleaved thinking ⟶ orchestrating own code execution sandbox (Q2’25).
  • Finetuning for tool calling, limited VRAM ⟶ YOLO libraries (unsloth) (Q2’25).
    • Manually fixing chat template for Qwen3.
  • (Non redacted) interleaved thinking traces ⟶ Patching vLLM and hosting own GPT-OSS model (Q4’25).
  • Vibe codingAutonomous task completion with single prompt (with tool calls) (Q4’25).

Lessons learned along the way

Taking the birds eye view

  • Tool use + reasoning ⟶ significant jump in LLM performance in reasoning tasks.
    • Python (or similar tools) grounds the model - filter bad hypotheses early and replan.
    • Labs underreported their flagship models’ performance on ARC because they evaluated without tool use.
  • Small / open-weight models punch above their reputation
    • We roughly ’d score on gpt-oss-120b with the right harness.
  • Synthetic data isn’t magic: Have to watch out for distribution shift.
    • “Shortcut-y” traces skipped the messy reality of wrong hypotheses and recovery — undermines performance in inference time where replanning is necessary.

Practical take-home messages

For working effectively with modern LLMs

  • Don’t over engineer agentic harnesses (just like prompts).
    • Complex harnesses generalize poorly, especially across tasks with varying difficulty levels.
    • LLMs armed with reasoning+tool use can dynamically adapt compute to the task: imposing a rigid structure hamstrings them.
  • Simplicity + verification
    • Tell the model how to deterministically evaluate its output against the goal (partly why models struggle at open-ended creative tasks).
    • But careful: LLMs will hack the evaluator function if they can.
  • There’s plenty of room for innovation outside frontier labs
    • Models internalize more and more abilities → harnesses must continuously evolve.
    • Limits of innate abilities of models → exactly where things get exciting!
    • Open weight models catch up to frontier models after a while.

What about the water filling puzzle?

Back to the (python) sandbox

Using GPT 5.5 XHigh

Puzzle
understands
Visual features
"water settling into crevices..."
"...under gravity"
develops
Awareness of walls and flow dynamics
"flows around walls..."
"...spills over a wall"
transforms grid
Text representation
........WW
........WW
.......GWW
.......GWW
.......GWW
.....G.GWW
.....G.GWW
....RG.GWW
....RG.GWW
.RRRRGGGWW
hypothesizes
Flow dynamics
"drops straight until it hits an obstacle"
"...may slide down the diagonal"
Tool call
Trial solution using simulation
checks gravity direction, horizontal flow rule bad → oscillations
adds tracing
Oscillation bug found
‘Fixes’ by left-over-right tie breaking
applies solution on
Test example
left-over-right tie breaking needed to avoid infinite loop
submits
Solution

Simple sandbox physics solves the puzzle…

…on its own devices

Water-flow simulation: 20 steps converging to the output grid

What do you think? Do LLMs have fluid intelligence?

Good news for (us) physicists…

…still much to play with and discover (like parity symmetry)

Water-flow simulation: 20 steps converging to the output grid

Thank you for your attention!

Any Questions?