Do LLMs have fluid intelligence?

Lessons from competing in ARC AGI 2

Dibya Chakravorty, Debsankha Manik & Bernhard Altaner

TNG Big Techday 26

2026-05-22

Intro

About us

and many thanks to everyone involved!

Dibya Chakravorty

Generalist software engineer.
Organises the AGI Munich meetup and PyMunich.

Debsankha Manik

Dynamical systems, graph theory; data science × discrete optimisation.
Loves teaching.

Bernhard Altaner

Thermodynamics and information processing in complex systems.
Has local GlaDOS in his smart home.

Supported by Nicolas Berg – co-host of the AGI Munich meetup.

Thank you to Somayeh Vojdani & Tariq Baig-Meininghaus for initial discussions and continued encouragement.

Thanks to lambda.ai for providing free credits that helped support this research.

What is this talk about?

ARC-AGI 2 and our journey into forefront AI research

ARC-AGI 2: Abstraction and Reasoning Corpus for Artificial General Intelligence (2nd edition, 2025)
Task: Transform 2D input pixel grids into corresponding output grids

From a few train grid pairs, infer a general transformation rule…

…and apply it to a test grid!

So what is this talk about? [click]
A story about our journey how we, as a small amateur team, competed in a challenge to solve an AI benchmark know as ARC-AGI 2 [click]
ARC-AGI 2 stands for Abstraction and Reasoning Corpus for Artificial General Intelligence (2nd edition, 2025) [click]
More practically, ARC-AGI is a set of pixel puzzles, where an input needs to be transformed into an output grid, like in this example [click]
Each puzzle consists of a set of few training pairs [click]
and a test input grid without the corresponding output.
Let us briefly look at this example:
- Input grid consists of four stacks of objects
- In each train input, two stacks are underlined by a green or red marker
- In the output, the stack above the red marker becomes grey,
- while the stack above the green markers “grows” by the height of greyed out stack
- notice that in the test grid, unlike in the training examples, we have two red underlined stacks

Easy for humans, hard for AI?

What gives human an edge?

ARC AGI puzzles are not well-defined mathematically.
Humans still agree on a correct™ solution
Core human perceptive priors: objects, groups, proximity, numbers, etc.
Another example: What happens when you think about this puzzle?

A child with sandbox experience might solve it, but AI has never played in the mud…
… and neither had you at some point in your life.

What makes these puzzles hard for AI? [click]
Notice that these puzzles are not well-defined mathematically:
- fitting of a complicated function to 2-5 data points ( (x,y) tuples or input-label pairs. )
- expect that it evaluates “correctly” for some fixed input (the test input grid)
- mathematically under-determined system [click]
However, humans tend to agree on the correct solution, how can this be?
Note that in describing the transformation rule for the last puzzle we used words like “stacks”, “markers”, “growing” and so on.
These are high level concepts that are related to core human perceptive priors
So already in the previous puzzle, some human intuition enters.
In order to make this even more explicit, let’s look at another example, and pay attention to your own thought process. [click]
- You might remember your experiences of building castles and dams in the sandbox,
- But AI has never played in the mud…[click]
At some point you had to learn the concept of gravity and fluid flow and acquire the skill to build beautiful castles and dams, even if it was through play.

ARC-AGI as a general intelligence benchmark

What constitutes general intelligence?

Raymond Catell, 1948, defined two aspects of general intelligence

General intelligence combines knowledge with adaptation!

ARC AGI designed to test skill acquisition efficiency¹, a hallmark of fluid intelligence.
Requires In-Context-Learning, i.e. adaption to new situations not encountered in training data.
Essential problem: AI model weights are frozen during inference.
So how could LLMs ever exhibit fluid intelligence?

Thinking about how we solve the the water-flow puzzle, already touches on notion of **general* intelligence [click]
The questions of what constitutes general intelligence is a big topic of research in psychology and neuroscience
Raymond Catell, 1948, defined two aspects of general intelligence [click]
crystallized: priors, world model, domain knowledge, mechanics of language -> knowledge, experience
- this is a strong aspect about how you solved probably just solved the puzzle [click]
fluid: working memory, adaptation, complex skills, creativity -> adaption
- this is probably more how you sharpened your sandcastle skills all those many years ago [click]
General intelligence combines knowledge with adaptation! [click]
ARC AGI designed to test skill acquisition efficiency, a hallmark of fluid intelligence. [click]
Requires In-Context-Learning (ICL), i.e. adaption to new situations not encountered in training data. [click]
Essential problem: AI model weights are frozen during inference. [click]
So how could LLMs ever exhibit fluid intelligence?

State of the ARC in April 2025

On the shoulders of giants

Snapshot of https://www.arcprize.org/leaderboard (Wayback Machine, 17 April 2025).

⏺ ARC-AGI 1: Not saturated; multiple paradigms used; reasoning models give good results, but expensive

▲ ARC-AGI 2: barely any scores beyond noise (~3%) level

Challenge accepted!

Curiosity-driven YOLO meets Dunning-Krueger

Reasoning models seemed like an an interesting way to scale compute during test time
Deepseek R1 and its distillates had just been released

Straightforward plan:
- Step one: Generate synthetic data for this domain.
- Step two: Fine-tune a small open weights models and introduce reasoning like Deepseek R1 with reenforcement learning.
- Step three: Profit!

How hard could it be?

Surely this seems quite straightforward, Right?

Main part

Our journey

Let’s begin

Training data

Solution trajectories

Solution trajectories

Given the following training input-output pairs, find the correct output for the given test input.

011    000
000 -> 000
000    011

222    000
020 -> 222
000    020

500
500 -> ?
000

Let me think step by step. The first example seems to be a reflection. The second example is not a reflection. It seems that clusters of numbers are actually moving downwards until they hit the last row. By that logic the answer is:

000
500
500

Solution trajectories

Rewards

DeepSeek-R1 recipe: Reinforcement Learning

Problem: training data shortage

Solution: make synthetic puzzles

But how?

The best model OpenAI o3 had near zero score on ARC AGI 2

Hypothesis

Even though o3 cannot solve puzzles with ARC AGI 2 level difficulty, it might be able to create difficult puzzles.

8000 synthetic puzzles, 671 concepts / priors

Complex, difficult puzzles like ARC AGI 2

8000 synthetic puzzles, 671 concepts / priors

Complex, difficult puzzles like ARC AGI 2

8000 synthetic puzzles, 671 concepts / priors

Complex, difficult puzzles like ARC AGI 2

Puzzle generation as program synthesis

def generate_inputs(type: Literal["train", "test"]):
    """
    Generates train and test input grids.
    Usage: train_inputs = list(generate_inputs("train"))
    """
    ...

def transform(grid: list[list[int]]):
    """
    Implements the puzzle's transformation logic.
    Usage: output_grid = transform(input_grid)
    """
   ...

Puzzle generation pipeline

Lessons

Lesson 1

Non-reasoning models are necessary for creativity, diversity and naturalness.

Lesson 2

Tool calling (e.g. code execution) makes reasoning models more reliable.

Tool calling before o3 and o4-mini

Non-reasoning models e.g. gpt-4o

Tool calling before o3 and o4-mini

Reasoning models e.g. o1, Claude 4

Tool calling before o3 and o4-mini:

Reasoning models e.g. o1, Claude 4

Tool calling with o3 and o4-mini

Interleaved thinking

Interleaved thinking scales test-time compute

Instead of DeepSeek-R1 recipe…

Agentic RL

Example trace

Solve this puzzle. The training examples and test inputs are already loaded in the interpreter.

Let me think. The transformation seems to be a reflection. Let me verify with code.

def transform(grid):
...

assert transform(train0_inp) == train0_out

AssertionError

There is a mismatch. Perhaps the transformation is gravity related? Let me test.

def transform(grid):
    ...

assert transform(train0_inp) == train0_out

No output

Here is the transform function that solves the puzzle:

def transform(grid):
    ...

Our journey

Side quest: learning visual priors

ARC AGI assumes core visual priors

9 chambers in 3 x 3 formation, separated by gray walls
Each chamber contains a single shape
5 chambers have blue plus shapes
2 chambers have red T shapes
2 chambers have green Z shapes

Vision models expected to win, but…

Counterintuitively, language models performed better on ARC AGI 1.
Do language models have core visual priors?

Language models are efficient visual learners

Qwen3-14B before and after fine-tuning

Before SFT (summary)

Yellow pixels at (2,2), (2,3) …
Navy cross at (5, 3), …
Gray pixels at …
Pink square at …

After SFT (summary)

Yellow hollow square spanning row, col (3, 2) to (5, 4) with red central pixel
Navy hollow square spanning … with green central pixel
Green hollow square spanning … with yellow central pixel
Pink hollow square spanning … with blue central pixel
Surrounded by gray walls

Lesson

The power of text

Text is a surprisingly rich medium that can be used to teach about other sensory modes.

Our journey

The plan

Reinforce the successes

The reality with Qwen3-14B

There’s nothing to reinforce

Solution: distillation from successful traces

Cold-start SFT

Solution: distillation from successful traces

Not all is well in the synthetic world

Solution: distillation from successful traces

This time, the real thing

Example trace

Solve this puzzle. The training examples and test inputs are already loaded in the interpreter.

Let me think. The transformation seems to be a reflection. Let me verify with code.

def transform(grid):
...

assert transform(train0_inp) == train0_out

AssertionError

There is a mismatch. Perhaps the transformation is gravity related? Let me test.

def transform(grid):
    ...

assert transform(train0_inp) == train0_out

No output

Here is the transform function that solves the puzzle:

def transform(grid):
    ...

Model requirements for distillation

Open reasoning
Interleaved thinking capable
Intelligent enough to get 10% on ARC AGI 2

Reality

No model satisfied these requirements
Closest match: GPT OSS 120B
- Expected to get 0% on ARC AGI 2

GPT OSS 120B seemed dumb

Something was fundamentally off

Invalid tool calls
Server errors, most likely originating from the inference engine
No interleaved thinking

Inference providers we tried

Inference engines we tried

vLLM and SGLang on Lambda AI H100 and GH200 GPUs

Harmony chat template

OpenAI didn’t follow existing chat template standards like ChatML with GPT OSS
They released their own standard: Harmony

ChatML vs. Harmony

System / developer message

ChatML

<|im_start|>system
# Tools

<tools>
{"type": "function", "function": {"name": "get_weather", "description": "Get weather for a city", "parameters": {"type": "object", "properties": {"city": {"type": "string"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}}, "required": ["city"]}}}
</tools>

Harmony

<|start|>developer<|message|># Tools
## functions
namespace functions {
// Get weather for a city
type get_weather = (_: {
  city: string,
  unit?: "celsius" | "fahrenheit", // default: celsius
}) => any;
} // namespace functions<|end|>

ChatML vs. Harmony

User message

ChatML

<|im_start|>user
What's the weather in Tokyo?<|im_end|>

Harmony

<|start|>user<|message|>What's the weather in Tokyo?<|end|>

ChatML vs. Harmony

Reasoning

ChatML

<|im_start|>assistant
</think>
User wants Tokyo weather. Need to call get_weather function.
</think>

Harmony

<|start|>assistant<|channel|>analysis<|message|>User wants Tokyo weather. Need to call get_weather function.<|end|>

ChatML vs. Harmony

Tool call preamble

ChatML

I'll check the current weather in Tokyo for you.

Harmony

<|start|>assistant<|channel|>commentary<|message|>I'll check the current weather in Tokyo for you.<|end|>

ChatML vs. Harmony

Tool call

ChatML

<tool_call>
{"name": "get_weather", "arguments": {"city": "Tokyo", "unit": "celsius"}}
</tool_call><|im_end|>

Harmony

<|start|>assistant<|channel|>commentary to=functions.get_weather<|constrain|>json<|message|>{"city": "Tokyo", "unit": "celsius"}<|call|>

ChatML vs. Harmony

Tool response

ChatML

<|im_start|>user
<tool_response>
{"temperature": 22, "condition": "partly cloudy", "humidity": 65}
</tool_response>

Harmony

<|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>{"temperature": 22, "condition": "partly cloudy", "humidity": 65}<|end|>

ChatML vs. Harmony

Final assistant response

ChatML

<|im_start|>assistant
The current weather in Tokyo is 22°C with partly cloudy skies and 65% humidity.<|im_end|>

Harmony

<|start|>assistant<|channel|>final<|message|>The current weather in Tokyo is 22°C with partly cloudy skies and 65% humidity.<|return|>

Harmony chat template bug

Both vLLM and SGLang had wrong implementations of Harmony
Inference providers who used these inference engines inherited the bugs
We patched vLLM and started using our patched version

GPT OSS after fix

An interleaved thinking pro

GPT OSS after fix

Plain CoT vs interleaved thinking

State of the art results

Conclusions

Our journey

At the bleeding edge

The frontier may be closer than it appears to be

Tool call + interleaved thinking ⟶ orchestrating own code execution sandbox (Q2’25).
Finetuning for tool calling, limited VRAM ⟶ YOLO libraries (unsloth) (Q2’25).
- Manually fixing chat template for Qwen3.
(Non redacted) interleaved thinking traces ⟶ Patching vLLM and hosting own GPT-OSS model (Q4’25).
~~Vibe coding~~ ⟶ Autonomous task completion with single prompt (with tool calls) (Q4’25).

In April, we wre using OpenAI models with tool calling for generating synthetic ARC-like puzzles.
- No agentic coding harnesses were available. We orchestrated our own tool calling sandbox.
In May, we started our foray into finetuning a Qwen model with tool calling.
- Libraries (transformers and unsloth) “technically” supported it, but we had to fix the logic for masking tool call tokens correctly.
In August 2025, GPT-OSS was released: capable of interleaved thinking and tool calling.
- We started using it in our agentic harness immediately.
- No commercial provider deployed GPT-OSS in a way that supported interleaved thinking. We had to host our own inference engine (with custom patches).
For our SOTA result: we used LLMs in a completely autonomous tool calling loop without human involvement.

Lessons learned along the way

Taking the birds eye view

Tool use + reasoning ⟶ significant jump in LLM performance in reasoning tasks.
- Python (or similar tools) grounds the model - filter bad hypotheses early and replan.
- Labs underreported their flagship models’ performance on ARC because they evaluated without tool use.
Small / open-weight models punch above their reputation
- We roughly 4×’d score on gpt-oss-120b with the right harness.
Synthetic data isn’t magic: Have to watch out for distribution shift.
- “Shortcut-y” traces skipped the messy reality of wrong hypotheses and recovery — undermines performance in inference time where replanning is necessary.

Practical take-home messages

For working effectively with modern LLMs

Don’t over engineer agentic harnesses (just like prompts).
- Complex harnesses generalize poorly, especially across tasks with varying difficulty levels.
- LLMs armed with reasoning+tool use can dynamically adapt compute to the task: imposing a rigid structure hamstrings them.
Simplicity + verification
- Tell the model how to deterministically evaluate its output against the goal (partly why models struggle at open-ended creative tasks).
- But careful: LLMs will hack the evaluator function if they can.
There’s plenty of room for innovation outside frontier labs
- Models internalize more and more abilities → harnesses must continuously evolve.
- Limits of innate abilities of models → exactly where things get exciting!
- Open weight models catch up to frontier models after a while.

What about the water filling puzzle?

Back to the (`python`) sandbox

Using GPT 5.5 XHigh

Puzzle

understands

Visual features

"water settling into crevices..."
"...under gravity"

develops

Awareness of walls and flow dynamics

"flows around walls..."
"...spills over a wall"

transforms grid

Text representation

........WW
........WW
.......GWW
.......GWW
.......GWW
.....G.GWW
.....G.GWW
....RG.GWW
....RG.GWW
.RRRRGGGWW

hypothesizes

Flow dynamics

"drops straight until it hits an obstacle"
"...may slide down the diagonal"

Trial solution using simulation

checks gravity direction, horizontal flow rule bad → oscillations

adds tracing

Oscillation bug found

‘Fixes’ by left-over-right tie breaking

applies solution on

Test example

left-over-right tie breaking needed to avoid infinite loop

submits

Solution

Simple sandbox physics solves the puzzle…

…on its own devices

Water-flow simulation: 20 steps converging to the output grid

What do you think? Do LLMs have fluid intelligence?

Good news for (us) physicists…

…still much to play with and discover (like parity symmetry)

Thank you for your attention!

Any Questions?