Artificial intelligence models are getting more impressive by the day. They’re writing essays, solving math problems, even helping code software. But a central question remains: Are they really reasoning? Or are they just really good at appearing smart?
A new study takes this head-on, focusing on a new generation of models called Large Reasoning Models (LRMs). These models are designed to think through problems step by step while generating long chains of logic, not just final answers. But when researchers tested them using carefully controlled logic puzzles, the results were sobering.
Here’s what they found:
But Why Puzzles? Why Not Benchmarks?
Most AI models are tested with benchmarks like math problems, coding tasks, and standardized quizzes. But these tests have serious flaws:
Training Contamination: The model might’ve already seen similar questions during training.
Final Answer Bias: These tests judge only the output, not how the model got there.
To fix this, researchers used classic logic puzzles like Tower of Hanoi, checker jumping, river crossing, and block stacking. These puzzles can be scaled up in complexity and aren't likely to appear in training.
More importantly, the researchers asked the models to “think out loud”—generating every step of their reasoning trace. And using simulators, they could check the logic at every stage.
What Happens When You Crank Up the Difficulty?
The results revealed three very different “regimes” of performance depending on puzzle complexity:
1. Low Complexity: No Help Needed
In simple problems, standard models (those not designed for step-by-step thinking) performed just as well, or even better, than the more complex reasoning models. The extra thinking just created overhead.
2. Medium Complexity: Reasoning Pays Off
As the puzzles got tougher, the LRMs began to outperform. Their detailed thought traces helped navigate the added complexity, though at a significant computational cost. They needed far more tokens to do it.
3. High Complexity: Collapse
Once the puzzles passed a certain complexity threshold, both standard models and LRMs failed completely. Accuracy dropped to zero. It didn’t matter how many tokens they had to “think with.” They hit a wall.
Even more puzzling, right before failure, the LRMs actually reduced their reasoning effort. Instead of using more tokens on harder problems, they used fewer. A counterintuitive scaling failure that researchers dubbed a reasoning collapse.
Peeking Inside the AI’s Thought Process
By analyzing the model’s internal reasoning traces, researchers noticed patterns:
In easy tasks, the model often found the right answer early but then kept “thinking,” wasting compute by exploring wrong options. They were essentially overthinking.
In medium tasks, it wandered through wrong paths before correcting itself.
In hard tasks, it never found a valid solution at all.
There was some evidence of self-correction, but not enough to overcome complexity. The models just didn’t have the capacity for deep, sustained reasoning.
When You Hand AI the Answer and It Still Fails
Here’s where it gets wild.
Researchers gave the model the exact algorithm for solving Tower of Hanoi. All the model had to do was follow instructions.
And yet… it still failed. On harder versions, it broke down just as quickly as when it had to figure out the solution itself.
Why? Because these models struggle not just with planning, but with executing precise logical sequences. They don’t handle “exact computation” well. Especially over long stretches.
The Weirdest Finding of All
You might think the models’ performance would correlate with the number of steps a puzzle requires.
Nope.
One model solved 100 correct moves in a Tower of Hanoi puzzle that required over 1,000 steps. Impressive!
But in a much simpler river crossing puzzle with only 11 total steps, the same model failed after just four moves.
That kind of inconsistency suggests these models don’t reason in a general way. Their success is tied closely to specific problem structures, not just complexity. Maybe Tower of Hanoi is more common in training data. Maybe river crossing requires harder state tracking. Either way, it’s clear: these aren’t flexible general thinkers.
Are LRMs Actually Reasoning?
This study suggests that while LRMs do better on certain tasks, their reasoning is fragile, narrow, and easily overwhelmed. Even when you hand them the answer, they can fail to follow through.
That’s not robust intelligence. It’s closer to very sophisticated pattern-matching that breaks under pressure.
To build AIs that can truly reason, general-purpose, step-by-step, scalable problem solvers; we’re going to need something much more advanced. Possibly something fundamentally different in architecture.
Until then, it’s worth remembering: just because an AI sounds like it’s thinking... doesn’t mean it is.
Watch on YouTube, or listen on Spotify.
Read the original paper: The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity