The Illusion of Thinking: A Deep Dive into Large Reasoning Models

A groundbreaking study, freshly published by a research team at Apple, dives into the heart of artificial intelligence's latest darlings: Large Reasoning Models (LRMs). Titled The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, this paper—authored by Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar—challenges the hype surrounding models like OpenAI's o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Gemini Thinking. Released just days ago, it questions whether these models truly "reason" as promised. By using cleverly designed puzzle environments, the researchers reveal surprising truths about what LRMs can—and cannot—do. Written for a blog audience, this article unpacks the study's insights in a way that's engaging yet comprehensive, shining a light on the future of AI reasoning.

A New Way to Test Reasoning

Traditional tests for LRMs, like MATH-500 or AIME, focus on final answer accuracy but come with baggage: training data contamination means models may have "seen" similar problems before, and these tests reveal little about the reasoning process itself. The Apple team, led by Shojaee and colleagues, sidesteps this by introducing four controllable puzzle environments: Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World. These puzzles let researchers tweak complexity—think more disks or checkers—while keeping the logic consistent. This setup ensures models can't just regurgitate memorised answers and allows a peek into their step-by-step "thoughts" via reasoning traces.

Why puzzles? They're less likely to be in training data, forcing models to rely on actual reasoning. Plus, custom simulators validate every move, offering a detailed look at where models succeed or stumble. This approach, crafted by Apple's research squad, gives a clearer view of how LRMs "think" and where they fall apart.

Three Regimes of Reasoning

The study uncovers a fascinating pattern: LRMs and their non-reasoning cousins (standard LLMs) perform differently depending on problem complexity. The researchers identify three distinct regimes:

Low Complexity: Standard LLMs Take the Crown
For simple puzzles, standard LLMs like Claude 3.7 Sonnet (without thinking) or DeepSeek-V3 often outperform LRMs. They're more accurate and use fewer tokens, making them more efficient. Why? LRMs tend to "overthink," exploring unnecessary paths even after finding the right solution. For instance, in a Tower of Hanoi puzzle with three disks, an LRM might nail the moves early but then waste effort on invalid ones, dragging down efficiency.
Medium Complexity: LRMs Shine
As puzzles get trickier, LRMs start to show their worth. Their ability to generate long chains of thought (CoT) and self-reflect gives them an edge. In puzzles like Checker Jumping with a moderate number of checkers, LRMs outperform standard LLMs by carefully exploring moves and correcting errors. This suggests that extra computational effort pays off when problems demand more planning.
High Complexity: Total Collapse
Here's the kicker: beyond a certain complexity threshold, both LRMs and standard LLMs crash and burn, with accuracy plummeting to zero. In Tower of Hanoi with 15 disks or Blocks World with 40 blocks, no model could solve the puzzle. Even more bizarrely, LRMs reduce their reasoning effort (measured by token usage) as problems get harder, despite having plenty of computational room left. This "scaling limit" raises red flags about their ability to tackle complex real-world tasks.

The Overthinking Trap and Other Oddities

Digging into the reasoning traces, the Apple team found that LRMs behave differently depending on complexity. In simple puzzles, they often find the correct solution early but keep exploring wrong paths—a phenomenon called "overthinking." As complexity ramps up, they start with incorrect solutions and only later (if at all) find the right ones. At high complexity, they fail entirely, showing weak self-correction skills.

Even more surprising, LRMs struggle with exact computation. The researchers tested this by giving models a clear algorithm for Tower of Hanoi, expecting better performance. Shockingly, it made no difference—models still failed at the same complexity threshold. This suggests they're not just struggling to devise strategies but also to execute precise logical steps. Another quirk: models performed better on Tower of Hanoi (solving up to 100 moves for 10 disks) than on River Crossing (failing after just 4 moves for three actor-agent pairs). The team suspects this is because River Crossing puzzles are rarer online, meaning LRMs may not have memorised them during training.

What's Going Wrong?

The study, spearheaded by Apple's research crew, points to several core issues with LRMs:

No Generalisable Reasoning: Despite fancy self-reflection mechanisms, LRMs don't develop robust problem-solving skills for complex planning tasks. Their accuracy tanks at high complexity, hinting they rely more on pattern matching than true reasoning.
Scaling Limits: The drop in reasoning effort as problems get tougher is a head-scratcher. LRMs seem to "give up" when faced with overwhelming complexity, even with ample tokens available.
Inconsistent Performance: Models excel on some puzzles (like Tower of Hanoi) but flop on others (like River Crossing) at similar complexity levels, likely due to uneven training data exposure. This questions their ability to generalise across domains.
Overthinking and Inefficiency: LRMs waste resources on redundant or incorrect paths, especially in simpler tasks, highlighting inefficiencies in their reasoning processes.

What This Means for AI

These findings, hot off the press from Apple's research team, are a reality check for AI enthusiasts. If LRMs can't handle complex problems or execute precise algorithms, their use in real-world scenarios—like scientific research or logistics—may be limited. The study challenges claims that LRMs are a leap toward general artificial intelligence, suggesting they're more about clever pattern matching than deep reasoning.

For AI developers, this is a call to action. Current training methods, relying on reinforcement learning and massive datasets, aren't cutting it for building generalisable reasoning. New approaches—perhaps better symbolic manipulation or stronger self-correction mechanisms—are needed to break through these barriers.

Looking Forward

This study from Apple's research squad, led by Shojaee, Mirzadeh, Alizadeh, Horton, Bengio, and Farajtabar, is a wake-up call. LRMs are powerful, but they're not the reasoning wizards we might hope. Their overthinking on simple tasks, collapse on complex ones, and struggles with exact computation show we're still a long way from human-like reasoning. The use of controlled puzzle environments is a brilliant step forward, offering a cleaner way to test AI without the clutter of contaminated benchmarks.

As we dream of smarter AI, this research reminds us to stay grounded. LRMs are impressive tools, but their limitations—laid bare by Apple's team—push us to rethink how we build and evaluate reasoning in machines. The "illusion of thinking" is a challenge to do better, paving the way for AI that can plan, verify, and execute with true precision.