Discover how recursion transforms AI reasoning. Learn why tiny 7M-parameter models outperform billion-parameter models and what HRM and TRM mean for AI's fut...
Recursion in AI: The Next Scaling Law Beyond Bigger Models
Key Insights
- Recursion outperforms scale: Tiny Recursive Models (TRMs) with just 7 million parameters solve problems that billion-parameter models cannot, achieving 87% on Arc Prize 1 compared to 0% for larger models
- The hidden state advantage: Unlike transformers that must retain entire contexts, recursive models compress reasoning into efficient hidden states, reducing computational overhead while improving performance
- Truncated backpropagation works: Training with T=1 truncated backpropagation through time (BPTT) proves surprisingly effective, challenging conventional wisdom about gradient flow requirements
- Three-tier recursion architecture: High-order reasoning models use low-level (ZL), high-level (ZH), and outer refinement loops working in concert
- EM-like optimization: Models learn to store and retrieve information iteratively, similar to solving a Sudoku puzzle incrementally rather than guessing all cells at once
- Task-specific reasoning wins: When combined with general-purpose embeddings from large models, small recursive models could unlock unprecedented efficiency and reasoning capability
The Limitations of Current Large Language Models in Reasoning
The artificial intelligence landscape has undergone a dramatic shift from RNNs to transformers, but this transition came with hidden costs. Modern large language models (LLMs) like ChatGPT-2 operate as pure feed-forward networks, processing inputs in a single forward pass to generate outputs token-by-token. This architecture delivers exceptional scaling properties during training—all time steps are processed in parallel, enabling massive wall-clock efficiency and distributed training.
However, this architectural choice imposes fundamental reasoning limitations. Consider a simple sorting problem: achieving optimal comparison-based sorting requires N log N steps minimum. If you have 31 elements to sort but only 30 transformer layers, mathematically you cannot perform sufficient comparisons to sort correctly. The model lacks sufficient computational depth. Similarly, Sudoku puzzles, mazes, and incompressible algorithmic problems expose transformer blindspots—they simply cannot compress complex reasoning into a single forward pass.
The root issue traces back to the absence of external memory mechanisms. Traditional algorithms exploit memory caches, tapes, and auxiliary structures to exceed theoretical bounds. Radix sorting, for instance, reduces comparison-based O(N log N) complexity to O(N) by using memory buckets. Transformers lack analogous mechanisms built into their architecture. Without this capacity, they cannot access computational shortcuts that algorithms naturally provide. This isn't a training data problem or a model scale issue—it's a fundamental architectural constraint that scaling cannot overcome.
Chain-of-thought prompting offers a workaround, enabling models to output intermediate reasoning steps at test time. Yet this approach has critical limitations. Models don't discover novel algorithms; they retrieve algorithms from training data. A model trained only on bubble sort will only perform bubble sort, and likely inefficiently. Chain-of-thought cannot invent merge sort from first principles—it can only demonstrate algorithms already present in the training corpus. For unsolved problems like Millennium Prize Problems or novel scientific challenges, chain-of-thought reaches its limits because no human trace exists to retrieve.
This distinction between reasoning and retrieval separates transformers from true computational agents. Transformers excel at semantic embedding and pattern matching within discrete token spaces, but algorithmic reasoning—the kind requiring iterative refinement and state management—remains their Achilles heel. This gap has driven recent breakthroughs in recursive model architectures that address these fundamental limitations.
Higher-Order Reasoning Models: Recursion at Multiple Scales
Higher-Order Reasoning Models (HRMs) represent a return to recursive principles, reimagined for modern deep learning. Rather than the problematic backpropagation-through-time mechanisms that plagued RNNs, HRMs implement three nested levels of recursion inspired by neuroscience observations that different brain regions oscillate at different frequencies. Low-frequency regions handle high-level abstractions while high-frequency regions manage low-level details—a principle directly embedded into HRM architecture.
The model architecture operates through elegant recursion. An input—whether a Sudoku puzzle, maze, or Arc Prize challenge—enters the system. The low-level module (L-net) executes TL recursive steps, processing local information. Then the high-level module (H-net) runs TH times, integrating L-net outputs into broader patterns. Finally, an outer refinement loop executes N times, allowing the entire system to improve iteratively. Crucially, the same weights apply throughout each recursion tier—no new parameters are introduced, only repeated application of existing weights.
The breakthrough lies in the training mechanism. Rather than backpropagating through all recursion steps (which causes vanishing gradients and memory explosion), HRMs employ deep equilibrium (DEQ) methods. The model processes a batch forward, computes loss, and backpropagates. Then—remarkably—it processes the same batch again with updated weights, repeating this process 16 times. The hidden states ZL and ZH are not reset between iterations, effectively creating new "minibatches" from varying memory states despite identical external inputs.
This approach circumvents the catastrophic issues that killed RNNs. By treating repeated passes as distinct training iterations with frozen earlier activations, the model avoids storing activation records across hundreds of steps. The gradient signal remains clean because you're only backpropagating through the last iteration's changes, not through cascading multiplication of Jacobians across deep time. The mathematics resembles expectation-maximization, where the E-step refines hidden states while the M-step optimizes weights.
The results proved transformative. HRM achieved 70% on Arc Prize 1 with a 27-million-parameter model trained exclusively on Arc Prize data (1,000 tasks, no pretraining). By comparison, OpenAI's O3 model scored 0% on this benchmark. This wasn't a marginal improvement—it represented solving problems that far larger, internet-trained models could not solve at all. For Arc Prize 2, performance remained competitive despite dramatic parameter reduction compared to general-purpose models.
The model's elegance extends to interpretability. Variable scoping maps naturally to the architecture: ZL represents local, low-level deductions (like determining which numbers are possible in a Sudoku cell), while ZH captures high-level hypotheses and constraints. The iterative refinement mimics human problem-solving—you can't solve Sudoku by guessing all cells simultaneously; you make incremental deductions based on available information, propagating constraints upward and downward until convergence.
Tiny Recursive Models: Simplification Through Focused Recursion
Building on HRM's insights, Tiny Recursive Models (TRMs) achieve something seemingly paradoxical: better performance with fewer parameters and simpler architecture. A 7-million-parameter TRM achieved 87% on Arc Prize 1—a 17-percentage-point improvement over the 27-million-parameter HRM—while maintaining strong Arc Prize 2 performance. This 4x parameter reduction paired with 17% performance gain directly challenges the scaling hypothesis that dominates current AI research.
TRMs simplify the architecture by collapsing the separate L-net and H-net into a single weight-shared network. Rather than maintaining distinct transformer modules for low and high-level processing, TRMs use a unified architecture that processes both, typically a single transformer layer instead of four. The hidden states ZL and ZH remain conceptually distinct (representing local and high-level information), but they flow through identical weight matrices.
The critical innovation lies in truncated backpropagation. TRMs backpropagate through exactly one recursion step (T=1), not through the entire recursive chain. This counterintuitive finding emerged first in HRM experiments but reaches its logical conclusion in TRMs. By limiting backpropagation to single steps, the model avoids the vanishing gradient and memory problems that plagued RNNs, while still enabling gradients to flow through meaningful recursion. The outer refinement loop provides additional learning passes, allowing the model to iterate six times (or more during training) before computing loss and backpropagating.
The optimization proceeds through distinct phases: a forward pass without gradient tracking (no_grad) through the initial recursion steps, then a single step with full gradient computation (latent recursion), followed by outer refinement loops where loss is calculated and weights updated. This architecture resembles expectation-maximization algorithms, where local computation (ZL updates) and higher-level integration (ZH updates) alternate until convergence.
Testing reveals fascinating architectural insights. On Sudoku, a basic multilayer perceptron (MLP) matches transformer performance, suggesting that architectural complexity offers no advantage for certain problems. Conversely, maze tasks show zero performance with MLPs, indicating that transformers' sequential attention mechanisms provide necessary structure for spatial reasoning. This variability challenges the assumption that more complex architectures universally outperform simpler ones.
The most striking finding concerns test-time performance: models trained with many refinement steps often require only single-step refinement during testing for near-maximum performance. This suggests that training teaches the model an efficient reasoning algorithm that converges rapidly once learned. The model discovers how to solve problems incrementally, compressing multi-step reasoning into a compact recursive process that executes efficiently at inference.
Recursion Versus Scale: The Emerging Paradigm Shift
The historical arc of AI research has consistently rewarded scale: bigger models, more parameters, larger datasets, and greater computational budgets. Scaling laws emerged as organizing principles—doubling model size provided predictable performance gains. This paradigm produced transformers and now dominates large language model development. Yet HRM and TRM results suggest an orthogonal scaling axis: algorithmic depth through recursion rather than parameter width.
Consider the comparison directly: a 7-million-parameter TRM solving Arc Prize tasks versus GPT-4 and other billion-parameter models failing entirely. The tiny model demonstrates that reasoning capability derives not from parameter count but from architecture enabling iterative refinement. Melanie Mitchell's research, examining the relationship between model scale and performance, argues that "it is sufficient, not necessary, to go bigger and get better performance." Recursion provides a complementary sufficiency: recursive depth enables capabilities that parameter quantity cannot.
The mechanisms differ fundamentally. Large models rely on embedding spaces—they map inputs to high-dimensional semantic representations where statistical patterns become apparent. Their "reasoning" manifests as navigation through these learned feature spaces, constrained to operations on discrete token sequences. The embedding approach excels at capturing correlations, relationships, and patterns in data, but it struggles with problems requiring algorithmic execution: systematic exploration of solution spaces, constraint propagation, and state management.
Recursive models maintain explicit state through hidden variables (ZL and ZH), enabling algorithmic problem-solving within their architecture. They can enforce constraints iteratively, track partial solutions, and adjust hypotheses based on feedback. This mimics human problem-solving, where incremental deductions gradually narrow solution space. The continuous latent space (unlike transformers' discrete token space) provides richer expressiveness for intermediate representations.
The practical implication: neither approach is universally superior. Embedding-space reasoning excels at semantic understanding and transfer learning across domains. Algorithmic reasoning excels at systematic problem-solving within well-defined domains. A hybrid architecture combining both could theoretically achieve unprecedented capability: transfer learning from large models' semantic embeddings paired with recursive reasoning within that embedded space.
Integration: Combining Recursive Reasoning with General Models
The path forward likely involves synthesis rather than replacement. Pure recursive models are task-specific—a Sudoku solver cannot solve mazes without retraining. Large models are general but lack deep reasoning capability. The convergence point: use large models to learn rich semantic embeddings across diverse data, then embed recursive reasoning modules within that representation space.
Imagine training a large model on internet-scale data to create powerful semantic embeddings. Rather than predicting next tokens, take that embedding space and use it as the foundation for task-specific recursive reasoning models. The recursive module operates not on raw text or pixels but on the semantic representations the large model learned. A small recursive model embedded in this space could solve specific reasoning tasks (Sudoku, mazes, constraint satisfaction problems) efficiently and accurately, while inheriting the semantic knowledge from the large model's pretraining.
This architecture addresses both critical constraints: you gain the semantic transfer and general knowledge from large models without their reasoning limitations, and you gain the algorithmic problem-solving capabilities of recursive models without requiring task-specific retraining on raw data. The interface between components—the embedding space—becomes the crucial design choice.
Evidence already suggests this direction shows promise. Google's Recursive Language Models, derived from similar principles, demonstrate strong capabilities. Companies likely incorporating recursive thinking into their models (possibly Google's Gemini and other advanced systems) are combining scaling benefits with recursion benefits. The inflection point arrives when both design principles mature simultaneously.
The Future: Beyond Scaling Laws
The broader AI research implications are profound. For a decade, scaling laws provided the research organizing principle—predict performance from model size, compute budget, and data quantity. These laws proved remarkably predictive and drove billions in infrastructure investment. Yet they implicit assumed architectural stasis: you scale existing transformer designs, not fundamentally reimagine computation.
Recursion introduces a new variable to scaling equations. How does performance scale with recursion depth? Does it follow predictable laws similar to parameter scaling? Do scale and recursion show complementary or diminishing returns? Early evidence suggests complementarity—doubling model size while increasing recursion depth could potentially multiply gains rather than add them linearly.
This creates a vast unexplored research space. We know that scaling works. We're beginning to understand that recursion works. But we barely understand combining them effectively. What happens to a billion-parameter model with recursive refinement loops during inference? Could you achieve reasoning capabilities currently requiring trillion-parameter models through architectural innovations rather than parameter count?
The computational economics also shift. If 7-million-parameter models solve problems that billion-parameter models cannot, the inference cost drops dramatically. Smaller models consume less energy, demand less compute infrastructure, and run efficiently on edge devices. If this trend continues—more capability with fewer parameters through recursive architecture—it would invert current scaling trends and reshape AI deployment economics.
Furthermore, the interpretability benefits matter profoundly. Transformers' black-box nature stems partly from semantic embedding spaces requiring millions of dimensions. Recursive models maintaining explicit state through ZL and ZH variables offer potential interpretability windows—you can trace what local computations produced what high-level hypotheses. This matters for safety, alignment, and scientific understanding of AI systems.
Conclusion
Recursion represents a genuine paradigm shift in AI scaling, orthogonal to parameter scaling and proving equally potent. Tiny Recursive Models demonstrate that algorithmic reasoning through recursive refinement enables reasoning capabilities that far larger models cannot achieve. Higher-Order Reasoning Models prove that multiple recursion tiers operating at different frequencies create efficient computational structures mimicking biological intelligence.
The most exciting possibility lies ahead: combining large model semantic embeddings with small recursive reasoning modules. This hybrid approach could yield systems combining general knowledge with specialized reasoning depth, efficient inference with powerful capability, and interpretability with performance. As recursive model research matures and integrates with large-scale pretraining, expect acceleration in reasoning-focused AI systems.
The era of "bigger is always better" is ending. The era of "deeper reasoning through recursion" is beginning. Forward.
Original source: Beyond Bigger Models: Recursion As The Next Scaling Law In AI
powered by osmu.app