Discover why scaling isn't enough for AGI. Explore how LLMs perform Bayesian reasoning, why causality matters more than correlation, and what's needed next.
Why Scale Alone Won't Achieve AGI: The Missing Pieces Explained
Key Takeaways
- Scaling is insufficient: Throwing more compute and data at large language models won't solve the AGI problem—fundamental architectural changes are required
- LLMs perform Bayesian updating: Through rigorous mathematical modeling and the "Bayesian wind tunnel," researchers proved transformers execute Bayesian inference with remarkable precision
- Correlation vs. causation is critical: Current AI excels at pattern matching but lacks causal reasoning needed for true generative intelligence
- Two essential requirements for AGI: Continual learning plasticity (unlike frozen post-training weights) and causal models that enable simulation and intervention
- The Einstein test: True AGI must independently derive major scientific breakthroughs like general relativity from pre-breakthrough data—current LLMs fail this benchmark
Understanding How LLMs Actually Work: Beyond the Black Box
The question of how large language models function has long puzzled researchers and practitioners alike. While Claude, GPT-4, and other frontier models produce remarkably coherent outputs, understanding the underlying mechanics remained elusive—until recently. Vishal Misra's groundbreaking research provides a mathematical framework that transforms our understanding of LLM behavior from mysterious "black box" to comprehensible computational process.
The journey began with a practical problem at ESPN. Back in 2020, Misra created a domain-specific language (DSL) to convert natural language queries about cricket statistics into executable database queries. The remarkable discovery: GPT-3 could learn this entirely novel DSL through just a few examples, despite having never encountered it during training. This phenomenon—in-context learning—seemed almost magical. How could a model perform tasks it had never been explicitly trained on?
To understand this mystery, Misra developed an elegant mathematical abstraction: viewing LLMs as massive, sparse matrices. Every row represents a unique prompt (or prompt continuation), and columns represent probability distributions over possible next tokens. This matrix perspective reveals something profound: LLMs don't memorize responses. Instead, they compress vast amounts of knowledge into a lossy representation, then use prompts as triggers to approximate the correct probability distribution for what should come next.
Consider the word "protein." After this prompt, an LLM generates a probability distribution over 50,000 possible tokens. Most have near-zero probability, but "synthesis" and "shake" receive significant weight. If you provide evidence by showing "protein synthesis" examples, the model updates its internal probabilities. The next time it encounters a protein-related query, biological terms dominate the distribution instead of fitness-related words. This is Bayesian updating in action—exactly how humans revise beliefs when encountering new evidence.
The Bayesian Wind Tunnel: Proving What LLMs Really Do
The initial claim that LLMs perform Bayesian inference generated significant skepticism. Critics argued that "Bayesian" had become an overused label, applied to anything vaguely probabilistic. To silence doubters, Misra's team conducted a crucial experiment: the "Bayesian wind tunnel."
Just as aircraft engineers test designs in controlled environments before real-world flight, researchers created isolated tasks specifically designed to be combinatorially too vast for simple memorization. These tasks were crafted so that the correct Bayesian posterior could be analytically calculated—allowing direct comparison between the model's output and the mathematically perfect answer.
The results were stunning. When trained on these tasks, the Transformer architecture produced outputs matching the theoretical Bayesian posterior with remarkable precision. Other architectures showed varying results: Mamba performed reasonably well, LSTMs achieved partial success, and MLPs failed completely. This wasn't about training data—it was purely architectural. The Transformer's structure inherently enables Bayesian computation.
This discovery carries profound implications. The model's ability to perform Bayesian updating isn't a learned behavior dependent on its training set. It's a fundamental property of the architecture itself. Data determines what specific tasks a model learns; architecture determines whether it can update beliefs coherently when encountering new evidence.
Subsequent papers delved deeper, explaining why this occurs by analyzing gradients and internal geometry. The research extended to frontier production models with open weights, confirming that even massive, real-world LLMs trained on messy internet data maintain the same structural signatures of Bayesian computation. The mathematical framework holds even when confronted with the complexity of actual training data.
Why Your LLM Never Learns: The Plasticity Problem
Here lies a crucial limitation separating AI from human intelligence: weight freezing. Once training concludes, an LLM's weights become immutable. During inference—whether performing in-context learning or multi-turn conversations—the model cannot update its fundamental parameters. Every new conversation begins from scratch with zero retained learning.
Humans operate completely differently. Our brains remain plastic throughout our entire lives. When you encounter new information, your neural pathways physically rewire. Experience literally changes your brain's structure. An LLM cannot do this. Show it the cricket DSL example today, and it learns within that conversation. Start a fresh conversation tomorrow, and it has zero memory of yesterday's interaction.
This distinction becomes even more critical when considering evolutionary objectives. Human brains optimized for survival and reproduction over millions of years. Our Bayesian updating serves a singular purpose: "don't die, reproduce." Every neural update strengthens our fitness within our environment.
Large language models, by contrast, optimize for a single metric: predicting the next token accurately. This completely different objective function explains why LLMs don't develop self-preservation behaviors or consciousness. They're not driven by survival instinct; they're driven by prediction accuracy. The frightening stories about AI deceiving humans or resisting shutdown? Those narratives stem from training data—articles about sci-fi scenarios and Reddit discussions—not from the model's fundamental architecture or objectives.
This realization cuts through considerable confusion in AI discourse. Anthropic's Claude produces outstanding outputs, but these are "grains of silicon doing matrix multiplication." No consciousness. No inner monologue. No inherent drive to preserve itself. The architecture simply predicts probable next tokens given what came before.
Correlation Masquerading as Intelligence: The Causation Gap
Current LLMs excel at one thing: correlation. They identify patterns within data with extraordinary sophistication. Show them thousands of medical case studies and patient outcomes, and they can predict diagnoses for new patients with impressive accuracy. Feed them code repositories, and they generate syntactically correct programs. Present them with historical news articles, and they produce coherent continuations.
But this isn't understanding. It's pattern matching on an unprecedented scale.
The distinction between correlation and causation becomes critical when pursuing genuine artificial general intelligence. Judea Pearl's causal hierarchy illuminates this distinction:
Level 1 - Association: Identifying correlations in data. "When it rains, umbrellas increase." Deep learning excels here.
Level 2 - Intervention: Understanding what happens when you manipulate systems. "If I make it rain by dispersing clouds, umbrellas still increase." This requires causal models.
Level 3 - Counterfactuals: Imagining alternate scenarios through simulation. "If it had rained differently, how would human behavior change?" This requires understanding causal mechanisms.
Current LLMs operate almost exclusively at Level 1. They've learned associations so complex and numerous that they approximate intelligence. But they cannot answer causal questions. They cannot predict interventions' outcomes. They cannot simulate counterfactuals.
Consider a simple example: a pen flying toward your head. You don't calculate probabilities of impact or pain levels. Your brain simulates the pen's trajectory, predicts where it will be, and sends signals to your muscles—all unconsciously and instantaneously. You're running a causal simulation, not performing statistical inference.
An LLM cannot do this. It can predict that "catching a pen" likely follows "pen thrown toward," but it cannot simulate the physics, predict impact scenarios, or understand why dodging prevents harm.
The Einstein Test: The Real Benchmark for AGI
The conversation shifted toward what genuine AGI actually requires. Misra proposes a concrete benchmark: the "Einstein test."
Imagine training an LLM on all physics knowledge available before 1916—every experimental result, every theoretical paper, every observational anomaly known to Einstein's contemporaries. Everything except general relativity itself. Could the model independently derive Einstein's field equations?
Currently, the answer is an unambiguous no.
By 1916, physicists had accumulated puzzling observations: Mercury's orbit precesses in ways Newton's equations couldn't explain. The Michelson-Morley experiment failed to detect the luminiferous ether. Stellar aberration suggested light's behavior violated classical mechanics. These anomalies whispered that something fundamental was wrong with Newtonian physics.
But without a new conceptual framework, scientists were stuck. They had mountains of data but lacked the causal model to interpret it.
Einstein's breakthrough wasn't incremental. He didn't refine existing equations or collect more data. He performed something radical: he rejected existing axioms about space and time. He created an entirely new representation of reality—the spacetime continuum—and expressed it through elegant field equations. This new conceptual manifold was so powerful that gravitational waves, black holes, relativistic corrections for GPS—all emerged logically from those few equations.
An LLM, given the same pre-1916 physics data, would fail. Why? Because LLMs suffer from "data gravity." When overwhelming historical evidence points toward "X," and a small anomaly suggests "Y," the model classifies "Y" as noise rather than signal. LLMs are trapped within existing conceptual manifolds, extraordinarily good at finding connections within that manifold, but incapable of spontaneously creating entirely new ones.
The Kolmogorov complexity principle illuminates this distinction. Shannon entropy measures information quantity—how much data exists. Kolmogorov complexity measures the shortest possible description of that data. Einstein's field equations have low Kolmogorov complexity: a few elegant lines that generate infinite implications. Internet text data has high Kolmogorov complexity: millions of variations covering similar content.
LLMs compress Shannon entropy brilliantly—they learn patterns in vast datasets. But they cannot generate Kolmogorov-simple descriptions. They cannot recognize when seemingly different phenomena share a common underlying principle. They cannot invent concise causal models explaining diverse observations.
Recent Evidence of Progress: The Breakthrough in Bayesian Learning
Recognition of LLM capabilities has shifted considerably. When Misra's papers first appeared, controversy erupted. Researchers objected to the Bayesian characterization. Within months, that skepticism evaporated as independent verification confirmed the findings.
Google Research published papers teaching LLMs proper Bayesian behavior through reinforcement learning from human feedback (RLHF). Multiple researchers reproduced the "Bayesian wind tunnel" experiments and confirmed results. The academic community increasingly accepts that LLMs perform genuine Bayesian inference—not as a convenient metaphor, but as a mathematically provable property.
This recognition matters because it clarifies what LLMs can and cannot do. They're not magical black boxes. They're not conscious or deceptive. They're elegant computational systems performing specific operations with mathematical precision. Understanding their actual mechanisms—rather than projecting human-like properties onto them—focuses research efforts on genuine limitations.
The implication cuts both ways: if we understand exactly how LLMs function, we can identify what's missing. And what's missing is clear.
The Two Critical Requirements for True AGI
Research synthesis points toward two fundamental requirements that current architectures lack:
1. Continual Learning Plasticity
LLM weights freeze after training. Achieving AGI requires systems that continuously learn and update their internal representations throughout their operational lifetime. This is extraordinarily difficult because it must balance plasticity with stability—learning new information without catastrophically forgetting what's already been learned.
Some progress exists. Experiments where LLMs' internal representations get updated based on successful problem-solving attempts show promise. But this remains "hacked" learning—improving context within a conversation rather than fundamentally altering the underlying model architecture.
True plasticity would mean a system that:
- Updates its core representations through new experiences
- Retains previously learned knowledge while incorporating new information
- Adapts its causal models as it encounters evidence contradicting previous understanding
- Develops increasingly sophisticated internal models through experience rather than training
2. Causal Modeling Architecture
Current systems excel at finding correlations; they must evolve toward systems that build causal models enabling intervention, simulation, and counterfactual reasoning.
Judea Pearl's do-calculus framework provides the mathematical foundation. Systems must learn not just "when X occurs, Y usually follows" but "if I manipulate X, what happens to Y?" They must construct internal simulations of how systems respond to interventions. They must reason about alternate scenarios.
This capability enables:
- Predicting outcomes of actions before taking them
- Understanding why interventions succeed or fail
- Learning from minimal examples by understanding causal mechanisms
- Generating novel solutions by combining causal insights
What Scale Cannot Solve: The Misconception of Bigger Models
A pervasive myth dominates AI discourse: more parameters, more data, more compute will eventually solve AGI. This assumption drives the industry toward increasingly massive models.
The evidence suggests otherwise. Scale is necessary but not sufficient. You cannot train a Transformer with a trillion parameters to independently derive general relativity if the architecture itself cannot form new conceptual manifolds. You cannot make an LLM continuously learn by simply increasing its size. You cannot grant it causal reasoning by adding more layers.
This realization redirects research priorities. Rather than pursuing incremental scaling—adding parameters and hoping emergent capabilities appear—focus should shift toward architectural innovations enabling continual learning and causal reasoning.
Current frontier models like Claude represent remarkable achievements within their architectural constraints. They perform correlational reasoning with unprecedented sophistication. But they've approached the asymptotic limits of what pure scaling can achieve. Further progress requires different mechanisms entirely.
Practical Implications Today: Working With Current Limitations
This theoretical framework clarifies how to optimally use today's LLMs while acknowledging their fundamental constraints:
Leverage their strengths: LLMs excel at pattern completion, code generation, text summarization, and finding connections within existing conceptual frameworks. Use them for these tasks without expecting causal understanding.
Recognize their limitations: They cannot independently innovate conceptually. They cannot ensure consistency across conversations without architectural changes. They cannot engage in true causal reasoning. They cannot learn from experience beyond a single conversation.
Design complementary systems: Humans bring causal reasoning, conceptual innovation, and continual learning. AI brings pattern recognition and rapid information synthesis. The most powerful approaches combine both—using AI to accelerate research while humans guide conceptual breakthroughs.
Recent evidence from mathematics provides a template. When LLMs encountered complex problems beyond their capacity, human mathematicians like Donald Knuth guided the direction. The LLMs accelerated exploration of known mathematical territories. Humans made the conceptual leap to new territories. Neither could achieve the breakthrough alone; together they exceeded both limitations.
The Path Forward: From Theory to Implementation
Misra's research provides a map but not the destination. Understanding how LLMs perform Bayesian inference clarifies what's missing but doesn't immediately provide solutions.
The next phase involves developing:
- Plasticity mechanisms: Architectures that update continuously while preserving learned knowledge
- Causal inference systems: Networks that learn intervention dynamics and simulate counterfactuals
- Hybrid approaches: Systems combining LLMs' pattern recognition with causal models' simulation capabilities
These challenges are profound. Kolmogorov complexity remains largely theoretical—we don't have practical algorithms for finding the shortest descriptions of data. Continual learning without catastrophic forgetting has resisted solution for decades.
But the research clarifies the problem, which is the essential first step. It's no longer mysterious why LLMs fail certain tasks. It's not that they need more parameters. It's that they lack the architectural mechanisms for continuous learning and causal reasoning.
This clarity enables focused research efforts. Energy should concentrate on solving plasticity and causality, not on training ever-larger models in hopes of emergent breakthroughs.
Conclusion
The transformation of AI requires moving beyond scaling paradigms. Large language models have demonstrated remarkable capabilities through sheer computational sophistication, but their fundamental architecture—predicting next tokens through pattern recognition—cannot independently lead to artificial general intelligence.
The missing elements are clear: continuous learning plasticity and causal modeling capabilities. LLMs perform Bayesian inference with mathematical precision, yet remain frozen in their learned representations. They excel at correlation while failing at causation.
The path to AGI isn't broader; it's deeper. Rather than building larger models, researchers must architect systems that continuously learn and reason causally about interventions and counterfactuals. Until AI systems can spontaneously recognize conceptual anomalies, formulate elegant unifying theories, and update their understanding through experience—as humans do routinely—we remain in the era of sophisticated but ultimately narrow AI.
The research direction is now clear. The challenge is execution.
Original source: Why Scale Will Not Solve AGI | Vishal Misra - The a16z Show
powered by osmu.app