Explore François Chollet's revolutionary approach to AI beyond deep learning. Discover the ARC-AGI benchmark, symbolic learning, and why AGI might arrive by ...
Beyond Deep Learning: François Chollet's Vision for AGI and Symbolic AI
Key Insights
- Symbolic learning offers a more efficient alternative to deep learning by creating minimal, concise models that generalize better with less data
- ARC-AGI benchmarks have become crucial indicators of AI progress, showing when reasoning models and agentic capabilities emerge
- Verifiable domains like coding can be fully automated today, while non-verifiable domains face slower progress
- Ndea's approach aims to build AGI that matches human-level learning efficiency using program synthesis and "symbolic descent"
- AGI timeline prediction: Around 2030, possibly with the release of ARC 6 or ARC 7
The Case for Alternative Approaches to AI Development
The artificial intelligence landscape is dominated by a single approach: scaling up deep learning models with increasingly massive computational resources. However, François Chollet, creator of the ARC-AGI benchmark and founder of Ndea, argues that this path, while productive today, may not represent the optimal long-term direction for AI development.
The mainstream AI industry continues pouring billions into the large language model stack because the returns are demonstrable and immediate. Every major breakthrough—from GPT-3 to ChatGPT to reasoning models—has validated the deep learning approach. This creates a natural gravitational pull toward scaling and refinement rather than exploration of fundamentally different architectures. Yet Chollet contends this monoculture in AI research is counterproductive. "I personally don't think that machine learning or AI in 50 years is still going to be built on this stack," he explains. "I think this is a stack that is very nice; maybe it even gets us to AGI, but it's not as efficient as it should be."
His reasoning centers on a principle from information theory: the minimum description length principle states that the model most likely to generalize is the shortest. Deep learning, by its nature, produces parametric curves—complex mathematical functions with billions or trillions of parameters. Even when these models work extraordinarily well, they may not represent the most compressed, elegant solution to a problem. This distinction matters profoundly for the future of AI. If AI development continues along the current path indefinitely, we might build systems capable of AGI but doing so inefficiently, requiring exponentially more compute and energy than an optimized approach would demand.
Chollet's work at Ndea represents an attempt to "leapfrog directly to optimality" by exploring symbolic learning as an alternative foundation. This isn't about rejecting current progress—reasoning models and agentic coding agents represent genuine breakthroughs—but rather about ensuring that long-term AI development trends toward efficiency and sustainability. The stakes are high enough to justify exploring alternatives even if success probability is modest. As Chollet notes, "If you have a big idea, and it has a very low chance of success, but if it works, it's going to be big, and no one else is going to be working on it—it's not something popular—if you don't do it, no one else will do it."
Understanding Ndea: Building a New Machine Learning Paradigm
Ndea represents a radically different vision for machine learning infrastructure. Rather than training parametric models through gradient descent, Ndea pursues program synthesis research—but not in the way many researchers understand that term. The distinction is crucial. When people hear "program synthesis," they often think of code generation or AI coding agents. Ndea operates at a much lower foundational level, attempting to replace the entire learning substrate beneath such applications.
The core innovation involves substituting the parametric curve—the mathematical function at the heart of all deep learning—with a symbolic model deliberately designed to be as small and concise as possible. Instead of using gradient descent to adjust billions of parameters, Ndea employs "symbolic descent," an algorithm adapted for searching through symbolic spaces rather than continuously varying parameter spaces. The goal is to create a machine learning engine that produces "extremely concise symbolic models" of input data, models that compress the essential patterns into their simplest possible form.
This approach yields several theoretical advantages. First, because symbolic models are inherently more compressed, they should require less training data to achieve competent performance. Current deep learning systems often need massive datasets, partly because they're fitting high-dimensional curves rather than finding compressed representations. Second, symbolic models run far more efficiently at inference time. A model encoded as a short program or set of logical rules executes faster and consumes less energy than running neural network computations across billions of parameters. Third, smaller, more transparent models should generalize better to novel situations and compose more effectively when combined with other models.
Chollet describes this as rebuilding "the whole stack on top of different foundations." The LLM stack represents layers of abstraction built on gradient descent and parametric learning. Ndea aims to build a different stack, where the foundational layer uses symbolic methods instead. This doesn't mean rejecting all the insights from deep learning—it means finding a more fundamental substrate that deep learning might ultimately sit on top of, rather than vice versa.
The practical challenge is immense: how do you search through the astronomical space of possible symbolic models efficiently? Brute force exploration is computationally intractable. Ndea's answer involves using deep learning guidance to navigate the symbolic search space—employing the current tools as helpers to discover what replaces them. This mirrors the approach used in AlphaGo, where neural networks guided Monte Carlo tree search through game states. The research team spent approximately half a year establishing robust foundations, recognizing that committing prematurely to incomplete foundations would doom the entire effort. Building a compounding system, where new capabilities build on previous discoveries, requires patience and rigor at the base level.
How Reasoning Models Revealed the Limits and Potential of Current AI
The emergence of reasoning models in late 2024 represented a crucial inflection point in AI development, and it was the ARC-AGI benchmark that first signaled this shift to the broader research community. Base language models had long performed extremely poorly on ARC V1, scoring below 10% accuracy. Even GPT-3 scored zero. As models scaled from billions to trillions of parameters, base LLM performance on ARC remained stubbornly low—scaled up 50,000x, yet accuracy barely budged. This stagnation suggested that brute-force scaling of pre-training alone would never crack the benchmark.
Then, in December 2024, OpenAI released the O1 preview model, followed by O3, both demonstrating reasoning capabilities absent in previous models. Performance on ARC V1 jumped dramatically in a sudden step-function change. The benchmark had finally found something that worked—not because the models had more parameters, but because they could actually reason about problems rather than pattern-matching. This moment illustrated a key insight: ARC's value lies in detecting capability transitions that other benchmarks might miss. When something genuinely new emerges in frontier AI, ARC tends to reveal it clearly.
The subsequent release of ARC V2, maintaining the same format but adding compositional complexity to reasoning chains, told a different story. Early reasoning models performed poorly on V2, but within months—almost simultaneous with the breakthrough of agentic coding—performance saturated rapidly. This saturation revealed something important: the models didn't necessarily develop higher fluid intelligence. Rather, frontier labs deliberately targeted ARC V2, using a new post-training paradigm to brute-force solutions.
The paradigm works like this: generate new tasks similar to those in the benchmark, solve them using program induction with the reasoning model, verify solutions (possible because ARC provides verifiable reward signals), fine-tune the model on successful reasoning chains, and repeat millions of times. This reinforcement learning loop, powered by abundant compute resources, effectively mines the entire solution space. The models become more useful not because they're smarter, but because they're better trained through this iterative post-training process combined with learned models of code execution—understanding how to track variable values through program execution.
This distinction—between intelligence and knowledge, between capability and competency—proves vital. An 18-year-old human isn't getting smarter as they age, but through experience and training they become vastly more competent. The same principle applies to AI systems. Current reasoning models don't possess higher fluid IQ; they possess better training, specifically training through trial-and-error in verifiable environments with true reward signals. This explains the extraordinary product-market fit of coding agents today, completely transforming software engineering, while progress in non-verifiable domains remains slow.
The ARC-AGI Benchmark Series: Measuring Intelligence as Learning Efficiency
François Chollet's definition of AGI differs fundamentally from the industry standard. Most define AGI as a system capable of automating most economically valuable tasks. Chollet rejects this framing as being about automation, not intelligence. His definition centers on learning efficiency: AGI is a system that can approach any new problem, task, or domain, understand it, model it, and become competent with the same efficiency as a human. This means requiring similar amounts of training data and computational resources—very little by current standards, since humans are remarkably data-efficient learners.
The industry is likely to achieve the first definition of AGI before the second. Current technology can already fully automate at or beyond human level in any domain with verifiable rewards, with programming as the prime example. Achieving true general intelligence—human-level learning efficiency for arbitrary tasks—probably requires different technology and different approaches. This insight motivated Chollet's development of the ARC-AGI benchmark series.
The original ARC-AGI V1 involved pattern recognition and reasoning tasks where the AI must infer abstract transformation rules from examples. The benchmark's design drew from Chollet's earlier experiences around 2016-2017 conducting research at Google Brain on training deep learning models for reasoning tasks. He discovered that gradient descent struggled fundamentally with learning reasoning-style algorithms. The models wouldn't discover elegant solutions; instead, they'd overfit to surface patterns in token sequences. This gap between theoretical capability (deep learning is Turing-complete) and practical learning (gradient descent can't find certain solutions) inspired the benchmark concept.
ARC V1 achieved its purpose: it remained difficult for base models until reasoning models arrived, providing a clear signal of capability emergence. ARC V2 added compositional complexity, allowing the research community to track progress as both reasoning and agentic coding capabilities developed. The rapid saturation of V2 demonstrated the power of the new post-training paradigm—but also revealed that performance on a known benchmark might not reflect genuine progress in fluid intelligence when massive resources target that specific benchmark.
This recognition led to ARC V3, a fundamental redesign shifting from measuring passive modeling ability to measuring "agentic intelligence." Instead of presenting data and asking the AI to extract patterns, V3 drops an agent into a novel environment—essentially a video game—without instructions, goals, or control information. The agent must figure out everything through exploration and experimentation. It must discover how the environment responds to its actions, infer what goals exist or might exist, and efficiently solve challenges—all while every action is counted toward an efficiency score.
To create these benchmark environments, Chollet's team established an entire video game studio with dedicated game developers, building custom game engines and over 250 original games. Each game takes roughly 10 minutes for a human to understand and solve. The games are deliberately original, avoiding borrowing from existing games or leveraging external knowledge like physics or cultural symbols. This prevents AI systems from relying on pre-training knowledge and forces them to genuinely learn from interaction.
The crucial distinction: in Atari games or Dota 2, the training environment matches the test environment, allowing models to memorize strategies. In ARC V3, agents face entirely novel environments at test time, games they've never encountered during training. This eliminates memorization and truly tests learning efficiency. Current frontier models don't perform well on ARC V3, revealing significant gaps in genuine agentic intelligence.
Chollet's vision extends beyond V3. ARC V4 will emphasize continual learning and curriculum progression over longer timescales, with fewer games but greater compositional complexity requiring reuse of previous knowledge. ARC V5, which excites him most, will introduce "invention"—the ability to create new tools, concepts, or approaches to solve problems. The benchmark series will continue evolving as AI capabilities advance, always targeting the residual gap between human and artificial learning efficiency. He predicts the AGI moment—when no meaningful difference remains between human and frontier AI learning efficiency—could occur around 2030 or the early 2030s, possibly coinciding with ARC 6 or ARC 7.
Verifiable Versus Non-Verifiable Domains: The Constraint Shaping AI Progress
A crucial insight emerges from recent AI progress: the distinction between domains with verifiable rewards and those without fundamentally constrains development trajectories. In domains where solutions can be formally verified—where correctness is objective and automatable—AI systems have achieved breakthroughs. In domains requiring subjective evaluation, progress stalls.
Programming represents the paradigmatic verifiable domain. Code either compiles or doesn't, tests pass or fail, bugs exist or don't. This formal verification enables the reinforcement learning loops powering coding agents. The system can generate thousands of attempted solutions, automatically verify which ones work, and train on successful attempts. This creates exponentially larger and denser training data from a small human effort to define the problem.
Mathematics will likely experience similar transformation for identical reasons. Theorems are either proven or not; proof verification is algorithmically checkable. As AI systems master mathematics, they'll solve open problems currently requiring human researchers. Other formally verifiable domains—game-playing with clear rules, logical reasoning, constraint satisfaction—will all see automation breakthroughs.
Non-verifiable domains face profound challenges. Writing essays, creative work, strategic business decisions, ethical judgments—these lack objective verification mechanisms. Training progress requires human expert annotation, a costly, slow process. An LLM trained on human-written essays has only the limited training data those humans produced. Unlike code where you can generate and verify millions of attempts, essay writing can't use automated verification loops. The model must rely heavily on training data, essentially operationalizing the human knowledge encoded in that data.
This creates a fundamental asymmetry in AI development. Verifiable domains can undergo explosive progress through automated reward signals and post-training loops. Non-verifiable domains remain limited by human annotation capacity. Companies and researchers naturally focus resources where returns are highest. This dynamic, while rational from a business perspective, creates concerning imbalances in AI capabilities.
The path forward requires either transforming non-verifiable domains into verifiable ones—creating harnesses or formal frameworks that enable automated verification—or developing entirely different paradigms that don't depend on automated rewards. Harnesses represent the immediate practical solution. Research teams engineer problem-specific frameworks that structure domains to enable formal verification. This requires human expertise in the domain but can then unlock automated improvement loops. Companies like Poetic and Confluence Labs have demonstrated this approach, achieving breakthrough results on ARC V2 by engineering harnesses allowing LLMs to tackle it systematically.
However, the existence of harnesses highlights how far we remain from true AGI. Real general intelligence wouldn't need humans to engineer domain-specific solutions. True AGI would identify appropriate verification mechanisms independently. The harness approach represents human intelligence compensating for artificial intelligence's remaining gaps—a transition step, not a destination.
The Future of AI: Optimal Development Versus Efficient Scaling
Chollet articulates a crucial distinction that shapes his entire research vision: the difference between what works and what's optimal. Deep learning works extraordinarily well. Scaling produces impressive results. This success breeds confidence in the approach's sufficiency. But sufficiency differs from optimality. An inefficient solution that works remains inefficient, and inefficiency compounds across systems.
His prediction about AGI's eventual code base size illustrates this philosophy. He believes that retrospectively, when AGI is achieved, it will turn out to be built from less than 10,000 lines of code. Moreover, if this codebase had been known in the 1980s, AGI would have been achievable then with 1980s compute. This suggests that the secret to AGI isn't a scaling discovery—it's a conceptual breakthrough about fundamental principles of intelligence.
This differs radically from Douglas Lenat's CYC project, though the comparison is instructive. CYC attempted to handcraft symbolic knowledge, requiring human engineers for every improvement. It embodied the opposite of what Chollet advocates: a system bound by human labor bottlenecks. The deep learning revolution's strength was removing humans from the improvement loop. You could scale capabilities by adding training data and compute, with minimal human involvement beyond infrastructure maintenance.
Ndea's vision combines these insights. The core symbolic learning engine—the 10,000-line codebase—would be stable and elegant. But it would operate on a knowledge base built through automated learning processes, scaling through training data and compute without human engineers needing to hand-code improvements. This combines symbolic elegance with parametric scalability, the best of both approaches.
Chollet emphasizes that building AGI cannot be achieved by adding layers atop the current stack. It requires returning to foundations, reconsidering what the learning substrate itself should be. The entire field collapsed into deep learning, much as it previously collapsed into support vector machines in the early 2000s. Researchers in the 1990s were actively discouraged from pursuing neural networks; the consensus dismissed them as failed approaches. This historical collapse represents a loss. Many potentially viable approaches never received serious exploration because resources concentrated on the dominant paradigm.
For researchers considering alternative approaches, Chollet recommends exploring the broader history of AI research, particularly papers from the 1970s and 1980s when researchers pursued diverse directions. Genetic algorithms represent one underexplored approach with significant potential. The key criterion: does the approach scale without human bottlenecks? If improvements require human researchers spending time and effort, the approach won't work; capability gains will be bounded by human investment. But if the system improves through scaled resources—more data, more compute, automated processes—then you have a potentially viable direction.
This framework explains why simple analogy with scaling laws isn't sufficient. An approach might show promise at small scale but lack the mathematical properties enabling scaling without human involvement. The ideal research direction combines elegant principles, mathematical scalability, and removal of human bottlenecks from the improvement loop. Most historical AI approaches fail on at least one dimension. Ndea's bet is that symbolic learning meets all three criteria.
Practical Implications for AI Development Today
While Ndea pursues long-term alternatives to deep learning, immediate implications affect current AI use and development. The distinction between domains with verifiable rewards versus those without explains why coding agents represent such a dramatic breakthrough while general language model capabilities remain constrained by training data availability.
For individuals and organizations, the practical lesson is clear: AI progress is accelerating and won't be stopped. The relevant question isn't whether to engage with AI progress but how to leverage it effectively. Those with deep expertise in specific domains can use AI tools far more effectively than novices. A skilled programmer using coding agents becomes extraordinarily productive. A mathematician employing AI mathematical reasoning tools can focus on high-level insights while computational work accelerates. A domain expert can engineer harnesses and frameworks enabling AI to work within their field.
This transforms the relationship between humans and AI. Rather than competition, the optimal stance involves leverage. Each person should develop deep expertise in their domain, then use AI tools to multiply their effectiveness. The expertise becomes more valuable as AI tools advance, not less valuable. A mediocre programmer replaced by coding agents loses their edge, but a skilled programmer leveraging coding agents becomes superhuman in productivity.
For researchers starting new organizations or projects, Chollet's advice reflects years of experience building Keras into Python's dominant deep learning library. The winning formula combines several elements: obsessive focus on usability and intuitive APIs, comprehensive documentation that teaches domain knowledge not just tool mechanics, and aggressive community building. One powerful tactic: hire your power users. Identify your most enthusiastic community members and bring them onto the team. They understand user needs deeply and care about the project's success.
The open-source explosion around AI tools demonstrates this principle. Projects that prioritize accessibility and community grow exponentially. Those treating tools as secondary considerations to research papers stagnate. For individuals launching projects that unexpectedly gain traction, the challenge shifts from building community to managing growth. Sustaining momentum while maintaining quality requires putting people in place who care about the project. Eventually, successful projects transcend their creators, becoming living ecosystems with their own momentum.
Conclusion
François Chollet's vision challenges the AI industry to think beyond immediate scaling success toward long-term optimality. While deep learning dominates current development and produces remarkable results, Chollet argues that the next 50 years of AI will trend toward fundamentally different approaches built on more efficient principles. The ARC-AGI benchmark series serves as a persistent reminder of the gap between current capabilities and genuine general intelligence—the ability to learn new tasks with human-level efficiency.
His work at Ndea represents a moonshot bet on symbolic learning as a more optimal foundation for AI. The probability of success may be modest, but the stakes justify the effort, particularly because few others are exploring this direction. For the AI field broadly, Chollet advocates for diversity of approaches, warning against monoculture dominance in research directions. The presence of alternatives pursuing different paths makes the entire field more robust and innovative.
As AGI approaches—Chollet predicts around 2030—the approach we take matters profoundly. Do we build AGI efficiently or inefficiently? Do we understand the fundamental principles of intelligence or merely scale existing approaches until they work? These questions shape not just AI development but the sustainability and efficiency of AI systems in the coming decades. By maintaining alternative research directions today, we preserve optionality for a more optimal AI future.
Original source: François Chollet: ARC-AGI-3, Beyond Deep Learning & A New Approach To ML
powered by osmu.app