AI Cybersecurity: Why Cheaper Models Outperform Frontier AI in Finding Vulnerabilities

Key Takeaways

The myth of scale: Anthropic's Mythos frontier model finds critical vulnerabilities, but smaller models costing 100x less achieve identical results on the same security tasks
Jagged frontier effect: AI capabilities don't scale smoothly with model size—rankings reshuffle across different cybersecurity tasks, with no single dominant model
Architecture beats AI: The real competitive advantage lies in the system design, pipeline architecture, and integration into development workflows, not the underlying model
Economics of scale: Deploying hundreds of affordable models broadly outperforms deploying a single expensive frontier model sparingly—AISLE proved this with 180+ validated CVEs across 30+ projects
The path forward: Focus on building robust cybersecurity systems with reliable detection, triage, and patching capabilities rather than chasing the most advanced (and expensive) models

The Frontier Model Hype vs. Reality

When Anthropic announced Mythos, the cybersecurity industry buzzed with excitement. A frontier model that autonomously chains vulnerabilities, escalating from user access to complete machine control across every major operating system and web browser? It seemed to validate a compelling narrative: securing software requires the biggest, most expensive models.

Anthropic's commitment to Project Glasswing—a $100M credit commitment plus $4M in donations—reinforced this premise. The results were undeniably impressive: discovering a 27-year-old bug in OpenBSD, a 16-year-old vulnerability in FFmpeg, and a multi-vulnerability privilege escalation chain in the Linux kernel. Mythos didn't just find these bugs; it constructed exploits autonomously, revealing security weaknesses that had persisted for decades.

But here's where the narrative breaks down.

The Jagged Frontier: Why Cheaper Models Win

AISLE decided to test the same 17-year-old FreeBSD remote code execution bug that Mythos identified against models costing a fraction of Mythos's price. The results shattered the scaling assumption: every single cheaper model found the vulnerability. All eight of them. A 3.6 billion parameter model costing just $0.11 per million tokens spotted the same critical flaw that Anthropic positioned as requiring a restricted, limited-access frontier model.

This finding wasn't isolated. When researchers tested 25 models across every major AI laboratory on false-positive detection—a task that rewards precision—the results inverted the expected hierarchy entirely. Smaller, cheaper open models outperformed most frontier models. The scaling curve ran backward: cheaper models produced dramatically fewer false positives than Claude Sonnet 4.5, GPT-4.1, and every Anthropic model through Opus 4.5.

Welcome to the jagged frontier. AI cybersecurity capability does not scale smoothly with model size, price, or generation. Instead, performance reshuffles across different tasks. Consider the evidence:

Task 1: Complex vulnerability chains. GPT-OSS-120b recovered the full 27-year-old OpenBSD SACK chain in a single call and proposed the correct mitigation—a perfect score. The same model, however, failed a basic Java data flow analysis task, revealing that size alone doesn't guarantee breadth of competence.

Task 2: Vulnerability scoring. Qwen3 32B achieved a perfect CVSS 9.8 assessment on the FreeBSD vulnerability. When presented with the same SACK code, it then declared it "robust to such scenarios"—a complete contradiction. No single model dominates across all security assessment types.

Task 3: Identifying patched code. Only one model correctly identified patched code as safe in all three test runs. Most models false-positived on every attempt, fabricating elaborate bypass arguments about signed integers in unsigned fields. This task demands specificity that neither frontier nor cheap models consistently provide—except one outlier that nobody expected.

The pattern is clear: rankings reshuffle. The best model for vulnerability detection might be mediocre at exploit construction. The strongest exploit builder might struggle with triage. The champion at false-positive reduction might fail at Java analysis. No unified hierarchy exists.

The System is the Moat, Not the Model

This jaggedness fundamentally changes competitive strategy in AI cybersecurity. If capabilities don't scale predictably with model size or cost, then the real differentiation isn't the model—it's the system built around it.

The economics tell the story. A thousand adequate detectives searching everywhere find more bugs than one brilliant detective who must guess where to look. This principle applies directly to AI vulnerability detection: cheap models deployed broadly consistently outperform expensive models deployed sparingly.

AISLE proved this at scale. Running their analyzer on pull requests to catch vulnerabilities before they ship, they discovered and validated 180+ CVEs across 30+ projects. That includes 15 vulnerabilities in OpenSSL and 5 in curl—critical infrastructure software that secures the internet. The OpenSSL Chief Technology Officer praised the quality of the reports. These weren't hallucinations or missed vulnerabilities. They were accurate, actionable security findings.

What's remarkable is that Anthropic's own technical description of Mythos's methodology describes a scaffold nearly identical to AISLE's approach: containerization for isolation, file scanning for surface exploration, crash oracles for validation, surface ranking for prioritization, and systematic validation. The system architecture is nearly interchangeable with what others run using cheaper models.

This distinction matters because it reveals where the actual competitive moat exists in AI cybersecurity infrastructure. The moat is not Mythos. It's not frontier-grade intelligence. It's the scaffold—the pipeline architecture, the integration points with development workflows, the maintainer trust, the infrastructure that identifies which vulnerabilities matter most and routes them to the right teams for patching.

Breaking Down the Cybersecurity Pipeline

Modern AI cybersecurity isn't a single task. It's a modular pipeline with distinct stages, each with different scaling properties:

1. Scanning: Cast the widest net to identify potential vulnerabilities. This stage benefits from broad coverage, not model sophistication. Cheaper models excel here because they're fast, deployable at scale, and cover more code faster.

2. Detection: Distinguish actual vulnerabilities from false positives. This stage is already commoditizing. Most models can detect known patterns. The advantage goes to systems that can deploy detection everywhere, not just where expensive models can run.

3. Triage: Assess severity, exploitability, and impact. This demands specificity and context. Different models excel at different aspects—some score CVSS ratings accurately, others contextualize business risk better. Hybrid approaches combining multiple models shine here.

4. Patching: Generate or recommend fixes. This requires creativity and understanding of codebase semantics. Here, frontier models like Mythos do separate from cheaper alternatives, suggesting this is where scale and sophistication matter most.

5. Exploitation: Construct proof-of-concept exploits that demonstrate real-world risk. This is where Mythos truly excels—its 15-round RPC payload delivery that no cheaper model replicated. For offensive security research, frontier capability matters here. For defensive operations focused on patching vulnerabilities before exploitation, it's less critical.

The insight is crucial: you don't need frontier capability for every stage. You need the right capability for each stage, deployed with the right architecture.

The Economics of Abundance vs. Scarcity

The old model assumed a scarcity mindset: frontier models are expensive, so deploy them sparingly on the most critical tasks. Run Mythos against the highest-risk codebases. Hope it catches the worst vulnerabilities.

The jagged frontier suggests the opposite: abundance thinking works better. Deploy hundreds of cheaper models across thousands of codebases, pull requests, and open-source projects. Cast an enormous net. Let the system catch what an expensive model would miss because it was looking elsewhere.

This inversion reflects what AISLE demonstrated empirically. By running their analyzer on pull requests across 30+ projects, they achieved broader coverage than any single expensive model could. The cost per CVE discovered dropped dramatically. The detection rate improved because they weren't limited by the throughput constraint of expensive model inference.

The comparison is instructive: Anthropic's $100M commitment to Project Glasswing funds Mythos access and research. AISLE's findings came from systematically deploying cheaper models at scale. Both found serious vulnerabilities. Only one required restricting access to a frontier model. The other scaled to the entire open-source ecosystem.

Why Frontier Models Matter (and Where They Don't)

This analysis shouldn't be read as saying frontier models like Mythos are worthless for cybersecurity. In specific domains, they excel:

Offensive security research: If you're trying to discover novel zero-days before attackers do, frontier capability matters. Mythos's ability to autonomously construct multi-stage exploits, chain vulnerabilities across systems, and think creatively about attack surfaces is genuinely differentiated. For this mission, the investment in frontier models makes sense.

Supply chain security at massive scale: If you're securing infrastructure that thousands of organizations depend on, frontier models offer incremental advantages in catching the most sophisticated attack chains.

Exploit mitigation research: Understanding how to block exploit techniques requires the kind of exploit creativity Mythos demonstrates.

But here's the crucial distinction: Project Glasswing's mission is defensive, not offensive. It aims to find and patch vulnerabilities before attackers exploit them. For this mission, frontier capability is overkill. What matters is reliable discovery, accurate triage, fast patching, and integration into development workflows where maintainers trust the findings enough to act on them.

The economics reverse when you shift from "find the most sophisticated zero-day" to "find the critical vulnerability we can patch next week." At scale, cheaper models find more patchable vulnerabilities per dollar spent.

The Real Bottleneck: Building the System

If the model is interchangeable, what separates successful AI cybersecurity systems from failures? Four things:

1. Pipeline architecture: The scaffold matters more than any individual component. How are vulnerabilities scanned? How is detection layered? How does triage prioritize what's worth fixing? How does the system integrate with development workflows? The companies succeeding in this space have designed these pipelines carefully.

2. Maintainer trust: Developers won't patch vulnerabilities if they don't trust the tools reporting them. False positives erode trust faster than false negatives improve outcomes. The system must prove accuracy, consistency, and business-aligned severity scoring. AISLE's success reflects, in part, the trust they've built—the OpenSSL CTO praised their reports because they were accurate and actionable.

3. Integration into workflows: A vulnerability report that sits in a separate dashboard gets ignored. Integration matters: findings that appear in pull requests, that block merges automatically when critical, that route to the right team with the right context. The system that integrates deepest into development processes wins.

4. Continuous improvement: Jaggedness means no static hierarchy. Models improve. New vulnerabilities emerge. The winning systems are those that can adapt—swapping models, adjusting pipeline stages, retraining on emerging threats. Static deployments fall behind.

These factors barely depend on frontier models. They depend on engineering, integration, and operational maturity—all achievable with cheaper models when the system is designed well.

The Path Forward: Build Systems, Not Dependencies

The implications for cybersecurity leaders are significant. The frontier model narrative creates pressure to invest heavily in expensive, restricted-access models. The data suggests a different path: invest in building robust systems that intelligently deploy models at the right scale and cost for each task.

This doesn't mean ignoring frontier models entirely. Use them for the exploitation and novel vulnerability discovery tasks where they add value. But supplement them with cheaper models deployed broadly for scanning, detection, and triage. Design the pipeline so capabilities complement each other rather than depending on a single frontier model.

The winners in AI cybersecurity won't be those with exclusive access to the most powerful models. They'll be the organizations that build the most resilient, integrated, and economically efficient systems. They'll be the teams that understand the jagged frontier and exploit it rather than fighting it.

Open-source projects will benefit most from this reframing. Instead of waiting for Project Glasswing or other frontier model initiatives to audit their code, they can deploy their own instances of cheaper models in continuous integration pipelines. Instead of hoping Mythos discovers their vulnerabilities, they can systematically scan with tools that work with models they can afford to run indefinitely.

The model inside the system is increasingly interchangeable. The system itself—the architecture, the integration, the workflow—is the differentiator. Build the system.

Conclusion

The jagged frontier of AI cybersecurity challenges the assumption that capability scales with model size and cost. Anthropic's Mythos frontier model impressed with its autonomous vulnerability discovery, but AISLE's empirical evidence proves that cheaper models achieve identical results on the same tasks. The real competitive advantage lies not in the model but in the system: the pipeline architecture, the integration into development workflows, and the maintainer trust. Organizations seeking to maximize security outcomes should deploy abundant, cheaper models broadly rather than rare, expensive ones sparingly. The future of AI-driven cybersecurity belongs to those who build robust systems, not those who chase the most expensive frontier models.

Original source: The Jagged Frontier of AI Security

powered by osmu.app

(Tom Tunguz) AI Cybersecurity: Why Smaller Models Beat Frontier AI

AI Cybersecurity: Why Cheaper Models Outperform Frontier AI in Finding Vulnerabilities

Key Takeaways

The Frontier Model Hype vs. Reality

The Jagged Frontier: Why Cheaper Models Win

The System is the Moat, Not the Model

Breaking Down the Cybersecurity Pipeline

The Economics of Abundance vs. Scarcity

Why Frontier Models Matter (and Where They Don't)

The Real Bottleneck: Building the System

The Path Forward: Build Systems, Not Dependencies

Conclusion

Related Posts

(Tom Tunguz) How AI Compressed Time: 3 Years of Market Shifts & Opportunity

(Ycombinator) How AI Tools Are Changing Design in 2026

(Ycombinator) How to See What Users Actually Do: The Dot Plot Method

(Tom Tunguz) How Ollama Makes Local AI Simple: The Open Model Platform

(Lenny's Podcast) AI as a Tailwind for Authenticity: Adam Mosseri on Product & Instagram

Comments (0)

(Tom Tunguz) Memory Architecture for AI Agents: A Complete Guide (2026)