Learn why intelligent routing matters more than model selection for AI agents. Reduce costs by 90% with smart architecture and async inference strategies.
AI Agent Routing: Why Architecture Beats Model Choice in 2026
Key Takeaways
- Router-first design cuts AI costs by 90%+: 70-80% of agent traffic runs on cheap local models through intelligent routing, not expensive LLMs
- Architecture > Model selection: The biggest mistake teams make is choosing the model first—routing decisions should come before model selection
- Three-layer system works best: Skill classifiers identify tasks, routers assign tiers, and model selectors pick the cheapest option that meets confidence thresholds
- Async inference changes the game: Batch reasoning costs two orders of magnitude less than real-time inference for non-urgent tasks
- Coinbase cut AI spend in half while token usage grew—proving better defaults and routing beats cost controls
The Router-First Philosophy: Why Model Choice Comes Last
Most teams building AI agents start with the same mistake: they pick the expensive model first and build the system around it. This backwards approach leaves money on the table and creates unnecessary technical debt.
The real insight, popularized recently by Coinbase CEO Brian Armstrong, is that intelligent routing matters infinitely more than model selection. When you design your system correctly, 70-80% of traffic never touches expensive large language models. Instead, it runs on local models that cost essentially nothing per call, or on asynchronous batch models that reduce AI spend by over 90%.
The difference between teams that control AI costs and teams that don't isn't friction or spending alerts—it's better defaults, smarter routing, and strategic caching. Engineers should absolutely have the freedom to choose any model they want, but the system defaults determine what actually gets used at scale.
Coinbase's recent achievement illustrates this perfectly. While token usage grew exponentially, they cut their AI spend nearly in half. How? Not through draconian restrictions, but through deliberate architectural choices that made cheap options the default path for most requests.
Understanding the Three-Layer Routing Architecture
AI agent routing isn't a monolithic black box—it's a three-layer system where each component solves a different problem with distinct responsibilities.
The Skill Classifier is the first layer. It takes raw user requests and translates them into concrete operations. When someone asks your agent to "draft a reply," "summarize a repository," or "run a database migration," the skill classifier identifies which operation is actually needed. This is pure intent recognition—a language understanding problem. The classifier doesn't make routing decisions; it just labels what the user actually wants.
The Router is the second layer, and it's where the real economics happen. The router receives the classifier's label and makes one critical decision: which tier of compute should handle this operation? The router doesn't read the full prompt. Instead, it examines a few key features: task complexity, required context size, and the historical success rate for similar operations. This is a scheduling problem, not a language problem. The router's job is to send simple tasks to cheap models and complex tasks to capable ones.
The Model Selector is the third layer. Once the router decides on a tier (local, async, or real-time), the model selector picks the cheapest model within that tier that meets the confidence threshold. This separation matters because it lets you A/B test different models against the same operation without rewriting classifier logic or router rules.
This three-layer approach works because the classifier and router solve completely different problems. The classifier answers "what is the user asking?" The router answers "where should this run?" When teams conflate these two functions and bury model choice inside the prompt, they lose the ability to experiment with different models and they lock expensive model choices into their defaults.
Why Asynchronous Inference Changes Agent Economics
Here's the insight that reshapes everything: local compute costs virtually nothing, and asynchronous batch inference costs roughly two orders of magnitude less than real-time inference. Once you acknowledge this cost difference, the strategic question becomes much narrower: What fraction of work actually needs to return an answer in real time?
The answer, surprisingly, is: very little.
The key enabler is queueing. Most agent tasks don't need instant responses. Drafting a reply can wait a few minutes. Summarizing a repository doesn't require sub-second latency. Creating a diligence memo doesn't demand immediate return. Running an evaluator on yesterday's traces is explicitly designed to happen overnight. When your system can queue work instead of demanding everything be real-time, you unlock massive cost savings.
This queueing mindset rewires how you think about agent architecture. Instead of asking "which expensive model should handle this?" you ask "can this wait?" If the answer is yes—and for most agent tasks it is—you've just found a 100x cost reduction.
Real-time inference is expensive because it demands always-on compute, rapid responses, and no batching efficiency. Async batch inference is cheap because you can accumulate requests, process them in parallel, and amortize fixed costs. The economic difference is so stark that it should shape your entire routing strategy.
Synchronous and Asynchronous Feedback Loops: The Complete System
A sophisticated routing system doesn't stay static—it learns from every request and outcome. The best agents use two feedback mechanisms operating on different timescales, each catching different failure modes.
Synchronous failure-mode signals operate in real time as requests arrive. A specialized predictor examines each incoming task and annotates it with five risk features: missing repository context (does the agent have the code it needs?), long dependency chains (how many steps does this task require?), risky migrations (could this operation cause problems?), security-sensitive prompts (is this high-risk from a security perspective?), and high-consequence writes (does this touch critical data?). When the predictor detects a known-hard pattern, it signals the router immediately. Known-difficult tasks get routed to more capable models before they have a chance to fail.
Nightly closed-loop feedback operates on a longer timescale. A batch evaluator scores all the traces from the previous day overnight—examining what succeeded, what failed, and why. This nightly analysis updates the router's weights and decision boundaries, discovering failure modes that the synchronous predictor missed. Because this evaluation runs on asynchronous inference, the feedback loop itself costs nearly zero.
Together, these mechanisms create a learning system. The synchronous layer prevents predictable failures. The nightly layer discovers new failure patterns and adapts the router. Neither mechanism adds significant cost because the evaluation itself is cheap—it's just another async batch job, not real-time LLM inference on every request.
How Skill Distillation Unlocks Local Model Capability
Once you've optimized your routing architecture, the next insight is that most agent tasks don't actually require massive language models. This is where skill distillation becomes powerful.
Skill distillation is the process of teaching smaller, cheaper local models to handle specific agent operations—drafting replies, summarizing code, analyzing documents—that normally require Claude or GPT-4. When you distill capabilities into local models, something remarkable happens: those local models become genuinely competent at their specific narrow tasks, often matching or exceeding the performance of larger models on those exact operations.
The scale impact is enormous. If you distill your most common operations into local models, and you've already architected your system around intelligent routing, you reach the point where 70-80% of agent traffic runs on local models for non-coding work. That's not 70-80% of cheap tasks—that's 70-80% of all traffic, including many tasks that initially seem like they'd need expensive models.
The economic implication is straightforward: local model inference costs essentially nothing per call. There's no per-token billing, no API overhead, no cold start delays. It's pure compute, and it's fast. When three-quarters of your agent traffic runs on local models, your per-request cost drops from dollars to cents, or from cents to fractions of a cent.
Designing Systems Around Routing, Not Models
The architectural principle that binds everything together is this: design your system around routing, not around models. Pick your models last.
Most teams do the opposite. They choose a foundation model first—Claude, GPT-4, Llama—and then build everything else around it. The system becomes a monument to that one model choice. When something goes wrong, they blame the model. When costs are too high, they blame the pricing. When latency matters, they blame the inference speed. But the real culprit is architecture.
The routing-first approach inverts this logic. You start by understanding your task distribution: what fraction of requests are simple vs. complex? What fraction can be async vs. real-time? Which operations are high-consequence and which are low-consequence? You design a routing system that automatically navigates this landscape, allocating each request to the cheapest compute tier that can handle it reliably.
Only after you've built that system do you ask: "which models should we use at each tier?" And the answer emerges naturally. Local models for simple, well-defined tasks. Async capable models for medium-complexity requests. Real-time capable models for complex or high-consequence work. The models become implementation details, not architectural decisions.
This reframing explains why Coinbase could cut AI spend nearly in half while growing token usage exponentially. They didn't restrict engineers or add friction. They changed defaults, improved routing, and added caching—all architectural improvements that made cheap options the natural path for most requests.
The Practical Implications for Your Agents
If you're building AI agents in 2026, the tactical implications are clear:
First, audit your current task distribution. What fraction of your requests genuinely require real-time inference? What fraction can queue? What tasks could be handled by cheaper, specialized models? Most teams discover that 60-80% of traffic could move to cheaper tiers without quality loss.
Second, invest in your classifier and router before picking expensive models. These are force multipliers. A better classifier catches more intent patterns and routes more traffic to cheap models. A better router learns which models succeed on which tasks. These components shape your entire cost structure.
Third, implement synchronous and asynchronous feedback loops. Monitor failures and successes at both timescales. The synchronous signals keep bad requests from wasting expensive compute. The nightly feedback loop discovers new opportunities to shift traffic to cheaper options.
Fourth, distill your common operations into local models. This isn't a one-time effort—it's an ongoing process of teaching cheaper models to handle more tasks. Each operation you distill is another percentage point of traffic that doesn't need expensive models.
Finally, treat model selection as a last decision, not a first one. You'll naturally end up using multiple models at different tiers. That's good. It means your system is correctly optimized. The expensive models are reserved for the 10-20% of traffic that genuinely needs them.
Conclusion
The most important decision in building AI agents isn't which model to use—it's how to route requests through a system of models at different cost and capability tiers. By prioritizing architecture over model choice, implementing intelligent routing, and leveraging asynchronous inference for the majority of tasks, teams can reduce AI costs by 90% while improving user experience.
Start with routing. Build feedback loops. Distill capabilities into local models. Pick your models last. This sequence—the opposite of how most teams approach the problem—is what separates companies that control AI spend from those controlled by it.
Original source: Most AI Work Can Wait
powered by osmu.app