Beyond Full AI: Why 65% of My Agent's Workflow Runs as Code

Key Takeaways

Full AI automation isn't always better: Six months of optimization revealed that 65% of workflow nodes perform better as deterministic code rather than LLM calls
Blueprints beat prompts: Workflow blueprints—directed graphs specifying when to use LLM vs. code—provide superior control and reliability compared to skill-based prompting
Strategic LLM usage: Limiting AI to ambiguous tasks (research, synthesis, planning) while handling predictable tasks with code creates faster, more accurate systems
Hybrid architecture wins: The Stripe minion pattern divides work: deterministic code for predictable scenarios, LLMs for genuinely ambiguous decisions
Measurable improvement: Separating concerns reduced errors, improved speed, and enabled better error handling across 86% of workflows

The Problem With Fully Agentic Systems

When I first built my AI agent, the approach seemed logical: let the LLM handle everything. Why limit its capabilities? The system could delegate any task to an AI model, which would confidently work through the workflow step by step.

The reality was different. While LLMs could progress through tasks, they weren't always accurate. They would hallucinate connections, miss edge cases, and sometimes deviate from intended workflows entirely. I'd added tools to constrain what the LLM could call, limiting its ability to go off-track. I even added a Discovery tool to help the AI find relevant tools. Progress, yes, but not enough to justify the complexity and costs.

The fundamental issue: I was asking AI to make decisions about when to use which tools, on top of actually using the tools themselves. That's two layers of potential error.

The Stripe Minion Architecture: A Better Framework

Everything changed when I discovered Stripe's minion architecture. Their insight was elegant and transformative: deterministic code should handle the predictable; LLMs should tackle only the genuinely ambiguous.

This distinction sounds simple but reshapes how you build AI systems. Instead of asking "Can an LLM do this task?" ask "Does this task require judgment about ambiguous situations, or is it just executing known logic?"

I restructured my system around blueprints—workflow charts written in code. Each blueprint specifies:

Nodes: Discrete steps in the workflow
Transitions: Paths between nodes based on conditions
Trigger conditions: Rules for matching specific tasks
Error handling: Explicit recovery paths when things fail

A typical blueprint looks like this:

extract_domain (code) → attio_find (code) → harmonic_enrich (code)
    → generate_summary (LLM, 1 turn) → notion_prepend (code)

Notice that only one step—generate_summary—involves the LLM. The other four steps are pure deterministic code. This is fundamentally different from the "skill-based" approach where you tell the LLM what tools exist and let it choose.

The difference matters: A skill tells the LLM what to do. A blueprint tells the system when to involve the LLM at all. That shift in responsibility from the model to the architecture is where reliability comes from.

How Workflows Distribute Across the Stack

To understand the impact, I categorized my workflows by how much they lean on code vs. AI:

Pure Code Workflows (29% of all workflows)

Deal pipeline updates, chat message routing, and email categorization don't need AI judgment at all. These are entirely deterministic. Patterns like "if customer status changes, update pipeline" or "if email contains keyword, route to channel" execute perfectly well as code. No LLM calls. No hallucination risk. No latency or cost. They simply work.

Mostly Code, Occasional LLM (36% of all workflows)

Company research, newsletter processing, and person research workflows need the LLM for specific, bounded tasks: extracting key facts from documents or synthesizing information from multiple sources. But 67-91% of each workflow runs as code. The LLM sees only what it needs—a carefully prepared chunk of text to summarize, a structured list to categorize—and processes it in one to three turns with constrained tools. The code handles fetching data, formatting it, and applying results. The LLM handles understanding.

Genuinely Hybrid Workflows (21% of all workflows)

Blog post writing, document analysis, and bug fix investigation are harder to separate. These workflows make multiple LLM calls that iterate toward quality. The code sets up conditions and constraints, but the creative or analytical work genuinely needs the model's flexibility. These are rightly hybrid—the LLM has meaningful input on direction, and multiple passes refine the output.

Fully Agentic (14% of all workflows)

Only data transformations and error investigations remain fully agentic. Counterintuitively, these tend to be coding tasks rather than decision-making. When the LLM needs to write new code to handle an unexpected error, or transform data in a novel way, it needs freedom to explore. Blueprints constrain it too much. These workflows let it work.

Why This Architecture Works Better

The shift from "AI does everything" to "AI does specific things well" created measurable improvements:

Reliability: Deterministic code doesn't hallucinate. When 65% of your workflow is code, you eliminate 65% of your hallucination risk. The errors that remain are bounded to the 14% fully agentic workflows, where you expect some experimentation.

Speed: Code execution is faster than LLM inference. By moving predictable work to code, each workflow runs faster. The LLM only processes the work that truly needs its reasoning.

Cost: LLM API calls have per-token costs. Using the model only for research, synthesis, and planning—not for every decision point—reduces per-workflow costs significantly.

Debuggability: When something fails in your workflow, is it the code or the AI? If 65% is code, you know where to look first. Deterministic systems fail for clear, fixable reasons. Agentic systems fail mysteriously.

Error Handling: Blueprints let you specify explicit error paths. "If this node fails, retry twice, then escalate to manual review." You can't do that when every step is a prompt asking the LLM to handle failures creatively.

The Blueprint Structure: Directed Graphs of Mixed Logic

Each blueprint is a directed graph where nodes are either deterministic (code) or agentic (LLM), and transitions between nodes branch based on conditions.

This structure forces you to be explicit about:

What information each node needs
What conditions determine the next step
Where humans need to intervene
What constitutes success or failure

For example, a company research workflow might look like:

Fetch company data (code): Hit APIs for basic information
Check if data is complete (code): Evaluate what's missing
If incomplete, research online (LLM, 1-2 turns): Synthesis task
Format results (code): Standardize output
Store in database (code): Persistence

Steps 1, 2, 4, and 5 are deterministic. Only step 3 involves the LLM, and it's tightly scoped: "Research this company and answer these specific questions." Not "Do whatever you think is needed."

From Scaffolding to Permanent Architecture?

The blueprints, tools, and skills might feel like temporary scaffolding. With each new model release, capabilities expand. Tasks that required deterministic code six months ago might not tomorrow. The 14% currently fully agentic might become 25% with more capable models.

But the underlying principle—that systems work better when code and AI have distinct responsibilities—likely remains true. The percentage of work that's code might shift, but the idea that code should handle predictable work and AI should handle ambiguous work seems foundational.

As models improve, you're not abandoning blueprints. You're reclassifying nodes. A task that once required careful code scaffolding might move to a simple LLM call. But it still happens inside the blueprint, not as free-form agentic exploration.

Conclusion

I started by asking AI to do everything and ended up with a system where it does less than half the work. Paradoxically, the system is more capable, more reliable, and more useful because of that constraint.

The shift from "fully agentic" to "hybrid with blueprints" is the most practical improvement I've made to my AI workflow. It's the difference between a system that feels magical but fragile, and one that feels engineered and dependable. If you're building AI agents, sketch out your workflows as blueprints, separate the deterministic from the genuinely ambiguous, and let AI focus on what it's actually good at: handling the unexpected, synthesizing information, and navigating genuine uncertainty.

Original source: Is AI Doing Less & Less?

powered by osmu.app

(Tom Tunguz) Beyond Full AI: Why 65% of My Agent's Workflow Runs as Code

Beyond Full AI: Why 65% of My Agent's Workflow Runs as Code

Key Takeaways

The Problem With Fully Agentic Systems

The Stripe Minion Architecture: A Better Framework

How Workflows Distribute Across the Stack

Why This Architecture Works Better

The Blueprint Structure: Directed Graphs of Mixed Logic

From Scaffolding to Permanent Architecture?

Conclusion

Related Posts

(Lenny's Podcast) Why PRDs Still Matter in 2026: Complete Guide for Product Leaders

(Tom Tunguz) CIO Priorities in 2026: Why AI Stack Wins & SaaS Loses

(FirstRound) Kaizen Philosophy: How Toyota's Method Scales Startup Growth

Mission is the Moat: How VIZCOM Raised $80M to Transform AI Design

(Tom Tunguz) AI Compute Costs Per Engineer: 2026 Projections & Market Gap

Comments (0)

(Lenny's Podcast) How AI is Reshaping Product Management: A 2026 Guide