Building Software with AI Agents: Lessons from a Million-Line Codebase

Key Takeaways

Complete Agent-Generated Development: A production software product shipped with zero manually-written code across application logic, tests, CI configuration, and documentation using Codex agents
Radical Productivity Gains: The team achieved approximately 1/10th the development time compared to traditional manual coding, averaging 3.5 pull requests per engineer per day
Redefined Engineering Roles: Engineers shifted from writing code to designing systems, creating abstractions, and building feedback loops that enable agent autonomy
Repository as System of Record: Structured documentation in the codebase became the primary knowledge source, replacing scattered information across platforms and communication channels
Architectural Discipline Over Micromanagement: Enforcing strict boundaries and invariants through automated linting allowed rapid development without sacrificing code coherence
Continuous Garbage Collection: Regular automated cleanup processes address drift and maintain code quality without requiring weekly manual intervention

The Foundation: Starting with Zero Code

Over five months beginning in late August 2025, a team of seven engineers built a fully functional software product that shipped to real users without writing a single line of code by hand. This wasn't an academic exercise—the product has hundreds of internal daily users and external alpha testers. It deploys, breaks, and gets fixed just like any traditional software project, except every line came from Codex.

The initial repository scaffold included standard development infrastructure: folder structure, CI configuration, formatting rules, package management, and application framework. Even the AGENTS.md file directing how agents should work was generated by Codex itself using GPT-5. From the very first commit, the repository was optimized for agent navigation rather than human developers.

The scale became impressive quickly. Five months later, the repository contained roughly one million lines of code distributed across application logic, infrastructure, tooling, documentation, and internal utilities. The team opened and merged approximately 1,500 pull requests—averaging 3.5 per engineer per day. Remarkably, throughput increased as the team grew from three to seven engineers, suggesting that the system's efficiency improved with scale rather than degrading.

This outcome required a fundamental shift in how engineering teams approach development. The constraint was intentional: by committing to zero manual coding, the team forced themselves to build the infrastructure, tools, and feedback loops necessary for agents to operate reliably at scale. The philosophy became simple but powerful: humans steer, agents execute.

Redefining Engineering Work: From Code to Systems

The absence of human coding didn't mean less engineering work—it meant fundamentally different work. Early progress moved slower than expected, but not because Codex lacked capability. The repository was underspecified. The agent had insufficient tools, abstractions, and internal structure to make meaningful progress toward high-level goals.

This realization reshaped the team's mission. Their primary responsibility shifted from writing code to enabling agents to do useful work. In practice, this meant working depth-first: breaking larger goals into specific building blocks (design, code, review, test), prompting agents to construct those blocks, and using completed components to unlock more complex work. When something failed, the fix was never "try again." Instead, engineers asked: "What capability is missing, and how do we make it legible and enforceable for the agent?"

The workflow pattern became standardized: an engineer describes a task in a prompt, runs the agent, and lets it open a pull request. To drive that PR to completion, Codex reviews its own changes locally, requests agent reviews both locally and in the cloud, responds to feedback from humans and other agents, and iterates until all agent reviewers approve. This creates a surprisingly effective loop where the agent becomes its own primary code reviewer.

Humans review pull requests but aren't required to. Over time, the team pushed nearly all review effort toward agent-to-agent evaluation. This freed human attention for higher-level decisions: validating that the built system actually solved the intended problem and ensuring architectural coherence.

The shift represents a profound change in engineering culture. Rather than measuring productivity by lines of code written, success became measured by shipping features, fixing bugs, and maintaining system health. Engineers spent their time designing systems, specifying intent, and building feedback loops—work that requires deep thinking but not typing.

Making Applications Legible to Agents

As code throughput increased, a new bottleneck emerged: human QA capacity. The fixed constraint was human time and attention. To overcome this, the team made critical application components directly legible to Codex—the UI, logs, metrics, and trace data that developers normally interpret themselves.

The application became bootable per git worktree, allowing Codex to launch an isolated instance for each change. The Chrome DevTools Protocol was wired into the agent runtime with skills for DOM snapshots, screenshots, and navigation. This enabled Codex to reproduce bugs, validate fixes, and reason about UI behavior directly without human interpretation.

The same approach applied to observability tooling. Logs, metrics, and traces were exposed to Codex through a local observability stack that's ephemeral for each worktree. The agent operates on a fully isolated application version with its own logs and metrics, torn down once the task completes. Using LogQL for logs and PromQL for metrics, agents can execute on prompts like "ensure service startup completes under 800ms" or "no span in these four critical journeys exceeds two seconds."

With this infrastructure in place, agent work sessions sometimes run continuously for six hours or more—often while human engineers sleep. The agent validates its own work against real metrics and logs, dramatically reducing the need for human QA verification.

This approach—making system state machine-readable and actionable—fundamentally changed what agents could accomplish. Rather than working from abstract specifications, agents reasoned directly from the live system's behavior.

Repository Knowledge as the System of Record

One of the earliest and most important discoveries was about context management. The team learned that giving Codex "a map, not a 1,000-page instruction manual" worked exponentially better.

Large instruction files created multiple problems. They crowded out the actual task, code, and relevant documentation, forcing agents to either miss critical constraints or optimize for the wrong ones. When everything is marked "important," nothing is. Agents pattern-matched locally rather than navigating intentionally. Monolithic instruction files deteriorated instantly—rules became stale while humans stopped maintaining them, creating what one engineer described as "an attractive nuisance."

The solution was structural: treat the traditional AGENTS.md file not as an encyclopedia but as a table of contents. The repository's actual knowledge base lives in a structured docs/ directory treated as the source of truth. A brief AGENTS.md file (roughly 100 lines) injected into context serves primarily as a map with pointers to deeper documentation.

The knowledge structure included several components. Design documentation was catalogued and indexed with verification status and core beliefs defining agent-first operating principles. Architecture documentation provided a top-level map of domains and package layering. A quality document graded each product domain and architectural layer, tracking gaps over time. Plans were treated as first-class artifacts—ephemeral lightweight plans for small changes and detailed execution plans with progress logs and decision records for complex work, all versioned and co-located in the repository.

This structure enables progressive disclosure. Agents start with a small, stable entry point and learn where to look next rather than being overwhelmed immediately. Mechanical enforcement kept the system honest. Dedicated linters and CI jobs validated that documentation stayed up-to-date, cross-linked correctly, and reflected actual code behavior. A recurring "doc-gardening" agent scanned for stale documentation and opened fix-up pull requests automatically.

The benefit extended beyond agent efficiency. This approach created a system of record that worked like effective onboarding for new team members, whether human or agent. Knowledge lived in one discoverable place, versioned with code, and mechanically validated.

Agent Legibility as the Design Principle

As the codebase evolved, a counterintuitive principle emerged: the repository was optimized first and foremost for Codex's legibility, not human readability. This doesn't mean the code was ugly or unmaintainable—rather, design decisions prioritized making the full business domain reasonably understood by an agent reading the repository.

From the agent's perspective, anything not accessible in-context while running effectively doesn't exist. Knowledge in Google Docs, chat threads, or people's heads is invisible to the system. Only repository-local, versioned artifacts—code, markdown, schemas, executable plans—exist in the agent's world.

This realization drove continuous investment in pushing context into the repository. That Slack discussion aligning the team on an architectural pattern? If it isn't discoverable to agents, it's as lost as it would be to a new hire joining three months later. Giving Codex more context meant organizing information so agents could reason over it, not overwhelming them with ad-hoc instructions.

This framing clarified many engineering tradeoffs. The team favored dependencies and abstractions that agents could fully internalize and reason about. Technologies often described as "boring"—predictable, stable, well-represented in training data—were easier for agents to model. In some cases, it was cheaper to have agents reimplement functionality subsets than work around opaque upstream library behavior. For instance, rather than importing a generic concurrency-limiting package, they implemented their own helper tightly integrated with OpenTelemetry instrumentation, with 100% test coverage and predictable behavior.

Pulling more system functionality into forms agents could inspect, validate, and modify directly increased leverage not just for Codex but for other agents working on the codebase. This represents a fundamental shift in software architecture thinking: code should be legible to AI systems that will eventually modify it.

Enforcing Architecture Through Automated Discipline

Documentation alone couldn't maintain a fully agent-generated codebase. The team enforced invariants rather than micromanaging implementations, letting agents ship fast without undermining the foundation.

The application was built around a rigid architectural model. Each business domain was divided into fixed layers with strictly validated dependency directions and limited permissible edges. Within a domain like App Settings, code could only depend "forward" through specific layers: Types → Config → Repo → Service → Runtime → UI. Cross-cutting concerns (authentication, connectors, telemetry, feature flags) entered through single explicit interfaces called Providers. Anything else was disallowed and mechanically enforced through custom linters.

This is the type of architecture normally implemented only with hundreds of engineers. With coding agents, it became an early prerequisite. The constraints enable speed without decay or architectural drift. In human-first workflows, such strict rules might feel pedantic. With agents, they become multipliers—once encoded, they apply everywhere simultaneously.

Beyond structural rules, the team enforced "taste invariants" through custom linting. They statically enforced structured logging, naming conventions for schemas and types, file size limits, and platform-specific reliability requirements. Critically, custom lints included remediation instructions in error messages, injecting helpful context directly into agent workflows.

The team remained explicit about where constraints matter and where they don't. This resembles leading large platform organizations: enforce boundaries centrally, allow autonomy locally. Care deeply about boundaries, correctness, and reproducibility. Within those boundaries, grant teams—or agents—significant freedom in expressing solutions.

The resulting code doesn't always match human stylistic preferences, and that's intentionally acceptable. As long as output is correct, maintainable, and legible to future agent runs, it meets the standard. Human taste fed back continuously. Review comments, refactoring pull requests, and user-facing bugs were captured as documentation updates or encoded into tooling. When documentation fell short, rules were promoted directly into code.

Merge Philosophy in High-Throughput Systems

As Codex throughput increased, many conventional engineering norms became counterproductive. The repository operated with minimal blocking merge gates. Pull requests remained short-lived. Test flakes were often addressed with follow-up runs rather than blocking progress indefinitely.

In a system where agent throughput far exceeds human attention, corrections are cheap and waiting is expensive. This tradeoff would be irresponsible in low-throughput environments but became correct here. The team embraced rapid iteration: ship early, validate through real-world usage, and fix issues quickly when they appear.

This required rethinking merge practices. Traditional engineering emphasizes pre-merge validation because fixing issues post-merge is expensive in human time. When agents can ship corrections faster than humans can review blocks, the economics flip entirely.

The True Scope of Agent Generation

When the team describes the codebase as agent-generated, they mean comprehensively: everything. Agents produce product code and tests, CI configuration and release tooling, internal developer tools, documentation and design history, evaluation harnesses, review comments and responses, scripts managing the repository itself, and production dashboard definitions.

Humans remain in the loop but work at a different abstraction layer. They prioritize work, translate user feedback into acceptance criteria, and validate outcomes. When agents struggle, it signals missing tools, guardrails, or documentation. Humans feed these signals back by having Codex write the fixes.

Agents use standard development tools directly. They pull review feedback, respond inline, push updates, and often squash-merge their own pull requests. The development workflow integrated agents as first-class participants rather than treating them as experimental tools.

Progressive Automation: From Task to Feature

As more development loop steps were encoded—testing, validation, review, feedback handling, recovery—the repository reached a meaningful threshold. Codex can now end-to-end drive new features.

Given a single prompt, the agent validates the current codebase state, reproduces reported bugs, records demonstration videos, implements fixes, validates through application testing, records resolution videos, opens pull requests, responds to feedback, detects and remediates build failures, and escalates only when human judgment is required. It can merge changes when all conditions are met.

This autonomy depends heavily on specific repository structure and tooling and shouldn't be assumed to generalize without similar investment—at least not yet. But it demonstrates the potential: as systems become sufficiently well-specified, agents can drive increasingly complete workflows.

Managing Entropy: Garbage Collection for Code

Full agent autonomy introduced novel problems. Codex replicates existing patterns—even uneven or suboptimal ones. Over time, this inevitably created drift. Initially, humans addressed this manually, spending 20% of their week ("Friday cleanup") addressing what they called "AI slop." It didn't scale.

Instead, the team encoded "golden principles"—opinionated, mechanical rules keeping the codebase legible for future agent runs. Examples included preferring shared utility packages over hand-rolled helpers to centralize invariants, and validating data boundaries rather than probing "YOLO-style" without type safety.

On regular cadences, background Codex tasks scan for deviations, update quality grades, and open targeted refactoring pull requests. Most can be reviewed in under a minute and automerged. This functions like garbage collection: technical debt compounds like high-interest loans, and continuous small payments prevent painful bursts.

Human taste is captured once, then enforced continuously on every line of code. Bad patterns are caught and resolved daily rather than spreading for weeks. This approach—treating code quality as a continuous process rather than periodic cleanup—fundamentally changed how the team managed a large codebase.

Ongoing Learning and Future Challenges

This strategy worked well through internal launch and OpenAI adoption. Building real products for real users anchored investments in reality and guided long-term maintainability decisions.

What remains unknown is how architectural coherence evolves over years in fully agent-generated systems. The team is still learning where human judgment adds maximum leverage and how to encode that judgment so it compounds. They also don't know how systems evolve as models become increasingly capable.

What's become clear: building software still demands discipline, but discipline appears more in scaffolding than code. Tooling, abstractions, and feedback loops maintaining codebase coherence grow increasingly important.

The team's hardest challenges now center on designing environments, feedback loops, and control systems helping agents accomplish their mission: building and maintaining complex, reliable software at scale. As agents take on larger portions of the software lifecycle, these questions matter more than ever.

Conclusion

The experiment of building production software with zero manually-written code reveals a fundamental truth about the future of software engineering: the bottleneck is shifting from code production to system design. Teams that invest in clear architecture, comprehensive documentation, and mechanical enforcement of principles will enable agents to operate with remarkable efficiency. The most successful engineering teams won't be those writing the most code, but those designing the best environments for AI agents to work effectively. As you consider where to invest your effort in your own organization, focus on building systems, not just features. The ability to design environments and feedback loops will determine who thrives as AI agents become integral to software development.

Original source: Harness engineering: leveraging Codex in an agent-first world

powered by osmu.app

(OpenAI) Building Software with AI Agents: A Million Lines of Code Without Manual Writing

Building Software with AI Agents: Lessons from a Million-Line Codebase

Key Takeaways

The Foundation: Starting with Zero Code

Redefining Engineering Work: From Code to Systems

Making Applications Legible to Agents

Repository Knowledge as the System of Record

Agent Legibility as the Design Principle

Enforcing Architecture Through Automated Discipline

Merge Philosophy in High-Throughput Systems

The True Scope of Agent Generation

Progressive Automation: From Task to Feature

Managing Entropy: Garbage Collection for Code

Ongoing Learning and Future Challenges

Conclusion

Related Posts

(Lenny's Podcast) Why PRDs Still Matter in 2026: Complete Guide for Product Leaders

(Tom Tunguz) CIO Priorities in 2026: Why AI Stack Wins & SaaS Loses

(FirstRound) Kaizen Philosophy: How Toyota's Method Scales Startup Growth

Mission is the Moat: How VIZCOM Raised $80M to Transform AI Design

(Tom Tunguz) AI Compute Costs Per Engineer: 2026 Projections & Market Gap

Comments (0)

(Lenny's Podcast) How AI is Reshaping Product Management: A 2026 Guide