The Middle Agentic Path: Why Both Sides Are Wrong About Human-in-the-Loop AI Coding

Why both sides are wrong about human-in-the-loop

There are two loud camps in agentic coding right now. I think they're both wrong.

Camp one says you need a human reviewing every line of AI-generated code before it ships. Responsible and "serious" - yet leaving reviewers increasingly swamped.

Camp two says let the agents do everything. Steve Yegge's Gas Town epitomizes this idea - 20-30 simultaneous agents with specialized roles, plus an audacious vision called "The Wasteland" that federates thousands of these teams together.

Somewhere between the swamp and the wasteland, there's a pragmatic path being paved: the human stays in the loop for every change - not wading through diffs, but charting the course. Exercising taste and judgment, and staying close enough to the ground to sense when things start to drift off course.

Code review is dead

The "review everything" camp has strong advocates. Addy Osmani argued that "the human engineer remains firmly in control... reviewing and understanding every line of AI-generated code."

I humbly beg to differ. I argued in December that the era of mandatory code reviews was over, and the case is getting stronger every day.

DORA's 2025 report found that AI-assisted teams produce 98% more PRs while review time increases 91%. Something's got to give. And beyond the throughput problem, most of what code review is for can now be done better by machines.

What does a reviewer actually check? Maintainability - complexity, file size, dead code - is largely enforceable with static analysis and linter rules. Adherence to team and project standards is increasingly handled by AGENTS.md files and rule configurations; AI reviewers cross-reference these too, resulting in very high compliance. Catching bugs, security vulnerabilities, swallowed errors - this is where AI code review tools shine.

Developers routinely miss 30-40% of defects in code review. Under time pressure, that climbs. Reviews of large diffs degrade fast - tracking state across hundreds of lines of unfamiliar code exceeds working memory. The subtle bugs survive, which is precisely the class of bug that AI agents are most likely to introduce.

And there's a waste problem. Senior engineers - the ones whose review is most valuable - are also the most valuable for system design, architecture, and mentorship. Stop using your most expensive people as a quality gate that machines can handle for most PRs.

The tooling is already here. Greptile reports an 82% bug catch rate in benchmarks across real-world PRs, hitting 100% on high-severity bugs. The Linux Foundation told CodeRabbit it catches 90% or more of customer-facing bugs and cut their review time in half. On Martian's independent benchmark - the only one not run by a vendor - Qodo scored 64.3%, ten points ahead of the next competitor. These numbers aren't perfect, and every vendor's own benchmark conveniently ranks themselves first. But AI review tools are already catching bugs that human reviewers routinely miss.

To be clear: I'm not saying human review is without value. A senior engineer reviewing a tricky concurrency change or a security-critical auth flow - that's time well spent. What I am saying is that mandatory human review on every PR costs more than it's worth. The five-day queues, the rubber-stamping, the false sense of security from reviews that miss defects - for the majority of changes shipping today, that's a bad trade.

You don't need a colony

On the other end of the spectrum, Steve Yegge is betting that the future belongs to multi-agent colonies. He describes Gas Town, an orchestrator that coordinates 20-30 parallel Claude Code instances with specialized roles - each making autonomous decisions while a merge queue tries to keep their parallel work from colliding. The Wasteland extends this further: a federated network linking thousands of Gas Towns together, with a shared work board and a reputation system. He concludes: "Colonies are going to win. Factories are going to win. Automation is going to win."

Perhaps. But nature also spent millions of years evolving the coordination mechanisms that make colonies work. We've had coding agents for about eighteen months. I'm not ready to hand over my project to a colony.

At project scale, failures compound. You spin up a colony, hand it a project, and don't see output until everything is built. One agent makes a wrong assumption early - say, the data model - and every agent building on top of it inherits that assumption silently, confidently. Glen Rhodes calls this "silent drift": Agent A's flawed output becomes Agent B's unvalidated input, and so on. When you finally review the output, you're not fixing one thing. You're triaging across a dozen features trying to figure out which assumption was the root cause, which agent silently reinterpreted a constraint, and which agents started depending on each other in ways nobody planned for. The more agents you run in parallel, the more surfaces for drift - and untangling it is its own expensive project.

To be clear, I'm not arguing against using multiple agents - features like "teams and swarms" can be valuable for feature-level implementation. I'm arguing against agents running autonomously at project scale, where coordination complexity and drift management become the dominant problems.

What the human actually does

No spec is perfect. No matter how sharp your specification is, there will be gaps - edge cases you didn't think of, design decisions you left implicit, constraints you carry in your head but didn't write down. The agent will fill every one of those gaps with its best guess, silently, confidently. AI review tools will catch some of those - the ones that manifest as code-level issues. But the assumptions that are contextually wrong, aesthetically off, or architecturally misguided? Those pass every automated check.

This is why you need a human checkpoint after each change, before continuing. Not to review the code line by line - but to exercise judgment that no tool can automate. Those judgments require being close enough to the work to have an informed opinion.

Taste. When a feature comes back, I look at it - but not the way a traditional code reviewer does. Taste lives in using the thing, not reading the diff. An agent can pass every test and still produce something that feels wrong. Osmani puts it well: "Run the application, click through the UI, use the feature yourself. When higher stakes are involved, read more code and add extra checks. And despite moving fast, fix ugly code when you see it rather than letting the mess accumulate."

Architectural coherence. I'm not reviewing implementation details - I'm asking whether this feature's approach is coherent with the system's direction. Does this create a dependency I'll regret? Does this abstraction make the next three features easier or harder? OpenAI's harness engineering team had one engineer whose role was exactly this - maintaining architectural vision while "shipping code he doesn't read line by line."

Bullshit detection. Sometimes I look at clean, passing, well-structured code and still have a sense that something is off. Maybe the agent solved the wrong problem. Maybe it made an assumption that's invisible in the code but obvious if you know the domain.

Learning and steering. Each cycle teaches me something that sharpens the next spec. And the next thing I planned may no longer be the right next thing to build - I can only know that if I've engaged with what just shipped.

Underneath all of it is accountability. As an old IBM training saying goes: "A computer can never be held accountable. That's your job as the human in the loop." No matter how much AI contributed, a human must own the result.

You can't exercise taste over thirty parallel agents' output - the volume overwhelms the intuition. You can't maintain architectural coherence when thirty streams are making independent micro-decisions. You can't smell that something is off when you're distant from all of it. And you can't learn from the last feature to sharpen the next spec when everything is running at once. The colony model doesn't just skip the human checkpoint - it makes it structurally impossible.

Codagent Update

If you want a taste of the possible future, check out Gas Town. But if you're doing serious engineering work today, Codagent is building open-source tools for that.

One of them is Agent Validator, a configurable “feedback loop” runner. No, it doesn't provide taste, steering, or bullshit detection - that's the human's job. But it will catch bugs, enforce standards, check for security holes. It runs a hybrid validation pipeline - deterministic checks (your tests, linters, type-checkers, security scanners) combined with cross-agent AI review.

Agent Validator runs during development, not after a PR is opened. The agent writes code, Agent Validator runs checks and cross-agent reviews, the agent reads failures and iterates - in a tight loop until everything passes. By the time you see the result, it's already been put through "the gauntlet".

I also recommend pairing this with an AI-based PR review tool. The commercial options - CodeRabbit, Greptile, Qodo - are effective but not free, typically ~$25/seat/month or more (with tight review quotas). If that's out of budget, Agent Validator already gives you cross-agent review for cheap (or even free, if you have an underutilized subscription to Codex or Copilot). It's open-source and works with whatever CLI tools you already have. Check out https://github.com/Codagent-AI/agent-validator to get started.

Next issue: I'll dig into AI code review in more depth: a review of the commercial PR-based tools, as well as my own evaluation of local cross-agent review comparing Claude, Codex, and GitHub Copilot. One agent is leading the pack and one is already ruled out. The results weren't what I expected - more on that next time.

The Middle Agentic Path