No Validation Left Behind

In The Middle Agentic Path, I promised I would dig into code review agents in more depth. This is that deep dive.

But code review agents are only one part of the story. There is a big difference between asking an agent to run a workflow and having a workflow that runs the agent.

You can write the clearest instruction in the world: implement the task, run tests and linters, review the diff, fix failures, re-review, open the PR, wait for CI, address review comments. The agent may do most of it. But "most" is not a reliable engineering workflow.

This issue is about the validation layer I use now: local and PR code review agents and the runner that takes your workflow from "vibes" to "serious" engineering.

My Validation Workflow

❝

"Make it work, make it right, make it fast"

- Kent Beck

It starts with spec driven development. I spent a lot of time on this, producing a detailed, thorough spec that includes an overview, multiple spec files, a design doc, and a task breakdown.

Then, I don't just say "implement this spec." Here is the full workflow for implementing a new feature:

First, loop over every task file in the change. For each task, start a fresh agent that implements using TDD.

Then Agent Validator runs static checks (linter, type safety, build, tests, etc) and a local code review agent.

The local code review agent checks for bugs, security issues, error handling gaps, and task compliance — did the agent actually implement what was in the spec?

If Agent Validator finds failures, the same implementation session fixes them, commits, and runs Validator again in a loop.

The reviewer doesn't have to be the same CLI as the implementor. For example, you could implement in Claude Code and have Codex CLI review, or vice versa.

Once every task has gone through that loop, the "lead" agent reviews all assumptions made by the implementor agent and makes corrections where needed — surfacing anything it's not sure about back up to me at the end.

Next, the lead agent runs a "simplify" skill - reviewing the whole diff for readability, duplication, and code reuse, then making improvements.

Finally, Agent Validator runs again to address any issues introduced from the assumption corrections and simplifications.

But the validations don't stop there. After the PR is created, two PR review agents run, catching things that the local review agents missed.

Whew - that must be tedious getting through every step of the whole workflow, right?

Not at all. (It runs fully autonomously while I'm doing something else).

But how often does the agent actually run all those steps without skipping a single validation, or stopping with "ready to create PR when you are"?

Every single time. (Even on days when it's feeling lazy.)

How do I get the whole workflow to run consistently without babysitting the agent? I'll get to that at the end. But first, let me explain what code review agents actually are, and how the ones I tested performed.

Feedback Controls

In Harnesses Explained, I talked about feedback controls as a core part of the outer harness: post-action observers that catch issues after the agent acts.

There are two kinds. Some feedback controls are deterministic: linters, type checks, tests, security scanners. They are fast, computational, and boring in the best way. If the type checker says the code doesn't compile, there isn't much to debate.

The other kind is inferential. This is where code review agents fit. They are slower and messier because they use a model to make a judgment: did this change hide a bug, miss an error path, violate the spec, or create a risk that normal checks don't know how to express? Agent Validator runs both types locally.

PR review agents sit later in the workflow, after the branch becomes a PR, but they are the same kind of harness component: post-action observers trying to catch issues before I trust the change.

How Do Code Review Agents Work?

Sometimes I come across code and think, "who the heck wrote this?", only to check the git blame and realize the author was me. The situation is not that different with LLMs. Even asking the same model to review its own code in a fresh context window helps a lot. Cross-model review catches even more issues. That's the rationale for reviewing locally, before the PR even exists.

The basic loop is the same everywhere: gather the code and surrounding context, ask a model to judge it against a prompt or policy, and turn likely problems into comments or structured findings. What varies is when they run and what context they get.

In Agent Validator, the review runs before the PR exists. It can look at the local diff, the task or spec, and the output from the previous round of reviews. Then it returns structured findings back into the same implementation loop, so the implementor can fix them before moving on to the next task.

On the other hand, many specialized tools run review agents after the PR is opened. The PR diff is the trigger and the place where comments get reported, but some tools also pull in repository context, prior PR history, project rules, or static analysis results before asking the model to judge the change.

Local and PR review agents are doing the same basic thing: context-aware, model-judged review. Sometimes they are wrong. But when they get the right context, they catch problems that can't be caught by deterministic checks alone.

So how well do they actually work? I evaluated both: local code review agents in Agent Validator, and PR review agents on real PRs. If you don't care about the eval details, skip to Putting it All Together for the payoff.

Validating the Validator

I built an eval fixture with 56 planted issues across TypeScript, Python, and Go — bugs, security holes, swallowed errors — and ran Agent Validator's local code review agents against them with different models, adapters, and prompt strategies. I'll cut to the chase.

I expected the story to be about models. Use the strongest model, get the best results. That's not what happened. The harness mattered: Claude Code CLI with Sonnet timed out at 300 seconds, while the same Sonnet through Copilot CLI finished reliably and produced the best code-quality recall of anything I tested. Prompt strategy mattered too — for some model/adapter combos, one or two broad prompts beat three specialized ones.

The best setup was a two-agent hybrid through Copilot: Sonnet for code quality, GPT-5.3 for a combined security and error-handling pass. I ran a second round in May with newer and cheaper models — they didn't beat the April baseline. The more expensive GPT-5.5 sometimes found more, but not reliably enough to recommend.

My current default: Codex CLI with GPT-5.3 running a single combined prompt covering code quality, security, and error handling. One pass, strong precision, usually under 30 seconds. The Copilot two-pass hybrid has better recall, but Copilot is moving to token-based billing on June 1, so I can't recommend it on cost until I measure real token spend under the new system.

Reviewing the Reviewers

I also tried several PR review agents to see which ones work best for me. Caveat up front: this is a small sample set from Agent Runner and Agent Validator PRs, not a full evaluation.

The result was clear enough for my purposes: CodeRabbit and Copilot found the most meaningful issues. Greptile and Qodo found some real bugs too, but in this eval they found fewer real issues and more findings that looked weak, stale, or misread.

The best example was agent-validator#134, which added a shared trust ledger so validation state could carry across worktrees. Copilot found the important fault lines: trust being forced too broadly, brittle merge parsing, premature lock-state handling, and scope metadata that could mix checks and reviews. CodeRabbit found a real test reliability bug where process.chdir could leak global process state across tests.

That pattern held across PRs. CodeRabbit tended to find broad practical implementation risks. Copilot was strongest on workflow invariants - the kind of bugs where the harness trusts the wrong thing or fails silently. Greptile and Qodo occasionally found something real, but produced more weak or misread findings.

Greptile and Qodo are well respected, and both score well on other benchmarks. But on my Codagent projects, Copilot and CodeRabbit gave me the review layer I actually want after the PR opens. Agent Validator catches many issues locally and gives faster feedback, which keeps bugs from compounding. Copilot and CodeRabbit act as a safety net for the issues the earlier reviews missed.

Putting it All Together

Have you ever tried telling an agent to loop over every task in a spec, start a fresh implementor for each one with TDD, run a validator with static checks and a code review agent in parallel, fix failures and re-run the validator in a loop until everything passes, then do an assumptions review across the whole change, simplify the diff, run the validator one more time, create a PR, wait for CI and PR review agents, and fix what they find?

I have. How often did it do everything I asked? About as often as my teenager does when I ask him to clean his room but vacuum before he dusts, then switch the laundry but fold what's in the dryer first, then load the dishwasher but rinse everything and keep the good knives out, then take out the trash but separate the recycling (and don't stop until it's all complete).

But that workflow I just described? Unlike my kid, it completes every step, every time. Without ever asking me if I'm "ready to proceed" to the next step (or taking a break to scroll on social media).

The key is Agent Runner, which I introduced last time as the missing coordination layer in the outer harness.

Claude and Codex have task lists. They can follow instructions. They can even keep a checklist. But complex workflows with nested loops are a different problem altogether.

The usual answer is to put the workflow inside the agent's context and shout: NEVER skip steps, execute ALL steps in order, do NOT stop early. That helps, but it is still prompt-following. The workflow lives inside the context window, which makes it a suggestion, not a guarantee.

Agent Runner flips that around. Instead of the agent calling my code when it happens to remember, my code tells the agent what to do next. It is a deterministic workflow layer outside the agent's context window.

So I don't have to remember what comes next, or babysit the workflow to make sure each validation loop actually runs. Agent Runner does that for me. It runs those steps automatically and autonomously, in sequence. While I'm focused entirely on something else, like planning the next change or taking out that garbage and recycling.

PR creation is part of the workflow. Waiting for CI is part of the workflow. Addressing review comments is part of the workflow. If a validation step fails, the fix loop is part of the workflow too.

So when I come back to the task, I can trust that it's already been verified for correctness and quality. I can instead focus on what I do best: steering the ship.

Next Time

Agent Runner 0.1 drops this month! Next time, I'll show you what the workflow looks like in code.