The machine around the model

The much-anticipated second edition is here (quite possibly the best issue yet)! Today we're going to talk all about "spec-driven development": what it is and isn't. (Is it waterfall? Does it require "writing"?)

But first, what is harness engineering again? That was the topic of my first installment, but it's such a new discipline that it's worth defining a little more comprehensively.

One of the best descriptions I've seen comes from The AI Forum's Harness Engineering: Building the OS for Autonomous Agents:

“While the AI model is the 'engine,' the harness is the 'rest of the car,' including the steering, brakes, lane boundaries, and maintenance schedules that keep the system on track. Harness engineering focuses on building a structured system around an LLM to improve reliability. Instead of changing the models, you control the environments in which they operate.”

Anatomy of a coding agent harness

A point of confusion I keep seeing: is the coding agent (e.g. Claude Code) the harness? It's certainly a big part of it. Here's how I see it:

  • Agent harness: the complete system that wraps a model to make it reliable. The model is a reasoning engine; the harness is an engineering system. “Agent = Model + Harness. If you're not the model, you're the harness.”

  • Coding agent harness: a specific type of agent harness built for software development. Two parts:

    • Coding agent: a product that wraps a foundation model for software development (Claude Code, Codex CLI, Gemini CLI, Cursor). Provides tool systems, context management, sandboxing, and the interaction model that lets the model write and modify code. A coding agent is itself a form of agent harness, engineered by the vendor.

    • Engineering layer: everything developers build around the coding agent. The coding agent handles a single session; the engineering layer handles the full lifecycle of a software change. Its patterns fall into three categories: information architecture (how you organize and feed knowledge), coordination & control (how you orchestrate and verify), and infrastructure (execution environments and tool design). I'll dig into each of these in future issues.

I've also been distilling what I see as the 19 patterns of harness engineering. We won't dig into them today, but here's a diagram showing where they fit. More on individual patterns in future issues.

What is spec-driven development?

A friend sent me a video the other day tearing into spec-driven development: "it’s just Waterfall in markdown," the usual. And I get it. The name alone makes experienced engineers flinch. But I’ve been using SDD daily for months now, and the results speak for themselves: Red Hat reports a 40% reduction in code reviews, and undirected AI coding has been shown to increase code complexity by 41%, a gap that specs aim to close by forcing clarity before the agent starts.

SDD has been popping up a lot recently, so let’s talk about what it actually is and isn’t.

If you’ve used planning mode in Claude Code, or any agent’s "think before you code" equivalent, you’ve touched the edge of spec-driven development. Planning mode is the agent reasoning about how to implement: "I’ll create this file, modify that function, use this pattern." That reasoning is ephemeral (it dies with the session) and implementation-focused. The agent is planning its own work based on its own interpretation of your intent. It doesn’t significantly narrow the gap between what you meant and what gets built.

SDD is the human reasoning about what to build and why: intent, requirements, constraints, and acceptance criteria, captured as durable artifacts that persist in the repo across sessions and can be consumed by any agent or team member.

In practice, a spec is a markdown file, or maybe three. Red Hat recommends separating "what-specs" (goals, user stories, success criteria) from "how-specs" (constraints, security standards, testing requirements) into modular files. Alex Cloudstar’s starting point is even simpler: one markdown doc per feature covering what, why, technical constraints, and definition of done. Not a 200-page requirements document.

But here’s the part that matters most, and Cloudstar puts it well: the value comes from "specification-writing thinking, not tool complexity." SDD is about thinking, not writing. Experienced developers hear "write a spec" and immediately picture formal documentation nobody reads, outdated before it’s finished, existing to satisfy a process rather than clarify thinking. This is not that.

The best approach is to have the agent interview you, a pattern pioneered by Superpowers, Jesse Vincent's agentic skills framework. Instead of writing the spec yourself, the agent asks you questions one at a time, about one capability at a time. What should happen when a webhook delivery fails? What’s the retry policy? Should users configure notification channels, or is that a later feature? It forces you to confront edge cases and design decisions you wouldn’t have thought of on your own. That’s where the real value is, not in the document, but in the thinking the process demands. And don't worry, you’re not the one doing the writing. The agent produces the artifacts based on your answers. You review them; that review step matters, and you need to catch misunderstandings and hallucinated requirements. But the heavy lifting is the conversation, not the typing.

Red Hat’s Rich Naszcyniec has a good name for the problem this solves: the "encoding/decoding gap." The space between what you intend and what the AI produces. Natural language is a lossy channel for developer intent. You say "clean, reactive dashboard for monitoring cluster health" and the AI hears something close but not quite: it picks a deprecated charting library, skips authentication, and uses patterns that don’t match your codebase. The vibe was right, but the execution was off-key. A spec narrows that gap. Not because it’s magic, but because the thinking process behind it forces clarity before you prompt.

Now, the obvious objection. "Isn’t this just Waterfall?"

No. Cloudstar makes the distinction precisely: Waterfall’s failure "was caused by prohibitively high costs discovering specifications were wrong, not from specifications being inherently bad." You’d write a 200-page requirements document, hand it to a development team, and not see deliverables until the next quarter. By then, requirements had drifted, the market had shifted, and the spec was fiction. Specs didn’t fail us; the economics around specs did.

In SDD, the feedback loop is minutes long. Write a spec, hand it to an agent, review the output. If the spec was wrong, and it will be, you update it and regenerate. You’re editing a markdown file and re-running a prompt, not re-planning a quarter.

And the distinction goes deeper than speed. Waterfall specs were static, frozen artifacts handed over the wall. SDD specs, done right, are the opposite. Red Hat calls it "spec co-evolution": when an agent encounters a failure, it doesn’t just fix the immediate problem; it suggests an update to the relevant spec so the same class of error doesn’t recur. Waterfall specs degraded over time. SDD specs improve over time. Same artifact type, opposite trajectory.

My Take

SDD works, but only at the right scope. A spec for "add webhook notifications with retry logic" is small enough to evaluate in one pass. You read the output. It works or it doesn't. You adjust. The loop stays tight.

Now try speccing "build the notifications subsystem": webhooks, email digests, in-app alerts, user preferences, delivery tracking, retry logic, rate limiting. You skip the human review checkpoints, hand the whole thing to agents, and don't see output until everything is built. One wrong assumption early, say, the data model, silently compounds through every feature built on top of it. And when you finally review the output (it will be wrong), you're not fixing one thing. You're triaging across six features trying to figure out which assumption was the root cause, which interactions between features were underspecified, and which parts the agent silently reinterpreted. Untangling that is its own expensive project. Sound familiar? Big batch, late feedback: that's Waterfall by definition. It's running on faster hardware, but it's the same spiral.

There's also a case for prototyping before you spec - vibe before you spec, if you will. You can't write a good spec for something you don't understand yet. Sometimes the fastest path to a good spec is building a rough prototype, learning what you didn't know you didn't know, throwing it away, and then speccing from real understanding. And even after a proper spec-driven implementation, if the output reveals that your spec was fundamentally wrong - wrong data model, wrong abstraction boundary, wrong assumptions about how users will actually use the thing - the right move is to throw it away and re-spec from scratch rather than trying to patch your way to something coherent. Erik Bernhardsson calls this the Rule of 3: your first two attempts to solve a problem will fail because you misunderstood the problem; the third time it works. The difference now is that each attempt costs you an afternoon, not a quarter.

I also think SDD's bad reputation is partly the tooling's fault. The most prominent tools can push you toward project-scale specification before you write a line of code. GitHub Spec Kit runs a five-artifact pipeline: assessment, PRD, technical spec, ADR, tasks. One practitioner generated 1,300 lines of markdown just to display a date. GSD (Get Shit Done) spawns four parallel researcher agents plus a plan-checker verification loop per phase. A single bug fix can consume 100+ agent spawns. These tools have their place, but for most day-to-day work, that's a lot of ceremony before you've written a line of code, and it's no surprise that some people look at the output and say, "this is just Waterfall in markdown." I use OpenSpec specifically because it's lighter: just markdown, scoped to a single change, no backlog planning or milestone ceremonies. More on this next time.

Codagent Update

So what does SDD look like when the tooling stays out of the way? That's the question I've been trying to answer with Agent Skills. What sets it apart from other skills libraries is the end-to-end arc — from structured planning through autonomous execution, connected by a chain of constrained artifacts. The execution pipeline (Agent Runner) is still in development, but the planning workflow is available now.

What sets Agent Skills apart from other skills libraries is the end-to-end arc, from structured planning through autonomous execution, connected by a chain of constrained artifacts. The execution pipeline (Agent Runner) is still in development, but the planning workflow is available now, and that's where most of the value lives.

The planning stage is collaborative: you and the agent, thinking together. The skills walk you through the full arc: /propose honestly evaluates whether the idea is worth building, then /spec moves into interview-driven discovery where the agent asks you questions one capability at a time, capturing behavioral contracts and testable scenarios. From there, the agent brainstorms architectural approaches through dialogue and produces a design with decisions and trade-offs. Finally, it decomposes the change into right-sized, independently implementable tasks, each self-contained with its own goal, context, and done criteria. /review lets you verify that the artifacts are a cohesive whole before moving on. You're in the loop the whole time, approving artifacts and catching bad assumptions early.

The execution stage is autonomous. The agent dispatches a fresh subagent per task, runs Agent Validator as a quality gate, and commits on success. Then it handles the last mile: push, wait for CI, fix failures, repeat, with hard limits so it doesn't loop forever. It only comes back to you if something breaks that it can't fix.

For managing spec documents, I recommend OpenSpec. OpenSpec defines what the artifacts are and the process for keeping them current as your codebase evolves. Agent Skills defines the process for how each artifact gets created: the interviews, the brainstorming, the review gates. They're complementary: OpenSpec is the "what," Agent Skills is the "how." You can use Agent Skills on its own, but together you get the spec co-evolution piece, specs that stay accurate as living documents rather than rotting the moment implementation starts.

This maps directly to what I said above: SDD's value comes from the thinking, not the format. Agent Skills enforces that thinking as a process. You can't skip straight to implementation because the implement skill reads from task files, which read from the design, which read from the specs, which read from the proposal. Each artifact constrains the next.

Specs aren't the point; thinking is the point, and specs are just what the thinking produces. The tools should make that thinking easier, not bury it under ceremony. Agent Skills is available now as a plugin for Claude Code or Cursor (other agents coming soon); check out codagent.dev to get started. Next time: more on why the right unit of work is a feature, not a project.

Keep Reading