When I started this newsletter, "harness engineering" was a term just starting to crop up. Now it's a household term in the community, and there's a lot of great material on it - most of it on building agentic systems with frameworks like LangGraph, or how coding agent tools like Claude Code and Codex work under the hood.
But I'm writing for people coding with agents, not necessarily building them. So why should you care about harness engineering? You're not building the harness, you're just using it - right?
Not quite. The coding agent is only half of the harness equation. The other half is everything you bring to it: context, skills, specs, feedback loops, steering. Birgitta Böckeler calls this the "outer harness" (and it's such an important piece that I'm assigning her post as required reading after you finish this).
In his AI Daily Brief, Nathaniel Whittemore dubbed this split the "inner" vs "outer" harness, and I'll use these terms throughout. First, the inner harness - the core architecture and what it provides. Then the outer harness, and the piece I think is still missing.
What the coding agent gives you
In the world of coding, the inner harness is the coding agent itself - Claude Code, Codex, etc. Akshay Pachaar's thread is the clearest overview I've seen of how inner harnesses work. Based on my synthesis of that and a few other sources, the inner harness decomposes into seven components:
Orchestration loop. The core driver - ReAct, plan-and-execute, or some hybrid. Charriere (The Great Convergence) calls this the commoditizing essence of modern agents: goal + tools + loop until done. It's what everyone means when they say "agent."
Tool interface. File reads and edits, shell, grep, web search, MCP adapters - Millidge's analogy: tools are device drivers.
Context management. What's in the working window right now - compaction, trimming, just-in-time retrieval. The inner harness decides what the model sees and when.
State & session persistence. What outlives a single turn or crash - checkpoints, session APIs, or git history as state.
Guardrails & permissions. What the loop is allowed to do without asking - tool scoping, sandboxing, human-in-the-loop escape hatches.
Subagent orchestration. Spawning isolated helpers with fresh context - fork, teammate, worktree - so subagent output doesn't drown the parent's context.
Extension surfaces. The seams the outer harness plugs into - hook events, plugin slots (
skills/,agents/,commands/), MCP servers, auto-loaded briefing files (CLAUDE.md, AGENTS.md), session lifecycle events. This is the only component in the inner harness that exists for the outer harness. Everything else on the list is internal.
These seven are my distillation of several overlapping schemas - Pachaar's 12-component list, Osmani's behavioral breakdown, Aetna Labs' three-layer model, Anthropic's Managed Agents framing - all map onto the same seven.
The discipline of inner harness engineering is headed in a clear direction. "The field is moving toward thinner harnesses as models improve."
Anthropic is publicly deleting harness machinery as newer models handle what scaffolding used to do. Garry Tan is teaching the same principle at YC: "thin harness, fat skills" - keep the harness to ~200 lines and push everything else into reusable skill files the model loads on demand.
The inner harness is commoditizing (everyone ships roughly the same components) and thinning (control logic being removed). If that's true, the interesting question is what you layer on top - which is the outer harness.
The harness only you can provide
"A well-built outer harness serves two goals: it increases the probability that the agent gets it right in the first place, and it provides a feedback loop that self-corrects as many issues as possible before they even reach human eyes." - Böckeler

If the inner harness provides a set of core capabilities, the outer harness is everything you bring to it. Böckeler's framework breaks it into two categories: feedforward controls and feedback controls.
Feedforward controls, or "guides", are everything that shapes behavior before the agent acts, with the goal of preventing mistakes before they happen. They come in several flavors:
Guidance - CLAUDE.md files, architecture docs, coding conventions. Either auto-loaded by the agent or indexed so the agent reads them on demand when relevant.
Skills - reusable procedures the agent activates based on their description matching the task.
Specs - instructions the human explicitly tells the agent to read and follow. (I wrote about spec-driven development in Think Before You Prompt.)
On the other side are feedback controls - post-action observers "optimised for LLM consumption." (She calls these "sensors.") Deterministic feedback comes from tools with fixed, repeatable outputs: linters, type checkers, test runners, build scripts. LLM-based feedback uses a second model to evaluate what the first model produced: code reviewers, spec-compliance checkers, evaluator agents - or the agent itself closing what Osmani calls the "self-verification loop" by observing its own output through a browser or screenshot tool.
Deterministic feedback catches what rules can express; LLM-based feedback catches what only judgment can - architectural drift, spec misinterpretation, subtle regressions. Boris Cherny, creator of Claude Code, noted that giving the model a way to verify its work improves quality by 2–3×. The practitioner heuristic "hooks over prompts for reliability" is a statement about preferring feedback over feedforward - feedback doesn't depend on the agent's attention. My Agent Validator tool is a configurable feedback loop runner for both types - deterministic checks and LLM-based reviews.
Two other pieces round out the outer harness: persistent memory and codebase preparation. Without cross-session recall, every conversation starts cold - the agent re-learns your codebase, your conventions, your past mistakes. And agents perform dramatically better on clean, well-structured code - the outer harness isn't only what you configure, it's also what you've already cleaned up.
All four connect through the steering loop: "Whenever an issue happens multiple times, the [harness] should be improved to make the issue less probable to occur in the future, or even prevent it." When something goes wrong, you can improve a feedforward control (prevent it next time), add a feedback control (catch it next time), save it to memory (so the agent doesn't repeat it across sessions), or clean up the code that confused the agent in the first place - or some combination of the above.
The human's job is the steering loop - channeling what goes wrong into better feedforward, feedback, memory, and code. I wrote about what the human actually does in issue #3.
This is becoming one of the main functions of the human software engineering role - cultivating the harness.
The missing coordination layer
The inner harness should thin. That's a sign of progress - models handling more, vendors deleting scaffolding that's no longer needed. But the harness overall can't thin. Not yet.
The thin-harness thesis depends on models that keep getting better at following instructions. But every model upgrade changes how the model follows them - prompts that worked on Opus 4.6 break on 4.7, workflows tuned for one model need retuning for the next. When coordination lives in-context, every upgrade is a migration. And not everyone can afford the frontier model anyway.
And look at how much of the benchmark movement comes from harness changes, not just model changes. Blitzy beats GPT-5.4 on SWE-bench Pro (66.5% vs 57.7%). LLMCompiler plan-and-execute is 3.6× ReAct. Verification loops show 2–3× quality lifts. The harness is the lever - and in each case, external orchestration is a key part of what moved the numbers.
So as the inner harness sheds structure, something in the outer harness - your harness - needs to pick up the slack. But the current tools for that have a fundamental problem.
Feedforward lives in-context, which means it degrades as the window fills. BMAD-METHOD ships workflow files full of defensive ALL CAPS: "NEVER skip steps," "Execute ALL steps in exact order," "do NOT stop because of milestones." Superpowers does the same thing with <HARD-GATE> tags. They shout because shouting is the only enforcement mechanism a prompt-based framework has. The workflow lives inside the agent's context, which makes it a suggestion, not a guarantee.
How well does that shouting work? The ECC project's chief-of-staff agent docs estimate that LLMs ignore prompt instructions roughly 20% of the time. And failures compound. Even at 99% per-step success - a far more generous estimate - a 20-step process lands at 82% end-to-end.
Feedback mitigates this - but only if it actually runs every time. The current solution is hooks - and hooks help. But hooks are callbacks: the agent acts, your code reacts. You get execution points like "before every tool call" and "after every stop," not control sequences "run steps A, B, C in this order and don't proceed until each one passes."
What's missing is the inverse - not the agent calling your code, but your code telling the agent what to do next. A deterministic workflow layer that lives outside the agent's context window - where sequencing is enforced by the system, not suggested in prompts.
The research catches up
I've shown you where the industry is today, and what I think is missing. But what does the research say? This month, we finally have the first academic paper focused specifically on harness engineering.
Zhou, Zhang et al. published "Externalization in LLM Agents" - a 50-page academic review, 21 authors, multiple institutions. If you haven't seen it yet, it's been making the rounds. What they found lines up exactly with the gap described above.
The paper argues that when coordination lives inside the agent's context as prompts and instructions, every multi-step action becomes what they call "a fragile prompt-following exercise." Their fix isn't better prompts - it's moving coordination out of the model entirely. "Multi-step interactions need coordination: who acts next, what state transitions are allowed, when a task is complete or has failed. Protocols externalize these sequencing rules into explicit state machines or event streams, removing them from the model's inferential burden."
The model does what it's good at - reasoning, judgment, adaptation. The workflow layer does what it's good at - sequencing, validation, enforcement. Neither tries to do the other's job.
What the paper calls the "externalized interaction" protocol - a deterministic workflow layer that coordinates agents without living inside their context - is the gap I described above. Their paper names it. I'm building the solution - a free, open-source tool called Agent Runner. Releasing soon.

