Agent Harness Engineering

Source addyosmani.com/blog/agent-harness-engineering Published May 19, 2026

Harness engineering — the practice of building and iterating on the scaffolding around a model — is where the real leverage lies in making coding agents reliable and capable, often more so than the choice of model itself.

A coding agent is the model plus everything built around it — the harness — and treating that scaffolding as a real artifact is the core of harness engineering.
Every agent mistake should be treated as a permanent signal that adds a specific rule or constraint, creating a ratchet where failures reduce over time.
The most impactful harness components are filesystem and Git for durable state, bash for general-purpose tool execution, sandboxes for safe environments, and memory files for continual learning.
Hooks enforce constraints automatically (e.g., blocking destructive commands, running tests after edits) and make feedback loops nearly free in the common case.
As models improve, the space of interesting harness problems shifts rather than shrinks — better models unlock harder tasks with new failure modes that require fresh scaffolding.

What is a harness?

A harness is every piece of code, configuration, and execution logic that isn't the model itself. A raw model is not an agent — it becomes one once a harness gives it state, tool execution, feedback loops, and enforceable constraints. Concretely, a harness includes system prompts, tool definitions, bundled infrastructure (filesystem, sandbox, browser), orchestration logic (subagent spawning, handoffs, model routing), hooks and middleware, and observability tooling.

The equation is simple: coding agent = AI model(s) + harness. The debate over the left-hand side is loud, but most of the actual leverage sits on the right. A decent model with a great harness beats a great model with a bad one — a pattern the author has observed repeatedly in their own work.

The "skill issue" reframe

When an agent does something dumb, the default reaction is to blame the model and wait for a better version. The harness engineering mindset rejects that. Failures are usually legible: the agent didn't know a convention, so you add it to AGENTS.md; it ran a destructive command, so you add a hook that blocks it; it got lost in a long task, so you split it into a planner and executor.

A striking data point: on Terminal Bench 2.0, Claude Opus 4.6 scores far lower inside Claude Code than the same model in a custom harness. One team moved from Top 30 to Top 5 by changing only the harness. The gap between what today's models can do and what you see them doing is largely a harness gap.

Designing a harness from behavior

The most useful framing for designing a harness is to start from the behavior you want and derive the component that delivers it. Every harness component should have a specific job — if you can't name the behavior it exists to deliver, it probably shouldn't be there.

Key components include: the filesystem and Git for durable state; bash for general-purpose tool execution; sandboxes for safe, scalable execution; memory files like AGENTS.md for continual learning through context injection; and techniques to battle context rot such as compaction, tool-call offloading, and progressive disclosure of skills. For long-horizon work, patterns like Ralph Loops (re-injecting the original prompt into a fresh context) and planner/generator/evaluator splits help maintain coherence.

Harnesses as living systems

Every component in a harness encodes an assumption about what the model can't do on its own. As models improve, the space of interesting harness problems moves — the ceiling rises and new failure modes appear. The discipline is to remove scaffolding when the model makes it redundant and add new scaffolding for the tasks that have just become reachable.

The shift toward Harness-as-a-Service means developers now configure a runtime rather than build one from scratch. The harness is a living system, not a config file set up once. The "best" harness isn't necessarily the one the model was trained inside; it's the one designed for your specific task and failure history.

Read this at any depth.

Install Depth and pick your level — Glance for a sentence, Summary for the gist, Read for the full take. Free daily quota, no signup needed.

Add to Chrome

9 views