Harness Engineering for AI Agents

The framing most teams start with when building AI agents is wrong. They spend time picking the right model, crafting elaborate system prompts, and tuning temperature, and then wonder why the system behaves erratically in production. The model is rarely the problem. The harness is.

By harness I mean the runtime infrastructure that wraps an LLM to make it behave as an agent: the code that decides what goes into the context window, which tools the model can call, what state persists between steps, and what happens when something goes wrong. This is unglamorous work. There are no leaderboards for harness design. But in practice, harness quality predicts agent reliability far better than model quality does.

The Anatomy of an Agent Harness

At the lowest level, an AI agent is a loop. On each iteration, the model receives an observation, produces an action, and the action changes the environment, which produces a new observation. The harness is everything outside the model that makes this loop work.

Each layer encodes decisions that determine what behaviors are possible:

Observation layer: decides what the model sees: raw tool results, filtered summaries, retrieved documents, conversation history.
Action layer: validates, routes, and executes what the model decides to do.
Memory layer: manages what persists across steps (working context, episodic logs, semantic retrieval).
Eval / guardrails: detects whether the agent is making progress, stuck, or about to do something harmful.

An agent with a poorly designed observation layer will hallucinate. An agent with a poorly designed action layer will take irreversible actions it should not. Neither of these is a model failure.

Observation Engineering

The context window is not a dump. It is the only channel through which a model perceives the world, which means what you put in it is the most consequential engineering decision you make in an agent system.

Consider a typical coding agent with access to a shell. After each command, the harness appends stdout to the context. This works fine for five turns. By turn twenty, the model is reading through thousands of tokens of logs to find the one error that matters. Attention is diluted, not because the model degraded, but because the harness accumulated noise without a policy for what to discard.

Good observation engineering means treating context as a budget:

def format_observation(result: ToolResult, budget: int = 2000) -> str:
    if len(result.output) <= budget:
        return result.output

    head = result.output[:budget // 3]
    tail = result.output[-(budget // 3):]
    truncated = len(result.output) - 2 * (budget // 3)
    return f"{head}\n\n[... {truncated} chars truncated ...]\n\n{tail}"

The code is trivial. The decision behind it is not. You are explicitly encoding a belief that the beginning and end of a tool output are more signal-dense than the middle. For shell commands that is usually right. For JSON API responses it may not be. The point is that the harness makes this choice, and if you do not make it deliberately, it defaults to "dump everything," which is not a neutral choice.

OpenAI's engineering team captured this precisely when describing the Codex harness: give the agent a map, not a 1,000-page instruction manual. [1] Context management is one of the biggest challenges in making agents effective at complex tasks, and the solution is always the same: give the model a small, stable entry point and progressively disclose detail only when needed.

A subtler issue is context poisoning: when an early observation contains incorrect or malformed information that the model anchors on for the rest of the session, even after correct data appears later. LLMs give disproportionate weight to early context. A robust harness annotates and validates tool results before they enter the window:

def safe_observation(result: ToolResult) -> str:
    if result.error:
        return (
            f"[TOOL ERROR] {result.tool_name} failed: {result.error}\n"
            f"Do not assume previous state."
        )
    if not result.output:
        return f"[TOOL RESULT] {result.tool_name} returned no output."
    return f"[TOOL RESULT] {result.tool_name}:\n{format_observation(result)}"

The annotations look redundant when you read them as a human. They are not redundant to the model.

Action Space Design

Tool schema is interface design, and the same principles apply. A tool with an ambiguous name or an underspecified parameter schema will be misused, not because the model is wrong, but because you gave it insufficient information to be right.

Compare these two definitions for a file-writing tool:

// Underspecified
{
  "name": "write_file",
  "parameters": {
    "path": { "type": "string" },
    "content": { "type": "string" }
  }
}

// Specified
{
  "name": "write_file",
  "description": "Write content to a file. Overwrites if it exists. Use append_file to add to existing content.",
  "parameters": {
    "path": {
      "type": "string",
      "description": "Absolute path. The parent directory must exist."
    },
    "content": {
      "type": "string",
      "description": "Full content to write, UTF-8 encoded."
    }
  }
}

The second schema is three times longer. It is also approximately three times less likely to cause a silent overwrite when the model did not intend one. The description explicitly names the complementary tool, append_file, which forces you to define that tool too, meaning you have thought through the action space properly.

Granularity is the other critical dimension. A coarse-grained tool like run_bash_command gives the model flexibility but gives you no observability into what it is doing, and makes it impossible to implement meaningful safety checks. Fine-grained tools (read_file, write_file, list_directory) are more work to define but make the action space auditable.

A useful rule of thumb: each tool should do exactly one thing describable in a sentence without the word "and." If you find yourself writing "reads a file and processes the result," that is two tools.

State and Memory Architecture

An LLM has no internal state between calls. Every turn, the harness reconstructs the model's world from scratch. This means the memory architecture determines what the agent can remember and, crucially, what it can forget.

There are three distinct kinds of memory in an agent system, and conflating them is a common source of subtle bugs:

Type

What it stores

Typical implementation

Working memory

The current context window

In-memory, rebuilt each turn

Episodic memory

A log of what happened this session

Append-only store, summarized on retrieval

Semantic memory

Reusable facts and domain knowledge

Vector database, retrieved by relevance

Most agent frameworks default to "everything goes in the context window," which works for short tasks and degrades quietly for long ones. A well-designed harness has an explicit policy for each type:

When does a tool result get stored to episodic memory vs. appended to context?
When does a retrieved document get added to context vs. summarized first?
What gets discarded at the end of a session, and what persists?

A concrete example of what good externalized memory looks like comes from OpenAI's Codex long-horizon work. [2] A 25-hour uninterrupted run producing roughly 30,000 lines of code relied not on a large context window, but on four markdown files maintained by the harness throughout execution: a prompt document locking in the original specification, a plan file with milestone checkpoints, an implementation runbook, and a real-time documentation log. The agent would read and update these at each checkpoint, maintaining coherence across a session far longer than any context window. This is episodic memory implemented with no specialized infrastructure.

Getting memory policies wrong does not cause the agent to fail visibly. It causes subtly worse decisions over time, which is much harder to diagnose than an outright crash.

Long-horizon Execution

The checkpoint-driven model changes the structure of the execution loop itself. Rather than a simple observe-act cycle, long-horizon tasks need a loop that validates progress, records state, and can recover from failure without losing the thread. The diagram below shows the structure that emerged from both the Codex [2] and Anthropic [3] harness work:

Each milestone is a small, independently verifiable unit of work. The validate step runs quality checks (tests, linting, type checking) and either promotes the result to a checkpoint or routes it to repair. Crucially, the externalized memory store is updated at every checkpoint, so if the agent loses context or needs to restart, it can reconstruct where it left off from the files rather than from the context window.

Anthropic's harness work [3] introduced a related pattern for quality-sensitive tasks: a generator-evaluator split, where a separate agent is responsible solely for judging output rather than producing it. The motivation is that self-evaluation is structurally biased. Models consistently approve their own outputs even when quality is poor. A standalone evaluator, calibrated with explicit criteria and few-shot examples, produces sharper feedback. The tradeoff is cost and latency, but for tasks where output quality matters more than speed, the separation pays for itself.

A Taxonomy of Harness Failures

When an agent misbehaves in production, it is useful to have a vocabulary for the failure. Most issues fall into a small number of categories:

Action loops. The model takes an action, observes its result, and decides to take the same action again. This happens when the harness does not give the model a mechanism to detect that progress has stalled. Fix: include a monotonic step counter in every observation, and implement loop detection that terminates or escalates after N identical consecutive actions.

Over-tooling. The harness exposes so many tools that the model spends reasoning capacity deciding which one to call rather than reasoning about the problem. I have seen implementations with forty-plus tools that could have been collapsed to eight without any loss of capability. Each additional tool increases the action space and the probability that the model selects something unnecessary or harmful. Design for the minimum set of tools that covers the task.

Under-observability. The harness does not log enough to reconstruct what happened in a failed session. Since agent behavior is determined by the sequence of observations received, a log that captures only model outputs is useless for debugging. The full trace must be preserved: every observation, every action, and the exact context passed to the model at each step. This is not optional for production systems. Tools like LangSmith and Langfuse both exist to solve exactly this problem, which should tell you something about how often teams skip it.

Irreversibility without confirmation. Some tools have effects that cannot be undone. A harness that does not distinguish reversible from irreversible actions, and does not enforce a confirmation step for the latter, is one hallucination away from a serious incident.

Evaluation Is a Harness Concern

You cannot evaluate an agent by inspecting only its final output. An agent that produced the right answer by reasoning incorrectly and getting lucky will fail on the next variation of the same task.

Evaluation requires instrumenting the harness. At minimum, a harness should produce:

Step-level traces: the observation and action at each turn, with timing
Tool call logs: which tools were called, with what arguments, and what they returned
Termination signals: why the agent stopped (goal reached, max steps, error, timeout)

With these in place, you can run the agent on a benchmark, compute metrics (tool call accuracy, mean steps to completion, recovery rate from injected errors) and iterate on harness design with real feedback.

The ReAct paper introduced the reasoning-action interleaving that underpins most modern agent frameworks. What it describes as "thought" traces are, in effect, harness-level instrumentation baked into the prompt: the model writes its reasoning aloud so the evaluator can read it. This costs tokens but makes the agent's internal state legible. Without it, you can observe what the model did but not why, which is not enough to debug reliably.

What Comes Next

Every component in a harness encodes an assumption about what the model cannot do on its own. [3] That is worth writing down explicitly, because those assumptions have a shelf life.

Anthropic's harness work found that sprint-based task decomposition, which was once necessary to keep agents coherent across long sessions, became unnecessary overhead when Opus 4.6 released with better planning and context handling. The harness that was right for one model generation became the wrong harness for the next. Teams that had those assumptions written down adapted quickly. Teams that had baked them silently into their scaffolding had a harder time.

The practical implication is that harness design is not a one-time exercise. As model capabilities shift, the right places to invest in scaffolding shift too. What the harness compensates for today (context management, self-evaluation bias, loop detection) may require no compensation at all in two years. What the harness ignores today (multi-agent coordination, long-range planning, ambiguous goal disambiguation) may become the dominant failure mode.

The model is a component. The harness is the system. Build it like it will need to change, because it will.

References

OpenAI. Harness engineering: leveraging Codex in an agent-first world. 2025.
OpenAI. Run long-horizon tasks with Codex. 2025.
Anthropic. Harness design for long-running apps. 2025.
Yao, S. et al. ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023.
Wang, L. et al. A Survey on Large Language Model based Autonomous Agents. Frontiers of Computer Science, 2024.