Where Knowledge Lives: Composing an Agent Workflow

December 14, 2025

Most discussions about AI-assisted engineering treat tooling as a choice problem. Which CLI? Which IDE? Which MCP server? Which memory layer? The framing suggests the difficulty is selecting the right product from a crowded shelf. After rebuilding my own workflow three times in roughly a year, I am increasingly convinced that the choice problem is the easy part. The hard problem is placement: deciding where a given piece of knowledge or capability should live across an already-chosen stack.

This essay is not a survey of tools. There are better ones for that. It is an attempt to write down the heuristics I now use to place things — what goes in AGENTS.md, what becomes a skill, what justifies a Model Context Protocol (MCP) server, what belongs in long-term memory, and what should not be encoded anywhere because the model already does it well enough on its own.

The illusion of a single agent

When most engineers say "the agent," they mean a single conversational surface — Claude Code, Cursor, Codex, the Anthropic API. That framing collapses several distinct layers of decision-making into one undifferentiated thing. In practice, every working agent setup is a small distributed system, and like every distributed system, the interesting properties live in the interfaces between components.

requestruntimeCLI / IDE harnessclaude code · cursor · codexprojectAGENTS.md / CLAUDE.mdalways loaded · this repo onlycapabilitiesskills · subagentson-demand · portableconnectionsMCP servers · toolsside-effects · external statememoryfiles · vector store · transcriptacross sessions

I think of the stack as five layers: the runtime that hosts the loop, the project instructions that tell the model what this codebase is, the skills that give it portable procedures, the MCP servers that let it touch the world, and the memory that persists between sessions. The key insight is that each layer has distinct lifetime, scope, and cost characteristics, and putting the wrong kind of knowledge in the wrong layer is the most common cause of an agent setup that "works in demos and fails in practice."

The placement function

Whenever I am about to encode a new piece of knowledge or capability into my workflow, I ask four questions. They form an informal placement function that, embarrassingly, I now run almost reflexively:

  • Scope. Does this apply to one repo, one task, one user, or anyone running this code?
  • Lifetime. Is it true for this session, this week, or indefinitely?
  • Cost. Does it belong in every prompt, or only when needed?
  • Determinism. Is it a fact (the model should know it), a procedure (the model should do it), or a capability (the model needs to call out to it)?

A piece of knowledge with narrow scope, long lifetime, low marginal cost, and pure-fact determinism is a perfect AGENTS.md candidate. A piece with broad scope, long lifetime, high cost, and procedural determinism is a skill. A piece with broad scope, ephemeral lifetime, and capability-level determinism is an MCP. None of this is theoretically deep. It is, however, what I wish someone had handed me a year ago.

Layer 1: the runtime

The runtime is the part most people fight about and that matters least. Whether you use Claude Code, Cursor, Codex, or the SDK directly, you are picking a harness — observation policy, action validation, context formatting, retry behavior. These differ in real ways but the differences are mostly about ergonomics and not about what the agent can fundamentally do.

The one thing worth being deliberate about at this layer is which loop you live inside. I now keep two distinct workflows:

  1. Tight loop — an editor-integrated agent where I am driving every turn, reviewing each diff. Cursor, Claude in VS Code, anything inline.
  2. Loose loop — a CLI agent that I hand a non-trivial task and let run for minutes or hours, often in parallel branches. Claude Code, Codex CLI, the Anthropic Agent SDK.

The mistake is using the wrong loop for the task. Tight-loop agents waste the model on autocomplete-grade work that is faster to type. Loose-loop agents amplify ambiguous specifications into expensive failures. The decision is not about features; it is about how much variance you can absorb and how much you will read carefully.

Layer 2: AGENTS.md, the project's nervous system

AGENTS.md (and its sibling CLAUDE.md) is the file an agent reads at the start of every session in the project. The convention is now well-established enough that there is a shared spec at agents.md, and most major tools respect it. I treat this file as the canonical answer to one question: what does a competent contributor need to know about this repository before touching anything?

What this means in practice is that AGENTS.md should be the densest, most boring, most factual document in the repo. Not because facts are interesting, but because every token in it is paid for on every single session.

# Repository conventions

- Package manager: pnpm. Never `npm install`.
- Tests: `pnpm test`, hits a real Postgres in `docker-compose.test.yml`.
  Mocked DB tests are not allowed (see incident #482).
- Migrations live in `db/migrations`, generated via `pnpm db:gen`.
- All API routes return `{ data, error }`. Never throw across the boundary.
- We do not use barrel files (`index.ts` re-exports). Import from source.

A few things worth noting about this style. There are no explanations of why the conventions exist beyond a one-line pointer to an incident. There are no examples. There is no reasoning. The model already knows what pnpm is and how to write a migration. The only thing it does not know is the unwritten rules of this specific codebase, and that is exactly what AGENTS.md is for.

The two failure modes I see most often are bloat and aspiration. Bloat: teams turn AGENTS.md into a wiki, fifteen pages of architecture diagrams the model has to wade through to find the one rule it needed. Aspiration: teams write down rules they wish were true ("we always write tests first") rather than rules that actually hold. The model takes both of these literally, and both produce subtly wrong work.

A useful test: if a rule in AGENTS.md is also enforced by a linter or a pre-commit hook, delete it from the file. The hook will catch the violation at no cost to the context window.

Layer 3: skills, the procedural library

If AGENTS.md is what is true here, a skill is how we do this thing. Skills are reusable, self-contained procedures the agent loads on demand — typically a folder with a SKILL.md describing when to use it, plus any scripts, prompts, or reference material the procedure needs. Anthropic recently formalized this as Agent Skills, but the underlying pattern is older: it is essentially a callable runbook.

The reason to factor knowledge into a skill rather than into AGENTS.md comes down to the cost question above. AGENTS.md pays its tokens always; a skill pays them only when invoked. So procedural knowledge that is occasionally relevant — running a security review, rotating a credential, generating release notes — does not belong in the project file. It belongs in a skill that the model picks up only when the situation calls for it.

A useful litmus test: a skill should be triggerable. The opening of every skill I write specifies, with examples, exactly when to load it:

---
name: deprecate-endpoint
description: Use when removing a public API endpoint. Handles the
  sunset header, changelog entry, client SDK update, and the 90-day
  deprecation notice in our status page.
---

## When to use
- Removing or renaming a route under `/api/v1/...`
- Changing a response shape in a backwards-incompatible way

## Procedure
1. ...

The triggering criteria matter more than the procedure itself. A skill the model never loads might as well not exist; a skill the model loads in the wrong situation is worse than not having one. Spending real effort on the description field — making it specific, naming the symptoms that should cause the model to reach for it — is the highest-leverage thing you can do here.

A second observation: skills are the right place to put taste. AGENTS.md codifies rules; skills codify judgment. The deprecation skill above is not a rule, it is a sequence of value-laden choices about what we owe our API consumers. Encoding judgment as a procedure the model can replay is one of the genuinely new things this generation of tooling allows.

Layer 4: MCP, the rest of the world

The Model Context Protocol is, conceptually, very small. It is a standard wire protocol for exposing tools, resources, and prompts to an agent. It exists for one reason: before MCP, every agent had to ship a bespoke integration for every external system, and the combinatorial explosion was killing everyone.

The placement question for MCP is: does this knowledge live somewhere the agent has to go to get it? If yes — Linear tickets, GitHub issues, a deployment dashboard, your design system in Figma — that is an MCP server. If no, you are encoding state that should be a fact (AGENTS.md) or a procedure (a skill).

The mistake I see most often is people standing up MCP servers as a wrapper around static reference material. If your "Slack MCP" exists so the model can read pinned messages that explain the team's coding conventions, those conventions belong in the repo, not behind a network round-trip. MCP is for live state, where the answer can change between when you write it down and when the agent needs it. For everything else, the local filesystem is faster, cheaper, and more reliable.

A subtle point about MCP that does not get enough attention: every connected server expands the action space the model has to reason over, and the cost is non-linear. I had a setup with eleven MCP servers connected at one point. The model spent visible reasoning effort on tool selection, occasionally chose tools that were technically applicable but worse than the obvious local-filesystem alternative, and overall behaved less well than a setup with three servers. The right number is the smallest number that covers the live-state surface of your work.

Layer 5: memory

The memory layer is the most underdeveloped part of the stack and the one where the right answer is most contested. The options range from "no persistent memory at all" (every session starts cold) to "a vector database the agent searches at every turn" with everything in between.

My current setup is deliberately boring: a structured set of files in a known directory, written by the agent, read by the agent, organized by topic. No vector database, no embedding pipeline, no retrieval cleverness. The agent decides what is worth remembering, writes a small markdown file, and updates an index. It then reads only the index on each session, fetching specific files as needed.

This works well for the reason described in the previous post on harness engineering: externalized memory is mostly about giving the model a stable, navigable workspace, and a filesystem already does that better than anything more sophisticated. A vector store becomes interesting only when the volume of stored knowledge exceeds what an index can summarize, which for individual workflows is essentially never.

The real risk in the memory layer is staleness. A memory file written six months ago is a fact about the world as it was, not as it is, and the model will treat it as ground truth unless explicitly told not to. I now treat every memory entry as a hypothesis the model should verify before acting on, and I write that policy into the system prompt rather than relying on the model to figure it out. The cost is occasionally a few extra tool calls; the benefit is an agent that does not confidently apply yesterday's solution to a problem it no longer fits.

A decision table

Distilled into a single reference, the placement function looks like this:

Knowledge shape
Where it lives
Why
Repo-specific fact
AGENTS.md
Always relevant, narrow scope
Reusable procedure
skill
Loaded on demand, portable
Live external state
MCP server
Changes between sessions
Cross-session preference
memory file
Persistent, user-scoped
One-off task context
prompt
Ephemeral, narrow
General knowledge
nowhere
The model already has it

The last row is the one I most often have to remind myself of. A surprising fraction of "agent engineering" effort goes into encoding things the model already knows perfectly well, which adds noise without adding capability. If you can ask the model a question and get a good answer with no scaffolding, that knowledge does not need to live in your stack.

A worked example

To make this less abstract, here is a recent task I ran end to end and where each piece of context came from. The job was to add a new payment provider to a service I work on.

The CLI I used (Claude Code) provided the loop. The project's AGENTS.md told the agent that we use a strategy pattern for providers and that all of them implement an IPaymentProvider interface defined at a specific path — fact, narrow scope, every session. A skill called add-payment-provider walked through the seven-step procedure (interface stub, factory registration, integration test, sandbox credentials, feature flag, dashboard entry, runbook update) — procedure, occasional, portable. The Linear MCP let the agent read the ticket and the linked design doc — live external state. My memory directory contributed a single relevant file noting that I prefer the integration tests in this service to use real HTTP fixtures via nock rather than mocked clients, because we got burned by a divergence months ago — preference, persistent, user-scoped. Nothing in the task required the agent to know what HMAC is or how OAuth works; the model brought that on its own.

The whole thing took about forty minutes of mostly-autonomous execution. I read the diffs, ran the tests locally, pushed a PR. The interesting observation is that no single layer of the stack was doing very much work. Each one contributed a small, well-scoped piece, and the agent assembled them into a coherent change. The effort I had spent earlier — figuring out which layer each piece belonged in — paid for itself on this run and will pay for itself on every subsequent payment-provider integration too.

What still does not work

I want to be honest about the gaps, because the marketing copy in this space is unbearable.

The composition story across layers is poor. There is no good answer to "how do I install this skill into this project for this user," and most teams end up with skills that live in three different places and partially shadow each other. The MCP ecosystem has the same problem at a larger scale: discoverability is bad, version pinning is bad, and there is no reasonable way to tell which servers a given agent should have available for a given task without manually toggling them.

Memory is not solved. The filesystem-and-index approach I described works, but it works because I am the only writer and reader. The moment you want shared memory across a team, you hit the same problems distributed systems have always had — invalidation, conflict resolution, access control — with the additional complication that one of the readers is a model that will confidently act on stale information.

Evaluation is essentially nonexistent for personal workflows. We can measure agent performance on benchmarks, but I have no good way to measure whether my setup is getting better or worse over time. Most of the placement decisions I described above are calibrated by feel, and feel is famously unreliable. The benchmark someone eventually publishes for "is this individual workflow well-composed" will reshape the space, and I cannot wait for it.

The point

The shape of this work has changed in the last year. A year ago, the bottleneck was capability — could the model do the thing at all. Today, for most engineering tasks I care about, the bottleneck is composition: how do I assemble a setup such that the model has the facts it needs, the procedures it can replay, the connections it can call out to, and the persistent context that makes the next session faster than this one.

That assembly is the actual job. The tools are commodities; the placement is taste. If you only take one thing from this, let it be the question I now ask before adding anything to my stack: what is the smallest, longest-lived, most boring layer that this knowledge can live in. Pushing things down toward boring almost always pays off. Pushing them up toward clever almost never does.


References

  1. The AGENTS.md working group. agents.md — a simple, open format for guiding coding agents. 2025.
  2. Anthropic. Introducing the Model Context Protocol. 2024.
  3. Model Context Protocol. Specification and reference servers. 2024–.
  4. Anthropic. Claude Code documentation. 2025.
  5. Anthropic. Equipping agents for the real world with Agent Skills. 2025.
  6. Willison, S. Things I've learned about building coding agents. Ongoing.
  7. OpenAI. Codex CLI repository. 2025.