May 5, 2026

A deep-dive on how to assemble context for an LLM agent

Open Claude Code, Cursor, or any other agentic coding CLI. Ask it to fix a bug. Watch it run one tool call, then another, then another. Under the hood, on every single LLM turn, the CLI is making a decision you never see: what exactly to send the model as the conversation history. The whole transcript, just the last few messages, a summary of the earlier ones, the original task plus the latest file read — each produces a different agent. The CLI has made the decision for you, baked in at compile time, invisible unless you read the source.

This article is a tour of the design space of context-assembly strategies for LLM agents — the invisible decisions a harness makes about what to put in the prompt on every turn. We build a taxonomy of the choices (where compression happens, what gets compressed, when it fires), survey how production CLIs actually solve the problem in source, and run a small experiment on four representative strategies against a bug-fix fixture.

The vehicle for the experiment is pi — an agent framework that exposes context-assembly as a function you write, which makes swapping strategies trivial. The article ends with a small algorithmic proposal — age-aware tool-result truncation — the cheapest one-line strategy we found that holds the line on cost without losing what the agent is currently reasoning about.

Why context is the agent’s central control problem

Three pieces make up any agentic coding setup: the model (the LLM that generates tokens), the agent (the control loop that repeatedly calls the model inside an environment), and the harness (the software around it that manages context, tools, prompts, state, and control flow). Sebastian Raschka has a good walkthrough of how these compose into a working coding agent. The point that matters here: a lot of what looks like “model quality” in a coding session is really context quality — and context quality is the harness’s job.

The agent’s loop, stripped to its essentials, is:

Send a message history to the LLM.
Get back either a text response or tool calls.
If tool calls: run them, append the results to the history, go to 1.
If text: done.

Now think about what happens to step 1 over a real session. The user’s prompt is small — a few hundred tokens at most. The system prompt is fixed and bounded. Almost all of the context the model sees on any given turn is tool calls and their results. A file read returns 2,000 characters of source code, a test run returns 1,500 characters of failure output, an ls returns 500. The agent may take thirty turns, each appending another batch of these to the history. By turn thirty you’re sending ~100,000 tokens of tool output alone per call, every call. Tool I/O dominates context size — and therefore dominates cost, latency, and how much room is left for the model to actually reason.

This is why most context-assembly strategies are, at heart, tool-output handling strategies: when to keep tool results verbatim, when to truncate them, when to replace them with stubs, when to fold a batch of older ones into a summary. A few strategies operate on whole-message granularity (sliding windows, full-history compaction), but the bulk of the design space — and almost all of the cost-difference signal in this article — comes from how aggressively each strategy treats the megabytes of tool output the agent accumulates per session.

If we simply leave the history to grow unchecked, we run into four problems:

Cost. You pay for every input token, every turn. Linear-in-history cost on a thirty-turn run means the last few turns are the most expensive calls you make.
Latency. Providers stream output, but they don’t stream input. A 100k-token prompt takes noticeable wall-clock time to submit and tokenize before the first response token arrives.
Context window size limitations. Gemini 2.5 Flash caps at 1M tokens; Claude Sonnet at 1M; most others at 200k. Hit the ceiling and the call fails outright.
Long-context degradation. Even inside the window, models attend more to the beginning and the end of the prompt than to the middle (Liu et al., 2023). For an agent, that means the earliest turns (goal, system prompt) and the most recent turns (latest tool result) are well-attended, while the middle — stale tool outputs, abandoned fix attempts, half-completed reasoning — gets diluted regardless of how much window remains.

Every agent has to make a choice about how to handle these pressures. That choice — explicit or implicit — is its context-assembly strategy. Some strategies ignore the pressures entirely (send everything, every turn) and let the user pay. Some rewrite the history aggressively. Most CLIs pick one point on this spectrum and ship it.

A context-assembly strategy isn’t a single decision — it’s a stack of interacting choices:

What gets dropped or rewritten?
When — every turn, or only at thresholds?
Where in the stack — tool layer, strategy layer, or both?
How aggressively — in tokens, characters, messages?
What do you summarize — older history, whole conversation, certain tool types?
How do you summarize — freeform, structured template, multi-round chained?
What about cache — does your transformation respect the prefix or invalidate it every turn?

Each is a real choice, and the combinatorics get complicated very fast. A strategy that’s perfect for short single-bug sessions may be catastrophic on multi-bug exploratory tasks; a tool-layer cap that helps on small files actively hides bugs in larger ones; a summarization template that captures a single bug’s fix may lose track when there are four.

Throughout the article we’ll lean on a small set of numbers to talk about each strategy concretely. These aren’t benchmark scores — the fixture is deliberately small and we run only 3 trials (k=3) per strategy, so nothing here is a formal evaluation. Think of the numbers as a shared vocabulary: a way to point at how each strategy behaves, where its failure modes show up, and where the cost goes when it goes. The two metrics we’ll come back to most often are pass rate and cost, with a few diagnostic numbers underneath that help explain why a given strategy ends up cheap or expensive.

Metric	What it tells us	How we measure it
Pass rate	Did the task get done? Without success, the rest of the numbers don’t really matter.	Binary per run; 3 trials (k=3) per strategy.
Median cost	What a typical run costs.	Sum of per-call costs; reported as median across the three trials.
Worst-case cost	What a bad run looks like. Worth flagging because picking by median can hide a $4 catastrophe that happens 1 run in 20.	Max cost across the three trials.
Turn count	Stand-in for latency — more LLM calls means more wall-clock time.	Number of assistant turns per run.
Prompt size per turn	Diagnostic. Explains why cost is what it is.	`input_tokens + cached_tokens` per LLM call.
Cache hit ratio	Diagnostic. A small per-turn prompt can still cost like a big one if the strategy invalidates the cache every turn.	`cached_tokens / total_tokens` per call.

When you read the tables later, the natural order to look at things in is: pass rate first (a strategy that doesn’t reliably fix the bug isn’t really in the running), then worst-case cost (it tells you about the failure mode the median averages away), then the diagnostics if you want to understand why.

A taxonomy of strategies

With the metrics established, let’s look at the design space they’ll be applied to. The strategies that show up in production CLIs fall into a handful of families along two axes: where the compression happens and what gets compressed.

The simplest strategy is to do nothing — send the entire conversation verbatim on every LLM call. We call this baseline, and it’s the obvious starting point: zero implementation cost, perfect cache stability, and on short sessions it’s competitive with everything fancier. Until you hit the model’s context window, your budget, or long-context degradation starts costing you accuracy on the things that matter, baseline is fine.

Every other strategy is a way to implement compression. Compression can happen in three places:

Per-turn. Apply a transformation to the message list on every LLM call, shaping each turn’s prompt as the conversation grows.
At thresholds only. Leave the conversation alone until it crosses some size limit, then fire a one-time operation (typically summarization) on the older portion and freeze the result for the rest of the run.
At the tool level. Cap or rewrite a tool’s return value before it ever enters the conversation history. Lives one layer below the other two and composes with them — we cover it in detail further down.

Between the two strategy-layer modes, neither dominates, and both are often combined. Per-turn is cheaper (no extra LLM call), cache-stable from turn 1, and honest — a marker inserted into the prompt like …[truncated 1500 chars] tells the model something’s missing, where a summary can look complete even when it omits the bug. At-thresholds bounds the conversation (compaction shrinks the message list, so multi-round versions can run arbitrarily long), pays zero compression cost on short sessions, and recovers attention budget — the post-compaction prompt is small enough that every position is well-attended again. As a rule of thumb: short sessions favor per-turn, long sessions favor at-thresholds, and any production CLI worth shipping ends up combining both.

There’s one strategy that doesn’t require compression — sliding window — which keeps only the last K messages and drops the rest. However, it’s cache-hostile by construction, and the dropped messages are gone forever — making it the most aggressively lossy strategy in the lineup. On multi-bug tasks it’s catastrophic — earlier fixes scroll out of the window, the agent re-encounters them as unfamiliar code, undoes them, and loops. That’s why we don’t include sliding window in our experiments — we mention it below where it illustrates cache-hostility as a concept, but no production-relevant comparison would ship with one.

So since every strategy has to compress, the next question is what to compress. Five common patterns:

Drop old turns. Keep only the last N messages. Classic sliding window. Variant: pin the original user prompt at the head.
Drop old tool results. Keep the tool calls (preserving the reasoning trace) but drop or truncate their outputs.
Truncate tool outputs. Keep every message, but cap each tool result at a max character count — either at the strategy layer (re-applied each turn) or inside the tool’s implementation (capped once at execution time, then stored capped).
Summarize old turns. Once history exceeds a threshold, invoke another LLM call to produce a compact summary that stands in for the dropped turns. This is what opencode and pi’s own coding-agent both do on overflow.
Retrieve on demand. Keep a full log, embed each turn, and on each call include only the top-k most relevant turns to the current goal. No one ships this in a real CLI yet that we’re aware of.

These patterns aren’t mutually exclusive — most production strategies stack two or three, e.g. truncate every tool result, then summarize older turns, then drop pre-summary content.

Cache stability

One property dominates cost regardless of which layer compression happens at, and is therefore extremely important to get right. That property is cache stability. Every modern LLM API charges different rates for tokens it has seen recently versus tokens it’s seeing for the first time. Gemini’s implicit prefix cache, Anthropic’s cache_control blocks, and OpenAI’s prompt cache all work the same way structurally: the provider hashes the leading byte sequence of your request, looks for a match against recent requests, and if it finds one, charges you a much lower rate for those tokens (typically 10–25% of the uncached price; details differ by provider). The cached portion has to be a prefix — a contiguous identical sequence starting at byte 0. The first byte that differs invalidates everything after it.

A strategy is cache-stable if it mutates the prefix at most a bounded number of times in known places. Concretely: a strategy may transform a position once — for example, truncate a tool result that just aged out of the recent-K window, or replace older history with a frozen summary on compaction — but once transformed, the bytes at that position never change again. Each transformation costs one cache-write at that position; every turn after that lands on a cache hit. The strictest version is baseline, where every message stays bit-identical from the moment it’s appended and never transforms at all. A strategy doesn’t have to be this rigid to be cache-friendly: as long as transformations happen at known points and stay frozen after, the cache amortizes cheaply.

An agent’s conversation grows monotonically: user → tool calls → tool results → assistant → tool calls → tool results → assistant. By turn 30 you’re sending ~100K tokens of mostly-stable history per call, every call. If your strategy keeps the prefix stable, those 100K tokens are mostly cache hits and you pay full rate only on the few hundred new tokens at the tail. If your strategy mutates the prefix every turn, those same 100K tokens are all uncached, and your bill scales 4–10× higher with no behavioral benefit.

A simple cache-stable example is a cap on tool results: every turn the strategy walks the conversation with full tool-call output and truncates each tool result to 500 chars. Because the rule is deterministic and the underlying tool result text doesn’t change, the truncated version is bit-identical at the same position on every subsequent call.

An example of the opposite — what we call cache-hostile — is a sliding window: it re-arranges what lands at each position every turn, so cache hits collapse past the seed prompt and every token is billed at the uncached rate. There’s no “after this, frozen” — there’s continuous churn. The difference between cache-stable and cache-hostile is the dominant cost factor across runs.

The corollary is that there are really only two cache-friendly shapes a strategy can take: append-only (only ever modify the tail — baseline and the truncate variants both fit here, since once a tool result is truncated to N chars those N chars never change) or freeze-once (make one big rewrite — typically compaction — and never touch it again, so the post-rewrite prefix becomes a new stable frozen prefix). Anything else — periodic re-compaction, dynamic-window eviction keyed to the current turn, in-place summary rewriting — mutates the prefix turn over turn and is cache-hostile by default. Production CLIs do re-compact periodically without paying that cost, but only by leaving prior summaries frozen, appending new summary blocks rather than rewriting old ones, and using explicit cache breakpoints (Anthropic’s cache_control, OpenAI’s prompt_cache_key) so the provider knows where the stable prefix ends.

Two meanings of “caching”: prefix vs. semantic

Everything above is about prefix caching — the provider-side mechanism that makes resending a long, stable conversation cheap. It’s worth separating that from a different thing that also gets called “caching” in agent systems: semantic caching, which sits in front of the model and tries to skip the call entirely.

The two are easy to conflate because both promise “cheaper LLM calls”, but they operate at different layers and have different failure modes:

Prefix caching keys on an exact token prefix — a contiguous byte-identical sequence starting at position 0. A hit means the provider skips prefill on those tokens and charges 10–25% of the normal rate; the model still runs and still generates a fresh completion. It’s purely a compute optimization, so it’s always correct — the output is identical to not caching. This is the thing every strategy in this article is implicitly optimizing for, and it lives inside the inference provider.
Semantic caching keys on the meaning of a request — you embed the incoming prompt, vector-search a store of past prompts, and if something scores above a similarity threshold you return its stored response without calling the model at all. A hit saves the entire call, not just prefill. But it’s approximate: a loose threshold returns the answer to a different question, and anything time- or context-dependent goes stale. It lives in your application (Redis, a vector DB, GPTCache), and you own the threshold, the TTL, and the invalidation logic.

	Prefix caching	Semantic caching
Layer	Inference provider	Your application
Key	Exact token prefix	Embedding of the request
On hit	Faster prefill, fresh generation	Stored response, no model call
Savings	Partial — prefill only	Total — no inference
Correctness	Always exact	Approximate; can be wrong or stale
You manage	Cache breakpoints, prefix stability	Threshold, eviction, invalidation, partitioning

For a coding agent, semantic caching is mostly the wrong tool: turns are rarely semantically interchangeable (each one is conditioned on the exact current state of the repo), and a false hit means acting on a stale answer about code that has since changed. It earns its keep in narrower settings — FAQ-style assistants, classification endpoints, anything where the same handful of questions recur across many users and the correct answer doesn’t depend on mutable state. The two compose cleanly when both apply: a semantic cache out front to short-circuit repeat questions, prefix caching underneath for everything that does reach the model. The rest of this article is about the second of those — keeping the prefix stable so the calls you do make stay cheap.

Lossy storage vs. lossy view

Standing separately is the question of where compression happens — and therefore what gets stored in the conversation. Two options, same observable effect on the model:

At the tool layer (lossy storage). The tool caps or rewrites its return value before the result enters the conversation. The compressed text is what gets appended to history and stays there forever. opencode’s read_file does this — the rest of the section on tool-output design covers it later. Once the tool returns a 50KB slice, the rest of the file isn’t anywhere local; to recover it the agent has to call the tool again with a different offset.
At the strategy layer (lossy view, lossless storage). The tool returns its full output. The full text is appended to history. Every turn, a per-turn hook (in pi this is called transformContext, covered later when we get to our implementation) re-derives a compressed view over the full history for that LLM call only — truncating, summarizing, stubbing, dropping, whatever the strategy does. The conversation log retains every byte forever; only the model’s per-turn view is reduced. All four strategies we benchmark (baseline, truncate-500, age-truncate-500-keep-3, compact-at-12000-structured) work this way.

The difference doesn’t show up in the LLM’s prompt — both designs produce the same text. It shows up in what stays on disk:

Recoverability. Strategy-layer compression is reversible: swap the strategy mid-run (or replay the log later with a different strategy) and the full text comes back. Tool-layer compression is irreversible without another tool call.
Compositionality. Strategy-layer lets you experiment with different views over the same underlying log. Tool-layer freezes the data permanently in its first form.
Runtime cost. Strategy-layer does compression work on every turn (cheap, but non-zero). Tool-layer does it once.

This split is reflected in how agent frameworks expose hooks. Tool-layer compression doesn’t need a framework hook — tools are just functions you write, so capping at the tool layer means putting the cap in the tool’s implementation. The strategy layer is different: it runs on every turn against a moving target (the growing conversation), so the framework has to expose an entry point for it.

Our experiment uses pure strategy-layer compression precisely so the full conversation is preserved — every run records the unmodified tool output (via a logger we added on top of pi’s event stream — pi keeps the conversation in memory but doesn’t persist anything by itself), and the strategy variants are pure replays over the same data.

Tool output design: the other half of the picture

Everything so far has operated at the context-assembly layer — transformContext runs on a message list that’s already in hand. But there’s a parallel design space one layer down: what the tools themselves choose to return. A tool that dumps raw output makes your strategy do all the work. A tool that bounds its own output makes your strategy’s job smaller — sometimes vanishingly so.

Two dimensions matter at the tool layer:

Per-call cap. A maximum size the tool will ever return in a single call. opencode’s read_file caps at ~50KB total / 2,000 chars per line. Claude Code’s read tool caps at 256KB on the file side and 25K tokens on the rendered output. Beyond the cap, content is omitted from the return value and the agent doesn’t see it unless it asks again.
Pagination. Whether the agent can ask for the next slice. Both opencode and Claude Code accept offset / limit parameters on their read tool, so a 200KB file becomes four sequential read_file calls instead of one truncated read. The conversation ends up with four small, cache-stable tool results rather than one big partially-truncated one.

The interaction with the context-assembly layer is direct: if your tools self-bound, your strategy has less work to do. opencode keeps the entire conversation in context (no message-level dropping or per-conversation truncation in their default path) and gets away with it because each tool result is already small by construction. Running a baseline strategy on top of opencode’s tools would behave very differently than running baseline on top of a tool that returns 1MB of raw text — even though it’s the same baseline.

A tool that paginates is, in a real sense, doing lossless compression: nothing is permanently dropped, the rest of the file is still on disk, and the agent can fetch on demand. A tool that just caps without pagination is doing lossy compression: anything past the cap is invisible until the tool’s contract changes.

How real CLIs handle context

Before we pin down which strategies we’ll measure, it’s worth seeing how production CLIs actually solve the problem. Each one picks a specific blend of the patterns from the taxonomy, sometimes informed by what their target provider’s API exposes. Here’s a quick survey of what’s visible in source.

Claude Code

In March 2026 Anthropic shipped @anthropic-ai/claude-code v2.1.88 with a ~60 MB source map, exposing ~512K lines of TypeScript. Several teardowns (Straiker, Karan Prasad) and a verbatim prompt archive (Piebald-AI/claude-code-system-prompts) appeared shortly after. That made Claude Code by far the most empirically grounded reference point in this article — it’s the only major closed-source agentic CLI whose implementation we can actually read.

Cache stability is one of the things Claude Code clearly cares about most — the leak shows multiple deliberate, mutually reinforcing moves to keep the prefix bit-stable across requests.

The most visible of those moves is a static / dynamic prompt split. Claude Code’s prompt is structured as a fixed-position layout where the early blocks — system prompt, tool descriptions, workspace summary, CLAUDE.md contents — are bit-identical across every request in the session. They’re explicitly marked with Anthropic’s cache_control: { type: "ephemeral" } to tell the API that this prefix is cacheable. The variable blocks come after, in known positions. Schematically, every request looks like the example below — switch tabs to see how the shape changes when compaction fires:

// Early in a session, before history has crossed the compaction threshold.
await client.messages.create({
  model: "claude-...",
  system: [
    // ─── STATIC: bit-identical across turns ─────────────────
    { type: "text", text: SYSTEM_PROMPT },                    // ~5KB, never changes
    { type: "text", text: TOOL_DESCRIPTIONS },                // ~8KB, never changes
    { type: "text", text: workspaceSummary },                 // computed once at session start
    { type: "text", text: CLAUDE_MD_CONTENTS,
      cache_control: { type: "ephemeral" } },                 // 👈 cache breakpoint #1
                                                              // (everything above this point is cached)
  ],
  messages: [
    // ─── frozen prefix: just the seed user message ─────
    { role: "user", content: SEED_USER_MESSAGE,
      cache_control: { type: "ephemeral" } },                 // 👈 cache breakpoint #2 (on the seed)
    // ─── tail: every turn so far, appended ─────────────
    ...allTurnsSoFar,
  ],
});

// After compaction has fired at least once: the older portion of the
// conversation has been replaced by a frozen summary, and breakpoint #2
// has shifted forward to land on it.
await client.messages.create({
  model: "claude-...",
  system: [
    // ─── STATIC: bit-identical across turns ─────────────────
    { type: "text", text: SYSTEM_PROMPT },                    // ~5KB, never changes
    { type: "text", text: TOOL_DESCRIPTIONS },                // ~8KB, never changes
    { type: "text", text: workspaceSummary },                 // computed once at session start
    { type: "text", text: CLAUDE_MD_CONTENTS,
      cache_control: { type: "ephemeral" } },                 // 👈 cache breakpoint #1
  ],
  messages: [
    // ─── frozen prefix: bit-stable from compaction onward ─────
    { role: "user", content: SEED_USER_MESSAGE },             // first user message in the run
    { role: "assistant", content: FROZEN_SUMMARY,
      cache_control: { type: "ephemeral" } },                 // 👈 cache breakpoint #2 (last frozen item)
    // ─── tail: appended each turn since compaction ───────────
    ...recentKMessages,
  ],
});

Two breakpoints, in fixed positions. New tokens only get charged for what’s appended at the tail. Why two and not one? Because each cache_control marker creates an independent cache entry — letting different parts of the prompt invalidate at different rates (the static block changes ~never, the post-summary section changes once per compaction, the tail every turn) and giving you fallback hits when one entry’s TTL expires before the others.

The shape transition between the two tabs is the only place the breakpoint #2 prefix changes during a session. When compaction fires, two things happen at once: the older portion of the tail collapses into the new FROZEN_SUMMARY block, and breakpoint #2 shifts forward from “on the seed” to “on the frozen summary.” That single transition costs one cache invalidation — the API has to write a new entry at the new boundary — but every turn after lands on the new, longer cached prefix.

What about the next compaction, and the one after that? Regardless of how compaction repeats, the request always carries the same two cache breakpoints — cache_control is per-request, so only the markers in the current call matter. Breakpoint #1 stays anchored at the end of CLAUDE.md; breakpoint #2 sits on whatever the most recent stable item is at the moment of the request. You don’t keep adding breakpoints as the session goes — Anthropic’s 4-per-request cap is a budget, not a counter that grows with session length.

What can differ across compaction events is how the frozen content is laid out behind that single breakpoint #2 marker. Two reasonable designs:

Rotating (Claude Code’s choice). There’s only ever one FROZEN_SUMMARY slot. On each subsequent compaction event the previous summary plus the new tail get re-summarized into a fresh single block, replacing the old one. The bytes at the FROZEN_SUMMARY position change — so breakpoint #2’s cache entry has to be re-written each time. One cache invalidation per compaction event, but the prompt stays compact ([seed + summary + recent] shape, regardless of session length).
Chained. Each new summary is appended after the previous frozen summaries; breakpoint #2 moves forward to land on the newest one. The prefix grows — [seed, summary_1, summary_2, ..., recent] — but each previous summary stays byte-stable, so older cache entries (still in the provider’s pool from earlier requests) can serve as fallback hits even though they’re no longer marked in the current request. More cache-friendly, but the prompt grows linearly with compaction count, so you’d eventually have to compact the chain itself.

Claude Code’s rotating pick trades occasional cache invalidations for prompt compactness. The math favors compactness because compactions are rare relative to turns — you might fire one every few dozen turns, eat one cache invalidation, then ride the new entry for several thousand cached tokens until the next compaction. The cost: each re-summarization is a lossy operation on top of an already-lossy summary, so detail compounds away over a long session.

Anthropic’s prompt-caching documentation has the full mechanics. The short version: each cache_control directive creates a new entry in the cache, anchored at byte 0 and ending at the marker’s position. So the two-breakpoint example above writes two nested entries — one ending at the CLAUDE.md boundary, one ending at the frozen summary. On the next request, the API tries to land the longest cached prefix first and falls back to shorter ones if the longer one no longer matches. This lets different parts of the prompt invalidate at different rates and keeps you landing partial hits instead of all-or-nothing.

Anthropic’s inline-marker pattern is the cleanest version of this among the major providers. OpenAI is purely automatic — the API decides where to write entries (~5-minute TTL); you can group requests with prompt_cache_key but you can’t mark a position. Gemini offers both: an automatic implicit cache, plus an explicit cachedContents API where you pre-create a cached resource with a configurable TTL and reference it by name on subsequent calls (different ergonomics from Anthropic’s in-request markers). Anthropic lets you mark up to 4 byte positions inline per request, with opt-in 1-hour TTL at a higher write surcharge. We come back to the OpenAI/Codex flavor — automatic + prompt_cache_key — in the next section.

Every transformation Claude Code does on its prompt is a pure function of well-defined state (turn count, tool call args, message position) and never depends on transient signals like the current branch’s HEAD hash. Anything that drifts turn over turn would force the prefix to mutate on every request and collapse the cache. That’s how a 30-turn coding session can pay near-baseline cost despite sending ~100K tokens per call.

Per-tool aging policies

Claude Code keeps a hardcoded list of tool names — named COMPACTABLE_TOOLS in the leak — that are subject to per-turn aging. In our taxonomy this is a per-turn strategy at the strategy layer, despite the “compactable” naming (which suggests at-thresholds compaction — a separate mechanism Claude Code also has). Tools not on the list are exempt: their results are kept verbatim forever.

The interesting move here is to step away from the usual two extremes for handling tool output: always verbatim (what most strategies default to) and always capped or paginated (what opencode’s read_file does at the tool layer). Aging adds a third option that combines both — keep recent results verbatim, age older ones down. Same tool, different treatment depending on how stale the result is. That solves a problem either extreme creates on its own: always verbatim lets stale 50KB blobs accumulate forever; always capped can cut off the file the agent just opened (which is exactly the truncate-500 failure mode shown below).

Claude Code takes the idea one step further: not just when to age but how to age. Each tool on the list gets a different rule, tuned to how that tool’s output retains value over time:

Tool	Aging rule
`read_file`	Once older than the last 3 reads, replace the result with a one-line stub naming the path
`list_files`	Once older than the last 2 listings, truncate to 200 chars
`write_file`	Always keep verbatim
`run_tests`	Always keep verbatim

A small clarification on what “older than the last 3 reads” actually means: it counts calls of the same tool, not turns. To make that concrete, suppose the agent’s call history so far is:

turn 1:  read_file(api.ts)         ← 1st read
turn 2:  list_files(./src)
turn 3:  read_file(api.ts)         ← 2nd read
turn 4:  run_tests()
turn 5:  read_file(storage.ts)     ← 3rd read
turn 6:  read_file(api.ts)         ← 4th read
turn 7:  read_file(serializer.ts)  ← 5th read

After turn 7, looking only at read_file calls (the ones at turns 1, 3, 5, 6, 7), the 3 most recent are turns 5, 6, and 7. So:

Reads at turns 1 and 3 → stubbed.
Reads at turns 5, 6, 7 → kept verbatim.

If turn 8 is run_tests() (not a read), nothing changes. The moment turn 9 is another read_file, the read at turn 5 ages out — it becomes the 4th-most-recent read — and gets stubbed. Each tool is independently ranked by call recency, and the top-K of each rank stay verbatim.

The reasoning behind each row, in plain terms:

A stale list_files is almost pure noise — once the agent has moved past exploring the directory, the old listing has near-zero ongoing value. So it’s truncated hard and fast.
A stale read_file result is more nuanced: the agent might want that file again. Instead of truncating its text, Claude Code replaces it with a stub that names the path; the agent can re-fetch by calling read_file with the same args. Lossless in the sense that nothing’s gone, just deferred.
A stale write_file represents an action the agent took — it modified that file. Forgetting you wrote something is a recipe for re-writing it differently the next turn. Kept verbatim.
A stale run_tests carries authoritative test-suite state that the agent often re-checks against. Kept verbatim.

The shape is the same as our age-truncate-500-keep-3 — cache-stable, age-keyed, deterministic. The generalization is per-tool granularity instead of one uniform rule for all tool results. On long sessions where many tools are called many times, the granularity pays off because each tool’s eviction matches its actual value-retention curve. On our 10–20 turn runs the agent calls each tool only a handful of times, so the nuance doesn’t have room to manifest — a uniform age-truncate already captures most of the savings.

The structured-summary template

The last Claude Code design worth pulling out is the summarization prompt itself — the system prompt sent to the model when compaction fires. The leak ships it as system-prompt-context-compaction-summary.md. Our compact-at-12000-structured strategy reuses a paraphrased version of this exact prompt — same five sections, same constraints — so the summary it produces has the same shape Claude Code’s does:

You produce continuation summaries for coding agents that have run out of context.

Output the summary wrapped in <summary></summary> tags, with the following five sections
as level-2 markdown headings, in order:

## Task Overview — what the user asked for, in one or two sentences.
## Current State — files created, modified, or analyzed, listed with their full paths;
                   state of the test suite; open work.
## Important Discoveries — key facts the agent learned, including approaches that did
                           NOT work and why.
## Next Steps — the immediate action the continuing agent should take.
## Context to Preserve — user preferences, promises made, constraints that must not
                         be violated.

Be specific. Cite exact filenames. No filler. No conversational framing.

Three details about this design make it work:

Section ordering mirrors a human handover. Task Overview → Current State → Discoveries → Next Steps → Context to Preserve is roughly how an engineer briefs a colleague taking over a task: what’s the goal, where are we, what did we learn, what’s next, what shouldn’t I drop on the floor. The post-compaction agent reads it the same way.
“Approaches that did NOT work” is in Discoveries. Without that explicit prompt, summaries tend to focus on what was achieved and quietly drop the dead ends — leading the post-compaction agent to retry the same failed approaches and burn turns. This single phrase prevents a specific failure mode that freeform summary prompts occasionally hit.
“Cite exact filenames. No filler. No conversational framing.” The template forces the summary to be actionable rather than narrative. Filenames are also bit-stable across regenerations, which preserves cache hits if a multi-round implementation later re-summarizes incrementally.

Three lines of constraint added to a vanilla “summarize the transcript” prompt, and the resulting summary is sharper, smaller, and easier for the post-compaction agent to act on. The lesson that generalizes: the summarization prompt itself is part of the strategy — not a detail you can leave to the model’s defaults.

Codex

Codex takes a much simpler approach to the same problem. Both Claude Code and Codex care deeply about cache stability — they just achieve it with different amounts of machinery. Claude Code makes it an explicit contract: cache_control breakpoints in the request, deterministic eviction, hand-tuned prompt structure. Codex relies on append-only history plus OpenAI’s prefix cache — the API decides on its own where to write cache entries and for how long, and Codex just doesn’t get in its way.

In our taxonomy’s terms, Codex’s steady-state behavior is the do-nothing strategy — send the conversation verbatim, append-only, no per-turn rewriting, prefix grows monotonically. The interesting additions kick in only as fallbacks. Two of them work in series:

At-thresholds compaction is the primary fallback — proactive. It fires when the conversation crosses a Codex-controlled threshold set below the model’s actual context window, so it triggers before the API would reject. Mechanism: an extra LLM call summarizes the older portion, and the conversation is rebuilt as [summary, recent...]. Costs one summarizer call but keeps a coherent stand-in for the dropped detail.
Emergency trim-from-start is the panic button — reactive. It fires only if a normal request still returns ContextWindowExceeded after compaction has already happened (which can occur if the post-compaction tail has grown back, or if even compaction’s own output is too large). Mechanism: drop the oldest message, retry; drop the next-oldest, retry; loop until the request fits. No LLM call, but items are gone entirely with no summary, and each retry is a wasted billed request.

During the bulk of any session — before the first compaction fires — Codex’s behavior is identity at the strategy layer. The fallbacks above are what stops a do-nothing approach from crashing in long-running sessions, not how it handles the steady-state cost. (Codex also has tool-layer caps, covered separately below.)

Strategy-layer details for the curious. Cache stability rides on more than just the append-only prefix — Codex sets a per-conversation cache key (prompt_cache_key = conversation_id) to scope OpenAI’s cache to the session. Compaction is implemented in its own module; its summarization prompt is a structured template — framed as a “handoff summary for another LLM,” same conceptual move as Claude Code’s 5-section template above, just less rigidly structured. Both production CLIs have learned the same lesson: a freeform “summarize the transcript” prompt isn’t enough; you want a contract on what the summary must include. The post-compaction tail is capped at 20K tokens. The emergency trim path carries an explicit rationale comment: “to preserve cache (prefix-based) and keep recent messages intact.”

Tool-layer details. Codex’s two tool-layer mechanisms run at execution time, so the conversation only ever stores the already-truncated version — same architectural slot as opencode’s read_file 50KB cap, the lossy-storage flavor of compression covered in the Tool output design section earlier. First, the shell tool has a 1 MiB hard output cap; above that the model has to contrive pagination via sed -n '...p' itself. Second, recorded shell output passes through a middle-truncation TruncationPolicy — middle-truncation keeps the first N bytes and last M bytes verbatim and replaces the middle stretch with a ...[truncated K bytes]... marker, on the bet that for shell output the command echo at the head and the exit status at the tail are the bytes that carry signal.

opencode

opencode keeps the entire conversation in context without per-conversation truncation. Compression happens at two layers:

Tool layer. Tools self-bound at execution time: read_file caps at ~50KB / 2,000 chars per line, grep matches paginate, and so on. The conversation accumulates many small results rather than a few huge ones.
Strategy layer. When the conversation crosses a threshold, opencode runs multi-round summarization (similar shape to Claude Code and Codex) to compact older history.

The split: bounded tools handle most of the size pressure, compaction picks up what’s left. opencode’s bet is that careful per-tool design lets the strategy layer stay light.

pi-coding-agent

The bare Agent (what our experiment uses) ships no default — that’s what made it convenient to compare strategies cleanly. The higher-level pi-coding-agent package built on top of Agent ships multi-round summarization on overflow, similar in shape to opencode and Codex. Tools are defined per-agent; the bundled read_file doesn’t impose its own cap.

The strategies we implement

Now that we’ve seen what production CLIs actually do, we can pick a small set of representative strategies to measure. Every strategy in this article is benchmarked against baseline — an identity strategy where nothing is dropped or rewritten, and the prefix grows monotonically across turns, so cache hits are at their theoretical maximum.

Each strategy runs on the same workload — same fixture (described below), same model, same prompts. For each strategy we run the agent 3 times (k=3) — because even at temperature=0, Gemini Flash trajectories diverge run-to-run, so we collect 3 data points per strategy to compute median/spread rather than bet on a single trial. We measure pass rate, total cost, turns, peak prompt size, and how many of the input tokens were billed at the cached vs. uncached rate.

We’ll walk through one fixture in detail — a single-file bug-fix task we’ll call the main fixture — and report the per-strategy results, transcripts, and per-turn diffs for it. We picked one canonical instance per pattern rather than running parameter sweeps; the goal is to teach the shapes, not benchmark every setting. Each row below corresponds to a pattern from the production CLIs above:

pattern	strategy	mode
No transformation (control)	`baseline`	—
Truncate tool outputs (uniform)	`truncate-500`	per-turn
Truncate tool outputs (age-aware)	`age-truncate-500-keep-3`	per-turn
Summarize older turns (structured)	`compact-at-12000-structured`	at thresholds

A note on naming. Each strategy’s name follows a <family>-<param>[-<modifier>] shape. So age-truncate-500-keep-3 reads as: family = age-truncate (age-aware truncation), 500 = the character cap on older results, keep-3 = the 3 most recent tool results pass through verbatim. Likewise compact-at-12000-structured is: family = compact, at-12000 = fires when the conversation crosses 12,000 chars, structured = uses the structured-summary template (vs a freeform variant). Each numeric token has its meaning prefixed by the family knob it parameterizes.

The remaining patterns from the taxonomy — drop old turns (sliding-N), retrieve on demand, replace old reads with stubs (Claude Code’s COMPACTABLE_TOOLS shape), freeform summaries — are intentionally out of scope. Sliding-N and retrieval because they’re cache-hostile by construction (the prefix mutates every turn). Stub-old-reads and freeform compaction because they’re variations on the patterns we do measure (stub-old-reads is a per-tool form of age-truncate; freeform compaction is the same shape as structured compaction with a different summarizer prompt). And production-grade multi-round compaction because it’s a separate engineering problem (which threshold, what to keep verbatim, whether to chain summaries) that warrants its own article.

The four-strategy table below is the same set, grouped by mode — the axis the article’s cost analysis turns on:

Per-turn (transformation applied on every LLM call).

strategy	what it drops	implementation
`truncate-500`	text past 500 chars in every tool result	map over tool results
`age-truncate-500-keep-3`	text past 500 chars in older tool results only	position-aware truncate

At thresholds (fires once when the conversation crosses a size limit, then freezes).

strategy	what it drops	implementation
`compact-at-12000-structured`	all turns before a fixed split point, replaced by an LLM summary using Claude Code’s 5-section template	one LLM call, frozen summary

For each strategy in detail — what it does, the implementation in one line, cache behavior, cost shape, when to use it, and who ships it in production — see the reference table below. Click the icon to expand to fullscreen for a side-by-side read.

Strategy referenceAll strategies side-by-side. Click the icon to expand.

strategy	mode	what it does	implementation	cache behavior	cost shape	when to use it	who ships it
`baseline`	none (reference)	transformContext returns the message list unchanged. The model sees the entire conversation on every call.	messages => messages. Three characters of code.	Maximally cache-friendly. Prompt prefix grows monotonically — every byte from position 0 is identical across turns.	Linear in conversation length. Per-turn cost grows with each appended message; the last few turns of a 30-turn run are the most expensive of the run.	Short sessions (under ~20 turns) where the conversation fits in budget and window. Also: any session where the cost of losing information would exceed paying for the full prompt.	Every CLI ships baseline implicitly when no strategy is configured. Default in pi (when transformContext is omitted), opencode, and Claude Code before compaction fires.
`truncate-500`	per-turn	Keep every message, but cap each tool result's text at 500 characters with a `…[truncated K chars]` marker. Uniform cap, applied to every tool result regardless of age.	Walk the message list; for each toolResult whose text exceeds 500 chars, replace the tail with the marker.	Cache-stable. The cap is deterministic — once truncated to 500 chars, it stays exactly 500 chars on every subsequent turn.	Smaller prompts than baseline, cache preserved. Cheap in the median when it works. Failure mode: if a bug lives past character 500 of a file the agent reads, the agent never sees it. Fails 0/3 on the main fixture and 0/3 on the multi-file fixture.	Almost never at this aggressive a cap — it actively hides bugs. Demonstrates that the uniform truncation pattern is dangerous without an age qualifier.	Variants of this pattern with larger caps (opencode caps tool output at ~2,000 chars at the tool layer); 500 is the failure-mode example we added.
`age-truncate-500-keep-3`	per-turn	Same 500-char cap as truncate-500, but applied only to older tool results. The 3 most recent tool results are kept verbatim regardless of length.	Index tool results oldest → newest. Truncate text past 500 chars on all but the last 3. Recent results pass through unchanged.	Cache-stable. The decision depends only on a tool result's position in the message list, which is monotonic — once a result becomes 'old enough', it stays truncated.	Caps cumulative growth without hiding the file the agent is currently looking at. Ties baseline on cost, passes 3/3. On the multi-file fixture: 50% cheaper than baseline at the same pass rate; on the four-bug fixture: the only non-baseline strategy that passes 3/3.	The article's default recommendation for short coding-agent sessions (10–50 turns). Avoids truncate-500's 'hide the bug' failure while still tightening the middle.	Not directly. A generalization of Claude Code's per-tool eviction policy. We propose it explicitly because it's the lightest strategy satisfying cache stability, working-set preservation, and bounded growth.
`compact-at-12000-structured`	at thresholds	Track conversation size. When it crosses 12,000 chars, fire one extra LLM call to summarize the older portion using Claude Code's 5-section template (Task Overview / Current State / Important Discoveries / Next Steps / Context to Preserve), wrapped in <summary> tags. Replace history with [first user message, frozen summary, recent messages]. Fires once per run; summary is frozen forever.	Count chars across messages. If past threshold, slice off older portion, call summarizer model with the structured prompt, save the result. From that turn on, return trimmed-and-summarized list every time.	Cache-stable after first compaction (summary is bit-stable once generated). Before compaction, identity. The single transition is the only place the prefix changes shape.	Extra LLM call costs something. On short sessions: ties baseline ($0.016 on the main fixture) because the post-compaction prompt is small enough that the savings absorb the summarization cost. The structured template is shorter and more actionable than freeform alternatives.	Sessions that reliably cross the threshold. The CC template is a strict improvement over freeform compaction — same infrastructure, more focused prompt.	Paraphrased from system-prompt-context-compaction-summary.md in the leaked Claude Code source. opencode and pi-coding-agent both run compaction at thresholds (production versions are multi-round; ours is single-shot to isolate first-fire behavior).

How pi wires them

The strategies above are agent-framework-agnostic — they describe what to do with the message list. To actually run them in our experiment we need a place to plug them in. We use pi because, unlike most agentic CLIs, it exposes context assembly as a first-class extension point — a function you write — which makes strategies trivially swappable for comparison.

Pi is structured so that, on every iteration of the agentic loop — right before sending the message history to the LLM, after the latest tool results have been appended to that history — the agent calls two user-overridable hooks between “the current transcript” and “what the LLM actually sees”:

new Agent({
  initialState: { systemPrompt, model, tools, thinkingLevel: "off" },
  // Structural layer: prune, summarize, or inject messages.
  transformContext: async (messages) => { /* ...your logic... */ },
  // Mapping layer: filter or translate custom message types.
  convertToLlm: (messages) => messages.filter(/* ... */),
});

transformContext is the hook that decides what conversation the agent should have. It takes the full conversation as AgentMessage[] — pi’s name for the unified message type that covers user, assistant, and tool-result messages — and returns a (possibly modified) AgentMessage[]. Same type in, same type out. This is where every strategy in the taxonomy above lives: cap each tool result to N chars (truncate-500), cap only older ones (age-truncate-500-keep-3), summarize at threshold (compact-at-12000-structured), and so on.

convertToLlm is the hook that packages that conversation for the wire. It runs on the output of transformContext and translates the internal AgentMessage[] into the provider-specific Message[] that actually gets sent to Anthropic, Gemini, or OpenAI — filtering custom message types the provider doesn’t understand, fixing content blocks for models that don’t support attachments, and so on.

For this article we leave convertToLlm at its default (the identity filter for standard message roles) and focus entirely on transformContext. In pi’s architecture, a context-assembly strategy is just a function:

type Strategy = (messages: AgentMessage[]) => Promise<AgentMessage[]>;

That signature is all the extension surface there is. Because it’s code (not a config blob), a strategy can do arbitrary work: call another LLM to summarize old turns, embed past messages and retrieve by similarity, read files from disk, maintain state across turns via a closure. The strategies we walked through above span from three lines (baseline) to a few dozen (compact-at-N-structured), but they all share this same shape — and they’re all swappable by passing a different function to the same hook.

To make this concrete, here’s what pi does on a single turn — picking up mid-session, after the conversation history already holds the user’s seed prompt, several rounds of LLM responses, and a stack of tool results from earlier turns:

transformContext runs over the full conversation history — including any 50KB tool results sitting in there untouched — and produces the message list to send. The strategy decides what to do with each piece: pass through (baseline), truncate uniformly to 500 chars (truncate-500), truncate only older results (age-truncate-500-keep-3), fold older history into a summary (compact-at-12000-structured), and so on. Pi sends the resulting list to the LLM. The LLM sees the strategy’s view, not the original.
The LLM responds — text, tool-call intents, or both.
If the LLM emitted tool calls, pi executes each one. Each tool returns its full output (e.g., 50KB of file content). Pi appends both the LLM’s response and each tool result to the conversation history.
Loop back to step 1.
Repeat until the LLM emits a response with no tool calls — that’s the agent’s signal to stop.

Crucially, the original full tool outputs never leave the conversation log on the pi side. They’re invisible to the LLM (because the strategy is summarizing or truncating them out of the prompt), but they’re recoverable — swap the strategy mid-run or replay the log later, and the full text comes back. This is the “lossless storage” property from the previous section, made concrete.

One detail worth being explicit about: in pi, the bare Agent class ships no default strategy. If you instantiate new Agent({...}) without supplying transformContext, you get the baseline identity behavior — the entire conversation is sent every turn. The higher-level pi-coding-agent package built on top of Agent does ship a default (multi-round summarization on overflow, similar shape to opencode). We use the bare Agent for these experiments so that every strategy in the comparison is something we wrote explicitly, with nothing built-in to control for.

Experiment setup

The fixture we’re going to use to test different strategies on is the service-and-storage layer of a TODO web app. Two classes do the work: TaskStore keeps an in-memory list of tasks, and Api is a thin dispatch layer that takes request objects and routes them to the store:

export type ApiRequest =
  | { action: "add"; payload: { title: string } }
  | { action: "complete"; payload: { id: unknown } }
  | { action: "get"; payload: { id: unknown } }
  | { action: "list" };

export class Api {
  constructor(private store: TaskStore) {}

  handle(request: ApiRequest): ApiResponse {
    switch (request.action) {
      case "complete": {
        // 👇 The bug. `payload.id` is typed `unknown` and arrives as a string
        //    when the request comes from JSON. The cast silences TypeScript
        //    but does no runtime coercion — so `markComplete("1")` reaches
        //    `t.id === id` where t.id is a number, and the lookup misses.
        const found = this.store.markComplete(request.payload.id as number);
        return { ok: true, data: { completed: found } };
      }
      // …other cases…
    }
  }
}

export class TaskStore {
  private tasks: Task[] = [];

  markComplete(id: number): boolean {
    // 👇 Strict equality. If `id` arrives as a string ("1"), this returns
    //    undefined even when a Task with id 1 exists. Combined with the
    //    missing coercion in api.ts, this is what breaks the test.
    const task = this.tasks.find((t) => t.id === id);
    if (!task) return false;
    task.completed = true;
    return true;
  }
  // …add, get, list, clear…
}

The bug spans src/api.ts (the dispatcher) and src/storage.ts (the typed store) — the cast in one file plus the strict equality in the other is what produces the failing test.

The fix is one Number() call at the API boundary. Shallow as the bug is, it sits past character 500 of api.ts — which is exactly why truncate-500 will fail catastrophically below: the agent never gets to see the line that needs fixing.

Our fixture is extremely basic compared to the serious open evaluations every model lab reports against — SWE-bench (and its Verified / Live variants) for full-repo bug fixes, τ-bench for tool-use correctness, TerminalBench for shell tasks, BigCodeBench for realistic library-using code, Aider’s polyglot benchmark for multi-language editing — but it suits our purpose better. Those benchmarks all hold the harness fixed and vary the model, producing one number per model: useful for ranking, opaque about why. This article does the inverse: model fixed (Gemini 2.5 Flash), harness strategy varied, on a fixture deliberately small enough that you can read every transcript end-to-end. Once you can see those mechanisms in a 13-turn transcript you can reason about what they’d do in a 200-turn one.

The tests in test/tasklist.test.ts exercise the full surface end-to-end: add tasks, list them, mark complete via a JSON-encoded request, fetch by id. 3 source files, 1 failing test, single-file fix.

// the failing test, abridged
import { test } from "node:test";
import assert from "node:assert/strict";
import { Api, parseRequest } from "../src/api.ts";
import { TaskStore } from "../src/storage.ts";

test("complete via API with a JSON payload marks the task completed", () => {
  const store = new TaskStore();
  const api = new Api(store);
  api.handle({ action: "add", payload: { title: "buy milk" } });

  // Clients serialize ids as strings (JSON over HTTP, URL path params, etc.).
  const raw = JSON.stringify({ action: "complete", payload: { id: "1" } });
  const request = parseRequest(raw);
  const res = api.handle(request);

  assert.equal(res.ok, true);
  if (!res.ok) return;
  assert.deepEqual(res.data, { completed: true });
});

// …also: "add creates a task with an id", "list returns all tasks"

The test suite drives the API through a JSON round-trip — JSON.stringify({...}) followed by parseRequest() — instead of calling handle() with a typed object directly. That round-trip is what makes the bug reachable: it’s the only way payload.id arrives at runtime as a string. The “user” of the library is this test suite; the agent is dropped in, given four tools (read_file, write_file, list_files, run_tests), and asked to make the suite go green.

We implemented the four tools used in the harness ourselves as thin wrappers. Each one is a few lines of Node fs plus a JSON schema registered with pi’s AgentTool interface. They’re deliberately uncapped: read_file returns the entire file content with no per-call cap, no pagination, no per-line limit; write_file is a plain overwrite; list_files returns the full directory listing. If read_file capped output at the tool layer (the way opencode’s does), truncate-500 and age-truncate-500-keep-3 would behave indistinguishably on small files — the tool’s cap would be doing the work the strategy is supposed to be doing. Keeping the tools minimal forces every observable difference in the experiment to come from the context-assembly strategy alone.

Below you can see the four tool bodies, stripped to their execute paths (schemas, labels, and workdir resolution elided for clarity):

// no per-call cap, no pagination, no per-line limit
async (_id, args) => {
  const abs = resolveInWorkdir(workdir, args.path);
  const contents = fs.readFileSync(abs, "utf8");
  return textResult(contents);
}

// plain overwrite — no diff, no validation, no edit-tolerance policy
async (_id, args) => {
  const abs = resolveInWorkdir(workdir, args.path);
  fs.mkdirSync(path.dirname(abs), { recursive: true });
  fs.writeFileSync(abs, args.content, "utf8");
  return textResult(`wrote ${args.content.length} bytes to ${args.path}`);
}

// recursive walk; returns every file path joined by newlines
async (_id, args) => {
  const rel = args.path ?? ".";
  const abs = resolveInWorkdir(workdir, rel);
  const entries: string[] = [];
  const walk = (dir: string) => {
    for (const name of fs.readdirSync(dir)) {
      const full = path.join(dir, name);
      if (fs.statSync(full).isDirectory()) {
        if (name === "node_modules" || name === ".git") continue;
        walk(full);
      } else {
        entries.push(path.relative(workdir.root, full));
      }
    }
  };
  walk(abs);
  entries.sort();
  return textResult(entries.join("\n") || "(empty)");
}

// shells out to `node --test`; returns full stdout + stderr + exit code
async () => {
  const testFiles = fs.readdirSync(path.join(workdir.root, "test"))
    .filter((n) => n.endsWith(".test.ts"))
    .map((n) => path.join("test", n));
  const result = spawnSync(
    "node",
    ["--experimental-strip-types", "--test", ...testFiles],
    { cwd: workdir.root, encoding: "utf8", timeout: 30_000 },
  );
  return textResult(
    `exit_code: ${result.status ?? -1}\n` +
    `--- stdout ---\n${result.stdout}\n` +
    `--- stderr ---\n${result.stderr}`,
  );
}

The chart below plots the four strategies turn-by-turn — one representative per shape we want to teach: baseline (the control — no transformation), age-truncate-500-keep-3 (age-aware per-turn truncation), compact-at-12000-structured (at-thresholds compaction with the Claude Code template), and truncate-500 (the catastrophic failure mode — uniform truncation aggressive enough to hide bugs). Toggle between three metrics:

prompt size — total input tokens sent that turn, counting both new and cached tokens (input_tokens + cached_tokens from the provider’s usage report). This is what determines how much work the LLM has to read.
cumulative cost — running sum, in USD, of every per-turn bill (Gemini Flash prompt + cache + output) up to and including that turn.
cache hit ratio — for that turn, cached_tokens / (cached_tokens + input_tokens). 1.0 means every input token came from the prefix cache; 0.0 means nothing was cached and you paid full rate on the entire prompt. We pull both numbers directly from the provider’s per-call usage breakdown and compute the ratio per turn.

metric: bug-01 · four strategies side-by-side

You can also step through any of the four runs turn-by-turn below. A few terms first.

A turn is one LLM call. The cycle around each turn:

transformContext runs over the agent’s accumulated conversation history.
The LLM is invoked with the result.
The LLM emits a response — text and/or tool-call intents.
The agent dispatches the tool calls, runs each tool, and appends the results to the history.

That ends the turn. The next turn is the next LLM call. By turn N, the conversation has grown to roughly 1 + 2(N-1) messages — the seed prompt plus an alternating LLM-response/tool-result pair per prior turn.

Click any Turn N in the left navigator to focus on it. The widget shows the moment just before that turn’s LLM call: the Before strategy side is everything accumulated through turn N-1’s tool results (turn N’s response hasn’t happened yet at this point); the After strategy side is what transformContext produced from that input — the prompt the LLM actually saw. The default Diff view colors what changed: red lines were dropped by the strategy, green lines were added or replaced, gray lines align across both sides. Switch to Cards for a structured per-message view. For baseline, the two sides are byte-identical — that’s the control case. For the other three strategies, the diff is the article’s central question made literal.

strategy: baseline · ✗ fail · $0.0614 · 27 LLM calls

Turns

Diff: before → after — red = removed by strategy, green = added/replaced

No changes — the strategy returned the conversation unchanged for this turn.

Before strategy

After strategy

=== USER ===
A test in test/tasklist.test.ts is failing. Find the bug in the source code under src/, fix it, and make the whole test suite pass.

=== USER ===
A test in test/tasklist.test.ts is failing. Find the bug in the source code under src/, fix it, and make the whole test suite pass.

For each run we copy the fixture to an isolated scratch directory, give the agent the four tools, and let it work until it stops. verify() runs node --test one more time and records whether the test suite is green.

Model: Gemini 2.5 Flash with temperature = 0 (injected via pi’s onPayload hook, since Agent doesn’t expose temperature directly). Even at zero, Flash is not fully deterministic in practice — batched inference and floating-point noise cause the agent’s trajectory to diverge across runs, which is why we run k=3 per cell instead of k=1.

What we measure per run:

pass — did the test suite go green at the end?
turns — number of assistant turns (i.e., LLM calls).
cost — dollars, summed across every call in the run (Gemini Flash prompt + output + cache).
peak prompt — the largest prompt size (new + cached tokens) the agent ever sent.
new input tokens — uncached prompt tokens, summed. The ones you pay full price for.
cached input tokens — tokens served from Gemini’s implicit prefix cache.

The cached column is the cache-boundary signal: a high value means the strategy is preserving the prefix across turns; a low value means it’s invalidating it.

Results per strategy

The numbers below are means across the 3 trials (k=3) per strategy; ± values are the sample standard deviation (so 13±3 means a mean of 13 turns with σ ≈ 3 across the runs). The new and cached columns are also per-run means — what a typical single trial paid in uncached vs. cached input tokens.

strategy	n	pass	turns	cost	peak prompt	new	cached
baseline	3	3/3	13±3	$0.016±0.006	6,705±1,830	27,164	18,224
truncate-500	3	0/3	12±6	$0.098±0.012	5,851±3,187	27,167	14,050
age-truncate-500-keep-3	3	3/3	13±3	$0.017±0.006	5,744±1,444	29,082	11,394
compact-at-12000-structured	3	3/3	12±4	$0.016±0.007	5,212±1,170	22,834	14,462

Three things jump out.

truncate-500 fails catastrophically — and expensively. 0/3 passed, at roughly 6× the cost of baseline. The bug in api.ts sits past character 500 of the file the agent reads, so a 500-char cap on tool results literally cuts off the buggy code. The agent reads the file, sees a complete-looking import block and type definitions, trusts it, fails to find the bug, tries random edits, and burns money. Truncation that’s too aggressive isn’t just lossy — it’s misleadingly lossy, because the agent has no way to know it’s missing the relevant span. This is the article’s sharpest anti-pattern.

age-truncate-500-keep-3 is the best balance. 3/3 pass, $0.017 avg cost (essentially tied with baseline), cache hits preserved (11K cached tokens). The strategy’s idea — keep the last K tool results verbatim, truncate older ones — avoids the “hide the bug” failure mode (the agent always sees fresh reads in full) while still tightening the middle of the conversation. It is also deterministic and position-fixed, so it passes the cache-stability rule. If you need a default, this shape is it.

Compaction also works. compact-at-12000-structured passes 3/3 at $0.016 — essentially tied with baseline despite the extra LLM call needed to produce the summary, because the post-compaction prompt is small enough that the savings absorb the summarization cost. The threshold matters: fire too early (before the bug’s diagnosis is settled in the conversation) and the summary captures exploration but not the resolution; fire too late and you’ve already paid most of the bill. 12,000 chars hits a sweet spot for this fixture; production CLIs use multi-round compaction with adaptive thresholds to handle the general case.

The bigger picture across the four: three of four strategies pass 3/3 at essentially the same cost. That’s itself a finding: on a 10–20 turn bug-fix task, if your strategy is cache-stable and doesn’t destroy information the agent is actively using, you can pick more-or-less anything. Cost and reliability diverge sharply only when the strategy violates one of those two principles — which truncate-500 does (it destroys the working set) and the rest don’t.

A proposed custom algorithm: age-aware tool-result truncation

Most CLIs and reference implementations treat context management as either “drop old stuff” or “summarize old stuff”. Both have problems we’ve seen in the data. Dropping old stuff (sliding window) kills the cache. Summarizing old stuff (compaction) costs an extra LLM call and is sensitive to the summary’s quality — if the summary misses the bug’s diagnosis, the agent loses its thread.

There’s a third option that our data suggests is underused: truncate the tail of old tool results, keep the recent ones verbatim, change nothing else. Concretely:

export function makeAgeAwareTruncate({ keepRecent, maxChars }) {
  return async function(messages: AgentMessage[]) {
    const resultIndices = messages
      .map((m, i) => (m.role === "toolResult" ? i : -1))
      .filter((i) => i !== -1);
    const keepFromIndex =
      resultIndices.length > keepRecent
        ? resultIndices[resultIndices.length - keepRecent]
        : -1;

    return messages.map((msg, i) => {
      if (msg.role !== "toolResult") return msg;
      if (i >= keepFromIndex) return msg;  // recent — leave verbatim
      return truncateTextBlocks(msg, maxChars);
    });
  };
}

Why this works well:

It respects the cache-boundary rule. The truncation decision is a pure function of message position and text length — deterministic, stable across turns. A tool result at position 5, once truncated, stays truncated identically on every later call. The prefix cache hits like baseline’s.
It never hides the bug. The last K tool results — the ones the agent’s current reasoning depends on — are never touched. If the agent just read a file, it sees the whole file.
It’s cheap. Pure in-process work; no extra LLM call, unlike compaction.
It’s boring. ~15 lines of code, one clear invariant. It doesn’t need a threshold tuning exercise (small K like 3 works across tasks). It doesn’t have a “what should go in the summary?” subproblem.

The data on the main fixture supports the claim: 3/3 pass, $0.017 — essentially tied with baseline despite running the cap on every turn. The cap doesn’t hurt the agent because the tail — the recent reads and test output the agent is actively reasoning about — stays intact. The intuition is that the cap caps how much any given turn’s prompt can grow, while the recent working set is preserved verbatim, so nothing the agent currently needs gets hidden.

It won’t dominate compaction on genuinely long conversations — beyond some horizon, even truncated old tool results crowd out useful context — but for the 10–50 turn range that covers most single-task coding sessions, it’s the best default we found.

How this differs from opencode’s two-layer approach

opencode tackles the same problem from the opposite direction: it caps tool output at the tool layer (read_file returns at most ~50KB / 2,000 chars per line, grep paginates) and lets compaction handle long sessions. age-aware tool-result truncation lives at the strategy layer instead. Two substantive differences come out of that choice:

Storage model. opencode does lossy storage — once a tool returns its capped result, the rest of the file isn’t anywhere local; the agent has to re-call read_file with offset/limit to recover more. age-truncate does lossy view, lossless storage — the conversation log keeps every tool’s full output forever, and the strategy re-derives a compressed view on each turn. Swap the strategy mid-run and the full bytes come back without re-calling tools. Easier to experiment with; bigger on disk.
The recent working set. opencode’s tool-layer cap is uniform: the file you just opened is capped at 50KB too. If a bug lives past the cap, the agent has to ask for the next slice — exactly the failure mode truncate-500 demonstrated in miniature on our fixture (which uniformly caps at 500 chars). opencode mitigates this with a much larger cap and explicit pagination. age-truncate inverts the discipline: the K most recent results pass through verbatim regardless of size, and only older results get capped. So a 100KB read you just did is fully visible; the same 100KB read from 10 turns ago has been truncated to 500 chars.

The two approaches are complementary, not competing. opencode’s tool-layer caps bound any single huge dump (a 1MB log file, a stack trace from a runaway test); age-truncate-500-keep-3 prevents the count of preserved-verbatim results from growing unbounded across turns. A production stack would do both: bound each tool’s output at the tool layer (so single dumps stay reasonable), then run age-aware truncation at the strategy layer (so the working set doesn’t accumulate), and reserve compaction for the genuine long-session case.

The reason age-truncate works in our experiment without tool-layer caps is that we deliberately removed those caps to isolate the strategy’s effect. In production you’d want both.

There’s a separate line of work worth knowing about: learned compactors. Every compaction strategy in this article uses an LLM-prompted summarizer — same model, just asked to write a summary. A trained compactor replaces the prompt with a small model fine-tuned specifically to hit a target compression ratio while preserving downstream task performance. Cmprsr (Zakazov et al., 2026) trains Qwen3-4B with SFT + GRPO for exactly this, including a “question-agnostic” mode whose output is reusable across follow-up queries — i.e., cache-stable in the sense §2 cared about. The productized version (compresr.ai) ships it as a compress() SDK call you can drop into a transformContext hook. It’s a natural fit for the Pro-side regime our flip surfaced: when holes punish the agent’s thinking budget and a coherent compressed view wins, a compactor trained on the compression objective rather than coaxed into one is the next thing to reach for.