← Full guide: reduce LLM token costs
Explainer

Context compounding: why every Claude Code message costs more than the last

Every turn re-sends the whole conversation as input, so verbose tool output paid for once is paid for again on every later message. The mechanism behind a runaway AI coding bill, explained plainly.

Grant Unwin · Founder, Lineman

Short answer: Context compounding is why every Claude Code message can cost more than the last. Each turn re-sends the entire conversation, including all earlier tool output, to the model as input. Anything bulky you read once gets re-billed on every following turn, so cost grows with the length of the session even when your prompts stay short.

Most people picture an AI coding bill as "I asked it to do things, and each thing cost money." The reality is odder than that, and pricier: the history is what you keep paying for. Understanding why is about the most useful thing you can know if you want to bring it down.

The per-turn re-billing model

Large language models are stateless. They don't remember your conversation, so the client re-sends the whole thing every time. That means each turn is billed on:

system prompt + CLAUDE.md + every previous message + every previous tool result + your new prompt → then it generates a reply.

So the input you pay for grows on every turn. Your prompt might be ten words, but the model still reads everything that came before it.

TurnWhat's in the inputRelative input size
1system + first promptsmall
3+ 2 turns of replies and tool outputgrowing
10+ everything from turns 1–9large
20+ everything from turns 1–19very large

The replies themselves are usually small. What bloats that growing input is tool output: the files, logs, and search results pulled in along the way.

A worked example

Say turn 2 reads a 5,000-token file. You use three functions from it and move on, but the file doesn't move on with you. It sits in context. On turn 3 you pay for it again. On turn 4, again. By turn 11 you've paid for those 5,000 tokens roughly nine more times, for a file you were done with back on turn 2. Multiply that across every file read, test run, and search in a long session and you've got the bill.

The prompt-caching nuance (where most explanations get it wrong)

Now the caveat most write-ups skip, stated carefully so it actually holds up. Claude Code uses prompt caching. When the start of your context is unchanged from the previous turn, those tokens are billed at the cache-read rate, which is roughly a tenth (≈0.1×) of the normal input price (Anthropic pricing). So compounding really is cheaper than the naive "full price every turn" arithmetic suggests.

It doesn't make the problem go away, though, for two reasons:

  1. Cached context is still billed every turn, at 10% rather than 0%. Across a long session, ten cents on the dollar paid a few hundred times still mounts up, and you're paying it for tokens the model mostly doesn't need.
  2. It still occupies the window. Cached or not, that output takes up room in the context window. As the window fills with low-signal text, there's less space for what matters, and models tend to reason less reliably the more noise they have to wade through.

The cheapest token is the one that never enters context in the first place. Caching discounts repetition; it doesn't undo it.

Why tool output is the worst offender

Stack the properties up and tool output is uniquely bad for compounding. A single read or log can dwarf a whole exchange of prompts and replies, almost all of it is noise the model didn't need, and it persists in context and re-bills (even if cheaply) on every later turn. Reasoning is small and changes each turn; tool output is big and just sits there. That's why, on our benchmarks, it comes to over half of a typical bill.

What actually stops the compounding

Three things, in increasing order of durability:

  • /compact summarises the running transcript, shedding 60–80% of the active context while keeping continuity.
  • /clear wipes it entirely at a task boundary.
  • Never letting the bulk in. The durable fix is to compress tool output down to a task-relevant summary before it ever lands in context. Then it's neither billed (not even at the cache-read rate) nor taking up the window. That's the part Lineman automates: it intercepts data-heavy tool calls and hands the model the signal instead of the dump.

If you're staring at a bill right now, start by finding where your tokens go, then work through the practical ways to cut usage. But the mechanism above is the part worth keeping in your head: in a long agentic session, most of your bill isn't the work itself, it's re-paying to carry everything the session has piled up.

Frequently asked questions

What is context compounding?
Context compounding is the way AI coding costs grow as a session goes on: every turn re-sends the whole conversation, including all earlier tool output, as input tokens. Anything bulky you read once is re-billed on every subsequent message, so cost climbs with conversation length even when your prompts stay short.
Why does tool output use so many tokens?
Tool output is large, low-signal, and sticky. A single file read or test log can be thousands of tokens, most of which the model never needs to reason about, and once it's in context it's re-billed every turn until you clear or compact. That combination is why tool output is over half a typical bill.
Does a bigger context window cost more?
A larger maximum window doesn't cost more by itself. You pay for the tokens you actually use, not the window size. But a bigger window makes it easy to keep more in context, and every token in context is re-billed each turn, so in practice large windows tend to raise cost unless you actively manage what's in them.
GU

Grant Unwin

Founder, Lineman

Grant is the founder of Lineman, where he works on cutting the token cost of agentic coding. He writes about how AI coding tools bill, where the spend actually goes, and how to reduce it without losing output quality.

More on cutting token costs