← All news
Product

7 Ways to Reduce Tokens in Long LLM Agent Chats

Cut token usage in multi-turn LLM agent chats with 7 proven techniques. Lineman shows you how to reduce costs while keeping output quality intact.

The Lineman team

Long multi-turn agent sessions burn through tokens faster than most engineers expect. Each turn re-sends the entire conversation history, and bulky tool outputs—file reads, build logs, search results—compound the cost on every subsequent message. Lineman gives you a way to intercept that data-heavy work and keep your context window focused on reasoning instead of re-reading the same files over and over.

This guide breaks down seven techniques that cut token usage in long agent workflows without degrading output quality. You'll learn which levers to pull manually, which can be automated, and how to diagnose where your tokens actually go.

Quick guide: 7 token reduction techniques for LLM agent chats

  1. Automatic tool-output compression: The most effective technique for cutting token spend in data-heavy agent workflows
  2. Context clearing at task boundaries: A manual reset that prevents stale context from compounding
  3. Mid-session compaction: Condenses active context when clearing isn't an option
  4. Model routing: Delegates mechanical tasks to smaller models at a fraction of the cost
  5. Prompt discipline: Trims system prompts and instructions to remove per-turn bloat
  6. Selective file loading: Loads only the code sections the agent needs instead of entire files
  7. Memory and retrieval strategies: Stores prior context externally and retrieves it on demand

How we chose these token reduction techniques

Engineering teams running long multi-turn agent sessions share a common problem: context compounding. Every token in your window gets re-billed as input on each turn, so a 50-turn session can cost 50 times more than you'd expect from the raw output alone. The techniques here address that mechanic directly.

We selected these seven levers based on:

  • Measured impact on token spend: Each technique has documented savings in real coding workflows, not theoretical benchmarks
  • Output quality retention: Cutting tokens means nothing if your agent starts producing worse code—we prioritized techniques that maintain baseline quality
  • Workflow fit: Some techniques require manual discipline every session; others run automatically—you'll see which is which
  • Applicability to multi-turn agents: Generic LLM tips often ignore the compounding effect of long sessions—these techniques address it specifically
  • Compatibility with common tooling: These techniques work across Claude Code, GPT-based agents, and similar agentic coding environments

The 7 token reduction techniques for long LLM agent chats

1. Automatic tool-output compression: The most effective technique for agent workflows

Tool output is the largest single contributor to token costs in most agent sessions. File reads, build logs, test results, and search outputs enter your context window in full, then get re-billed on every subsequent turn. On Lineman's benchmarks, tool output accounts for over half of a typical bill.

Automatic compression intercepts these bulky outputs before they reach your context window. Instead of sending 2,000 lines of test logs to your frontier model, a compression layer distills it down to the relevant failures and stack traces. Because the bulk never enters context, it's never billed—not once and not on any later turn.

Lineman handles this automatically by routing data-heavy tool calls through a smaller model that produces a compact, task-relevant summary. On Lineman's data, this approach cuts 40%+ of tokens while holding output quality at 98.3% of baseline. You keep prompting exactly as you do now while the largest cost driver gets handled in the background.

Automatic tool-output compression features

  • Intercepts file reads, logs, and search results: The bulkiest tool outputs get compressed before they hit your context, so you pay for a distilled version instead of the full dump
  • Language-agnostic processing: Works across Python, TypeScript, Go, Rust, and other languages without configuration changes
  • Sub-2-second latency: Compression happens fast enough that you won't notice it in your workflow—Lineman runs on CPU-only inference to keep processing quick
  • Real-time savings visibility: You can see exactly how many tokens Lineman trimmed on each request, so you're never guessing about the impact
  • No workflow changes required: Lineman installs in minutes and works inside Claude Code without requiring you to change how you prompt or interact with your agent
  • Quality retention guarantees: Benchmarked at 98.3% baseline output quality retention, so compression doesn't come at the cost of worse code

Automatic tool-output compression pros and cons

Pros:

  • Addresses the largest cost driver (tool output) without requiring manual effort each session
  • Compounds savings across every turn in a long session, since compressed outputs are re-billed at their smaller size
  • Works automatically in the background—no discipline or habit changes required from you

Cons:

  • Requires initial setup and API key configuration, though installation takes only a few minutes
  • Adds a processing step between your agent and tool outputs, which could be noticeable in latency-sensitive workflows—though Lineman's sub-2-second latency keeps this minimal
  • Some edge cases with highly specialized output formats may need manual review initially

2. Context clearing at task boundaries: A manual reset for fresh starts

When you switch tasks without clearing context, your agent drags along every file read and log dump from the previous task. That stale data gets re-billed on every turn of your new task, even though it's no longer relevant.

Running a clear command (/clear in Claude Code) wipes the accumulated context and gives you a fresh start. This is a manual technique that requires you to remember to do it at task boundaries, but the savings are immediate and significant.

Context clearing features

  • Immediate context reset: Wipes accumulated tool outputs, file reads, and conversation history from your active session
  • Built into most agent interfaces: Claude Code offers /clear natively, and similar commands exist in other agentic environments
  • Zero cost to implement: No tooling or configuration required—just discipline to remember the command at task boundaries

Context clearing pros and cons

Pros:

  • Eliminates re-billing of stale context from previous tasks
  • No tooling or setup required—works with built-in agent commands
  • Gives you full control over when to reset your session

Cons:

  • Requires manual discipline every session—you have to remember to clear at task boundaries
  • Clears all context, including potentially useful information you might want to retain
  • Doesn't help mid-task when you can't afford to lose accumulated context

3. Mid-session compaction: Condense without clearing

Sometimes you're 30 turns into a debugging session and clearing context would mean losing critical state. Compaction commands (/compact in Claude Code) condense your active context by summarizing older turns while preserving the most recent and relevant information.

This technique sheds 60–80% of active context on Lineman's benchmarks, giving you room to continue the session without hitting context limits or paying for a bloated window.

Mid-session compaction features

  • Selective summarization: Older turns get condensed while recent context stays intact for continuity
  • 60–80% context reduction: Significant savings on long sessions where clearing isn't an option
  • Preserves session continuity: Unlike clearing, compaction keeps your agent aware of what happened earlier in the task

Mid-session compaction pros and cons

Pros:

  • Reduces context size without losing session history entirely
  • Works mid-task when clearing would disrupt your workflow
  • Available as a built-in command in Claude Code and similar tools

Cons:

  • Summarization may lose some details from earlier turns that could be relevant later
  • Still requires manual invocation—you have to remember to compact on long sessions
  • Less effective than clearing if you've accumulated large amounts of irrelevant context

4. Model routing: Match the model to the task

Frontier models like Claude Opus or GPT-4 cost significantly more per token than smaller models. On a per-token basis, Sonnet costs about a fifth of Opus ($3/$15 vs $5/$25 per million input/output, June 2026). For mechanical tasks—formatting, simple refactors, boilerplate generation—the smaller model produces equivalent output at a fraction of the cost.

Model routing means using your expensive frontier model only for genuinely hard reasoning tasks, and delegating mechanical work to smaller, cheaper models. This is a multiplier on every token you spend, so it's often the single biggest saving after tool-output compression.

Model routing features

  • Cost asymmetry exploitation: Routes mechanical tasks to models that cost 5-10x less per token
  • Preserves quality on hard tasks: Your frontier model still handles complex reasoning, architecture decisions, and subtle bugs
  • Works with Lineman's automatic delegation: Lineman routes appropriate tasks to smaller models automatically, so you don't have to decide on each request

Model routing pros and cons

Pros:

  • Massive cost reduction on mechanical tasks without quality loss
  • Can be automated through tools like Lineman that handle routing decisions
  • Compounds with other techniques—cheaper tokens are still subject to context compounding, so cutting their cost matters

Cons:

  • Requires judgment about which tasks are "mechanical" vs "genuinely hard"—automated routing helps, but some edge cases need manual override
  • Manual routing requires switching models mid-session, which can disrupt flow
  • Some tasks that seem mechanical may benefit from frontier model reasoning in subtle ways

5. Prompt discipline: Trim your system prompts and CLAUDE.md

Your CLAUDE.md file (or equivalent system prompt configuration) gets re-sent on nearly every turn. A 500-token system prompt doesn't sound like much, but over a 50-turn session, you're paying for it 50 times. The compounding adds up.

Trim your system prompts to durable rules only. Say what you need once, avoid verbose explanations of preferences, and remove anything that's nice-to-have rather than essential. Small per-turn savings compound across a whole session.

Prompt discipline features

  • Reduce per-turn overhead: Every token you cut from your system prompt saves that token on every subsequent turn
  • Focus on durable rules: Keep instructions that apply across all tasks; remove session-specific or situational guidance
  • Use /context to diagnose: Check what's filling your window and identify system prompt bloat

Prompt discipline pros and cons

Pros:

  • Reduces fixed overhead on every turn of every session
  • Forces clarity in your instructions, which often improves agent behavior
  • No tooling required—just editing your configuration files

Cons:

  • Requires upfront effort to audit and trim your system prompts
  • Over-trimming can degrade agent behavior if you remove important context
  • Benefits are incremental compared to tool-output compression—this is a supporting technique, not a primary lever

6. Selective file loading: Load only what you need

When your agent reads a 3,000-line file to understand a 50-line function, you pay for the full 3,000 lines on every turn until you clear context. Selective loading means pointing your agent at specific functions, classes, or line ranges instead of entire files.

This requires more precise prompting—instead of "read the user service file," you'd say "read lines 142-195 of user_service.py." The tradeoff is extra effort in your prompts for significantly reduced context bloat.

Selective file loading features

  • Line-range precision: Load only the code section relevant to your current task
  • Function or class targeting: Some agents support loading specific symbols rather than full files
  • Reduced re-billing overhead: Smaller file reads mean smaller re-billing costs on subsequent turns

Selective file loading pros and cons

Pros:

  • Directly reduces the size of file-read outputs entering your context
  • Works with any agent that supports file reading—no additional tooling required
  • Gives you precise control over what context your agent has access to

Cons:

  • Requires you to know which lines or functions are relevant before asking the agent to read them
  • More verbose prompting compared to "just read the file"
  • Agent may miss important context if you load too selectively

7. Memory and retrieval strategies: Store context externally

Instead of keeping all prior context in the active window, you can store it externally and retrieve it on demand. This is the approach behind RAG (retrieval-augmented generation) systems: maintain a knowledge base, query it when relevant, and inject only the retrieved chunks into context.

For agent workflows, this means storing summaries of previous sessions, code snippets, or architectural decisions in a searchable format. Your agent queries this store when it needs historical context, rather than carrying everything in the active window.

Memory and retrieval features

  • External knowledge storage: Prior context lives outside the context window and gets retrieved on demand
  • Query-based injection: Only relevant chunks enter context, not the entire knowledge base
  • Cross-session continuity: Useful information persists between sessions without bloating each individual session's context

Memory and retrieval pros and cons

Pros:

  • Enables very long workflows without context window limits
  • Prior learnings persist across sessions and team members
  • Retrieval can be tuned to inject only high-relevance context

Cons:

  • Requires infrastructure setup—vector databases, embedding pipelines, retrieval logic
  • Retrieval quality depends on how well you've indexed and chunked your knowledge base
  • Adds latency for the retrieval step, which may slow down interactive workflows

Comparison table: Token reduction techniques for LLM agent chats

TechniqueAutomation levelToken savingsSetup time
Automatic tool-output compression (Lineman)Fully automatic40%+Minutes
Context clearingManual per-session100% (full reset)None
Mid-session compactionManual per-session60–80%None
Model routingManual or automatic5–10x cost reductionMinutes to hours
Prompt disciplineOne-time setup5–15%Hours
Selective file loadingManual per-requestVariableNone
Memory and retrievalAutomatic after setupVariableDays to weeks

How do I diagnose where my tokens are going in long agent sessions?

Run /context in Claude Code to see the breakdown of your current window. You'll see how much of your context is consumed by system prompts, conversation history, and tool outputs. Most engineers are surprised to find that tool output—file reads, logs, and search results—accounts for the majority of their context consumption.

Watch this breakdown over several turns. If tool output is the largest contributor, automatic compression will give you the biggest savings. If conversation history is compounding quickly, more aggressive clearing or compaction will help. If your system prompt is larger than expected, prompt discipline is your first lever.

Lineman shows you real-time savings statistics, so you can see exactly how many tokens get trimmed on each request. This visibility helps you understand which technique is having the most impact on your specific workflow.

When should I use manual techniques vs automatic compression?

The manual techniques—clearing, compaction, selective loading—require discipline every session. They work, but they depend on you remembering to do them. The first three levers are things you have to remember every session, which means they're easy to skip when you're focused on the code.

Automatic tool-output compression works in the background without requiring you to change your prompting behavior. You keep prompting exactly as you do now while the largest cost driver gets handled automatically. This is why Lineman focuses on the data-heavy tool calls: they're the biggest contributor to cost, and automating their compression removes the discipline requirement.

The most effective approach combines both: use automatic compression to handle the bulk of your token spend, then layer in manual techniques (clearing at task boundaries, compaction on long sessions) for additional savings when appropriate.

Why Lineman is the most effective technique for cutting tokens in long agent chats

Tool output drives the majority of token costs in multi-turn agent sessions. File reads, build logs, test results, and search outputs enter your context in full, then compound on every subsequent turn. Lineman intercepts this data-heavy work and delivers a distilled version that keeps your main model focused on reasoning.

On Lineman's benchmarks, automatic tool-output compression cuts 40%+ of tokens while maintaining 98.3% baseline output quality. This directly counters context compounding by ensuring that bulky outputs never enter your context window in the first place. Because the bulk never enters context, it's never billed—not once and not on any later turn.

Lineman installs in minutes inside Claude Code with no workflow changes required. You get real-time visibility into your savings, sub-2-second latency on each compressed request, and the confidence that your output quality stays intact. Give your AI coding assistant a sidekick and see the difference in your next long session.

FAQs about LLM token optimization

What is context compounding and why does it matter for token costs?

Context compounding happens because LLMs are stateless—every turn re-sends the entire conversation history as input. A 50-turn session means your context gets re-billed 50 times, not once.

This is why tool outputs are the worst offenders: a 2,000-token file read on turn 5 gets re-billed on turns 6, 7, 8, and every turn after. Lineman addresses this by compressing tool outputs before they enter context, so the compounding effect applies to a smaller number.

How much can I realistically save on token costs in long agent sessions?

On Lineman's benchmarks, automatic tool-output compression delivers 40%+ token savings with 98.3% baseline quality retention. Combined with manual techniques like context clearing and model routing, engineering teams typically see 50–70% total cost reduction on data-heavy workflows.

Your actual savings depend on your workflow: sessions with lots of file reads, build logs, and test outputs will see the largest gains from automatic compression.

Do token reduction techniques affect output quality?

The goal is cutting tokens without degrading quality. Lineman's compression approach specifically targets data that's irrelevant to the current task—extracting the signal from bulky tool outputs while discarding the noise.

On Lineman's data, output quality retention sits at 98.3% of baseline across benchmarks. Manual techniques like context clearing don't affect quality at all—they just remove stale data that was no longer relevant anyway.

Can I use multiple token reduction techniques together?

Yes, and you should. Automatic tool-output compression handles the largest cost driver without requiring manual effort. Layer in context clearing at task boundaries, compaction on long sessions, and model routing for mechanical tasks to maximize savings.

These techniques compound: if Lineman cuts 40% of your tool-output tokens and model routing cuts your per-token cost by 5x on mechanical tasks, the combined savings are significant.

How does Lineman compare to just clearing context more often?

Context clearing removes all accumulated data, which works if you're switching tasks. But mid-task, clearing means losing context you still need. Lineman compresses tool outputs without removing them—you keep the relevant information in a smaller footprint.

The other difference is discipline: clearing requires you to remember to do it. Lineman runs automatically on every tool call that produces bulky output.

What types of tool outputs benefit most from compression?

File reads, build logs, test results, and codebase search outputs are the biggest wins. These outputs are often thousands of tokens but contain only a few hundred tokens of relevant information for the current task.

Lineman extracts the relevant failures from test logs, the pertinent code sections from file reads, and the matching results from searches—discarding the bulk that would otherwise fill your context window.

Related