Aspects of the technology described in this paper are patent-pending.
Frontier large language models (LLMs) power today's AI coding assistants, but a significant portion of their token consumption goes to mechanical, data-processing tasks -- summarizing files, filtering search results, triaging build output -- that require no deep reasoning. This paper presents our research into task-specific model routing, a technique that classifies coding assistant workloads by cognitive complexity and delegates data-heavy, low-reasoning tasks to small, specialized models while reserving frontier models for tasks that require genuine intelligence.
Our key findings: 27-58% token cost reduction on files ranging from 250 to 2,000 lines with no measurable degradation in task quality; sub-2-second latency per delegated task running on CPU-only inference with no GPU requirement; quality scores averaging 86/100 using an automated evaluation framework; and that disabling chain-of-thought reasoning improves performance for structured extraction tasks across all models tested.
AI coding assistants have transformed software development. Developers interact with frontier models -- Claude, GPT-4, and others -- that can reason about code architecture, debug complex issues, and generate production-quality implementations. But this power comes at a cost: frontier model inference is expensive, and much of that expense is wasted.
Consider a typical AI-assisted coding session. A developer asks the assistant to understand a codebase, fix a bug, and implement a feature. The assistant reads files, searches for references, analyzes build output, and filters search results. Each of these operations sends tokens through the frontier model. A 2,000-line source file costs approximately $0.06 in input tokens alone on a model like Claude Sonnet. Over a productive session involving 50 file reads, that's $3.00 spent purely on reading -- before any reasoning, code generation, or decision-making occurs.
The insight behind our research is that these tasks fall into two fundamentally different categories: high-reasoning tasks that require the full capability of a frontier model (code generation, architectural decisions, multi-step debugging), and mechanical tasks that require data processing but not deep reasoning (file summarization, search filtering, build output triage, error classification).
The mechanical category often accounts for the majority of token throughput in a coding session, yet these tasks share key properties: they have high input volume, require structured output in a predictable format, tolerate some information loss, and do not require chain-of-thought reasoning.
Our research explores whether a small, cheap model -- running at roughly 1/100th the per-token cost of a frontier model -- can handle these mechanical tasks at acceptable quality, and what infrastructure is needed to route tasks effectively between models.
The AI industry has developed several approaches to reducing inference costs. Prompt caching stores frequently-used prompts to avoid re-processing, but does not reduce the fundamental token volume of data-heavy tasks. Context window optimization (RAG, chunking, sliding windows) attempts to reduce what gets sent to the model but not which model sees it. Model distillation creates smaller versions of large models but requires significant training infrastructure. Mixture-of-Experts architectures route different tokens to different components within a single model -- our approach operates at a higher level, routing entire tasks to different models.
The concept of using multiple models with different capabilities is not new. Ensemble methods, cascading classifiers, and speculative decoding all involve multiple models cooperating. Our contribution is applying this principle to the specific domain of AI coding assistants, where the task taxonomy is well-defined and the cost asymmetry between model tiers is extreme.
Our architecture leverages the Model Context Protocol (MCP), an open standard for connecting AI models to external tools and data sources. MCP provides a standardized interface through which a primary model can invoke secondary models as tools, making multi-model routing a natural extension of existing tool-use patterns.
The core of our approach is a principled framework for determining which tasks can be safely delegated to a smaller model. We define four properties that characterize delegable tasks:
| Property | Description | Example |
|---|---|---|
| High-volume input | Task involves processing hundreds or thousands of lines | Reading a 1,500-line source file |
| Low-reasoning requirement | Requires extraction or classification, not planning | Summarizing exports and structure |
| Structured output | Response format is predictable and verifiable | JSON with known fields |
| Loss tolerance | Some information loss is acceptable | A summary capturing 90% of key facts |
| Task Category | Input Volume | Reasoning | Delegable? |
|---|---|---|---|
| File summarization | High | Low | Yes |
| Search result filtering | Medium-High | Low | Yes |
| Build output triage | High | Low | Yes |
| Error classification | Medium | Low | Yes |
| Content compression | High | Low | Yes |
| Code generation | Low-Medium | High | No |
| Architecture decisions | Low | High | No |
| Bug diagnosis | Variable | High | No |
A key design principle that emerged from our research: the secondary model should act exclusively as a compressor, filter, or classifier -- never as a reasoner. This bright line dramatically simplifies the system. The primary model trusts the secondary model to reduce data, not to make decisions. This asymmetric trust relationship is critical for maintaining overall quality while achieving cost savings.
Our architecture follows a delegation pattern where the frontier model acts as orchestrator and explicitly decides when to invoke the secondary model. The primary model sees the secondary model as a tool -- one of many available actions it can take. This leverages the primary model's existing tool-use capabilities for routing, without requiring a separate routing layer.
Developer <-> Primary LLM (Frontier) <-> [MCP Protocol] <-> Task Router <-> Secondary LLM (Small)
The system comprises five key components: a Task Router that receives requests and dispatches to the secondary model; Prompt Templates optimized for small models as tightly scoped single-turn prompts; a Response Shaper using regex-based parsing (more reliable than JSON mode with small models); Authority Framing that directs the primary model not to re-read source material; and a Fallback Mechanism that transparently reverts to the primary model on failure.
| Topology | Description | Use Case |
|---|---|---|
| Co-located | Secondary model on the same machine | Development, privacy-sensitive |
| Disaggregated | Secondary model as a cloud service | Production, team environments |
Both topologies present the same interface. A strategy pattern abstracts the routing, with automatic fallback from disaggregated to co-located mode if the cloud service is unreachable.
A practical finding: small models (1.7B-14B parameters) running structured extraction tasks perform adequately on CPU-only inference. On an Apple M4 Pro (24GB), our recommended 8B parameter model processes requests at approximately 25 tokens/second, completing most tasks in under 2 seconds.
We developed a three-mode evaluation framework: Fast Mode for deterministic structural validation (under 5 seconds, no external deps); Full Mode using a frontier model as automated judge with 3-shot median scoring per dimension; and Compare Mode for statistical A/B testing with Cohen's d effect size and bootstrap confidence intervals.
Each task type defines a weighted rubric with dimensions specific to that task. Weights are integers summing to 100.
| Dimension | Weight | What It Measures |
|---|---|---|
| Completeness | 30 | Key symbols, module purpose, relationships captured |
| Accuracy | 30 | No hallucinated names, types, or incorrect facts |
| Conciseness | 20 | Summary length proportionate to input |
| Structure | 20 | Required output fields present |
| Gate | Threshold | Enforcement |
|---|---|---|
| Structural validation | All required fields | Blocks any merge |
| Average quality score | >= 70/100 | Blocks merge to main |
| Per-dimension regression | No drop > 5 points | Blocks merge to main |
| Composite regression | No drop > 3 points | Alerts and blocks |
| Token budget | Output < 70% of input | Warning only |
| File Size | Without Routing | With Routing | Savings | Turn Reduction |
|---|---|---|---|---|
| 250 lines | 87K tokens | 64K tokens | 27% | 4 -> 1 turn |
| 1,000 lines | 125K tokens | 52K tokens | 58% | 6 -> 1 turn |
| 2,000 lines | 148K tokens | 105K tokens | 29% | 7 -> 2 turns |
The routing layer cost is approximately constant at ~53K tokens regardless of file size. The sweet spot for maximum savings is files in the 500-1,500 line range.
| Size class | Parameters | Quality | Speed | Notes |
|---|---|---|---|---|
| Small | 1.7B | 73/100 | 99 tok/s | Fastest, adequate for simple tasks |
| Medium | 8B | 86/100 | 25 tok/s | Best balance |
| Medium-large | 12B | 90/100 | 17 tok/s | Highest overall scores |
| Large | 14B | 88/100 | 14 tok/s | Best at code understanding |
Notable negative results: a 4B candidate scored 0% success rate, and one 14B candidate from a different family was inconsistent - scoring 0 on some medium-difficulty tasks. These results underscore the importance of empirical benchmarking over parameter count assumptions.
Disabling chain-of-thought reasoning improves performance for structured extraction tasks across all models tested. The explanation: chain-of-thought introduces unnecessary deliberation for tasks that are fundamentally pattern-matching and extraction, adding latency and occasionally leading the model to overthink straightforward tasks.
Small models frequently produce invalid JSON but reliably follow structural patterns. Regex-based extraction proved significantly more reliable for structured data.
Each response includes a directive to the primary model not to re-read source material. Without this, frontier models frequently re-read files after receiving summaries, negating the token savings.
Consolidating all task types behind a single tool with a task_type discriminator reduced schema token overhead by 68% (1,642 to 524 tokens). This compounds over an entire session.
Every delegated task has an automatic fallback. The developer never experiences a failure due to the optimization layer -- at worst, they lose cost savings for that request.
Files above approximately 3,000 lines require chunked processing, introducing coordination overhead and potential information loss at chunk boundaries. Our current implementation covers 7 core task types with 2 fully benchmarked -- each new task type requires its own prompt engineering, fixtures, and rubrics. Two of the six models tested were unsuitable, underscoring that model selection requires empirical validation.
Future research directions include adaptive routing using learned task complexity classifiers, multi-model cascading across capability levels, cross-session learning from accumulated benchmark data, and expanding the task taxonomy beyond the current 7 core types.
Task-specific model routing is a practical and effective technique for reducing the token cost of AI coding assistants. By classifying workloads into high-reasoning and mechanical categories, and delegating mechanical tasks to small, specialized models, we achieved 27-58% token savings while maintaining quality scores of 86/100 on a rigorous benchmark framework.
The key contributions of this research are: a principled task classification framework based on four measurable properties; an architecture for multi-model routing leveraging existing tool-use protocols (MCP); a rigorous evaluation methodology combining deterministic validation, LLM-as-judge scoring, and statistical significance testing; and empirical findings including the counterintuitive result that disabling chain-of-thought reasoning improves structured extraction performance.
These results suggest that the AI coding assistant industry can achieve significant cost reductions without requiring better models -- only smarter routing of the models already available.