The first public benchmark ranking AI coding tools by token efficiency - how few tokens they consume to accomplish software engineering tasks, without sacrificing quality.
Inspired by Tokenomics, SWE-Effi, and AI Agents That Matter. All results are reproducible and open.
Input tokens, output tokens, quality preservation, compression latency, and cost savings - across real software engineering tasks from SWE-bench and DevGPT.
Composite Score = token savings % × quality preservation rate × latency factor. An independent judge LLM evaluates output quality against an uncompressed baseline.
Submit your own tool or model for benchmarking. All results include full metadata and per-task data for independent verification.
| # | Tool / Model | Composite | Token Savings | Quality | Input Tokens | Output Tokens | Cost Saved | Latency |
|---|---|---|---|---|---|---|---|---|
| 1 | Wristband Pipeline (balanced) Any model (compression is model-agnostic)+ Wristband v0.2 | 16.1 | 16.1% | 100% | 64K | 0 | $0.04 | 1.3ms |
| 2 | No compression (baseline) Any model | 0.0 | 0.0% | 100% | 76K | 0 | $0.00 | 0ms |
Tools on the Pareto frontier are highlighted - no other tool is both cheaper AND higher quality. Points below the frontier are dominated.
Pareto frontier chart - coming soon
An independent judge LLM (a different model from the one being evaluated) scores each compressed output against the uncompressed baseline.
| Score | Meaning |
|---|---|
| 10 | Identical or equivalent to baseline |
| 8-9 | Minor differences, same conclusion/output |
| 6-7 | Noticeable differences, still mostly correct |
| 4-5 | Significant quality loss, partially correct |
| 1-3 | Fundamentally wrong or broken |
tiktoken cl100k_base encoding, matching OpenAI and Anthropic billing
Independent LLM judge (different model from task model) scores compressed output vs uncompressed baseline on a 1-10 rubric. For code tasks, automated test suite verification supplements the judge.
Quality-Adjusted Token Efficiency = token_savings_% × quality_preservation_rate × latency_factor
This leaderboard builds on the research community's call for cost-aware evaluation of AI coding tools.