Open Benchmark

Tokenomics Leaderboard

The first public benchmark ranking AI coding tools by token efficiency - how few tokens they consume to accomplish software engineering tasks, without sacrificing quality.

Inspired by Tokenomics, SWE-Effi, and AI Agents That Matter. All results are reproducible and open.

What We Measure

Input tokens, output tokens, quality preservation, compression latency, and cost savings - across real software engineering tasks from SWE-bench and DevGPT.

How We Score

Composite Score = token savings % × quality preservation rate × latency factor. An independent judge LLM evaluates output quality against an uncompressed baseline.

Open Submissions

Submit your own tool or model for benchmarking. All results include full metadata and per-task data for independent verification.

Rankings

#Tool / ModelCompositeToken SavingsQualityInput TokensOutput TokensCost SavedLatency
1
Wristband Pipeline (balanced)
Any model (compression is model-agnostic)+ Wristband v0.2
16.116.1%100%64K0$0.041.3ms
2
No compression (baseline)
Any model
0.00.0%100%76K0$0.000ms

Pareto Frontier

Tools on the Pareto frontier are highlighted - no other tool is both cheaper AND higher quality. Points below the frontier are dominated.

Pareto frontier chart - coming soon

Metrics Reference

Primary Metrics

Composite Score
Quality-Adjusted Token Efficiency
Token Savings
% reduction vs uncompressed baseline
Quality Preserved
% of tasks scoring ≥ 8/10
Input Tokens
Total input tokens across all tasks
Output Tokens
Total output tokens across all tasks

Secondary Metrics

Cost Savings
Dollar savings at current API pricing
Compression Ratio
original_tokens / compressed_tokens
Compression Latency
Time spent in compression stage
Wall Clock Time
End-to-end time including compression
Lossless Rate
% of tasks where original can be reconstructed

Quality Scoring Rubric

An independent judge LLM (a different model from the one being evaluated) scores each compressed output against the uncompressed baseline.

ScoreMeaning
10Identical or equivalent to baseline
8-9Minor differences, same conclusion/output
6-7Noticeable differences, still mostly correct
4-5Significant quality loss, partially correct
1-3Fundamentally wrong or broken

Methodology

Datasets

SWE-bench Lite - Real GitHub issues from popular Python repositories. Measures whether the tool can still fix bugs after compression.(300 tasks)
DevGPT - Real developer-LLM conversations with code snippets. Measures real-world compression ratios on actual developer prompts.(29778 tasks)

Token Counting

tiktoken cl100k_base encoding, matching OpenAI and Anthropic billing

Quality Scoring

Independent LLM judge (different model from task model) scores compressed output vs uncompressed baseline on a 1-10 rubric. For code tasks, automated test suite verification supplements the judge.

Composite Formula

Quality-Adjusted Token Efficiency = token_savings_% × quality_preservation_rate × latency_factor

Standing on Shoulders

This leaderboard builds on the research community's call for cost-aware evaluation of AI coding tools.

AI Agents That MatterPrinceton, 2024

Proposed cost-controlled evaluation methodology for LLM agents

SWE-Effi2025

Defined EuTB/EuCB metrics — Effectiveness under Token/Cost Budget

Tokenomics2025

Coined 'Tokenomics' — the study of token consumption in LLM agent systems