Open Benchmark

Tokenomics Leaderboard

The first public benchmark ranking AI coding tools by token efficiency - how few tokens they consume to accomplish software engineering tasks, without sacrificing quality.

Inspired by Tokenomics, SWE-Effi, and AI Agents That Matter. All results are reproducible and open.

What We Measure

Input tokens, output tokens, quality preservation, compression latency, and cost savings - across real software engineering tasks from SWE-bench and DevGPT.

How We Score

Composite Score = token savings % × quality preservation rate × latency factor. An independent judge LLM evaluates output quality against an uncompressed baseline.

Open Submissions

Submit your own tool or model for benchmarking. All results include full metadata and per-task data for independent verification.

Rankings

#	Tool / Model	Composite	Token Savings	Quality	Input Tokens	Output Tokens	Cost Saved	Latency
1	Wristband Pipeline (balanced) Any model (compression is model-agnostic)+ Wristband v0.2	16.1	16.1%	100%	64K	0	$0.04	1.3ms
2	No compression (baseline) Any model	0.0	0.0%	100%	76K	0	$0.00	0ms

Pareto Frontier

Tools on the Pareto frontier are highlighted - no other tool is both cheaper AND higher quality. Points below the frontier are dominated.

Pareto frontier chart - coming soon

Metrics Reference

Primary Metrics

Composite Score

Quality-Adjusted Token Efficiency

Token Savings

% reduction vs uncompressed baseline

Quality Preserved

% of tasks scoring ≥ 8/10

Input Tokens

Total input tokens across all tasks

Output Tokens

Total output tokens across all tasks

Secondary Metrics

Cost Savings

Dollar savings at current API pricing

Compression Ratio

original_tokens / compressed_tokens

Compression Latency

Time spent in compression stage

Wall Clock Time

End-to-end time including compression

Lossless Rate

% of tasks where original can be reconstructed

Quality Scoring Rubric

An independent judge LLM (a different model from the one being evaluated) scores each compressed output against the uncompressed baseline.

Score	Meaning
10	Identical or equivalent to baseline
8-9	Minor differences, same conclusion/output
6-7	Noticeable differences, still mostly correct
4-5	Significant quality loss, partially correct
1-3	Fundamentally wrong or broken

Methodology

Datasets

SWE-bench Lite - Real GitHub issues from popular Python repositories. Measures whether the tool can still fix bugs after compression.(300 tasks)

DevGPT - Real developer-LLM conversations with code snippets. Measures real-world compression ratios on actual developer prompts.(29778 tasks)

Token Counting

tiktoken cl100k_base encoding, matching OpenAI and Anthropic billing

Quality Scoring

Independent LLM judge (different model from task model) scores compressed output vs uncompressed baseline on a 1-10 rubric. For code tasks, automated test suite verification supplements the judge.

Composite Formula

Quality-Adjusted Token Efficiency = token_savings_% × quality_preservation_rate × latency_factor

Standing on Shoulders

This leaderboard builds on the research community's call for cost-aware evaluation of AI coding tools.

AI Agents That MatterPrinceton, 2024

Proposed cost-controlled evaluation methodology for LLM agents

SWE-Effi2025

Defined EuTB/EuCB metrics — Effectiveness under Token/Cost Budget

Tokenomics2025

Coined 'Tokenomics' — the study of token consumption in LLM agent systems