SWE-bench Pro

Head-to-head cost comparison on real-world software engineering tasks. Each task runs with and without Lineman to measure token savings while preserving solution quality.

9 tasks · 9 valid · model: sonnet · commit 43d0725

-20.9%

Cost saved

$4.0353

Baseline total

$4.8797

Lineman total

5W / 4L

Wins / Losses

1.0

Avg runs / task

Repository	Baseline	Lineman	Delta	Quality	Turns	Runs	View case study
qutebrowser/qutebrowser f631cd44	$0.6922	$0.4086	-41.0%	--	22	1	→
gravitational/teleport 3fa69043	$0.4847	$0.3426	-29.3%	--	20	1	→
NodeBB/NodeBB 51d8f3b1	$0.4201	$0.3467	-17.5%	--	21	1	→
internetarchive/openlibrary 4a5d2a7d	$0.1594	$0.1387	-13.0%	--	7	1	→
ansible/ansible f327e65d	$0.3118	$0.2767	-11.3%	--	14	1	→
qutebrowser/qutebrowser f91ace96	$0.5492	$0.6195	+12.8%	--	32	1	→
qutebrowser/qutebrowser c580ebf0	$0.1576	$0.2035	+29.1%	--	7	1	→
navidrome/navidrome-7073d18b54da7e53274d11c9e2baef1242e8769e	$1.1184	$2.1628	+93.4%	--	90	1	→
ansible/ansible a26c325b	$0.1419	$0.3806	+168.2%	--	22	1	→

Quality columns show baseline then Lineman. ✓ = resolved, ✗ = not resolved, - = not yet evaluated.

Methodology

Each SWE-bench Pro task is solved twice by the same Claude model in an isolated environment: once as a baseline (standard Claude Code), and once with the Lineman MCP server active. Both runs receive identical instructions and the same git checkout.

Cost is measured in USD using Anthropic's published token prices at the time of the run. Delta is the percentage difference between the Lineman and baseline costs - a negative delta indicates savings.

Quality is evaluated by running the SWE-bench test harness against each solution. A task is marked resolved when all required tests pass.

Generated: 4/12/2026, 4:54:25 PM · commit 43d0725