SWE-bench Pro

Head-to-head cost comparison on real-world software engineering tasks. Each task runs with and without Lineman to measure token savings while preserving solution quality.

10 tasks · 9 valid · model: sonnet · commit 43d0725

-20.9%
Cost saved
$4.0353
Baseline total
$4.8797
Lineman total
5W / 4L
Wins / Losses
0.9
Avg runs / task
RepositoryBaselineLinemanDeltaQualityTurnsRuns
qutebrowser/qutebrowser
f631cd44
$0.6922$0.4086-41.0%
22
1
gravitational/teleport
3fa69043
$0.4847$0.3426-29.3%
20
1
NodeBB/NodeBB
51d8f3b1
$0.4201$0.3467-17.5%
21
1
internetarchive/openlibrary
4a5d2a7d
$0.1594$0.1387-13.0%
7
1
ansible/ansible
f327e65d
$0.3118$0.2767-11.3%
14
1
NodeBB/NodeBB-04998908ba6721d64eba79ae3b65a351dcfbc5b5-vnan
timeouttimeout
0
qutebrowser/qutebrowser
f91ace96
$0.5492$0.6195+12.8%
32
1
qutebrowser/qutebrowser
c580ebf0
$0.1576$0.2035+29.1%
7
1
navidrome/navidrome-7073d18b54da7e53274d11c9e2baef1242e8769e
$1.1184$2.1628+93.4%
90
1
ansible/ansible
a26c325b
$0.1419$0.3806+168.2%
22
1

Quality columns show baseline then Lineman. = resolved, = not resolved, - = not yet evaluated.

Methodology

Each SWE-bench Pro task is solved twice by the same Claude model in an isolated environment: once as a baseline (standard Claude Code), and once with the Lineman MCP server active. Both runs receive identical instructions and the same git checkout.

Cost is measured in USD using Anthropic's published token prices at the time of the run. Delta is the percentage difference between the Lineman and baseline costs - a negative delta indicates savings.

Quality is evaluated by running the SWE-bench test harness against each solution. A task is marked resolved when all required tests pass.

Generated: 4/12/2026, 4:54:25 PM · commit 43d0725