DevTools & AI Coding benchmarks
How performance is measured at the frontier · as of 2026-06-15 · 4 benchmarks
Agentic coding
SWE-bench Verified
% resolvedShare of real GitHub issues resolved end-to-end.
↑ higher is better·SWE-bench
- #0171.8%
- #02
Moatless Tools + Claude 4 Sonnet 70.8% - #03
OpenHands + Claude 4 Sonnet 70.4% - #04
SWE-agent + Claude 4 Sonnet 66.6% - #0565.8%
- #0665.4%
- #07
SWE-agent + Claude 3.7 Sonnet w/ Review Heavy 62.4% - #08
OpenHands + 4x Scaled (2024-02-03) 60.8% - #0958.2%
- #10
OpenHands + CodeAct v2.1 (claude-3-5-sonnet-20241022) 53.0% - #11
Composio SWE-Kit (2024-10-25) 48.6% - #1247.2%
- #0188.0%
- #0286.7%
- #0384.9%
- #04
gemini-2.5-pro-preview-06-05 (32k think) 83.1% - #0581.3%
- #0681.3%
- #0779.6%
- #08
gemini-2.5-pro-preview-06-05 (default think) 79.1% - #0978.2%
- #10
Gemini 2.5 Pro Preview 05-06 76.9% - #1176.9%
- #12
DeepSeek-V3.2-Exp (Reasoner) 74.2%
CodeClash Elo
EloHead-to-head: models write bots that compete in programming-game arenas.
↑ higher is better·CodeClash
- #011,385
- #021,366
- #031,343
- #041,224
- #051,199
- #061,124
- #071,006
- #08952
Economics
Blended cost
$ / 1M tokensPublished price, 3:1 input:output blend — lower is cheaper.
↓ lower is better·LiteLLM (open pricing)
- #01$0.32
- #02$0.42
- #03$0.85
- #04$3.44
- #05$3.44
- #06$3.50
- #07$4.50
- #08$6.00
- #09$6.00
- #10$10.00