Provider-official SWE-Bench Verified results manually verified from public provider pages.
Source · Provider official evals Version · 2026-Q2 public provider cards Scores · 6
Test details
Visible tradeoffsThis is an objective signal, so it is mainly about measurable task performance rather than public taste.
source
Provider official evals
metric
Resolved tasks (%)
judge
Objective
direction
higher better
group id
provider_official_swe_bench_verified_2026_q2
domain
Coding
What it measures vs what it misses
✓ Measures
Single-pass resolution on verified software engineering tasks.
✗ Misses
Long-horizon collaboration and iterative codebase work.
Why this countsIt tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.Same-test ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt does not fully capture repo-scale iteration, IDE ergonomics, or long debugging loops.
Leaderboard · this benchmark version
#1 · Claude Opus 4.7
OFF · Apr 29, 2026
Official company resultmanual verifiedmanual verified
Verified against Anthropic's public Claude Opus 4.7 system card, checked on April 29, 2026. Uses the published SWE-Bench Verified figure. Checked 2026-04-29. Verification: manual_public_page_verification.
87.6%
#2 · Claude Opus 4.6
OFF · Apr 29, 2026
Official company resultmanual verifiedmanual verifiedBackground only
Verified against Anthropic's public Claude Opus 4.6 system card, checked on April 29, 2026. Uses the published SWE-Bench Verified figure. Checked 2026-04-29. Verification: manual_public_page_verification.
80.8%
#3 · Gemini 3.1 Pro
OFF · Apr 29, 2026
Official company resultmanual verifiedmanual verified
Verified against the current Gemini Pro public model page. Page did not expose a visible page date on April 29, 2026. Uses the published SWE-Bench Verified figure for Gemini 3.1 Pro. Checked 2026-04-29. Verification: manual_public_page_verification.
80.6%
#4 · Claude Sonnet 4.6
OFF · Apr 29, 2026
Official company resultmanual verifiedmanual verified
Verified against Anthropic's public Claude Sonnet 4.6 system card, checked on April 29, 2026. Uses the published SWE-Bench Verified figure. Checked 2026-04-29. Verification: manual_public_page_verification.
79.6%
#5 · Gemini 3 Pro Preview
OFF · Apr 29, 2026
Official company resultmanual verifiedmanual verifiedBackground only
Verified against the current Gemini Pro public model page. Page did not expose a visible page date on April 29, 2026. Uses the published SWE-Bench Verified figure for Gemini 3 Pro Preview. Checked 2026-04-29. Verification: manual_public_page_verification.
76.2%
#6 · Gemini 2.5 Pro
OFF · Mar 25, 2025
Official company resultmanual verifiedmanual verified
Verified against Google's March 25, 2025 Gemini thinking models update post. Uses the published SWE-Bench Verified figure for Gemini 2.5 Pro with Google's stated custom agent setup. Checked 2026-04-29. Verification: manual_public_page_verification.