Benchmarks · /benchmarks/provider-swe-bench-verified

SWE-Bench Verified

Name: SWE-Bench Verified
Creator: Provider official evals

Provider-official SWE-Bench Verified results manually verified from public provider pages.

Source · Provider official evals
Version · 2026-Q2 public provider cards
Scores · 6

Test details

Visible tradeoffsThis is an objective signal, so it is mainly about measurable task performance rather than public taste.

source

Provider official evals

metric

Resolved tasks (%)

judge

Objective

direction

higher better

group id

provider_official_swe_bench_verified_2026_q2

domain

Coding

What it measures vs what it misses

✓ Measures

Single-pass resolution on verified software engineering tasks.

✗ Misses

Long-horizon collaboration and iterative codebase work.

Why this countsIt tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.Same-test ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt does not fully capture repo-scale iteration, IDE ergonomics, or long debugging loops.

Leaderboard · this benchmark version

#1 · Claude Opus 4.7

OFF · Apr 29, 2026

Official company resultmanual verifiedmanual verified

Raw row drilldownsource, percentile, eligibility

Source URL: https://anthropic.com/claude-opus-4-7-system-card
Percentile: 100%
Last updated: recent
Eligibility: headline eligible
Identity: exact (1.00)

Verified against Anthropic's public Claude Opus 4.7 system card, checked on April 29, 2026. Uses the published SWE-Bench Verified figure. Checked 2026-04-29. Verification: manual_public_page_verification.

87.6%

#2 · Claude Opus 4.6

OFF · Apr 29, 2026

Official company resultmanual verifiedmanual verifiedBackground only

Raw row drilldownsource, percentile, eligibility

Source URL: https://anthropic.com/claude-opus-4-6-system-card
Percentile: 80%
Last updated: recent
Eligibility: historical_model
Identity: exact (1.00)

Verified against Anthropic's public Claude Opus 4.6 system card, checked on April 29, 2026. Uses the published SWE-Bench Verified figure. Checked 2026-04-29. Verification: manual_public_page_verification.

80.8%

#3 · Gemini 3.1 Pro

OFF · Apr 29, 2026

Official company resultmanual verifiedmanual verified

Raw row drilldownsource, percentile, eligibility

Source URL: https://deepmind.google/models/gemini/pro/
Percentile: 60%
Last updated: recent
Eligibility: headline eligible
Identity: exact (1.00)

Verified against the current Gemini Pro public model page. Page did not expose a visible page date on April 29, 2026. Uses the published SWE-Bench Verified figure for Gemini 3.1 Pro. Checked 2026-04-29. Verification: manual_public_page_verification.

80.6%

#4 · Claude Sonnet 4.6

OFF · Apr 29, 2026

Official company resultmanual verifiedmanual verified

Raw row drilldownsource, percentile, eligibility

Source URL: https://anthropic.com/claude-sonnet-4-6-system-card
Percentile: 40%
Last updated: recent
Eligibility: headline eligible
Identity: exact (1.00)

Verified against Anthropic's public Claude Sonnet 4.6 system card, checked on April 29, 2026. Uses the published SWE-Bench Verified figure. Checked 2026-04-29. Verification: manual_public_page_verification.

79.6%

#5 · Gemini 3 Pro Preview

OFF · Apr 29, 2026

Official company resultmanual verifiedmanual verifiedBackground only

Raw row drilldownsource, percentile, eligibility

Source URL: https://deepmind.google/models/gemini/pro/
Percentile: 20%
Last updated: recent
Eligibility: preview_model
Identity: exact (1.00)

Verified against the current Gemini Pro public model page. Page did not expose a visible page date on April 29, 2026. Uses the published SWE-Bench Verified figure for Gemini 3 Pro Preview. Checked 2026-04-29. Verification: manual_public_page_verification.

76.2%

#6 · Gemini 2.5 Pro

OFF · Mar 25, 2025

Official company resultmanual verifiedmanual verified

Raw row drilldownsource, percentile, eligibility

Source URL: https://blog.google/innovation-and-ai/models-and-research/google-deepmind/gemini-model-thinking-updates-march-2025/
Percentile: 0%
Last updated: archived
Eligibility: headline eligible
Identity: exact (1.00)

Verified against Google's March 25, 2025 Gemini thinking models update post. Uses the published SWE-Bench Verified figure for Gemini 2.5 Pro with Google's stated custom agent setup. Checked 2026-04-29. Verification: manual_public_page_verification.

63.8%

Benchmarks · /benchmarks/provider-swe-bench-verified

SWE-Bench Verified

Provider-official SWE-Bench Verified results manually verified from public provider pages.

Source · Provider official evals
Version · 2026-Q2 public provider cards
Scores · 6

Test details

Visible tradeoffsThis is an objective signal, so it is mainly about measurable task performance rather than public taste.

source

Provider official evals

metric

Resolved tasks (%)

judge

Objective

direction

higher better

group id

provider_official_swe_bench_verified_2026_q2

domain

Coding

What it measures vs what it misses

✓ Measures

Single-pass resolution on verified software engineering tasks.

✗ Misses

Long-horizon collaboration and iterative codebase work.

Leaderboard · this benchmark version

#1 · Claude Opus 4.7

OFF · Apr 29, 2026

Official company resultmanual verifiedmanual verified

Raw row drilldownsource, percentile, eligibility

Source URL: https://anthropic.com/claude-opus-4-7-system-card
Percentile: 100%
Last updated: recent
Eligibility: headline eligible
Identity: exact (1.00)

87.6%

#2 · Claude Opus 4.6

OFF · Apr 29, 2026

Official company resultmanual verifiedmanual verifiedBackground only

Raw row drilldownsource, percentile, eligibility

Source URL: https://anthropic.com/claude-opus-4-6-system-card
Percentile: 80%
Last updated: recent
Eligibility: historical_model
Identity: exact (1.00)

80.8%

#3 · Gemini 3.1 Pro

OFF · Apr 29, 2026

Official company resultmanual verifiedmanual verified

Raw row drilldownsource, percentile, eligibility

Source URL: https://deepmind.google/models/gemini/pro/
Percentile: 60%
Last updated: recent
Eligibility: headline eligible
Identity: exact (1.00)

80.6%

#4 · Claude Sonnet 4.6

OFF · Apr 29, 2026

Official company resultmanual verifiedmanual verified

Raw row drilldownsource, percentile, eligibility

Source URL: https://anthropic.com/claude-sonnet-4-6-system-card
Percentile: 40%
Last updated: recent
Eligibility: headline eligible
Identity: exact (1.00)

79.6%

#5 · Gemini 3 Pro Preview

OFF · Apr 29, 2026

Official company resultmanual verifiedmanual verifiedBackground only

Raw row drilldownsource, percentile, eligibility

Source URL: https://deepmind.google/models/gemini/pro/
Percentile: 20%
Last updated: recent
Eligibility: preview_model
Identity: exact (1.00)

76.2%

#6 · Gemini 2.5 Pro

OFF · Mar 25, 2025

Official company resultmanual verifiedmanual verified

Raw row drilldownsource, percentile, eligibility

Source URL: https://blog.google/innovation-and-ai/models-and-research/google-deepmind/gemini-model-thinking-updates-march-2025/
Percentile: 0%
Last updated: archived
Eligibility: headline eligible
Identity: exact (1.00)

63.8%

SWE-Bench Verified

Test details

What it measures vs what it misses

✓ Measures

✗ Misses

Leaderboard · this benchmark version

Loading benchmark evidence.

SWE-Bench Verified

Test details

What it measures vs what it misses

✓ Measures

✗ Misses

Leaderboard · this benchmark version