GPT-5
Closest option
- GPT-5 has direct evidence on part of this preset, but not enough to clear the exact-match floor.
The current evidence supports a shortlist, not a single winner.
Direct matches stay strict; strong models with indirect data still surface below. Open a row for its scores, source links, and caveats.
Claude Fable 5 is strongest on Reasoning / math / science and Long context for this preset.
The visible evidence mix still leans on weaker or split signals, especially around Document understanding, source verification state, and any backfilled or relay evidence still in play.
Data parser or model matching changes recently moved Artificial Analysis, Vals AI, Arena.
Claude Opus 4.8 is strongest on Reasoning / math / science and Long context for this preset.
The visible evidence mix still leans on weaker or split signals, especially around Document understanding, source verification state, and any backfilled or relay evidence still in play.
Data parser or model matching changes recently moved Artificial Analysis, Vals AI, Arena.
Claude Opus 4.7 is strongest on Reasoning / math / science and Document understanding for this preset.
Provider-official evidence is self-reported company data. It can support recency, but it is not third-party test data.
The current lead depends partly on official company results, so share it with a warning until independent coverage deepens.
Some visible coverage is coming from provider-official source links while independent coverage catches up.
GPT-5.5 is strongest on Long context and Document understanding for this preset.
Provider-official evidence is self-reported company data. It can support recency, but it is not third-party test data.
The current lead depends partly on official company results, so share it with a warning until independent coverage deepens.
Some visible coverage is coming from provider-official source links while independent coverage catches up.
Gemini 3.1 Pro is strongest on Long context and Reasoning / math / science for this preset.
Provider-official evidence is self-reported company data. It can support recency, but it is not third-party test data.
The current lead depends partly on official company results, so share it with a warning until independent coverage deepens.
Some visible coverage is coming from provider-official source links while independent coverage catches up.
No source link clears the minimum source requirement.
Closest option
Closest option
Closest option
Closest option
Closest option
Known current model
OpenAI · 100% visible · 100% direct · 0% indirect
GPT-5 has direct evidence on part of this preset, but not enough to clear the exact-match floor.
xAI · 100% visible · 100% direct · 0% indirect
Grok 4.20 has direct evidence on part of this preset, but not enough to clear the exact-match floor.
Anthropic · 100% visible · 100% direct · 0% indirect
Claude Haiku 4.5 has direct evidence on part of this preset, but not enough to clear the exact-match floor.
Google · 100% visible · 100% direct · 0% indirect
Gemini 3.5 Flash has direct evidence on part of this preset, but not enough to clear the exact-match floor.
OpenAI · 100% visible · 100% direct · 0% indirect
GPT-5.4 mini has direct evidence on part of this preset, but not enough to clear the exact-match floor.
OpenAI · 100% visible · 100% direct · 0% indirect
GPT-5.4 nano has direct evidence on part of this preset, but not enough to clear the exact-match floor.
xAI · 100% visible · 100% direct · 0% indirect
Grok 4 has direct evidence on part of this preset, but not enough to clear the exact-match floor.
xAI · 100% visible · 100% direct · 0% indirect
Grok 4.3 has direct evidence on part of this preset, but not enough to clear the exact-match floor.
xAI · 100% visible · 100% direct · 0% indirect
Grok 4.1 Fast has direct evidence on part of this preset, but not enough to clear the exact-match floor.
Google · 100% visible · 100% direct · 0% indirect
Gemini 3 Pro Preview has direct evidence on part of this preset, but not enough to clear the exact-match floor.
Google · 100% visible · 100% direct · 0% indirect
Gemini 2.5 Pro has direct evidence on part of this preset, but not enough to clear the exact-match floor.
Google · 100% visible · 100% direct · 0% indirect
Gemini 3 Flash has direct evidence on part of this preset, but not enough to clear the exact-match floor.
The product keeps parser and mapping ambiguity visible instead of silently guessing.
80 benchmark rows were added, 4 removed, and 16276 existing rows changed value or evaluation date. Window: 2026-06-20T23:37:10Z -> 2026-06-24T03:37:55Z.
28 benchmark rows were added, 0 removed, and 5949 existing rows changed value or evaluation date. Window: 2026-06-20T23:37:17Z -> 2026-06-24T03:38:09Z.
The saved raw source snapshot changed relative to the previous run. Window: 2026-06-20T23:37:24Z -> 2026-06-24T03:38:25Z.
The saved raw source snapshot changed relative to the previous run. Window: 2026-06-20T23:37:34Z -> 2026-06-24T03:38:36Z.
The product keeps parser and mapping ambiguity visible instead of silently guessing.
80 benchmark rows were added, 4 removed, and 16276 existing rows changed value or evaluation date. Window: 2026-06-20T23:37:10Z -> 2026-06-24T03:37:55Z.
28 benchmark rows were added, 0 removed, and 5949 existing rows changed value or evaluation date. Window: 2026-06-20T23:37:17Z -> 2026-06-24T03:38:09Z.
The saved raw source snapshot changed relative to the previous run. Window: 2026-06-20T23:37:24Z -> 2026-06-24T03:38:25Z.
The saved raw source snapshot changed relative to the previous run. Window: 2026-06-20T23:37:34Z -> 2026-06-24T03:38:36Z.
Added comparison-table homepage, same-test normalization, per-cell source links, source pages, and custom-ranking preview.
Documented comparability rules, raw-vs-normalized behavior, and why unlike metrics are never averaged by default.
Stable model and creator IDs are now the preferred external identity keys when available.
If this date looks stale, you may be seeing an older build or cached deploy.
best open model for long-context researchResearch assistantOpen pagecompare gpt-5, claude opus, gemini proEveryday chatbotOpen pagegpt-5 vs claude opusEveryday chatbotOpen pagebenchmark controversy for livebench codingCoding copilotOpen pagewhat changed this weekEveryday chatbotOpen pageopen model gpt-5Open-weight shortlistOpen page