UABUnbiased AI BenchAI model rankings with source links.
Every score links back to its source.
Home/Guide
Guide
Live · updated continuously
Home · guided decision

Find the right AI model,
with evidence you can check.

Tell us the task. Get a model shortlist, then inspect the source links, raw scores, dates, and warnings before you trust it.
Default job · model selection
Presets · 6 visible scenarios
Supporting evidence · always available
Step 1 · Ask the job

Describe the job in one line.

Results update live while you type. Use the button only when you want the app to apply suggested preset and filter changes from your wording.

Query

Live update · Preset · Everyday chatbot · All public sources

Current questionEveryday chatbot with all public sources
Use case

General-purpose chat quality with decent reasoning and enough context to feel useful day to day.

Sources to include

All public sources can include official company results while independent results catch up.

Access model
Primary filters
Current scoring recipecoverage, recency, and included sources
This preset weights chat text, reasoning math science, long context with a 60% coverage floor and a 120-day recency window. Official company results can contribute when they are clearly labeled. Copied, historical, and demo evidence stay out unless you explicitly allow them.
Step 2 · Recommendation

Best evidence-backed choices for Everyday chatbot with all public sources: Gemini 3.1 Pro, GPT-5, and Grok 4.

Data version May 13, 2026No blocked sources excludedAll public sources
Visible tradeoffsThe current evidence supports a shortlist, not a single winner.
Current top picksGemini 3.1 Pro, GPT-5, Grok 4
Answer typeTop picks, not one winner
Coverage100% visible · 100% verified
PresetEveryday chatbot
SourcesAll public sources
Latest strong evidenceMay 13, 2026
PreviewOfficial source data includedModel family view
Why these finalists made the cuttop reasons behind the current answer
  • Current shortlist: Gemini 3.1 Pro, GPT-5, and Grok 4.
  • Gemini 3.1 Pro is the strongest exact-match option still visible.
  • Gemini 3.1 Pro currently leads the fit score at 82.0, but the evidence is still too mixed for a single headline winner.
What to pressure testwhere the current answer is still fragile
  • No single winner: The current public evidence is only strong enough to support a shortlist, not one winner.
  • Strongest alternative · GPT-5: GPT-5 is strongest on Reasoning / math / science and Chat / text for this preset.
  • Evidence risk: The current lead depends partly on official company results, so share it with a warning until independent coverage deepens.
Step 3 · Pressure test the callRead the argument before you commit

What would flip the answer

  • If you tighten benchmark spread: Gemini 3.1 Pro still holds if you care more about aligned evidence than upside.
  • If you tighten recency: Gemini 3.1 Pro remains viable because the visible evidence is still fairly fresh.
  • If you require open-weight: No open-weight model currently clears the same evidence floor.
  • If cost and speed matter more: No clearly cheaper alternative currently clears the same evidence floor.

Why this is not a clean win

  • The current evidence supports a shortlist, not a single winner.
  • GPT-5 remains close enough that a different scoring recipe can still flip the public answer.

Evidence trail

  • Top picks: Current shortlist: Gemini 3.1 Pro, GPT-5, and Grok 4
  • Evidence 1: Strongest exact-match option: Gemini 3.1 Pro
  • Evidence 2: Strongest indirect contender: None visible
  • Evidence 3: Best open-weight finalist: No open-weight finalist in the current source data
Decision Buckets

Exact leaders first, then indirect and missing-coverage cases

The guide now keeps exact-match winners strict while still surfacing strong models that would previously disappear.

Primary bucket

Exact-match leaders

#1

Gemini 3.1 Pro

Google · frontier · 100% visible · 67% exact · 0% indirect

PreviewOfficial source data includedModel family view
Visible tradeoffs27.1% benchmark spread · 100% freshness · exact direct
Fit score82.0
Strongest evidencelong context · reasoning math science

Gemini 3.1 Pro is strongest on Long context and Reasoning / math / science for this preset.

The current lead depends partly on official company results, so share it with a warning until independent coverage deepens.

Some visible coverage is coming from provider-official source links while independent coverage catches up.

Verified rows
6
Manual checks
2
Relay rows
0
Backfilled rows
0
Headline lane
May 13, 2026
Context lane
No extra context
Base score is the weighted mean of preset-domain benchmark fit (85.4).
Open model
Open compare
#2

GPT-5

OpenAI · frontier · 100% visible · 100% exact · 0% indirect

Visible tradeoffs33.3% benchmark spread · 95% freshness · exact direct
Fit score49.6
Strongest evidencereasoning math science · chat text

GPT-5 is strongest on Reasoning / math / science and Chat / text for this preset.

The visible evidence mix still leans on weaker or split signals, especially around Long context, source verification state, and any backfilled or relay evidence still in play.

Verified rows
6
Manual checks
0
Relay rows
0
Backfilled rows
0
Headline lane
May 13, 2026
Context lane
No extra context
Base score is the weighted mean of preset-domain benchmark fit (44.1).
Open model
Open compare
#3

Grok 4

xAI · premium · 67% visible · 67% exact · 0% indirect

Visible tradeoffs91.1% benchmark spread · 77.5% freshness · exact direct
Fit score41.4
Strongest evidencechat text · reasoning math science

Grok 4 is strongest on Chat / text and Reasoning / math / science for this preset.

The visible evidence mix still leans on weaker or split signals, especially around Reasoning / math / science, source verification state, and any backfilled or relay evidence still in play.

Verified rows
3
Manual checks
0
Relay rows
0
Backfilled rows
0
Headline lane
May 13, 2026
Context lane
No extra context
Base score is the weighted mean of preset-domain benchmark fit (52.3).
Open model
Open compare
#4

Qwen3 235B A22B

Qwen · mid · 67% visible · 67% exact · 0% indirect

Visible tradeoffs58.9% benchmark spread · 77.5% freshness · exact direct
Fit score35.4
Strongest evidencechat text · reasoning math science

Qwen3 235B A22B is strongest on Chat / text and Reasoning / math / science for this preset.

The visible evidence mix still leans on weaker or split signals, especially around Reasoning / math / science, source verification state, and any backfilled or relay evidence still in play.

Verified rows
2
Manual checks
0
Relay rows
0
Backfilled rows
0
Headline lane
May 13, 2026
Context lane
No extra context
Base score is the weighted mean of preset-domain benchmark fit (40.5).
Open model
Open compare
#5

Llama 4 Maverick

Meta · mid · 67% visible · 67% exact · 0% indirect

Visible tradeoffs43.9% benchmark spread · 100% freshness · exact direct
Fit score22.8
Strongest evidencechat text · reasoning math science

Llama 4 Maverick is strongest on Chat / text and Reasoning / math / science for this preset.

The visible evidence mix still leans on weaker or split signals, especially around Reasoning / math / science, source verification state, and any backfilled or relay evidence still in play.

Verified rows
2
Manual checks
0
Relay rows
0
Backfilled rows
0
Headline lane
May 13, 2026
Context lane
No extra context
Base score is the weighted mean of preset-domain benchmark fit (25.2).
Open model
Open compare
Tertiary bucket

Tracked but under-benchmarked

These models are in the official registry, but the current benchmark surface still has missing or indirect coverage.

Gemini 3 Pro Preview

Google · 100% visible · 100% exact · 0% indirect

Gemini 3 Pro Preview has direct evidence on part of this preset, but not enough to clear the exact-match floor.

chat textreasoning math science

Gemini 2.5 Pro

Google · 67% visible · 67% exact · 0% indirect

Gemini 2.5 Pro has direct evidence on part of this preset, but not enough to clear the exact-match floor.

  • Missing benchmark coverage in Long context.
chat textreasoning math science

GPT-5.4

OpenAI · 67% visible · 67% exact · 0% indirect

GPT-5.4 has direct evidence on part of this preset, but not enough to clear the exact-match floor.

  • Missing benchmark coverage in Long context.
reasoning math sciencechat text

Claude Opus 4.7

Anthropic · 67% visible · 67% exact · 0% indirect

Claude Opus 4.7 has direct evidence on part of this preset, but not enough to clear the exact-match floor.

  • Missing benchmark coverage in Long context.
reasoning math sciencechat text

Claude Sonnet 4.6

Anthropic · 67% visible · 67% exact · 0% indirect

Claude Sonnet 4.6 has direct evidence on part of this preset, but not enough to clear the exact-match floor.

  • Missing benchmark coverage in Long context.
chat textreasoning math science

Gemini 3.1 Flash-Lite Preview

Google · 67% visible · 67% exact · 0% indirect

Gemini 3.1 Flash-Lite Preview has direct evidence on part of this preset, but not enough to clear the exact-match floor.

  • Missing benchmark coverage in Long context.
chat textreasoning math science

GPT-5.5

OpenAI · 67% visible · 67% exact · 0% indirect

GPT-5.5 has direct evidence on part of this preset, but not enough to clear the exact-match floor.

  • Missing benchmark coverage in Long context.
chat textreasoning math science

Claude Haiku 4.5

Anthropic · 33% visible · 33% exact · 0% indirect

Claude Haiku 4.5 has direct evidence on part of this preset, but not enough to clear the exact-match floor.

  • Missing benchmark coverage in Reasoning / math / science, Long context.
chat text

DeepSeek Chat

DeepSeek · 33% visible · 33% exact · 0% indirect

DeepSeek Chat has direct evidence on part of this preset, but not enough to clear the exact-match floor.

  • Missing benchmark coverage in Reasoning / math / science, Long context.
chat text

DeepSeek Reasoner

DeepSeek · 33% visible · 33% exact · 0% indirect

DeepSeek Reasoner has direct evidence on part of this preset, but not enough to clear the exact-match floor.

  • Missing benchmark coverage in Reasoning / math / science, Long context.
chat text

DeepSeek V3.2 Exp

DeepSeek · 33% visible · 33% exact · 0% indirect

DeepSeek V3.2 Exp has direct evidence on part of this preset, but not enough to clear the exact-match floor.

  • Missing benchmark coverage in Reasoning / math / science, Long context.
chat text

Gemini 2.5 Flash

Google · 33% visible · 33% exact · 0% indirect

Gemini 2.5 Flash has direct evidence on part of this preset, but not enough to clear the exact-match floor.

  • Missing benchmark coverage in Reasoning / math / science, Long context.
chat text

Shareable claims with evidence

The product should generate public claims worth checking, not just filter state.

Open change report
alert
7 review items still need manual judgment

The product keeps parser and mapping ambiguity visible instead of silently guessing.

Open
models
Artificial Analysis moved via real benchmark movement

0 benchmark rows were added, 0 removed, and 134 existing rows changed value or evaluation date. Window: 2026-05-13T01:05:56Z -> 2026-05-13T01:19:35Z.

Open
product
Initial glass-box matrix release

Added matrix homepage, comparable-group normalization, per-cell receipts, source pages, and custom composite preview.

Open
models
Methodology contract published

Documented comparability rules, raw-vs-normalized behavior, and why unlike metrics are never averaged by default.

models
Artificial Analysis ID rule adopted

Stable model and creator IDs are now the preferred external identity keys when available.

What changed this week

alert
7 review items still need manual judgment

The product keeps parser and mapping ambiguity visible instead of silently guessing.

models
Artificial Analysis moved via real benchmark movement

0 benchmark rows were added, 0 removed, and 134 existing rows changed value or evaluation date. Window: 2026-05-13T01:05:56Z -> 2026-05-13T01:19:35Z.

Evidence window: 2026-05-13T01:05:56Z -> 2026-05-13T01:19:35Z

product
Initial glass-box matrix release

Added matrix homepage, comparable-group normalization, per-cell receipts, source pages, and custom composite preview.

Evidence window: 2026-04-16

models
Methodology contract published

Documented comparability rules, raw-vs-normalized behavior, and why unlike metrics are never averaged by default.

Evidence window: 2026-04-16

models
Artificial Analysis ID rule adopted

Stable model and creator IDs are now the preferred external identity keys when available.

Evidence window: 2026-04-15

models
BridgeBench parser fallback added

Added alternate selectors for category headers after leaderboard markup drift.

Evidence window: 2026-04-15

disagreement
Gemini 2.5 Pro is still a split decision

Cross-benchmark spread sits at 100.0 points, which means rankings still depend heavily on which visible benchmark slices you weight most.

disagreement
Gemini 3.1 Pro is still a split decision

Cross-benchmark spread sits at 100.0 points, which means rankings still depend heavily on which visible benchmark slices you weight most.

Data version

Read this before trusting a headline.

Data version May 13, 2026Model list checked9 providers · 800 tracked modelsPage refreshed May 18, 2026

If this date looks stale, you may be seeing an older build or cached deploy.

Quick routes

Jump straight to the page you need.

These shortcuts resolve into public URLs instead of hidden state. Use them to open a recommendation page, compare workspace, head-to-head page, disagreement page, change log feed, or a specific model, benchmark, or source.

Resolve a recommendation into a public artifactbest open model for long-context researchResearch assistantOpen page
Send a shortlist into compare modecompare gpt-5, claude opus, gemini proEveryday chatbotOpen page
Open a head-to-head debate pagegpt-5 vs claude opusEveryday chatbotOpen page
Open a disagreement artifactbenchmark controversy for livebench codingCoding copilotOpen page
Open the latest public movementwhat changed this weekEveryday chatbotOpen page
Jump straight to an entity pageopen model gpt-5Open-weight shortlistOpen page