Home · guided decision

Find the right AI model,
with evidence you can check.

Tell us the task. Get a model shortlist, then inspect the source links, raw scores, dates, and warnings before you trust it.

Default job · model selection
Presets · 6 visible scenarios
Supporting evidence · always available

Step 1 · Ask the job

Describe the job in one line.

Results update live while you type. Use the button only when you want the app to apply suggested preset and filter changes from your wording.

Query

Live update · Preset · Everyday chatbot · All public sources

Current questionEveryday chatbot with all public sources

Use case

General-purpose chat quality with decent reasoning and enough context to feel useful day to day.

Sources to include

All public sources can include official company results while independent results catch up.

Access model

Primary filters

Current scoring recipecoverage, recency, and included sources

This preset weights chat text, reasoning math science, long context with a 60% coverage floor and a 120-day recency window. Official company results can contribute when they are clearly labeled. Copied, historical, and demo evidence stay out unless you explicitly allow them.

Step 2 · Recommendation

Best evidence-backed choices for Everyday chatbot with all public sources: Gemini 3.1 Pro, GPT-5, and Grok 4.

Data version May 13, 2026No blocked sources excludedAll public sources

Share this read Compare finalists

Visible tradeoffsThe current evidence supports a shortlist, not a single winner.

Current top picksGemini 3.1 Pro, GPT-5, Grok 4

Answer typeTop picks, not one winner

Coverage100% visible · 100% verified

PresetEveryday chatbot

SourcesAll public sources

Latest strong evidenceMay 13, 2026

PreviewOfficial source data includedModel family view

Why these finalists made the cuttop reasons behind the current answer

Current shortlist: Gemini 3.1 Pro, GPT-5, and Grok 4.
Gemini 3.1 Pro is the strongest exact-match option still visible.
Gemini 3.1 Pro currently leads the fit score at 82.0, but the evidence is still too mixed for a single headline winner.

What to pressure testwhere the current answer is still fragile

No single winner: The current public evidence is only strong enough to support a shortlist, not one winner.
Strongest alternative · GPT-5: GPT-5 is strongest on Reasoning / math / science and Chat / text for this preset.
Evidence risk: The current lead depends partly on official company results, so share it with a warning until independent coverage deepens.

Compare finalists Inspect evidence Coverage report

Step 3 · Pressure test the callRead the argument before you commit

What would flip the answer

If you tighten benchmark spread: Gemini 3.1 Pro still holds if you care more about aligned evidence than upside.
If you tighten recency: Gemini 3.1 Pro remains viable because the visible evidence is still fairly fresh.
If you require open-weight: No open-weight model currently clears the same evidence floor.
If cost and speed matter more: No clearly cheaper alternative currently clears the same evidence floor.

Why this is not a clean win

The current evidence supports a shortlist, not a single winner.
GPT-5 remains close enough that a different scoring recipe can still flip the public answer.

Evidence trail

Top picks: Current shortlist: Gemini 3.1 Pro, GPT-5, and Grok 4
Evidence 1: Strongest exact-match option: Gemini 3.1 Pro
Evidence 2: Strongest indirect contender: None visible
Evidence 3: Best open-weight finalist: No open-weight finalist in the current source data

Decision Buckets

Exact leaders first, then indirect and missing-coverage cases

The guide now keeps exact-match winners strict while still surfacing strong models that would previously disappear.

Primary bucket

Exact-match leaders

Gemini 3.1 Pro

Google · frontier · 100% visible · 67% exact · 0% indirect

PreviewOfficial source data includedModel family view

Visible tradeoffs27.1% benchmark spread · 100% freshness · exact direct

Fit score82.0

Strongest evidencelong context · reasoning math science

Gemini 3.1 Pro is strongest on Long context and Reasoning / math / science for this preset.

The current lead depends partly on official company results, so share it with a warning until independent coverage deepens.

Some visible coverage is coming from provider-official source links while independent coverage catches up.

Verified rows: 6
Manual checks: 2
Relay rows: 0
Backfilled rows: 0

Headline lane: May 13, 2026
Context lane: No extra context

Base score is the weighted mean of preset-domain benchmark fit (85.4).

Open model

Open compare

GPT-5

OpenAI · frontier · 100% visible · 100% exact · 0% indirect

Visible tradeoffs33.3% benchmark spread · 95% freshness · exact direct

Fit score49.6

Strongest evidencereasoning math science · chat text

GPT-5 is strongest on Reasoning / math / science and Chat / text for this preset.

The visible evidence mix still leans on weaker or split signals, especially around Long context, source verification state, and any backfilled or relay evidence still in play.

Verified rows: 6
Manual checks: 0
Relay rows: 0
Backfilled rows: 0

Headline lane: May 13, 2026
Context lane: No extra context

Base score is the weighted mean of preset-domain benchmark fit (44.1).

Open model

Open compare

Grok 4

xAI · premium · 67% visible · 67% exact · 0% indirect

Visible tradeoffs91.1% benchmark spread · 77.5% freshness · exact direct

Fit score41.4

Strongest evidencechat text · reasoning math science

Grok 4 is strongest on Chat / text and Reasoning / math / science for this preset.

The visible evidence mix still leans on weaker or split signals, especially around Reasoning / math / science, source verification state, and any backfilled or relay evidence still in play.

Verified rows: 3
Manual checks: 0
Relay rows: 0
Backfilled rows: 0

Headline lane: May 13, 2026
Context lane: No extra context

Base score is the weighted mean of preset-domain benchmark fit (52.3).

Open model

Open compare

Qwen3 235B A22B

Qwen · mid · 67% visible · 67% exact · 0% indirect

Visible tradeoffs58.9% benchmark spread · 77.5% freshness · exact direct

Fit score35.4

Strongest evidencechat text · reasoning math science

Qwen3 235B A22B is strongest on Chat / text and Reasoning / math / science for this preset.

The visible evidence mix still leans on weaker or split signals, especially around Reasoning / math / science, source verification state, and any backfilled or relay evidence still in play.

Verified rows: 2
Manual checks: 0
Relay rows: 0
Backfilled rows: 0

Headline lane: May 13, 2026
Context lane: No extra context

Base score is the weighted mean of preset-domain benchmark fit (40.5).

Open model

Open compare

Llama 4 Maverick

Meta · mid · 67% visible · 67% exact · 0% indirect

Visible tradeoffs43.9% benchmark spread · 100% freshness · exact direct

Fit score22.8

Strongest evidencechat text · reasoning math science

Llama 4 Maverick is strongest on Chat / text and Reasoning / math / science for this preset.

The visible evidence mix still leans on weaker or split signals, especially around Reasoning / math / science, source verification state, and any backfilled or relay evidence still in play.

Verified rows: 2
Manual checks: 0
Relay rows: 0
Backfilled rows: 0

Headline lane: May 13, 2026
Context lane: No extra context

Base score is the weighted mean of preset-domain benchmark fit (25.2).

Open model

Open compare

Tertiary bucket

Tracked but under-benchmarked

These models are in the official registry, but the current benchmark surface still has missing or indirect coverage.

Gemini 3 Pro Preview

Google · 100% visible · 100% exact · 0% indirect

Gemini 3 Pro Preview has direct evidence on part of this preset, but not enough to clear the exact-match floor.

chat textreasoning math science

Open model

Gemini 2.5 Pro

Google · 67% visible · 67% exact · 0% indirect

Gemini 2.5 Pro has direct evidence on part of this preset, but not enough to clear the exact-match floor.

Missing benchmark coverage in Long context.

chat textreasoning math science

Open model

GPT-5.4

OpenAI · 67% visible · 67% exact · 0% indirect

GPT-5.4 has direct evidence on part of this preset, but not enough to clear the exact-match floor.

Missing benchmark coverage in Long context.

reasoning math sciencechat text

Open model

Claude Opus 4.7

Anthropic · 67% visible · 67% exact · 0% indirect

Claude Opus 4.7 has direct evidence on part of this preset, but not enough to clear the exact-match floor.

Missing benchmark coverage in Long context.

reasoning math sciencechat text

Open model

Claude Sonnet 4.6

Anthropic · 67% visible · 67% exact · 0% indirect

Claude Sonnet 4.6 has direct evidence on part of this preset, but not enough to clear the exact-match floor.

Missing benchmark coverage in Long context.

chat textreasoning math science

Open model

Gemini 3.1 Flash-Lite Preview

Google · 67% visible · 67% exact · 0% indirect

Gemini 3.1 Flash-Lite Preview has direct evidence on part of this preset, but not enough to clear the exact-match floor.

Missing benchmark coverage in Long context.

chat textreasoning math science

Open model

GPT-5.5

OpenAI · 67% visible · 67% exact · 0% indirect

GPT-5.5 has direct evidence on part of this preset, but not enough to clear the exact-match floor.

Missing benchmark coverage in Long context.

chat textreasoning math science

Open model

Claude Haiku 4.5

Anthropic · 33% visible · 33% exact · 0% indirect

Claude Haiku 4.5 has direct evidence on part of this preset, but not enough to clear the exact-match floor.

Missing benchmark coverage in Reasoning / math / science, Long context.

chat text

Open model

DeepSeek Chat

DeepSeek · 33% visible · 33% exact · 0% indirect

DeepSeek Chat has direct evidence on part of this preset, but not enough to clear the exact-match floor.

Missing benchmark coverage in Reasoning / math / science, Long context.

chat text

Open model

DeepSeek Reasoner

DeepSeek · 33% visible · 33% exact · 0% indirect

DeepSeek Reasoner has direct evidence on part of this preset, but not enough to clear the exact-match floor.

Missing benchmark coverage in Reasoning / math / science, Long context.

chat text

Open model

DeepSeek V3.2 Exp

DeepSeek · 33% visible · 33% exact · 0% indirect

DeepSeek V3.2 Exp has direct evidence on part of this preset, but not enough to clear the exact-match floor.

Missing benchmark coverage in Reasoning / math / science, Long context.

chat text

Open model

Gemini 2.5 Flash

Google · 33% visible · 33% exact · 0% indirect

Gemini 2.5 Flash has direct evidence on part of this preset, but not enough to clear the exact-match floor.

Missing benchmark coverage in Reasoning / math / science, Long context.

chat text

Open model

alert

7 review items still need manual judgment

The product keeps parser and mapping ambiguity visible instead of silently guessing.

Open

models

Artificial Analysis moved via real benchmark movement

0 benchmark rows were added, 0 removed, and 134 existing rows changed value or evaluation date. Window: 2026-05-13T01:05:56Z -> 2026-05-13T01:19:35Z.

Open

product

Initial glass-box matrix release

Added matrix homepage, comparable-group normalization, per-cell receipts, source pages, and custom composite preview.

Open

models

Methodology contract published

Documented comparability rules, raw-vs-normalized behavior, and why unlike metrics are never averaged by default.

models

Artificial Analysis ID rule adopted

Stable model and creator IDs are now the preferred external identity keys when available.

What changed this week

alert

7 review items still need manual judgment

The product keeps parser and mapping ambiguity visible instead of silently guessing.

Open

models

Artificial Analysis moved via real benchmark movement

0 benchmark rows were added, 0 removed, and 134 existing rows changed value or evaluation date. Window: 2026-05-13T01:05:56Z -> 2026-05-13T01:19:35Z.

Evidence window: 2026-05-13T01:05:56Z -> 2026-05-13T01:19:35Z

Open evidence page Model Benchmark Source

product

Initial glass-box matrix release

Added matrix homepage, comparable-group normalization, per-cell receipts, source pages, and custom composite preview.

Evidence window: 2026-04-16

Open changelog

models

Methodology contract published

Documented comparability rules, raw-vs-normalized behavior, and why unlike metrics are never averaged by default.

Evidence window: 2026-04-16

models

Artificial Analysis ID rule adopted

Stable model and creator IDs are now the preferred external identity keys when available.

Evidence window: 2026-04-15

models

BridgeBench parser fallback added

Added alternate selectors for category headers after leaderboard markup drift.

Evidence window: 2026-04-15

disagreement

Gemini 2.5 Pro is still a split decision

Cross-benchmark spread sits at 100.0 points, which means rankings still depend heavily on which visible benchmark slices you weight most.

Open

disagreement

Gemini 3.1 Pro is still a split decision

Cross-benchmark spread sits at 100.0 points, which means rankings still depend heavily on which visible benchmark slices you weight most.

Open

Watchlists

Followed items reopen from their canonical URL first. Bundle export still works, but the durable state is the href plus deterministic latest-delta links, not a rebuilt local compare preset.

Open workspaces

Loading watchlist state...

No watchlists yet. Follow a recommendation card or compare set.

Saved compare views

Loading saved compare views...

Save a compare workspace to keep a shortlist around.

Workspace bundle

Portable bundles stay link-native. Use them to preview a shared workspace, reopen the same compare URLs on another device, or import the snapshot without reconstructing intent from loose local fields.

Current workspace0 saved compare views · 0 watches · 0 pinned compare models

Preview or import a shared bundle

Data version

Read this before trusting a headline.

Data version May 13, 2026Model list checked9 providers · 800 tracked modelsPage refreshed May 18, 2026

If this date looks stale, you may be seeing an older build or cached deploy.

Quick routes

Jump straight to the page you need.

These shortcuts resolve into public URLs instead of hidden state. Use them to open a recommendation page, compare workspace, head-to-head page, disagreement page, change log feed, or a specific model, benchmark, or source.

Resolve a recommendation into a public artifactbest open model for long-context researchResearch assistantOpen page

Send a shortlist into compare modecompare gpt-5, claude opus, gemini proEveryday chatbotOpen page

Open a head-to-head debate pagegpt-5 vs claude opusEveryday chatbotOpen page

Open a disagreement artifactbenchmark controversy for livebench codingCoding copilotOpen page

Open the latest public movementwhat changed this weekEveryday chatbotOpen page

Jump straight to an entity pageopen model gpt-5Open-weight shortlistOpen page

Find the right AI model,with evidence you can check.

Describe the job in one line.

Best evidence-backed choices for Everyday chatbot with all public sources: Gemini 3.1 Pro, GPT-5, and Grok 4.

What would flip the answer

Why this is not a clean win

Evidence trail

Exact leaders first, then indirect and missing-coverage cases

Exact-match leaders

Gemini 3.1 Pro

GPT-5

Grok 4

Qwen3 235B A22B

Llama 4 Maverick

Tracked but under-benchmarked

Gemini 3 Pro Preview

Gemini 2.5 Pro

GPT-5.4

Claude Opus 4.7

Claude Sonnet 4.6

Gemini 3.1 Flash-Lite Preview

GPT-5.5

Claude Haiku 4.5

DeepSeek Chat

DeepSeek Reasoner

DeepSeek V3.2 Exp

Gemini 2.5 Flash

Shareable claims with evidence

What changed this week

Read this before trusting a headline.

Jump straight to the page you need.

Find the right AI model,
with evidence you can check.