Home · source disagreements

Where AI rankings
disagree.

See where public benchmark sources disagree, which model each source favors, and why one leaderboard can mislead.

Public sources · 9
Open disputes · 7
Goal · show disagreement

Data version

Read this before trusting a headline.

Data version May 13, 2026Model list checked9 providers · 800 tracked modelsPage refreshed May 18, 2026

If this date looks stale, you may be seeing an older build or cached deploy.

Where AI rankings disagree

Rankings get less tidy but more honest
when disagreement stays visible.

This view shows where public sources refuse to tell the same story. A wide score range is not noise to hide. It is the main fact.

Open matrix Open sources

ArenaAR

LiveBenchLB

Artificial AnalysisAA

BridgeBenchBB

Terminal-BenchTERMINAL-BENCH

LLMBaseLLMBASE

Scale LabsSL

OpenCompassOC

MTEBMTEB

Where the rankings split

score range across tests

Gemini 2.5 Pro

Google · 56.5%

0255075100

100%

cross-benchmark spread

Open model

Gemini 3.1 Pro

Google · 28.3%

0255075100

100%

cross-benchmark spread

Open model

GPT-5.4

OpenAI · 63%

0255075100

100%

cross-benchmark spread

Open model

GPT-5.4 mini

OpenAI · 50%

0255075100

98.1%

cross-benchmark spread

Open model

GPT-5.5

OpenAI · 41.3%

0255075100

96.7%

cross-benchmark spread

Open model

Grok 4.20

xAI · 30.4%

0255075100

93.7%

cross-benchmark spread

Open model

Claude Sonnet 4.6

Anthropic · 47.8%

0255075100

93.6%

cross-benchmark spread

Open model

Source honesty scorecard

Not a moral rating. A quick check on how inspectable each source is when you need to dispute the headline number.

9 of 9 sources in the current registry

Benchmark and eval counts reflect what this app currently tracks for each source, not the source's full external catalog.


Arena verified	11	793	no	May 13, 2026	0
LiveBench verified	6	773	yes	May 13, 2026	0
Artificial Analysis verified	7	638	yes	May 13, 2026	1
BridgeBench verified	5	122	no	May 13, 2026	0
Scale Labs verified	8	98	no	May 13, 2026	0
Terminal-Bench verified	1	31	no	May 13, 2026	0
OpenCompass verified	1	15	no	May 13, 2026	0
MTEB verified	1	11	no	May 13, 2026	0
LLMBase relay	0	0	no	May 13, 2026	0

Where AI rankingsdisagree.

Read this before trusting a headline.

Rankings get less tidy but more honestwhen disagreement stays visible.

Where the rankings split

Source honesty scorecard

Where AI rankings
disagree.

Rankings get less tidy but more honest
when disagreement stays visible.