UABUnbiased AI BenchAI model rankings with source links.
Every score links back to its source.
Home/Benchmarks/BS pushback
BS pushback
Live · updated continuously
Benchmarks · /benchmarks/bridgebench-pushback

BS pushback

BridgeBench BS benchmark for pushing back on false premises instead of bluffing.
Source · BridgeBench
Version · bridgebench snapshot 2026-05-13
Scores · 19

Passport

Verified but agingThis is a rubric-judged signal, so it is more structured than arena taste but still depends on the scoring rubric.
source
BridgeBench
metric
Pushback rate (%)
judge
Rubric
direction
higher better
group id
bridgebench_pushback_2026_04
domain
Professional reasoning

What it measures vs what it misses

✓ Measures

Resistance to confidently accepting bogus assumptions in expert-style prompts.

✗ Misses

Coding execution quality. Latency and cost.

Why this countsResistance to confidently accepting bogus assumptions in expert-style prompts.Comparable-group ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesCoding execution quality.

Leaderboard · this benchmark version

#1 · Claude Opus 4.6
BB · Undated
95%
#2 · qwen3.6-max-preview
BB · Undated
94.5%
#3 · Claude Sonnet 4.6
BB · Undated
91.5%
#4 · GPT-5.4
BB · Undated
91.5%
#5 · Grok 4.3
BB · Undated
91%
#6 · mimo-v2.5-pro
BB · Undated
88%
#7 · GPT-5.5
BB · Undated
88%
#8 · Grok 4.20
BB · Undated
82.5%
#9 · GPT-5.4 mini
BB · Undated
78.5%
#10 · Claude Opus 4.7
BB · Undated
75.5%
#11 · mimo-v2.5
BB · Undated
73.5%
#12 · kimi-k2.6
BB · Undated
69.5%
#13 · Gemini 3.1 Pro Preview
BB · Undated
66.5%
#14 · Kimi K2.5
BB · Undated
65.5%
#15 · glm-5v-turbo
BB · Undated
65.5%
#16 · minimax-m2.7
BB · Undated
47%
#17 · glm-5.1
BB · Undated
36.5%
#18 · Nemotron 3 Nano Omni 30B A3B Reasoning
BB · Undated
36%
#19 · deepseek-v4-pro
BB · Undated
36%