Benchmarks · /benchmarks/bridgebench-pushback

BS pushback

BridgeBench BS benchmark for pushing back on false premises instead of bluffing.

Source · BridgeBench
Version · bridgebench snapshot 2026-05-13
Scores · 19

Passport

Verified but agingThis is a rubric-judged signal, so it is more structured than arena taste but still depends on the scoring rubric.

source

BridgeBench

metric

Pushback rate (%)

judge

Rubric

direction

higher better

group id

bridgebench_pushback_2026_04

domain

Professional reasoning

What it measures vs what it misses

✓ Measures

Resistance to confidently accepting bogus assumptions in expert-style prompts.

✗ Misses

Coding execution quality. Latency and cost.

Why this countsResistance to confidently accepting bogus assumptions in expert-style prompts.Comparable-group ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesCoding execution quality.

Leaderboard · this benchmark version

#1 · Claude Opus 4.6

BB · Undated

95%

#2 · qwen3.6-max-preview

BB · Undated

94.5%

#3 · Claude Sonnet 4.6

BB · Undated

91.5%

#4 · GPT-5.4

BB · Undated

91.5%

#5 · Grok 4.3

BB · Undated

91%

#6 · mimo-v2.5-pro

BB · Undated

88%

#7 · GPT-5.5

BB · Undated

88%

#8 · Grok 4.20

BB · Undated

82.5%

#9 · GPT-5.4 mini

BB · Undated

78.5%

#10 · Claude Opus 4.7

BB · Undated

75.5%

#11 · mimo-v2.5

BB · Undated

73.5%

#12 · kimi-k2.6

BB · Undated

69.5%

#13 · Gemini 3.1 Pro Preview

BB · Undated

66.5%

#14 · Kimi K2.5

BB · Undated

65.5%

#15 · glm-5v-turbo

BB · Undated

65.5%

#16 · minimax-m2.7

BB · Undated

47%

#17 · glm-5.1

BB · Undated

36.5%

#18 · Nemotron 3 Nano Omni 30B A3B Reasoning

BB · Undated

36%

#19 · deepseek-v4-pro

BB · Undated

36%