Visible tradeoffsThis is a rubric-judged signal, so it is more structured than arena taste but still depends on the scoring rubric.
source
Scale Labs
metric
Score (%)
judge
Rubric
direction
higher better
group id
scale_tutorbench_current
domain
Reasoning / math / science
What it measures vs what it misses
✓ Measures
How well a model tutors through multi-step academic problems. Instruction quality, pedagogy, and reasoning support on teaching-style prompts.
✗ Misses
Live classroom preference. Latency and cost.
Why this countsIt is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.Same-test ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt still misses product usability, latency, and whether the model stays correct in messy real workflows.
Fallback benchmark identity is visible for context but excluded from default ranking.
Identity
benchmark proxy (0.58)
Manually verified from the official Scale Labs TutorBench leaderboard. Backfilled from GPT-5 via approved benchmark identity mapping map-gpt-5-4-mini-to-gpt-5.
Fallback benchmark identity is visible for context but excluded from default ranking.
Identity
benchmark proxy (0.58)
Manually verified from the official Scale Labs TutorBench leaderboard. Backfilled from GPT-5 via approved benchmark identity mapping map-gpt-5-4-nano-to-gpt-5.