Verified but agingThis is an objective signal, so it is mainly about measurable task performance rather than public taste.
source
LiveBench
metric
Score (%)
judge
Objective
direction
higher better
group id
livebench_reasoning_2026_01_08
domain
Reasoning / math / science
What it measures vs what it misses
✓ Measures
Theory-of-mind, logic, spatial, and navigation-heavy reasoning correctness.
✗ Misses
Human preference, tool use, and coding execution quality.
Why this countsIt is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.Same-test ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt still misses product usability, latency, and whether the model stays correct in messy real workflows.