Visible tradeoffsThis is an objective signal, so it is mainly about measurable task performance rather than public taste.
source
Artificial Analysis
metric
Score (%)
judge
Objective
direction
higher better
group id
aa_terminalbench_hard_current
domain
Coding
What it measures vs what it misses
✓ Measures
Hard multi-step terminal task completion.
✗ Misses
Adjacent capabilities, subjective preference, latency, and cost.
Why this countsIt tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.Same-test ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt does not fully capture repo-scale iteration, IDE ergonomics, or long debugging loops.
Leaderboard · this benchmark version
#1 · Claude Fable 5
AA · Jun 24, 2026
Source label: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
Fallback benchmark identity is visible for context but excluded from default ranking.
Identity
benchmark proxy (0.58)
Parsed from Artificial Analysis public leaderboard field `terminalbenchHard`. Backfilled from Claude Opus 4.1 via approved benchmark identity mapping map-claude-opus-4-to-4-1.