Visible tradeoffsThis is an objective signal, so it is mainly about measurable task performance rather than public taste.
source
Artificial Analysis
metric
Score (%)
judge
Objective
direction
higher better
group id
aa_tau2_telecom_current
domain
Search / tool use
What it measures vs what it misses
✓ Measures
Tool-use behavior in a telecom task environment.
✗ Misses
Adjacent capabilities, subjective preference, latency, and cost.
Why this countsIt matters when the model must browse, call tools, and recover useful answers from external systems.Same-test ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt does not fully capture production agent orchestration, cost ceilings, or safety policy behavior.
Fallback benchmark identity is visible for context but excluded from default ranking.
Identity
benchmark proxy (0.58)
Parsed from Artificial Analysis public leaderboard field `tau2`. Backfilled from Claude Opus 4.1 via approved benchmark identity mapping map-claude-opus-4-to-4-1.