Text Arena
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #126
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,370
- Percentile
- 61.5%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-next-80b-a3b-thinking`. Category: overall. Source rank: #153. Votes: 13693. Organization: alibaba. License: Apache 2.0.
61.5% percentile inside its fair comparison set1,370Raw benchmark valueCI 1,364 - 1,375
Text Arena · Creative Writing
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #136
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,324
- Percentile
- 58.2%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-next-80b-a3b-thinking`. Category: creative_writing. Source rank: #166. Votes: 1797. Organization: alibaba. License: Apache 2.0.
58.2% percentile inside its fair comparison set1,324Raw benchmark valueCI 1,310 - 1,338
Text Arena · English
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #123
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,387
- Percentile
- 62.5%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-next-80b-a3b-thinking`. Category: english. Source rank: #148. Votes: 6636. Organization: alibaba. License: Apache 2.0.
62.5% percentile inside its fair comparison set1,387Raw benchmark valueCI 1,379 - 1,395
Text Arena · Exclude Ties
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #127
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,337
- Percentile
- 61.2%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-next-80b-a3b-thinking`. Category: exclude_ties. Source rank: #154. Votes: 9759. Organization: alibaba. License: Apache 2.0.
61.2% percentile inside its fair comparison set1,337Raw benchmark valueCI 1,329 - 1,345
Text Arena · Hard Prompts
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #129
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,384
- Percentile
- 60.6%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-next-80b-a3b-thinking`. Category: hard_prompts. Source rank: #155. Votes: 6691. Organization: alibaba. License: Apache 2.0.
60.6% percentile inside its fair comparison set1,384Raw benchmark valueCI 1,377 - 1,392
Text Arena · Hard Prompts English
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #130
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,397
- Percentile
- 60.2%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-next-80b-a3b-thinking`. Category: hard_prompts_english. Source rank: #156. Votes: 3497. Organization: alibaba. License: Apache 2.0.
60.2% percentile inside its fair comparison set1,397Raw benchmark valueCI 1,387 - 1,407
Text Arena · Instruction Following
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #126
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,359
- Percentile
- 61.5%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-next-80b-a3b-thinking`. Category: instruction_following. Source rank: #153. Votes: 3517. Organization: alibaba. License: Apache 2.0.
61.5% percentile inside its fair comparison set1,359Raw benchmark valueCI 1,349 - 1,369
Text Arena · Longer Query
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #131
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,370
- Percentile
- 57.2%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-next-80b-a3b-thinking`. Category: longer_query. Source rank: #157. Votes: 2835. Organization: alibaba. License: Apache 2.0.
57.2% percentile inside its fair comparison set1,370Raw benchmark valueCI 1,359 - 1,381
Text Arena · Multi Turn
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #137
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,351
- Percentile
- 57.9%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-next-80b-a3b-thinking`. Category: multi_turn. Source rank: #164. Votes: 2296. Organization: alibaba. License: Apache 2.0.
57.9% percentile inside its fair comparison set1,351Raw benchmark valueCI 1,338 - 1,363
Text Arena · No Style Control
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #117
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,368
- Percentile
- 64.3%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-next-80b-a3b-thinking`. Category: overall. Source rank: #141. Votes: 13693. Organization: alibaba. License: Apache 2.0.
64.3% percentile inside its fair comparison set1,368Raw benchmark valueCI 1,362 - 1,374
Text Arena · Creative Writing · No Style Control
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #137
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,312
- Percentile
- 57.9%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-next-80b-a3b-thinking`. Category: creative_writing. Source rank: #165. Votes: 1797. Organization: alibaba. License: Apache 2.0.
57.9% percentile inside its fair comparison set1,312Raw benchmark valueCI 1,298 - 1,326
Text Arena · English · No Style Control
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #105
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,393
- Percentile
- 68%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-next-80b-a3b-thinking`. Category: english. Source rank: #126. Votes: 6636. Organization: alibaba. License: Apache 2.0.
68% percentile inside its fair comparison set1,393Raw benchmark valueCI 1,385 - 1,401
Text Arena · Exclude Ties · No Style Control
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #119
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,333
- Percentile
- 63.7%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-next-80b-a3b-thinking`. Category: exclude_ties. Source rank: #143. Votes: 9759. Organization: alibaba. License: Apache 2.0.
63.7% percentile inside its fair comparison set1,333Raw benchmark valueCI 1,326 - 1,341
Text Arena · Hard Prompts · No Style Control
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #116
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,370
- Percentile
- 64.6%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-next-80b-a3b-thinking`. Category: hard_prompts. Source rank: #141. Votes: 6691. Organization: alibaba. License: Apache 2.0.
64.6% percentile inside its fair comparison set1,370Raw benchmark valueCI 1,363 - 1,378
Text Arena · Hard Prompts English · No Style Control
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #111
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,387
- Percentile
- 66%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-next-80b-a3b-thinking`. Category: hard_prompts_english. Source rank: #134. Votes: 3497. Organization: alibaba. License: Apache 2.0.
66% percentile inside its fair comparison set1,387Raw benchmark valueCI 1,377 - 1,397
Text Arena · Instruction Following · No Style Control
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #120
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,343
- Percentile
- 63.4%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-next-80b-a3b-thinking`. Category: instruction_following. Source rank: #147. Votes: 3517. Organization: alibaba. License: Apache 2.0.
63.4% percentile inside its fair comparison set1,343Raw benchmark valueCI 1,333 - 1,353
Text Arena · Longer Query · No Style Control
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #122
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,353
- Percentile
- 60.2%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-next-80b-a3b-thinking`. Category: longer_query. Source rank: #150. Votes: 2835. Organization: alibaba. License: Apache 2.0.
60.2% percentile inside its fair comparison set1,353Raw benchmark valueCI 1,342 - 1,364
Text Arena · Multi Turn · No Style Control
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #133
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,346
- Percentile
- 59.1%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-next-80b-a3b-thinking`. Category: multi_turn. Source rank: #160. Votes: 2296. Organization: alibaba. License: Apache 2.0.
59.1% percentile inside its fair comparison set1,346Raw benchmark valueCI 1,333 - 1,359
Instruction following
LB · Chat / text · Objective
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #67
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- LiveBench
- Raw value
- 41.5%
- Percentile
- 38.9%
- Last updated
- archived
- Eligibility
- benchmark_derived_model
Derived from the official LiveBench website leaderboard table. Category: IF. Tasks scored: 4.
38.9% percentile inside its fair comparison set41.5%Raw benchmark value
Language
LB · Chat / text · Objective
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #88
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- LiveBench
- Raw value
- 56.3%
- Percentile
- 19.4%
- Last updated
- archived
- Eligibility
- benchmark_derived_model
Derived from the official LiveBench website leaderboard table. Category: Language. Tasks scored: 3.
19.4% percentile inside its fair comparison set56.3%Raw benchmark value
Paraphrase
LB · Chat / text · Objective
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #70
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- LiveBench
- Raw value
- 33.2%
- Percentile
- 36.1%
- Last updated
- archived
- Eligibility
- benchmark_derived_model
Derived from the official LiveBench website leaderboard table. Task: paraphrase. Category: IF.
36.1% percentile inside its fair comparison set33.2%Raw benchmark value
Simplify
LB · Chat / text · Objective
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #68
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- LiveBench
- Raw value
- 39.1%
- Percentile
- 38%
- Last updated
- archived
- Eligibility
- benchmark_derived_model
Derived from the official LiveBench website leaderboard table. Task: simplify. Category: IF.
38% percentile inside its fair comparison set39.1%Raw benchmark value
Story generation
LB · Chat / text · Objective
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #63
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- LiveBench
- Raw value
- 48.3%
- Percentile
- 42.6%
- Last updated
- archived
- Eligibility
- benchmark_derived_model
Derived from the official LiveBench website leaderboard table. Task: story_generation. Category: IF.
42.6% percentile inside its fair comparison set48.3%Raw benchmark value
Summarize
LB · Chat / text · Objective
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #65
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- LiveBench
- Raw value
- 45.6%
- Percentile
- 40.7%
- Last updated
- archived
- Eligibility
- benchmark_derived_model
Derived from the official LiveBench website leaderboard table. Task: summarize. Category: IF.
40.7% percentile inside its fair comparison set45.6%Raw benchmark value
Connections
LB · Chat / text · Objective
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #90
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- LiveBench
- Raw value
- 70.2%
- Percentile
- 18.5%
- Last updated
- archived
- Eligibility
- benchmark_derived_model
Derived from the official LiveBench website leaderboard table. Task: connections. Category: Language.
18.5% percentile inside its fair comparison set70.2%Raw benchmark value
Plot unscrambling
LB · Chat / text · Objective
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #84
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- LiveBench
- Raw value
- 40.8%
- Percentile
- 23.1%
- Last updated
- archived
- Eligibility
- benchmark_derived_model
Derived from the official LiveBench website leaderboard table. Task: plot_unscrambling. Category: Language.
23.1% percentile inside its fair comparison set40.8%Raw benchmark value
Typos
LB · Chat / text · Objective
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #97
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- LiveBench
- Raw value
- 58%
- Percentile
- 10.3%
- Last updated
- archived
- Eligibility
- benchmark_derived_model
Derived from the official LiveBench website leaderboard table. Task: typos. Category: Language.
10.3% percentile inside its fair comparison set58%Raw benchmark value