Text Arena
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #97
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,399
- Percentile
- 70.5%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: overall. Source rank: #119. Votes: 8994. Organization: alibaba. License: Apache 2.0.
70.5% percentile inside its fair comparison set1,399Raw benchmark valueCI 1,393 - 1,406
Text Arena · Creative Writing
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #91
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,373
- Percentile
- 72.1%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: creative_writing. Source rank: #114. Votes: 1078. Organization: alibaba. License: Apache 2.0.
72.1% percentile inside its fair comparison set1,373Raw benchmark valueCI 1,355 - 1,391
Text Arena · English
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #108
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,405
- Percentile
- 67.1%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: english. Source rank: #130. Votes: 4372. Organization: alibaba. License: Apache 2.0.
67.1% percentile inside its fair comparison set1,405Raw benchmark valueCI 1,396 - 1,414
Text Arena · Exclude Ties
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #99
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,380
- Percentile
- 69.8%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: exclude_ties. Source rank: #121. Votes: 6352. Organization: alibaba. License: Apache 2.0.
69.8% percentile inside its fair comparison set1,380Raw benchmark valueCI 1,371 - 1,390
Text Arena · Hard Prompts
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #100
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,418
- Percentile
- 69.5%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: hard_prompts. Source rank: #121. Votes: 3845. Organization: alibaba. License: Apache 2.0.
69.5% percentile inside its fair comparison set1,418Raw benchmark valueCI 1,409 - 1,427
Text Arena · Hard Prompts English
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #99
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,428
- Percentile
- 69.8%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: hard_prompts_english. Source rank: #119. Votes: 2009. Organization: alibaba. License: Apache 2.0.
69.8% percentile inside its fair comparison set1,428Raw benchmark valueCI 1,415 - 1,441
Text Arena · Instruction Following
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #99
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,386
- Percentile
- 69.8%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: instruction_following. Source rank: #122. Votes: 2099. Organization: alibaba. License: Apache 2.0.
69.8% percentile inside its fair comparison set1,386Raw benchmark valueCI 1,373 - 1,398
Text Arena · Longer Query
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #104
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,401
- Percentile
- 66.1%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: longer_query. Source rank: #129. Votes: 1623. Organization: alibaba. License: Apache 2.0.
66.1% percentile inside its fair comparison set1,401Raw benchmark valueCI 1,386 - 1,415
Text Arena · Multi Turn
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #106
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,393
- Percentile
- 67.5%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: multi_turn. Source rank: #129. Votes: 1403. Organization: alibaba. License: Apache 2.0.
67.5% percentile inside its fair comparison set1,393Raw benchmark valueCI 1,378 - 1,409
Text Arena · No Style Control
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #80
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,414
- Percentile
- 75.7%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: overall. Source rank: #95. Votes: 8994. Organization: alibaba. License: Apache 2.0.
75.7% percentile inside its fair comparison set1,414Raw benchmark valueCI 1,407 - 1,420
Text Arena · Creative Writing · No Style Control
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #75
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,386
- Percentile
- 77.1%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: creative_writing. Source rank: #89. Votes: 1078. Organization: alibaba. License: Apache 2.0.
77.1% percentile inside its fair comparison set1,386Raw benchmark valueCI 1,368 - 1,404
Text Arena · English · No Style Control
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #73
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,427
- Percentile
- 77.8%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: english. Source rank: #86. Votes: 4372. Organization: alibaba. License: Apache 2.0.
77.8% percentile inside its fair comparison set1,427Raw benchmark valueCI 1,418 - 1,436
Text Arena · Exclude Ties · No Style Control
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #80
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,400
- Percentile
- 75.7%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: exclude_ties. Source rank: #95. Votes: 6352. Organization: alibaba. License: Apache 2.0.
75.7% percentile inside its fair comparison set1,400Raw benchmark valueCI 1,390 - 1,409
Text Arena · Hard Prompts · No Style Control
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #83
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,416
- Percentile
- 74.8%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: hard_prompts. Source rank: #100. Votes: 3845. Organization: alibaba. License: Apache 2.0.
74.8% percentile inside its fair comparison set1,416Raw benchmark valueCI 1,406 - 1,425
Text Arena · Hard Prompts English · No Style Control
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #72
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,430
- Percentile
- 78.1%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: hard_prompts_english. Source rank: #87. Votes: 2009. Organization: alibaba. License: Apache 2.0.
78.1% percentile inside its fair comparison set1,430Raw benchmark valueCI 1,417 - 1,443
Text Arena · Instruction Following · No Style Control
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #91
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,385
- Percentile
- 72.3%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: instruction_following. Source rank: #109. Votes: 2099. Organization: alibaba. License: Apache 2.0.
72.3% percentile inside its fair comparison set1,385Raw benchmark valueCI 1,373 - 1,398
Text Arena · Longer Query · No Style Control
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #89
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,397
- Percentile
- 71.1%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: longer_query. Source rank: #109. Votes: 1623. Organization: alibaba. License: Apache 2.0.
71.1% percentile inside its fair comparison set1,397Raw benchmark valueCI 1,383 - 1,412
Text Arena · Multi Turn · No Style Control
AR · Chat / text · Human
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #84
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,406
- Percentile
- 74.3%
- Last updated
- recent
- Eligibility
- benchmark_derived_model
Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: multi_turn. Source rank: #101. Votes: 1403. Organization: alibaba. License: Apache 2.0.
74.3% percentile inside its fair comparison set1,406Raw benchmark valueCI 1,390 - 1,421
Instruction following
LB · Chat / text · Objective
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #69
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- LiveBench
- Raw value
- 40.6%
- Percentile
- 37%
- Last updated
- archived
- Eligibility
- benchmark_derived_model
Derived from the official LiveBench website leaderboard table. Category: IF. Tasks scored: 4.
37% percentile inside its fair comparison set40.6%Raw benchmark value
Language
LB · Chat / text · Objective
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #67
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- LiveBench
- Raw value
- 69.5%
- Percentile
- 38.9%
- Last updated
- archived
- Eligibility
- benchmark_derived_model
Derived from the official LiveBench website leaderboard table. Category: Language. Tasks scored: 3.
38.9% percentile inside its fair comparison set69.5%Raw benchmark value
Paraphrase
LB · Chat / text · Objective
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #66
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- LiveBench
- Raw value
- 37.6%
- Percentile
- 39.8%
- Last updated
- archived
- Eligibility
- benchmark_derived_model
Derived from the official LiveBench website leaderboard table. Task: paraphrase. Category: IF.
39.8% percentile inside its fair comparison set37.6%Raw benchmark value
Simplify
LB · Chat / text · Objective
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #69
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- LiveBench
- Raw value
- 38.1%
- Percentile
- 37%
- Last updated
- archived
- Eligibility
- benchmark_derived_model
Derived from the official LiveBench website leaderboard table. Task: simplify. Category: IF.
37% percentile inside its fair comparison set38.1%Raw benchmark value
Story generation
LB · Chat / text · Objective
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #70
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- LiveBench
- Raw value
- 39.2%
- Percentile
- 36.1%
- Last updated
- archived
- Eligibility
- benchmark_derived_model
Derived from the official LiveBench website leaderboard table. Task: story_generation. Category: IF.
36.1% percentile inside its fair comparison set39.2%Raw benchmark value
Summarize
LB · Chat / text · Objective
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #61
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- LiveBench
- Raw value
- 47.6%
- Percentile
- 44.4%
- Last updated
- archived
- Eligibility
- benchmark_derived_model
Derived from the official LiveBench website leaderboard table. Task: summarize. Category: IF.
44.4% percentile inside its fair comparison set47.6%Raw benchmark value
Connections
LB · Chat / text · Objective
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #72
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- LiveBench
- Raw value
- 83%
- Percentile
- 34.3%
- Last updated
- archived
- Eligibility
- benchmark_derived_model
Derived from the official LiveBench website leaderboard table. Task: connections. Category: Language.
34.3% percentile inside its fair comparison set83%Raw benchmark value
Plot unscrambling
LB · Chat / text · Objective
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #65
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- LiveBench
- Raw value
- 47.6%
- Percentile
- 40.7%
- Last updated
- archived
- Eligibility
- benchmark_derived_model
Derived from the official LiveBench website leaderboard table. Task: plot_unscrambling. Category: Language.
40.7% percentile inside its fair comparison set47.6%Raw benchmark value
Typos
LB · Chat / text · Objective
It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.
Rank #48
verified runtimeexact aliasBackground only
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- LiveBench
- Raw value
- 78%
- Percentile
- 63.6%
- Last updated
- archived
- Eligibility
- benchmark_derived_model
Derived from the official LiveBench website leaderboard table. Task: typos. Category: Language.
63.6% percentile inside its fair comparison set78%Raw benchmark value