Model profile · Qwen

qwen3-235b-a22b-thinking-2507

Open weightsmid · registry tag 2026 benchmark-derived

Thin verified coverage

Reads as thin verified coverage across the resolved source data.

Visible coverage: 21.9%
Verified coverage: 21.9%
Spread: n/a
Last verified: Jun 20, 2026

textcode1 aliases27 official source links

Open compare

Data version

Current snapshot.

Data version Jun 20, 2026Model list checked9 providers · 1081 tracked modelsPage refreshed Jul 5, 2026

The registry snapshot and page stamp are shown so a stale deploy is visible at a glance.

Source-linked scores by benchmark

Each row keeps the benchmark source, source type, raw metric, and percentile inside its fair comparison set.

Thin verified coverageThis model currently reads as thin verified coverage across the resolved source data.

Chat / text27 benchmarks61.9%

Text Arena

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #97

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,399
Percentile: 70.5%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: overall. Source rank: #119. Votes: 8994. Organization: alibaba. License: Apache 2.0.

70.5% percentile inside its fair comparison set

1,399Raw benchmark valueCI 1,393 - 1,406

Text Arena · Creative Writing

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #91

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,373
Percentile: 72.1%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: creative_writing. Source rank: #114. Votes: 1078. Organization: alibaba. License: Apache 2.0.

72.1% percentile inside its fair comparison set

1,373Raw benchmark valueCI 1,355 - 1,391

Text Arena · English

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #108

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,405
Percentile: 67.1%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: english. Source rank: #130. Votes: 4372. Organization: alibaba. License: Apache 2.0.

67.1% percentile inside its fair comparison set

1,405Raw benchmark valueCI 1,396 - 1,414

Text Arena · Exclude Ties

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #99

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,380
Percentile: 69.8%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: exclude_ties. Source rank: #121. Votes: 6352. Organization: alibaba. License: Apache 2.0.

69.8% percentile inside its fair comparison set

1,380Raw benchmark valueCI 1,371 - 1,390

Text Arena · Hard Prompts

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #100

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,418
Percentile: 69.5%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: hard_prompts. Source rank: #121. Votes: 3845. Organization: alibaba. License: Apache 2.0.

69.5% percentile inside its fair comparison set

1,418Raw benchmark valueCI 1,409 - 1,427

Text Arena · Hard Prompts English

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #99

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,428
Percentile: 69.8%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: hard_prompts_english. Source rank: #119. Votes: 2009. Organization: alibaba. License: Apache 2.0.

69.8% percentile inside its fair comparison set

1,428Raw benchmark valueCI 1,415 - 1,441

Text Arena · Instruction Following

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #99

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,386
Percentile: 69.8%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: instruction_following. Source rank: #122. Votes: 2099. Organization: alibaba. License: Apache 2.0.

69.8% percentile inside its fair comparison set

1,386Raw benchmark valueCI 1,373 - 1,398

Text Arena · Longer Query

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #104

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,401
Percentile: 66.1%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: longer_query. Source rank: #129. Votes: 1623. Organization: alibaba. License: Apache 2.0.

66.1% percentile inside its fair comparison set

1,401Raw benchmark valueCI 1,386 - 1,415

Text Arena · Multi Turn

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #106

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,393
Percentile: 67.5%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: multi_turn. Source rank: #129. Votes: 1403. Organization: alibaba. License: Apache 2.0.

67.5% percentile inside its fair comparison set

1,393Raw benchmark valueCI 1,378 - 1,409

Text Arena · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #80

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,414
Percentile: 75.7%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: overall. Source rank: #95. Votes: 8994. Organization: alibaba. License: Apache 2.0.

75.7% percentile inside its fair comparison set

1,414Raw benchmark valueCI 1,407 - 1,420

Text Arena · Creative Writing · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #75

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,386
Percentile: 77.1%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: creative_writing. Source rank: #89. Votes: 1078. Organization: alibaba. License: Apache 2.0.

77.1% percentile inside its fair comparison set

1,386Raw benchmark valueCI 1,368 - 1,404

Text Arena · English · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #73

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,427
Percentile: 77.8%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: english. Source rank: #86. Votes: 4372. Organization: alibaba. License: Apache 2.0.

77.8% percentile inside its fair comparison set

1,427Raw benchmark valueCI 1,418 - 1,436

Text Arena · Exclude Ties · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #80

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,400
Percentile: 75.7%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: exclude_ties. Source rank: #95. Votes: 6352. Organization: alibaba. License: Apache 2.0.

75.7% percentile inside its fair comparison set

1,400Raw benchmark valueCI 1,390 - 1,409

Text Arena · Hard Prompts · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #83

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,416
Percentile: 74.8%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: hard_prompts. Source rank: #100. Votes: 3845. Organization: alibaba. License: Apache 2.0.

74.8% percentile inside its fair comparison set

1,416Raw benchmark valueCI 1,406 - 1,425

Text Arena · Hard Prompts English · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #72

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,430
Percentile: 78.1%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: hard_prompts_english. Source rank: #87. Votes: 2009. Organization: alibaba. License: Apache 2.0.

78.1% percentile inside its fair comparison set

1,430Raw benchmark valueCI 1,417 - 1,443

Text Arena · Instruction Following · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #91

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,385
Percentile: 72.3%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: instruction_following. Source rank: #109. Votes: 2099. Organization: alibaba. License: Apache 2.0.

72.3% percentile inside its fair comparison set

1,385Raw benchmark valueCI 1,373 - 1,398

Text Arena · Longer Query · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #89

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,397
Percentile: 71.1%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: longer_query. Source rank: #109. Votes: 1623. Organization: alibaba. License: Apache 2.0.

71.1% percentile inside its fair comparison set

1,397Raw benchmark valueCI 1,383 - 1,412

Text Arena · Multi Turn · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #84

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,406
Percentile: 74.3%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: multi_turn. Source rank: #101. Votes: 1403. Organization: alibaba. License: Apache 2.0.

74.3% percentile inside its fair comparison set

1,406Raw benchmark valueCI 1,390 - 1,421

Instruction following

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #69

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 40.6%
Percentile: 37%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Category: IF. Tasks scored: 4.

37% percentile inside its fair comparison set

40.6%Raw benchmark value

Language

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #67

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 69.5%
Percentile: 38.9%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Category: Language. Tasks scored: 3.

38.9% percentile inside its fair comparison set

69.5%Raw benchmark value

Paraphrase

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #66

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 37.6%
Percentile: 39.8%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: paraphrase. Category: IF.

39.8% percentile inside its fair comparison set

37.6%Raw benchmark value

Simplify

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #69

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 38.1%
Percentile: 37%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: simplify. Category: IF.

37% percentile inside its fair comparison set

38.1%Raw benchmark value

Story generation

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #70

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 39.2%
Percentile: 36.1%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: story_generation. Category: IF.

36.1% percentile inside its fair comparison set

39.2%Raw benchmark value

Summarize

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #61

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 47.6%
Percentile: 44.4%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: summarize. Category: IF.

44.4% percentile inside its fair comparison set

47.6%Raw benchmark value

Connections

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #72

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 83%
Percentile: 34.3%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: connections. Category: Language.

34.3% percentile inside its fair comparison set

83%Raw benchmark value

Plot unscrambling

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #65

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 47.6%
Percentile: 40.7%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: plot_unscrambling. Category: Language.

40.7% percentile inside its fair comparison set

47.6%Raw benchmark value

Typos

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #48

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 78%
Percentile: 63.6%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: typos. Category: Language.

63.6% percentile inside its fair comparison set

78%Raw benchmark value

Coding9 benchmarks30.4%

Text Arena · Coding

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #100

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,442
Percentile: 69.1%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: coding. Source rank: #121. Votes: 1610. Organization: alibaba. License: Apache 2.0.

69.1% percentile inside its fair comparison set

1,442Raw benchmark valueCI 1,427 - 1,456

Text Arena · Coding · No Style Control

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #85

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,424
Percentile: 73.8%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: coding. Source rank: #102. Votes: 1610. Organization: alibaba. License: Apache 2.0.

73.8% percentile inside its fair comparison set

1,424Raw benchmark valueCI 1,409 - 1,438

Agentic coding

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #101

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 6.7%
Percentile: 7.4%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Category: Agentic Coding. Tasks scored: 3.

7.4% percentile inside its fair comparison set

6.7%Raw benchmark value

Coding

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #79

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 69%
Percentile: 27.8%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Category: Coding. Tasks scored: 2.

27.8% percentile inside its fair comparison set

69%Raw benchmark value

JavaScript

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #101

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 5%
Percentile: 10.2%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: javascript. Category: Agentic Coding.

10.2% percentile inside its fair comparison set

5%Raw benchmark value

TypeScript

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #96

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 5%
Percentile: 12.1%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: typescript. Category: Agentic Coding.

12.1% percentile inside its fair comparison set

5%Raw benchmark value

Python

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #104

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 10%
Percentile: 8.3%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: python. Category: Agentic Coding.

8.3% percentile inside its fair comparison set

10%Raw benchmark value

Coding generation

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #97

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 66.2%
Percentile: 13.9%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: code_generation. Category: Coding.

13.9% percentile inside its fair comparison set

66.2%Raw benchmark value

Coding completion

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #60

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 71.7%
Percentile: 50.9%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: code_completion. Category: Coding.

50.9% percentile inside its fair comparison set

71.7%Raw benchmark value

Reasoning / math / science12 benchmarks43.3%

Text Arena · Math

AR · Reasoning / math / science · Human

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #101

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,398
Percentile: 68.2%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: math. Source rank: #124. Votes: 489. Organization: alibaba. License: Apache 2.0.

68.2% percentile inside its fair comparison set

1,398Raw benchmark valueCI 1,373 - 1,422

Text Arena · Math · No Style Control

AR · Reasoning / math / science · Human

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #79

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,412
Percentile: 75.2%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: math. Source rank: #95. Votes: 489. Organization: alibaba. License: Apache 2.0.

75.2% percentile inside its fair comparison set

1,412Raw benchmark valueCI 1,388 - 1,437

Mathematics

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #66

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 73.4%
Percentile: 39.8%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Category: Mathematics. Tasks scored: 4.

39.8% percentile inside its fair comparison set

73.4%Raw benchmark value

Reasoning

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #69

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 59.4%
Percentile: 37%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Category: Reasoning. Tasks scored: 4.

37% percentile inside its fair comparison set

59.4%Raw benchmark value

AMPS Hard

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #88

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 86%
Percentile: 22.2%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: AMPS_Hard. Category: Mathematics.

22.2% percentile inside its fair comparison set

86%Raw benchmark value

Integrals with game

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #56

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 40%
Percentile: 49.1%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: integrals_with_game. Category: Mathematics.

49.1% percentile inside its fair comparison set

40%Raw benchmark value

Math competition

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #73

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 86.3%
Percentile: 35.2%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: math_comp. Category: Mathematics.

35.2% percentile inside its fair comparison set

86.3%Raw benchmark value

Olympiad

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #70

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 81.3%
Percentile: 36.1%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: olympiad. Category: Mathematics.

36.1% percentile inside its fair comparison set

81.3%Raw benchmark value

Theory of mind

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #72

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 53.8%
Percentile: 38%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: theory_of_mind. Category: Reasoning.

38% percentile inside its fair comparison set

53.8%Raw benchmark value

Zebra puzzle

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #63

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 37.8%
Percentile: 42.1%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: zebra_puzzle. Category: Reasoning.

42.1% percentile inside its fair comparison set

37.8%Raw benchmark value

Spatial

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #71

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 86%
Percentile: 36.1%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: spatial. Category: Reasoning.

36.1% percentile inside its fair comparison set

86%Raw benchmark value

Logic with navigation

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #70

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 60%
Percentile: 40.7%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: logic_with_navigation. Category: Reasoning.

40.7% percentile inside its fair comparison set

60%Raw benchmark value

Professional reasoning23 benchmarks71.5%

Text Arena · Expert

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena expert leaderboard.

Rank #48

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,463
Percentile: 82.9%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: expert. Source rank: #62. Votes: 410. Organization: alibaba. License: Apache 2.0.

82.9% percentile inside its fair comparison set

1,463Raw benchmark valueCI 1,435 - 1,492

Text Arena · Industry Business And Management And Financial Operations

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_business_and_management_and_financial_operations leaderboard.

Rank #95

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,402
Percentile: 70.4%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: industry_business_and_management_and_financial_operations. Source rank: #115. Votes: 1545. Organization: alibaba. License: Apache 2.0.

70.4% percentile inside its fair comparison set

1,402Raw benchmark valueCI 1,387 - 1,417

Text Arena · Industry Entertainment And Sports And Media

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_entertainment_and_sports_and_media leaderboard.

Rank #108

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,351
Percentile: 66.9%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: industry_entertainment_and_sports_and_media. Source rank: #132. Votes: 1542. Organization: alibaba. License: Apache 2.0.

66.9% percentile inside its fair comparison set

1,351Raw benchmark valueCI 1,337 - 1,366

Text Arena · Industry Legal And Government

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_legal_and_government leaderboard.

Rank #118

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,390
Percentile: 60.7%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: industry_legal_and_government. Source rank: #144. Votes: 539. Organization: alibaba. License: Apache 2.0.

60.7% percentile inside its fair comparison set

1,390Raw benchmark valueCI 1,365 - 1,414

Text Arena · Industry Life And Physical And Social Science

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_life_and_physical_and_social_science leaderboard.

Rank #104

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,411
Percentile: 68.1%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: industry_life_and_physical_and_social_science. Source rank: #126. Votes: 1514. Organization: alibaba. License: Apache 2.0.

68.1% percentile inside its fair comparison set

1,411Raw benchmark valueCI 1,395 - 1,426

Text Arena · Industry Mathematical

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_mathematical leaderboard.

Rank #72

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,425
Percentile: 76.9%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: industry_mathematical. Source rank: #87. Votes: 491. Organization: alibaba. License: Apache 2.0.

76.9% percentile inside its fair comparison set

1,425Raw benchmark valueCI 1,400 - 1,450

Text Arena · Industry Medicine And Healthcare

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_medicine_and_healthcare leaderboard.

Rank #87

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,433
Percentile: 70.8%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: industry_medicine_and_healthcare. Source rank: #107. Votes: 521. Organization: alibaba. License: Apache 2.0.

70.8% percentile inside its fair comparison set

1,433Raw benchmark valueCI 1,407 - 1,459

Text Arena · Industry Software And It Services

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_software_and_it_services leaderboard.

Rank #99

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,433
Percentile: 69.8%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: industry_software_and_it_services. Source rank: #119. Votes: 2992. Organization: alibaba. License: Apache 2.0.

69.8% percentile inside its fair comparison set

1,433Raw benchmark valueCI 1,422 - 1,443

Text Arena · Industry Writing And Literature And Language

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_writing_and_literature_and_language leaderboard.

Rank #98

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,378
Percentile: 70.1%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: industry_writing_and_literature_and_language. Source rank: #120. Votes: 1869. Organization: alibaba. License: Apache 2.0.

70.1% percentile inside its fair comparison set

1,378Raw benchmark valueCI 1,364 - 1,392

Text Arena · Expert · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena expert leaderboard.

Rank #32

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,456
Percentile: 88.7%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: expert. Source rank: #40. Votes: 410. Organization: alibaba. License: Apache 2.0.

88.7% percentile inside its fair comparison set

1,456Raw benchmark valueCI 1,428 - 1,485

Text Arena · Industry Business And Management And Financial Operations · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_business_and_management_and_financial_operations leaderboard.

Rank #66

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,411
Percentile: 79.6%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: industry_business_and_management_and_financial_operations. Source rank: #80. Votes: 1545. Organization: alibaba. License: Apache 2.0.

79.6% percentile inside its fair comparison set

1,411Raw benchmark valueCI 1,396 - 1,426

Text Arena · Industry Entertainment And Sports And Media · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_entertainment_and_sports_and_media leaderboard.

Rank #85

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,370
Percentile: 74%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: industry_entertainment_and_sports_and_media. Source rank: #103. Votes: 1542. Organization: alibaba. License: Apache 2.0.

74% percentile inside its fair comparison set

1,370Raw benchmark valueCI 1,355 - 1,385

Text Arena · Industry Legal And Government · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_legal_and_government leaderboard.

Rank #91

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,408
Percentile: 69.8%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: industry_legal_and_government. Source rank: #110. Votes: 539. Organization: alibaba. License: Apache 2.0.

69.8% percentile inside its fair comparison set

1,408Raw benchmark valueCI 1,383 - 1,432

Text Arena · Industry Life And Physical And Social Science · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_life_and_physical_and_social_science leaderboard.

Rank #70

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,432
Percentile: 78.6%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: industry_life_and_physical_and_social_science. Source rank: #84. Votes: 1514. Organization: alibaba. License: Apache 2.0.

78.6% percentile inside its fair comparison set

1,432Raw benchmark valueCI 1,417 - 1,447

Text Arena · Industry Mathematical · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_mathematical leaderboard.

Rank #50

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,435
Percentile: 84.1%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: industry_mathematical. Source rank: #60. Votes: 491. Organization: alibaba. License: Apache 2.0.

84.1% percentile inside its fair comparison set

1,435Raw benchmark valueCI 1,410 - 1,460

Text Arena · Industry Medicine And Healthcare · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_medicine_and_healthcare leaderboard.

Rank #48

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,450
Percentile: 84.1%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: industry_medicine_and_healthcare. Source rank: #54. Votes: 521. Organization: alibaba. License: Apache 2.0.

84.1% percentile inside its fair comparison set

1,450Raw benchmark valueCI 1,424 - 1,475

Text Arena · Industry Software And It Services · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_software_and_it_services leaderboard.

Rank #80

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,430
Percentile: 75.7%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: industry_software_and_it_services. Source rank: #97. Votes: 2992. Organization: alibaba. License: Apache 2.0.

75.7% percentile inside its fair comparison set

1,430Raw benchmark valueCI 1,419 - 1,441

Text Arena · Industry Writing And Literature And Language · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_writing_and_literature_and_language leaderboard.

Rank #80

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,388
Percentile: 75.6%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: industry_writing_and_literature_and_language. Source rank: #98. Votes: 1869. Organization: alibaba. License: Apache 2.0.

75.6% percentile inside its fair comparison set

1,388Raw benchmark valueCI 1,374 - 1,401

Data analysis

LB · Professional reasoning · Objective

Structured data manipulation and table reasoning accuracy.

Rank #62

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 52.2%
Percentile: 43.5%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Category: Data Analysis. Tasks scored: 3.

43.5% percentile inside its fair comparison set

52.2%Raw benchmark value

Overall

LB · Professional reasoning · Objective

Average objective performance across LiveBench's current public category mix.

Rank #78

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 53%
Percentile: 28.7%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Category averages included: 7.

28.7% percentile inside its fair comparison set

53%Raw benchmark value

Consecutive events

LB · Professional reasoning · Objective

Objective consecutive events score in LiveBench.

Rank #73

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 7.2%
Percentile: 33.3%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: consecutive_events. Category: Data Analysis.

33.3% percentile inside its fair comparison set

7.2%Raw benchmark value

Table join

LB · Professional reasoning · Objective

Objective table join score in LiveBench.

Rank #10

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 49.3%
Percentile: 91.7%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: tablejoin. Category: Data Analysis.

91.7% percentile inside its fair comparison set

49.3%Raw benchmark value

Table reformat

LB · Professional reasoning · Objective

Objective table reformat score in LiveBench.

Rank #26

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 100%
Percentile: 100%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: tablereformat. Category: Data Analysis.

100% percentile inside its fair comparison set

100%Raw benchmark value

Multilingual12 benchmarks69.6%

Text Arena · Chinese

AR · Multilingual · Human

Observed user preference in Arena's Text Arena chinese leaderboard.

Rank #77

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,449
Percentile: 74.2%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: chinese. Source rank: #93. Votes: 412. Organization: alibaba. License: Apache 2.0.

74.2% percentile inside its fair comparison set

1,449Raw benchmark valueCI 1,420 - 1,479

Text Arena · German

AR · Multilingual · Human

Observed user preference in Arena's Text Arena german leaderboard.

Rank #78

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,397
Percentile: 67.5%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: german. Source rank: #97. Votes: 204. Organization: alibaba. License: Apache 2.0.

67.5% percentile inside its fair comparison set

1,397Raw benchmark valueCI 1,357 - 1,438

Text Arena · Japanese

AR · Multilingual · Human

Observed user preference in Arena's Text Arena japanese leaderboard.

Rank #57

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,375
Percentile: 72.4%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: japanese. Source rank: #75. Votes: 403. Organization: alibaba. License: Apache 2.0.

72.4% percentile inside its fair comparison set

1,375Raw benchmark valueCI 1,344 - 1,406

Text Arena · Korean

AR · Multilingual · Human

Observed user preference in Arena's Text Arena korean leaderboard.

Rank #83

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,348
Percentile: 60.6%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: korean. Source rank: #103. Votes: 200. Organization: alibaba. License: Apache 2.0.

60.6% percentile inside its fair comparison set

1,348Raw benchmark valueCI 1,306 - 1,390

Text Arena · Russian

AR · Multilingual · Human

Observed user preference in Arena's Text Arena russian leaderboard.

Rank #83

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,403
Percentile: 71.6%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: russian. Source rank: #102. Votes: 438. Organization: alibaba. License: Apache 2.0.

71.6% percentile inside its fair comparison set

1,403Raw benchmark valueCI 1,377 - 1,429

Text Arena · Spanish

AR · Multilingual · Human

Observed user preference in Arena's Text Arena spanish leaderboard.

Rank #100

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,382
Percentile: 53.7%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: spanish. Source rank: #122. Votes: 154. Organization: alibaba. License: Apache 2.0.

53.7% percentile inside its fair comparison set

1,382Raw benchmark valueCI 1,338 - 1,427

Text Arena · Chinese · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena chinese leaderboard.

Rank #43

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,473
Percentile: 85.8%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: chinese. Source rank: #53. Votes: 412. Organization: alibaba. License: Apache 2.0.

85.8% percentile inside its fair comparison set

1,473Raw benchmark valueCI 1,443 - 1,502

Text Arena · German · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena german leaderboard.

Rank #66

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,406
Percentile: 72.6%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: german. Source rank: #82. Votes: 204. Organization: alibaba. License: Apache 2.0.

72.6% percentile inside its fair comparison set

1,406Raw benchmark valueCI 1,366 - 1,446

Text Arena · Japanese · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena japanese leaderboard.

Rank #44

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,384
Percentile: 78.8%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: japanese. Source rank: #57. Votes: 403. Organization: alibaba. License: Apache 2.0.

78.8% percentile inside its fair comparison set

1,384Raw benchmark valueCI 1,353 - 1,414

Text Arena · Korean · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena korean leaderboard.

Rank #70

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,360
Percentile: 66.8%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: korean. Source rank: #85. Votes: 200. Organization: alibaba. License: Apache 2.0.

66.8% percentile inside its fair comparison set

1,360Raw benchmark valueCI 1,319 - 1,402

Text Arena · Russian · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena russian leaderboard.

Rank #80

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,399
Percentile: 72.7%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: russian. Source rank: #100. Votes: 438. Organization: alibaba. License: Apache 2.0.

72.7% percentile inside its fair comparison set

1,399Raw benchmark valueCI 1,372 - 1,425

Text Arena · Spanish · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena spanish leaderboard.

Rank #89

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,390
Percentile: 58.9%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: spanish. Source rank: #107. Votes: 154. Organization: alibaba. License: Apache 2.0.

58.9% percentile inside its fair comparison set

1,390Raw benchmark valueCI 1,344 - 1,436

Source links and registry checks

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

LiveBench

Jun 20, 2026

source →

Model profile · Qwen

qwen3-235b-a22b-thinking-2507

Open weightsmid · registry tag 2026 benchmark-derived

Thin verified coverage

Reads as thin verified coverage across the resolved source data.

Visible coverage: 21.9%
Verified coverage: 21.9%
Spread: n/a
Last verified: Jun 20, 2026

textcode1 aliases27 official source links

Open compare

Data version

Current snapshot.

Data version Jun 20, 2026Model list checked9 providers · 1081 tracked modelsPage refreshed Jul 5, 2026

The registry snapshot and page stamp are shown so a stale deploy is visible at a glance.

Source-linked scores by benchmark

Each row keeps the benchmark source, source type, raw metric, and percentile inside its fair comparison set.

Thin verified coverageThis model currently reads as thin verified coverage across the resolved source data.

Chat / text27 benchmarks61.9%

Text Arena

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #97

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,399
Percentile: 70.5%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: overall. Source rank: #119. Votes: 8994. Organization: alibaba. License: Apache 2.0.

70.5% percentile inside its fair comparison set

1,399Raw benchmark valueCI 1,393 - 1,406

Text Arena · Creative Writing

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #91

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,373
Percentile: 72.1%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: creative_writing. Source rank: #114. Votes: 1078. Organization: alibaba. License: Apache 2.0.

72.1% percentile inside its fair comparison set

1,373Raw benchmark valueCI 1,355 - 1,391

Text Arena · English

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #108

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,405
Percentile: 67.1%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: english. Source rank: #130. Votes: 4372. Organization: alibaba. License: Apache 2.0.

67.1% percentile inside its fair comparison set

1,405Raw benchmark valueCI 1,396 - 1,414

Text Arena · Exclude Ties

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #99

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,380
Percentile: 69.8%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: exclude_ties. Source rank: #121. Votes: 6352. Organization: alibaba. License: Apache 2.0.

69.8% percentile inside its fair comparison set

1,380Raw benchmark valueCI 1,371 - 1,390

Text Arena · Hard Prompts

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #100

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,418
Percentile: 69.5%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: hard_prompts. Source rank: #121. Votes: 3845. Organization: alibaba. License: Apache 2.0.

69.5% percentile inside its fair comparison set

1,418Raw benchmark valueCI 1,409 - 1,427

Text Arena · Hard Prompts English

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #99

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,428
Percentile: 69.8%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: hard_prompts_english. Source rank: #119. Votes: 2009. Organization: alibaba. License: Apache 2.0.

69.8% percentile inside its fair comparison set

1,428Raw benchmark valueCI 1,415 - 1,441

Text Arena · Instruction Following

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #99

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,386
Percentile: 69.8%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: instruction_following. Source rank: #122. Votes: 2099. Organization: alibaba. License: Apache 2.0.

69.8% percentile inside its fair comparison set

1,386Raw benchmark valueCI 1,373 - 1,398

Text Arena · Longer Query

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #104

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,401
Percentile: 66.1%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: longer_query. Source rank: #129. Votes: 1623. Organization: alibaba. License: Apache 2.0.

66.1% percentile inside its fair comparison set

1,401Raw benchmark valueCI 1,386 - 1,415

Text Arena · Multi Turn

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #106

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,393
Percentile: 67.5%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: multi_turn. Source rank: #129. Votes: 1403. Organization: alibaba. License: Apache 2.0.

67.5% percentile inside its fair comparison set

1,393Raw benchmark valueCI 1,378 - 1,409

Text Arena · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #80

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,414
Percentile: 75.7%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: overall. Source rank: #95. Votes: 8994. Organization: alibaba. License: Apache 2.0.

75.7% percentile inside its fair comparison set

1,414Raw benchmark valueCI 1,407 - 1,420

Text Arena · Creative Writing · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #75

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,386
Percentile: 77.1%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: creative_writing. Source rank: #89. Votes: 1078. Organization: alibaba. License: Apache 2.0.

77.1% percentile inside its fair comparison set

1,386Raw benchmark valueCI 1,368 - 1,404

Text Arena · English · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #73

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,427
Percentile: 77.8%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: english. Source rank: #86. Votes: 4372. Organization: alibaba. License: Apache 2.0.

77.8% percentile inside its fair comparison set

1,427Raw benchmark valueCI 1,418 - 1,436

Text Arena · Exclude Ties · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #80

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,400
Percentile: 75.7%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: exclude_ties. Source rank: #95. Votes: 6352. Organization: alibaba. License: Apache 2.0.

75.7% percentile inside its fair comparison set

1,400Raw benchmark valueCI 1,390 - 1,409

Text Arena · Hard Prompts · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #83

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,416
Percentile: 74.8%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: hard_prompts. Source rank: #100. Votes: 3845. Organization: alibaba. License: Apache 2.0.

74.8% percentile inside its fair comparison set

1,416Raw benchmark valueCI 1,406 - 1,425

Text Arena · Hard Prompts English · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #72

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,430
Percentile: 78.1%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: hard_prompts_english. Source rank: #87. Votes: 2009. Organization: alibaba. License: Apache 2.0.

78.1% percentile inside its fair comparison set

1,430Raw benchmark valueCI 1,417 - 1,443

Text Arena · Instruction Following · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #91

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,385
Percentile: 72.3%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: instruction_following. Source rank: #109. Votes: 2099. Organization: alibaba. License: Apache 2.0.

72.3% percentile inside its fair comparison set

1,385Raw benchmark valueCI 1,373 - 1,398

Text Arena · Longer Query · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #89

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,397
Percentile: 71.1%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: longer_query. Source rank: #109. Votes: 1623. Organization: alibaba. License: Apache 2.0.

71.1% percentile inside its fair comparison set

1,397Raw benchmark valueCI 1,383 - 1,412

Text Arena · Multi Turn · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #84

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,406
Percentile: 74.3%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: multi_turn. Source rank: #101. Votes: 1403. Organization: alibaba. License: Apache 2.0.

74.3% percentile inside its fair comparison set

1,406Raw benchmark valueCI 1,390 - 1,421

Instruction following

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #69

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 40.6%
Percentile: 37%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Category: IF. Tasks scored: 4.

37% percentile inside its fair comparison set

40.6%Raw benchmark value

Language

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #67

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 69.5%
Percentile: 38.9%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Category: Language. Tasks scored: 3.

38.9% percentile inside its fair comparison set

69.5%Raw benchmark value

Paraphrase

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #66

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 37.6%
Percentile: 39.8%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: paraphrase. Category: IF.

39.8% percentile inside its fair comparison set

37.6%Raw benchmark value

Simplify

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #69

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 38.1%
Percentile: 37%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: simplify. Category: IF.

37% percentile inside its fair comparison set

38.1%Raw benchmark value

Story generation

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #70

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 39.2%
Percentile: 36.1%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: story_generation. Category: IF.

36.1% percentile inside its fair comparison set

39.2%Raw benchmark value

Summarize

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #61

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 47.6%
Percentile: 44.4%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: summarize. Category: IF.

44.4% percentile inside its fair comparison set

47.6%Raw benchmark value

Connections

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #72

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 83%
Percentile: 34.3%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: connections. Category: Language.

34.3% percentile inside its fair comparison set

83%Raw benchmark value

Plot unscrambling

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #65

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 47.6%
Percentile: 40.7%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: plot_unscrambling. Category: Language.

40.7% percentile inside its fair comparison set

47.6%Raw benchmark value

Typos

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #48

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 78%
Percentile: 63.6%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: typos. Category: Language.

63.6% percentile inside its fair comparison set

78%Raw benchmark value

Coding9 benchmarks30.4%

Text Arena · Coding

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #100

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,442
Percentile: 69.1%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: coding. Source rank: #121. Votes: 1610. Organization: alibaba. License: Apache 2.0.

69.1% percentile inside its fair comparison set

1,442Raw benchmark valueCI 1,427 - 1,456

Text Arena · Coding · No Style Control

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #85

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,424
Percentile: 73.8%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: coding. Source rank: #102. Votes: 1610. Organization: alibaba. License: Apache 2.0.

73.8% percentile inside its fair comparison set

1,424Raw benchmark valueCI 1,409 - 1,438

Agentic coding

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #101

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 6.7%
Percentile: 7.4%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Category: Agentic Coding. Tasks scored: 3.

7.4% percentile inside its fair comparison set

6.7%Raw benchmark value

Coding

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #79

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 69%
Percentile: 27.8%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Category: Coding. Tasks scored: 2.

27.8% percentile inside its fair comparison set

69%Raw benchmark value

JavaScript

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #101

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 5%
Percentile: 10.2%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: javascript. Category: Agentic Coding.

10.2% percentile inside its fair comparison set

5%Raw benchmark value

TypeScript

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #96

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 5%
Percentile: 12.1%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: typescript. Category: Agentic Coding.

12.1% percentile inside its fair comparison set

5%Raw benchmark value

Python

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #104

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 10%
Percentile: 8.3%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: python. Category: Agentic Coding.

8.3% percentile inside its fair comparison set

10%Raw benchmark value

Coding generation

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #97

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 66.2%
Percentile: 13.9%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: code_generation. Category: Coding.

13.9% percentile inside its fair comparison set

66.2%Raw benchmark value

Coding completion

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #60

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 71.7%
Percentile: 50.9%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: code_completion. Category: Coding.

50.9% percentile inside its fair comparison set

71.7%Raw benchmark value

Reasoning / math / science12 benchmarks43.3%

Text Arena · Math

AR · Reasoning / math / science · Human

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #101

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,398
Percentile: 68.2%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: math. Source rank: #124. Votes: 489. Organization: alibaba. License: Apache 2.0.

68.2% percentile inside its fair comparison set

1,398Raw benchmark valueCI 1,373 - 1,422

Text Arena · Math · No Style Control

AR · Reasoning / math / science · Human

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #79

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,412
Percentile: 75.2%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: math. Source rank: #95. Votes: 489. Organization: alibaba. License: Apache 2.0.

75.2% percentile inside its fair comparison set

1,412Raw benchmark valueCI 1,388 - 1,437

Mathematics

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #66

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 73.4%
Percentile: 39.8%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Category: Mathematics. Tasks scored: 4.

39.8% percentile inside its fair comparison set

73.4%Raw benchmark value

Reasoning

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #69

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 59.4%
Percentile: 37%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Category: Reasoning. Tasks scored: 4.

37% percentile inside its fair comparison set

59.4%Raw benchmark value

AMPS Hard

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #88

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 86%
Percentile: 22.2%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: AMPS_Hard. Category: Mathematics.

22.2% percentile inside its fair comparison set

86%Raw benchmark value

Integrals with game

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #56

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 40%
Percentile: 49.1%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: integrals_with_game. Category: Mathematics.

49.1% percentile inside its fair comparison set

40%Raw benchmark value

Math competition

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #73

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 86.3%
Percentile: 35.2%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: math_comp. Category: Mathematics.

35.2% percentile inside its fair comparison set

86.3%Raw benchmark value

Olympiad

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #70

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 81.3%
Percentile: 36.1%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: olympiad. Category: Mathematics.

36.1% percentile inside its fair comparison set

81.3%Raw benchmark value

Theory of mind

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #72

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 53.8%
Percentile: 38%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: theory_of_mind. Category: Reasoning.

38% percentile inside its fair comparison set

53.8%Raw benchmark value

Zebra puzzle

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #63

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 37.8%
Percentile: 42.1%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: zebra_puzzle. Category: Reasoning.

42.1% percentile inside its fair comparison set

37.8%Raw benchmark value

Spatial

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #71

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 86%
Percentile: 36.1%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: spatial. Category: Reasoning.

36.1% percentile inside its fair comparison set

86%Raw benchmark value

Logic with navigation

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #70

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 60%
Percentile: 40.7%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: logic_with_navigation. Category: Reasoning.

40.7% percentile inside its fair comparison set

60%Raw benchmark value

Professional reasoning23 benchmarks71.5%

Text Arena · Expert

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena expert leaderboard.

Rank #48

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,463
Percentile: 82.9%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: expert. Source rank: #62. Votes: 410. Organization: alibaba. License: Apache 2.0.

82.9% percentile inside its fair comparison set

1,463Raw benchmark valueCI 1,435 - 1,492

Text Arena · Industry Business And Management And Financial Operations

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_business_and_management_and_financial_operations leaderboard.

Rank #95

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,402
Percentile: 70.4%
Last updated: recent
Eligibility: benchmark_derived_model

70.4% percentile inside its fair comparison set

1,402Raw benchmark valueCI 1,387 - 1,417

Text Arena · Industry Entertainment And Sports And Media

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_entertainment_and_sports_and_media leaderboard.

Rank #108

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,351
Percentile: 66.9%
Last updated: recent
Eligibility: benchmark_derived_model

66.9% percentile inside its fair comparison set

1,351Raw benchmark valueCI 1,337 - 1,366

Text Arena · Industry Legal And Government

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_legal_and_government leaderboard.

Rank #118

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,390
Percentile: 60.7%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: industry_legal_and_government. Source rank: #144. Votes: 539. Organization: alibaba. License: Apache 2.0.

60.7% percentile inside its fair comparison set

1,390Raw benchmark valueCI 1,365 - 1,414

Text Arena · Industry Life And Physical And Social Science

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_life_and_physical_and_social_science leaderboard.

Rank #104

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,411
Percentile: 68.1%
Last updated: recent
Eligibility: benchmark_derived_model

68.1% percentile inside its fair comparison set

1,411Raw benchmark valueCI 1,395 - 1,426

Text Arena · Industry Mathematical

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_mathematical leaderboard.

Rank #72

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,425
Percentile: 76.9%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: industry_mathematical. Source rank: #87. Votes: 491. Organization: alibaba. License: Apache 2.0.

76.9% percentile inside its fair comparison set

1,425Raw benchmark valueCI 1,400 - 1,450

Text Arena · Industry Medicine And Healthcare

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_medicine_and_healthcare leaderboard.

Rank #87

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,433
Percentile: 70.8%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: industry_medicine_and_healthcare. Source rank: #107. Votes: 521. Organization: alibaba. License: Apache 2.0.

70.8% percentile inside its fair comparison set

1,433Raw benchmark valueCI 1,407 - 1,459

Text Arena · Industry Software And It Services

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_software_and_it_services leaderboard.

Rank #99

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,433
Percentile: 69.8%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: industry_software_and_it_services. Source rank: #119. Votes: 2992. Organization: alibaba. License: Apache 2.0.

69.8% percentile inside its fair comparison set

1,433Raw benchmark valueCI 1,422 - 1,443

Text Arena · Industry Writing And Literature And Language

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_writing_and_literature_and_language leaderboard.

Rank #98

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,378
Percentile: 70.1%
Last updated: recent
Eligibility: benchmark_derived_model

70.1% percentile inside its fair comparison set

1,378Raw benchmark valueCI 1,364 - 1,392

Text Arena · Expert · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena expert leaderboard.

Rank #32

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,456
Percentile: 88.7%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: expert. Source rank: #40. Votes: 410. Organization: alibaba. License: Apache 2.0.

88.7% percentile inside its fair comparison set

1,456Raw benchmark valueCI 1,428 - 1,485

Text Arena · Industry Business And Management And Financial Operations · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_business_and_management_and_financial_operations leaderboard.

Rank #66

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,411
Percentile: 79.6%
Last updated: recent
Eligibility: benchmark_derived_model

79.6% percentile inside its fair comparison set

1,411Raw benchmark valueCI 1,396 - 1,426

Text Arena · Industry Entertainment And Sports And Media · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_entertainment_and_sports_and_media leaderboard.

Rank #85

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,370
Percentile: 74%
Last updated: recent
Eligibility: benchmark_derived_model

74% percentile inside its fair comparison set

1,370Raw benchmark valueCI 1,355 - 1,385

Text Arena · Industry Legal And Government · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_legal_and_government leaderboard.

Rank #91

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,408
Percentile: 69.8%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: industry_legal_and_government. Source rank: #110. Votes: 539. Organization: alibaba. License: Apache 2.0.

69.8% percentile inside its fair comparison set

1,408Raw benchmark valueCI 1,383 - 1,432

Text Arena · Industry Life And Physical And Social Science · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_life_and_physical_and_social_science leaderboard.

Rank #70

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,432
Percentile: 78.6%
Last updated: recent
Eligibility: benchmark_derived_model

78.6% percentile inside its fair comparison set

1,432Raw benchmark valueCI 1,417 - 1,447

Text Arena · Industry Mathematical · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_mathematical leaderboard.

Rank #50

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,435
Percentile: 84.1%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: industry_mathematical. Source rank: #60. Votes: 491. Organization: alibaba. License: Apache 2.0.

84.1% percentile inside its fair comparison set

1,435Raw benchmark valueCI 1,410 - 1,460

Text Arena · Industry Medicine And Healthcare · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_medicine_and_healthcare leaderboard.

Rank #48

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,450
Percentile: 84.1%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: industry_medicine_and_healthcare. Source rank: #54. Votes: 521. Organization: alibaba. License: Apache 2.0.

84.1% percentile inside its fair comparison set

1,450Raw benchmark valueCI 1,424 - 1,475

Text Arena · Industry Software And It Services · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_software_and_it_services leaderboard.

Rank #80

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,430
Percentile: 75.7%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: industry_software_and_it_services. Source rank: #97. Votes: 2992. Organization: alibaba. License: Apache 2.0.

75.7% percentile inside its fair comparison set

1,430Raw benchmark valueCI 1,419 - 1,441

Text Arena · Industry Writing And Literature And Language · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_writing_and_literature_and_language leaderboard.

Rank #80

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,388
Percentile: 75.6%
Last updated: recent
Eligibility: benchmark_derived_model

75.6% percentile inside its fair comparison set

1,388Raw benchmark valueCI 1,374 - 1,401

Data analysis

LB · Professional reasoning · Objective

Structured data manipulation and table reasoning accuracy.

Rank #62

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 52.2%
Percentile: 43.5%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Category: Data Analysis. Tasks scored: 3.

43.5% percentile inside its fair comparison set

52.2%Raw benchmark value

Overall

LB · Professional reasoning · Objective

Average objective performance across LiveBench's current public category mix.

Rank #78

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 53%
Percentile: 28.7%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Category averages included: 7.

28.7% percentile inside its fair comparison set

53%Raw benchmark value

Consecutive events

LB · Professional reasoning · Objective

Objective consecutive events score in LiveBench.

Rank #73

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 7.2%
Percentile: 33.3%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: consecutive_events. Category: Data Analysis.

33.3% percentile inside its fair comparison set

7.2%Raw benchmark value

Table join

LB · Professional reasoning · Objective

Objective table join score in LiveBench.

Rank #10

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 49.3%
Percentile: 91.7%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: tablejoin. Category: Data Analysis.

91.7% percentile inside its fair comparison set

49.3%Raw benchmark value

Table reformat

LB · Professional reasoning · Objective

Objective table reformat score in LiveBench.

Rank #26

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 100%
Percentile: 100%
Last updated: archived
Eligibility: benchmark_derived_model

Derived from the official LiveBench website leaderboard table. Task: tablereformat. Category: Data Analysis.

100% percentile inside its fair comparison set

100%Raw benchmark value

Multilingual12 benchmarks69.6%

Text Arena · Chinese

AR · Multilingual · Human

Observed user preference in Arena's Text Arena chinese leaderboard.

Rank #77

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,449
Percentile: 74.2%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: chinese. Source rank: #93. Votes: 412. Organization: alibaba. License: Apache 2.0.

74.2% percentile inside its fair comparison set

1,449Raw benchmark valueCI 1,420 - 1,479

Text Arena · German

AR · Multilingual · Human

Observed user preference in Arena's Text Arena german leaderboard.

Rank #78

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,397
Percentile: 67.5%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: german. Source rank: #97. Votes: 204. Organization: alibaba. License: Apache 2.0.

67.5% percentile inside its fair comparison set

1,397Raw benchmark valueCI 1,357 - 1,438

Text Arena · Japanese

AR · Multilingual · Human

Observed user preference in Arena's Text Arena japanese leaderboard.

Rank #57

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,375
Percentile: 72.4%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: japanese. Source rank: #75. Votes: 403. Organization: alibaba. License: Apache 2.0.

72.4% percentile inside its fair comparison set

1,375Raw benchmark valueCI 1,344 - 1,406

Text Arena · Korean

AR · Multilingual · Human

Observed user preference in Arena's Text Arena korean leaderboard.

Rank #83

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,348
Percentile: 60.6%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: korean. Source rank: #103. Votes: 200. Organization: alibaba. License: Apache 2.0.

60.6% percentile inside its fair comparison set

1,348Raw benchmark valueCI 1,306 - 1,390

Text Arena · Russian

AR · Multilingual · Human

Observed user preference in Arena's Text Arena russian leaderboard.

Rank #83

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,403
Percentile: 71.6%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: russian. Source rank: #102. Votes: 438. Organization: alibaba. License: Apache 2.0.

71.6% percentile inside its fair comparison set

1,403Raw benchmark valueCI 1,377 - 1,429

Text Arena · Spanish

AR · Multilingual · Human

Observed user preference in Arena's Text Arena spanish leaderboard.

Rank #100

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,382
Percentile: 53.7%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: spanish. Source rank: #122. Votes: 154. Organization: alibaba. License: Apache 2.0.

53.7% percentile inside its fair comparison set

1,382Raw benchmark valueCI 1,338 - 1,427

Text Arena · Chinese · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena chinese leaderboard.

Rank #43

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,473
Percentile: 85.8%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: chinese. Source rank: #53. Votes: 412. Organization: alibaba. License: Apache 2.0.

85.8% percentile inside its fair comparison set

1,473Raw benchmark valueCI 1,443 - 1,502

Text Arena · German · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena german leaderboard.

Rank #66

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,406
Percentile: 72.6%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: german. Source rank: #82. Votes: 204. Organization: alibaba. License: Apache 2.0.

72.6% percentile inside its fair comparison set

1,406Raw benchmark valueCI 1,366 - 1,446

Text Arena · Japanese · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena japanese leaderboard.

Rank #44

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,384
Percentile: 78.8%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: japanese. Source rank: #57. Votes: 403. Organization: alibaba. License: Apache 2.0.

78.8% percentile inside its fair comparison set

1,384Raw benchmark valueCI 1,353 - 1,414

Text Arena · Korean · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena korean leaderboard.

Rank #70

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,360
Percentile: 66.8%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: korean. Source rank: #85. Votes: 200. Organization: alibaba. License: Apache 2.0.

66.8% percentile inside its fair comparison set

1,360Raw benchmark valueCI 1,319 - 1,402

Text Arena · Russian · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena russian leaderboard.

Rank #80

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,399
Percentile: 72.7%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: russian. Source rank: #100. Votes: 438. Organization: alibaba. License: Apache 2.0.

72.7% percentile inside its fair comparison set

1,399Raw benchmark valueCI 1,372 - 1,425

Text Arena · Spanish · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena spanish leaderboard.

Rank #89

verified runtimeexact aliasBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,390
Percentile: 58.9%
Last updated: recent
Eligibility: benchmark_derived_model

Parsed from Arena leaderboard dataset row `qwen3-235b-a22b-thinking-2507`. Category: spanish. Source rank: #107. Votes: 154. Organization: alibaba. License: Apache 2.0.

58.9% percentile inside its fair comparison set

1,390Raw benchmark valueCI 1,344 - 1,436