Model profile · xAI

Grok 4.20

Closed weightspremium · registry tag 2026 flagship

Thin verified coverage

Reads as thin verified coverage across the resolved source data.

Visible coverage: 52.1%
Verified coverage: 52.1%
Spread: 100%
Last verified: Jun 20, 2026

43%bench fit

textcodevisionsearchdocument16 aliases45 official source links

Open compare

Data version

Current snapshot.

Data version Jun 20, 2026Model list checked9 providers · 1081 tracked modelsPage refreshed Jul 5, 2026

The registry snapshot and page stamp are shown so a stale deploy is visible at a glance.

Source-linked scores by benchmark

Each row keeps the benchmark source, source type, raw metric, and percentile inside its fair comparison set.

Thin verified coverageThis model currently reads as thin verified coverage across the resolved source data.

Chat / text37 benchmarks76.3%

Intelligence Index

AA · Chat / text · Combined

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #108 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 22
Percentile: 72.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `intelligenceIndex`.

72.9% percentile inside its fair comparison set

22Raw benchmark value

AA-Omniscience accuracy

AA · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #44 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 26.6%
Percentile: 85.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `omniscienceAccuracy`.

85.6% percentile inside its fair comparison set

26.6%Raw benchmark value

AA-Omniscience non-hallucination

AA · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #291 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 3.1%
Percentile: 2.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `omniscienceNonHallucination`.

2.7% percentile inside its fair comparison set

3.1%Raw benchmark value

IFBench

AA · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #95 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 49.3%
Percentile: 70.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `ifbench`.

70.2% percentile inside its fair comparison set

49.3%Raw benchmark value

Blended price

AA · Chat / text · Speed / cost

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #221 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: $3 /1M tokens
Percentile: 21%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `price1mBlended0To3To1`.

21% percentile inside its fair comparison set

$3 /1M tokensRaw benchmark value

Input price

AA · Chat / text · Speed / cost

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #231 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: $2 /1M input tokens
Percentile: 17.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `price1mInputTokens`.

17.8% percentile inside its fair comparison set

$2 /1M input tokensRaw benchmark value

Output price

AA · Chat / text · Speed / cost

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #214 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: $6 /1M output tokens
Percentile: 23.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `price1mOutputTokens`.

23.6% percentile inside its fair comparison set

$6 /1M output tokensRaw benchmark value

Output Speed

AA · Chat / text · Speed / cost

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #21 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 226.5 tokens/s
Percentile: 90.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `medianOutputTokensPerSecond`.

90.5% percentile inside its fair comparison set

226.5 tokens/sRaw benchmark value

Time to first token

AA · Chat / text · Speed / cost

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #180 · Source label: Grok 4.20 0309 v2 (Reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 12.92s
Percentile: 14.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `medianTimeToFirstTokenSeconds`.

14.8% percentile inside its fair comparison set

12.92sRaw benchmark value

Time to first answer token

AA · Chat / text · Speed / cost

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #104 · Source label: Grok 4.20 0309 v2 (Reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 12.92s
Percentile: 51%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `medianTimeToFirstAnswerTokenSeconds`.

51% percentile inside its fair comparison set

12.92sRaw benchmark value

Text Arena

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #14 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,474
Percentile: 96%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: overall. Source rank: #18. Votes: 42370. Organization: xai. License: Proprietary.

96% percentile inside its fair comparison set

1,474Raw benchmark valueCI 1,470 - 1,479

Text Arena · Creative Writing

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #11 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,460
Percentile: 96.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: creative_writing. Source rank: #14. Votes: 4030. Organization: xai. License: Proprietary.

96.9% percentile inside its fair comparison set

1,460Raw benchmark valueCI 1,450 - 1,470

Text Arena · English

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #16 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,480
Percentile: 95.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: english. Source rank: #18. Votes: 12881. Organization: xai. License: Proprietary.

95.4% percentile inside its fair comparison set

1,480Raw benchmark valueCI 1,474 - 1,487

Text Arena · Exclude Ties

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #17 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,483
Percentile: 95.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: exclude_ties. Source rank: #19. Votes: 20478. Organization: xai. License: Proprietary.

95.1% percentile inside its fair comparison set

1,483Raw benchmark valueCI 1,477 - 1,489

Text Arena · Hard Prompts

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #20 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,491
Percentile: 94.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: hard_prompts. Source rank: #25. Votes: 27360. Organization: xai. License: Proprietary.

94.2% percentile inside its fair comparison set

1,491Raw benchmark valueCI 1,486 - 1,496

Text Arena · Hard Prompts English

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #20 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,489
Percentile: 94.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: hard_prompts_english. Source rank: #26. Votes: 13613. Organization: xai. License: Proprietary.

94.1% percentile inside its fair comparison set

1,489Raw benchmark valueCI 1,483 - 1,496

Text Arena · Instruction Following

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #29 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,449
Percentile: 91.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: instruction_following. Source rank: #37. Votes: 8420. Organization: xai. License: Proprietary.

91.4% percentile inside its fair comparison set

1,449Raw benchmark valueCI 1,442 - 1,456

Text Arena · Longer Query

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #31 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,465
Percentile: 90.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: longer_query. Source rank: #39. Votes: 17283. Organization: xai. License: Proprietary.

90.1% percentile inside its fair comparison set

1,465Raw benchmark valueCI 1,459 - 1,471

Text Arena · Multi Turn

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #18 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,482
Percentile: 94.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: multi_turn. Source rank: #22. Votes: 4401. Organization: xai. License: Proprietary.

94.7% percentile inside its fair comparison set

1,482Raw benchmark valueCI 1,472 - 1,492

Text Arena · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #21 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,453
Percentile: 93.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: overall. Source rank: #26. Votes: 42370. Organization: xai. License: Proprietary.

93.8% percentile inside its fair comparison set

1,453Raw benchmark valueCI 1,449 - 1,458

Text Arena · Creative Writing · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #24 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,437
Percentile: 92.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: creative_writing. Source rank: #29. Votes: 6804. Organization: xai. License: Proprietary.

92.9% percentile inside its fair comparison set

1,437Raw benchmark valueCI 1,429 - 1,445

Text Arena · English · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #21 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,457
Percentile: 93.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: english. Source rank: #26. Votes: 20156. Organization: xai. License: Proprietary.

93.8% percentile inside its fair comparison set

1,457Raw benchmark valueCI 1,452 - 1,463

Text Arena · Exclude Ties · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #21 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,453
Percentile: 93.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: exclude_ties. Source rank: #26. Votes: 32293. Organization: xai. License: Proprietary.

93.8% percentile inside its fair comparison set

1,453Raw benchmark valueCI 1,447 - 1,458

Text Arena · Hard Prompts · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #31 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,453
Percentile: 90.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: hard_prompts. Source rank: #38. Votes: 27360. Organization: xai. License: Proprietary.

90.8% percentile inside its fair comparison set

1,453Raw benchmark valueCI 1,448 - 1,458

Text Arena · Hard Prompts English · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #35 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,454
Percentile: 89.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: hard_prompts_english. Source rank: #42. Votes: 13613. Organization: xai. License: Proprietary.

89.5% percentile inside its fair comparison set

1,454Raw benchmark valueCI 1,448 - 1,460

Text Arena · Instruction Following · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #38 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,421
Percentile: 88.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: instruction_following. Source rank: #46. Votes: 14000. Organization: xai. License: Proprietary.

88.6% percentile inside its fair comparison set

1,421Raw benchmark valueCI 1,415 - 1,427

Text Arena · Longer Query · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #41 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,436
Percentile: 86.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: longer_query. Source rank: #49. Votes: 17283. Organization: xai. License: Proprietary.

86.8% percentile inside its fair comparison set

1,436Raw benchmark valueCI 1,430 - 1,442

Text Arena · Multi Turn · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #25 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,456
Percentile: 92.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: multi_turn. Source rank: #32. Votes: 6754. Organization: xai. License: Proprietary.

92.6% percentile inside its fair comparison set

1,456Raw benchmark valueCI 1,448 - 1,464

Instruction following

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #21 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 63.4%
Percentile: 81.5%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: IF. Tasks scored: 4.

81.5% percentile inside its fair comparison set

63.4%Raw benchmark value

Language

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #30 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 77.7%
Percentile: 73.1%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: Language. Tasks scored: 3.

73.1% percentile inside its fair comparison set

77.7%Raw benchmark value

Paraphrase

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #32 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 58.9%
Percentile: 71.3%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: paraphrase. Category: IF.

71.3% percentile inside its fair comparison set

58.9%Raw benchmark value

Simplify

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #18 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 60.7%
Percentile: 84.3%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: simplify. Category: IF.

84.3% percentile inside its fair comparison set

60.7%Raw benchmark value

Story generation

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #19 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 67.9%
Percentile: 83.3%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: story_generation. Category: IF.

83.3% percentile inside its fair comparison set

67.9%Raw benchmark value

Summarize

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #23 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 66.1%
Percentile: 79.6%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: summarize. Category: IF.

79.6% percentile inside its fair comparison set

66.1%Raw benchmark value

Connections

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #9 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 100%
Percentile: 100%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: connections. Category: Language.

100% percentile inside its fair comparison set

100%Raw benchmark value

Plot unscrambling

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #61 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 51.2%
Percentile: 44.4%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: plot_unscrambling. Category: Language.

44.4% percentile inside its fair comparison set

51.2%Raw benchmark value

Typos

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #24 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 82%
Percentile: 86%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: typos. Category: Language.

86% percentile inside its fair comparison set

82%Raw benchmark value

Coding21 benchmarks47%

Terminal-Bench Hard

AA · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #117 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 16.7%
Percentile: 61.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `terminalbenchHard`.

61.6% percentile inside its fair comparison set

16.7%Raw benchmark value

SciCode

AA · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #154 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 32.8%
Percentile: 58.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `scicode`.

58.4% percentile inside its fair comparison set

32.8%Raw benchmark value

Code Arena

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #42 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,385
Percentile: 43.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: overall. Source rank: #50. Votes: 9535. Organization: xai. License: Proprietary.

43.8% percentile inside its fair comparison set

1,385Raw benchmark valueCI 1,378 - 1,391

WebDev Arena

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #42 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,385
Percentile: 43.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: webdev. Source rank: #50. Votes: 9535. Organization: xai. License: Proprietary.

43.8% percentile inside its fair comparison set

1,385Raw benchmark valueCI 1,378 - 1,391

Code Arena · Webdev Html

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #45 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,370
Percentile: 39.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: webdev-html. Source rank: #53. Votes: 1155. Organization: xai. License: Proprietary.

39.7% percentile inside its fair comparison set

1,370Raw benchmark valueCI 1,352 - 1,388

Code Arena · Webdev React

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #35 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,381
Percentile: 42.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: webdev-react. Source rank: #44. Votes: 8351. Organization: xai. License: Proprietary.

42.4% percentile inside its fair comparison set

1,381Raw benchmark valueCI 1,374 - 1,388

LiveCodeBench

VALS-AI · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #22 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 84.3%
Percentile: 76.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: lcb; provider: xAI.

76.7% percentile inside its fair comparison set

84.3%Raw benchmark valueCI 82.3% - 86.3%

SWE-bench Verified

VALS-AI · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #29 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 72.2%
Percentile: 48.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: swebench; provider: xAI.

48.1% percentile inside its fair comparison set

72.2%Raw benchmark valueCI 68.3% - 76.1%

Terminal-Bench 2.1

VALS-AI · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #23 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 44.2%
Percentile: 18.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: terminal-bench-2-1; provider: xAI.

18.5% percentile inside its fair comparison set

44.2%Raw benchmark valueCI 42.3% - 46.1%

Vibe Code Bench v1.1

VALS-AI · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #42 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 4.1%
Percentile: 16.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: vibe-code; provider: xAI.

16.3% percentile inside its fair comparison set

4.1%Raw benchmark valueCI 0% - 8.1%

Text Arena · Coding

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #21 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,512
Percentile: 93.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: coding. Source rank: #27. Votes: 11388. Organization: xai. License: Proprietary.

93.8% percentile inside its fair comparison set

1,512Raw benchmark valueCI 1,506 - 1,519

Text Arena · Coding · No Style Control

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #38 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,460
Percentile: 88.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: coding. Source rank: #47. Votes: 11388. Organization: xai. License: Proprietary.

88.4% percentile inside its fair comparison set

1,460Raw benchmark valueCI 1,453 - 1,466

IOI

VALS-AI · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #10 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 30.2%
Percentile: 79.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: ioi; provider: xAI.

79.5% percentile inside its fair comparison set

30.2%Raw benchmark valueCI 15.5% - 44.8%

HiL-Bench

SL · Coding · Rubric

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #6

verified runtimeexact direct

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Scale Labs
Raw value: 8%
Percentile: 0%
Last updated: recent
Eligibility: headline eligible

0% percentile inside its fair comparison set

8%Raw benchmark value

Agentic coding

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #60 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 43.3%
Percentile: 47.2%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: Agentic Coding. Tasks scored: 3.

47.2% percentile inside its fair comparison set

43.3%Raw benchmark value

Coding

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #89 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 66.1%
Percentile: 18.5%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: Coding. Tasks scored: 2.

18.5% percentile inside its fair comparison set

66.1%Raw benchmark value

JavaScript

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #83 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 25%
Percentile: 29.6%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: javascript. Category: Agentic Coding.

29.6% percentile inside its fair comparison set

25%Raw benchmark value

TypeScript

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #57 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 35%
Percentile: 56.1%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: typescript. Category: Agentic Coding.

56.1% percentile inside its fair comparison set

35%Raw benchmark value

Python

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #39 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 70%
Percentile: 77.8%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: python. Category: Agentic Coding.

77.8% percentile inside its fair comparison set

70%Raw benchmark value

Coding generation

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #98 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 64.8%
Percentile: 10.2%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: code_generation. Category: Coding.

10.2% percentile inside its fair comparison set

64.8%Raw benchmark value

Coding completion

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #76 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 67.4%
Percentile: 36.1%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: code_completion. Category: Coding.

36.1% percentile inside its fair comparison set

67.4%Raw benchmark value

Reasoning / math / science18 benchmarks74.7%

Humanity's Last Exam

AA · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #28 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 24.2%
Percentile: 92.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `hle`.

92.7% percentile inside its fair comparison set

24.2%Raw benchmark value

GPQA

AA · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #82 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 77.6%
Percentile: 78.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `gpqa`.

78.3% percentile inside its fair comparison set

77.6%Raw benchmark value

CritPt

AA · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #76 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 0.3%
Percentile: 75.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `critpt`.

75.2% percentile inside its fair comparison set

0.3%Raw benchmark value

ProofBench

VALS-AI · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #25 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 14%
Percentile: 31.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: proof_bench; provider: xAI.

31.4% percentile inside its fair comparison set

14%Raw benchmark valueCI 7.2% - 20.8%

GPQA Diamond

VALS-AI · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #17 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 88.6%
Percentile: 82%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: gpqa; provider: xAI.

82% percentile inside its fair comparison set

88.6%Raw benchmark valueCI 85.5% - 91.8%

MMLU Pro

VALS-AI · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #26 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 86.3%
Percentile: 71.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: mmlu_pro; provider: xAI.

71.9% percentile inside its fair comparison set

86.3%Raw benchmark valueCI 85.6% - 86.9%

Text Arena · Math

AR · Reasoning / math / science · Human

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #19 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,470
Percentile: 94.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: math. Source rank: #23. Votes: 2399. Organization: xai. License: Proprietary.

94.3% percentile inside its fair comparison set

1,470Raw benchmark valueCI 1,457 - 1,482

Text Arena · Math · No Style Control

AR · Reasoning / math / science · Human

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #24 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,460
Percentile: 92.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: math. Source rank: #28. Votes: 2399. Organization: xai. License: Proprietary.

92.7% percentile inside its fair comparison set

1,460Raw benchmark valueCI 1,447 - 1,472

Mathematics

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #20 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 87.1%
Percentile: 82.4%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: Mathematics. Tasks scored: 4.

82.4% percentile inside its fair comparison set

87.1%Raw benchmark value

Reasoning

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #39 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 75.3%
Percentile: 64.8%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: Reasoning. Tasks scored: 4.

64.8% percentile inside its fair comparison set

75.3%Raw benchmark value

AMPS Hard

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #24 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 98%
Percentile: 94.4%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: AMPS_Hard. Category: Mathematics.

94.4% percentile inside its fair comparison set

98%Raw benchmark value

Integrals with game

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #11 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 81%
Percentile: 91.7%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: integrals_with_game. Category: Mathematics.

91.7% percentile inside its fair comparison set

81%Raw benchmark value

Math competition

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #36 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 94.1%
Percentile: 74.1%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: math_comp. Category: Mathematics.

74.1% percentile inside its fair comparison set

94.1%Raw benchmark value

Olympiad

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #81 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 75.1%
Percentile: 25.9%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: olympiad. Category: Mathematics.

25.9% percentile inside its fair comparison set

75.1%Raw benchmark value

Theory of mind

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #51 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 65.4%
Percentile: 55.6%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: theory_of_mind. Category: Reasoning.

55.6% percentile inside its fair comparison set

65.4%Raw benchmark value

Zebra puzzle

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #38 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 67.8%
Percentile: 65.4%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: zebra_puzzle. Category: Reasoning.

65.4% percentile inside its fair comparison set

67.8%Raw benchmark value

Spatial

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #26 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 98%
Percentile: 88.9%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: spatial. Category: Reasoning.

88.9% percentile inside its fair comparison set

98%Raw benchmark value

Logic with navigation

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #26 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 70%
Percentile: 83.3%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: logic_with_navigation. Category: Reasoning.

83.3% percentile inside its fair comparison set

70%Raw benchmark value

Professional reasoning31 benchmarks69.6%

APEX-Agents-AA

AA · Professional reasoning · Objective

Long-horizon agentic task completion.

Rank #18 · Source label: Grok 4.20 0309 (Reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 14.2%
Percentile: 29.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `apexAgents`.

29.2% percentile inside its fair comparison set

14.2%Raw benchmark value

Vals Index

VALS-AI · Professional reasoning · Combined

Weighted model performance across economically relevant Vals tasks.

Rank #25 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 39
Percentile: 7.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: vals_index; provider: xAI.

7.7% percentile inside its fair comparison set

39Raw benchmark valueCI 38 - 41

LegalBench

VALS-AI · Professional reasoning · Objective

Academic legal reasoning tasks.

Rank #66 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 77.7%
Percentile: 27.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: legal_bench; provider: xAI.

27.8% percentile inside its fair comparison set

77.7%Raw benchmark valueCI 76.8% - 78.7%

Finance Agent v2

VALS-AI · Professional reasoning · Objective

Core financial analyst tasks for agentic models.

Rank #25 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 28.5%
Percentile: 4%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: fabv2; provider: xAI.

4% percentile inside its fair comparison set

28.5%Raw benchmark valueCI 27.9% - 29.1%

TaxEval v2

VALS-AI · Professional reasoning · Objective

Answer quality on tax questions and responses.

Rank #24 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 74.1%
Percentile: 74.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: tax_eval_v2; provider: xAI.

74.7% percentile inside its fair comparison set

74.1%Raw benchmark valueCI 72.4% - 75.8%

MedCode

VALS-AI · Professional reasoning · Objective

Medical billing support and coding tasks.

Rank #48 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 32.2%
Percentile: 7.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: medcode; provider: xAI.

7.8% percentile inside its fair comparison set

32.2%Raw benchmark valueCI 28% - 36.3%

MedScribe

VALS-AI · Professional reasoning · Objective

Administrative documentation support for doctors.

Rank #50 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 63.4%
Percentile: 2%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: medscribe; provider: xAI.

2% percentile inside its fair comparison set

63.4%Raw benchmark valueCI 59.3% - 67.5%

Text Arena · Expert

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena expert leaderboard.

Rank #27 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,484
Percentile: 90.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: expert. Source rank: #34. Votes: 3775. Organization: xai. License: Proprietary.

90.5% percentile inside its fair comparison set

1,484Raw benchmark valueCI 1,474 - 1,495

Text Arena · Industry Business And Management And Financial Operations

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_business_and_management_and_financial_operations leaderboard.

Rank #16 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,470
Percentile: 95.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: industry_business_and_management_and_financial_operations. Source rank: #20. Votes: 5150. Organization: xai. License: Proprietary.

95.3% percentile inside its fair comparison set

1,470Raw benchmark valueCI 1,461 - 1,479

Text Arena · Industry Entertainment And Sports And Media

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_entertainment_and_sports_and_media leaderboard.

Rank #17 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,447
Percentile: 95%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: industry_entertainment_and_sports_and_media. Source rank: #20. Votes: 8927. Organization: xai. License: Proprietary.

95% percentile inside its fair comparison set

1,447Raw benchmark valueCI 1,439 - 1,454

Text Arena · Industry Legal And Government

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_legal_and_government leaderboard.

Rank #16 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,478
Percentile: 95%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: industry_legal_and_government. Source rank: #22. Votes: 3282. Organization: xai. License: Proprietary.

95% percentile inside its fair comparison set

1,478Raw benchmark valueCI 1,467 - 1,489

Text Arena · Industry Life And Physical And Social Science

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_life_and_physical_and_social_science leaderboard.

Rank #24 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,483
Percentile: 92.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: industry_life_and_physical_and_social_science. Source rank: #29. Votes: 6746. Organization: xai. License: Proprietary.

92.9% percentile inside its fair comparison set

1,483Raw benchmark valueCI 1,475 - 1,491

Text Arena · Industry Mathematical

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_mathematical leaderboard.

Rank #28 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,464
Percentile: 91.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: industry_mathematical. Source rank: #34. Votes: 2321. Organization: xai. License: Proprietary.

91.2% percentile inside its fair comparison set

1,464Raw benchmark valueCI 1,450 - 1,477

Text Arena · Industry Medicine And Healthcare

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_medicine_and_healthcare leaderboard.

Rank #12 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,495
Percentile: 96.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: industry_medicine_and_healthcare. Source rank: #14. Votes: 1953. Organization: xai. License: Proprietary.

96.3% percentile inside its fair comparison set

1,495Raw benchmark valueCI 1,481 - 1,509

Text Arena · Industry Software And It Services

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_software_and_it_services leaderboard.

Rank #18 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,506
Percentile: 94.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: industry_software_and_it_services. Source rank: #22. Votes: 16466. Organization: xai. License: Proprietary.

94.8% percentile inside its fair comparison set

1,506Raw benchmark valueCI 1,500 - 1,512

Text Arena · Industry Writing And Literature And Language

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_writing_and_literature_and_language leaderboard.

Rank #24 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,451
Percentile: 92.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: industry_writing_and_literature_and_language. Source rank: #30. Votes: 6118. Organization: xai. License: Proprietary.

92.9% percentile inside its fair comparison set

1,451Raw benchmark valueCI 1,443 - 1,459

Text Arena · Expert · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena expert leaderboard.

Rank #39 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,449
Percentile: 86.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: expert. Source rank: #47. Votes: 3775. Organization: xai. License: Proprietary.

86.2% percentile inside its fair comparison set

1,449Raw benchmark valueCI 1,439 - 1,460

Text Arena · Industry Business And Management And Financial Operations · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_business_and_management_and_financial_operations leaderboard.

Rank #33 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,434
Percentile: 89.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: industry_business_and_management_and_financial_operations. Source rank: #38. Votes: 8332. Organization: xai. License: Proprietary.

89.9% percentile inside its fair comparison set

1,434Raw benchmark valueCI 1,426 - 1,441

Text Arena · Industry Entertainment And Sports And Media · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_entertainment_and_sports_and_media leaderboard.

Rank #21 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,428
Percentile: 93.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: industry_entertainment_and_sports_and_media. Source rank: #25. Votes: 8927. Organization: xai. License: Proprietary.

93.8% percentile inside its fair comparison set

1,428Raw benchmark valueCI 1,420 - 1,435

Text Arena · Industry Legal And Government · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_legal_and_government leaderboard.

Rank #29 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,454
Percentile: 90.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: industry_legal_and_government. Source rank: #35. Votes: 3282. Organization: xai. License: Proprietary.

90.6% percentile inside its fair comparison set

1,454Raw benchmark valueCI 1,443 - 1,465

Text Arena · Industry Life And Physical And Social Science · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_life_and_physical_and_social_science leaderboard.

Rank #31 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,457
Percentile: 90.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: industry_life_and_physical_and_social_science. Source rank: #37. Votes: 6746. Organization: xai. License: Proprietary.

90.7% percentile inside its fair comparison set

1,457Raw benchmark valueCI 1,449 - 1,465

Text Arena · Industry Mathematical · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_mathematical leaderboard.

Rank #34 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,449
Percentile: 89.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: industry_mathematical. Source rank: #41. Votes: 2321. Organization: xai. License: Proprietary.

89.3% percentile inside its fair comparison set

1,449Raw benchmark valueCI 1,436 - 1,462

Text Arena · Industry Medicine And Healthcare · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_medicine_and_healthcare leaderboard.

Rank #35 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,457
Percentile: 88.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: industry_medicine_and_healthcare. Source rank: #38. Votes: 2979. Organization: xai. License: Proprietary.

88.5% percentile inside its fair comparison set

1,457Raw benchmark valueCI 1,445 - 1,469

Text Arena · Industry Software And It Services · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_software_and_it_services leaderboard.

Rank #31 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,464
Percentile: 90.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: industry_software_and_it_services. Source rank: #38. Votes: 16466. Organization: xai. License: Proprietary.

90.8% percentile inside its fair comparison set

1,464Raw benchmark valueCI 1,458 - 1,470

Text Arena · Industry Writing And Literature And Language · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_writing_and_literature_and_language leaderboard.

Rank #25 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,436
Percentile: 92.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: industry_writing_and_literature_and_language. Source rank: #32. Votes: 10281. Organization: xai. License: Proprietary.

92.6% percentile inside its fair comparison set

1,436Raw benchmark valueCI 1,429 - 1,443

SAGE

VALS-AI · Professional reasoning · Objective

Student Assessment with Generative Evaluation.

Rank #31 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 38.2%
Percentile: 33.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: sage; provider: xAI.

33.3% percentile inside its fair comparison set

38.2%Raw benchmark valueCI 31.5% - 45%

Data analysis

LB · Professional reasoning · Objective

Structured data manipulation and table reasoning accuracy.

Rank #42 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 62.9%
Percentile: 62%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: Data Analysis. Tasks scored: 3.

62% percentile inside its fair comparison set

62.9%Raw benchmark value

Overall

LB · Professional reasoning · Objective

Average objective performance across LiveBench's current public category mix.

Rank #44 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 68%
Percentile: 60.2%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category averages included: 7.

60.2% percentile inside its fair comparison set

68%Raw benchmark value

Consecutive events

LB · Professional reasoning · Objective

Objective consecutive events score in LiveBench.

Rank #40 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 50.8%
Percentile: 63.9%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: consecutive_events. Category: Data Analysis.

63.9% percentile inside its fair comparison set

50.8%Raw benchmark value

Table join

LB · Professional reasoning · Objective

Objective table join score in LiveBench.

Rank #78 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 37.8%
Percentile: 28.7%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: tablejoin. Category: Data Analysis.

28.7% percentile inside its fair comparison set

37.8%Raw benchmark value

Table reformat

LB · Professional reasoning · Objective

Objective table reformat score in LiveBench.

Rank #25 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 100%
Percentile: 100%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: tablereformat. Category: Data Analysis.

100% percentile inside its fair comparison set

100%Raw benchmark value

Search / tool use3 benchmarks73.6%

Tau2-Bench Telecom

AA · Search / tool use · Objective

It matters when the model must browse, call tools, and recover useful answers from external systems.

Rank #112 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 59.9%
Percentile: 64.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `tau2`.

64.1% percentile inside its fair comparison set

59.9%Raw benchmark value

Search Arena

AR · Search / tool use · Human

It matters when the model must browse, call tools, and recover useful answers from external systems.

Rank #6 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,208
Percentile: 83.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: overall. Source rank: #6. Votes: 56367. Organization: xai. License: Proprietary.

83.3% percentile inside its fair comparison set

1,208Raw benchmark valueCI 1,202 - 1,213

Search Arena · No Style Control

AR · Search / tool use · Human

It matters when the model must browse, call tools, and recover useful answers from external systems.

Rank #9 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,206
Percentile: 73.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: overall. Source rank: #9. Votes: 56367. Organization: xai. License: Proprietary.

73.3% percentile inside its fair comparison set

1,206Raw benchmark valueCI 1,201 - 1,212

Long context2 benchmarks52.2%

Long Context Reasoning

AA · Long context · Objective

It checks whether long-context claims survive contact with retrieval, memory, or long-document tasks.

Rank #202 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 17.3%
Percentile: 36.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `lcr`.

36.2% percentile inside its fair comparison set

17.3%Raw benchmark value

CorpFin v2

VALS-AI · Long context · Objective

It checks whether long-context claims survive contact with retrieval, memory, or long-document tasks.

Rank #29 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 63.7%
Percentile: 68.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: corp_fin_v2; provider: xAI.

68.2% percentile inside its fair comparison set

63.7%Raw benchmark valueCI 61.8% - 65.5%

Vision understanding18 benchmarks74.8%

MMMU-Pro

AA · Vision understanding · Objective

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #51 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 64.9%
Percentile: 63%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `mmmuPro`.

63% percentile inside its fair comparison set

64.9%Raw benchmark value

Vision Arena

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #19 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,251
Percentile: 83.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: overall. Source rank: #25. Votes: 16615. Organization: xai. License: Proprietary.

83.5% percentile inside its fair comparison set

1,251Raw benchmark valueCI 1,244 - 1,258

Vision Arena · Creative Writing Vision

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #7 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,280
Percentile: 89.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: creative_writing_vision. Source rank: #9. Votes: 904. Organization: xai. License: Proprietary.

89.1% percentile inside its fair comparison set

1,280Raw benchmark valueCI 1,259 - 1,301

Vision Arena · Diagram

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #22 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,271
Percentile: 70%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: diagram. Source rank: #29. Votes: 4362. Organization: xai. License: Proprietary.

70% percentile inside its fair comparison set

1,271Raw benchmark valueCI 1,261 - 1,282

Vision Arena · English

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #19 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,252
Percentile: 83.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: english. Source rank: #24. Votes: 6998. Organization: xai. License: Proprietary.

83.5% percentile inside its fair comparison set

1,252Raw benchmark valueCI 1,243 - 1,262

Vision Arena · Entity Recognition

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #9 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,251
Percentile: 75%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: entity_recognition. Source rank: #8. Votes: 78. Organization: xai. License: Proprietary.

75% percentile inside its fair comparison set

1,251Raw benchmark valueCI 1,187 - 1,316

Vision Arena · Homework

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #26 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,272
Percentile: 63.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: homework. Source rank: #33. Votes: 2387. Organization: xai. License: Proprietary.

63.2% percentile inside its fair comparison set

1,272Raw benchmark valueCI 1,259 - 1,285

Vision Arena · Humor

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #10 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,262
Percentile: 81.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: humor. Source rank: #13. Votes: 530. Organization: xai. License: Proprietary.

81.6% percentile inside its fair comparison set

1,262Raw benchmark valueCI 1,235 - 1,288

Vision Arena · Ocr

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #23 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,258
Percentile: 68.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: ocr. Source rank: #30. Votes: 11893. Organization: xai. License: Proprietary.

68.6% percentile inside its fair comparison set

1,258Raw benchmark valueCI 1,251 - 1,265

Vision Arena · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #22 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,257
Percentile: 80.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: overall. Source rank: #28. Votes: 16615. Organization: xai. License: Proprietary.

80.7% percentile inside its fair comparison set

1,257Raw benchmark valueCI 1,250 - 1,264

Vision Arena · Creative Writing Vision · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #8 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,296
Percentile: 87.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: creative_writing_vision. Source rank: #10. Votes: 904. Organization: xai. License: Proprietary.

87.3% percentile inside its fair comparison set

1,296Raw benchmark valueCI 1,275 - 1,316

Vision Arena · Diagram · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #26 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,263
Percentile: 64.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: diagram. Source rank: #32. Votes: 4362. Organization: xai. License: Proprietary.

64.3% percentile inside its fair comparison set

1,263Raw benchmark valueCI 1,253 - 1,274

Vision Arena · English · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #21 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,261
Percentile: 81.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: english. Source rank: #26. Votes: 6998. Organization: xai. License: Proprietary.

81.7% percentile inside its fair comparison set

1,261Raw benchmark valueCI 1,252 - 1,271

Vision Arena · Entity Recognition · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #9 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,282
Percentile: 75%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: entity_recognition. Source rank: #10. Votes: 78. Organization: xai. License: Proprietary.

75% percentile inside its fair comparison set

1,282Raw benchmark valueCI 1,217 - 1,346

Vision Arena · Homework · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #27 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,274
Percentile: 61.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: homework. Source rank: #34. Votes: 2387. Organization: xai. License: Proprietary.

61.8% percentile inside its fair comparison set

1,274Raw benchmark valueCI 1,261 - 1,287

Vision Arena · Humor · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #11 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,278
Percentile: 79.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: humor. Source rank: #14. Votes: 530. Organization: xai. License: Proprietary.

79.6% percentile inside its fair comparison set

1,278Raw benchmark valueCI 1,252 - 1,305

Vision Arena · Ocr · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #25 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,259
Percentile: 65.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: ocr. Source rank: #31. Votes: 11893. Organization: xai. License: Proprietary.

65.7% percentile inside its fair comparison set

1,259Raw benchmark valueCI 1,252 - 1,266

MMMU Pro

VALS-AI · Vision understanding · Objective

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #17 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 83.5%
Percentile: 72.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: mmmu; provider: xAI.

72.4% percentile inside its fair comparison set

83.5%Raw benchmark valueCI 81.7% - 85.2%

Document understanding4 benchmarks14.6%

Document Arena

AR · Document understanding · Human

It matters when the job is reading PDFs, tables, forms, or mixed-layout documents rather than plain chat.

Rank #18 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,437
Percentile: 29.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: overall. Source rank: #21. Votes: 14105. Organization: xai. License: Proprietary.

29.2% percentile inside its fair comparison set

1,437Raw benchmark valueCI 1,429 - 1,445

Document Arena · No Style Control

AR · Document understanding · Human

It matters when the job is reading PDFs, tables, forms, or mixed-layout documents rather than plain chat.

Rank #22 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,410
Percentile: 12.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: overall. Source rank: #25. Votes: 14105. Organization: xai. License: Proprietary.

12.5% percentile inside its fair comparison set

1,410Raw benchmark valueCI 1,402 - 1,417

Vals Multimodal Index

VALS-AI · Document understanding · Combined

It matters when the job is reading PDFs, tables, forms, or mixed-layout documents rather than plain chat.

Rank #19 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 39
Percentile: 5.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: vals_multimodal_index; provider: xAI.

5.3% percentile inside its fair comparison set

39Raw benchmark valueCI 38 - 40

MortgageTax

VALS-AI · Document understanding · Objective

It matters when the job is reading PDFs, tables, forms, or mixed-layout documents rather than plain chat.

Rank #54 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 45.4%
Percentile: 11.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: mortgage_tax; provider: xAI.

11.7% percentile inside its fair comparison set

45.4%Raw benchmark valueCI 43.4% - 47.3%

Multilingual16 benchmarks91.1%

Text Arena · Chinese

AR · Multilingual · Human

Observed user preference in Arena's Text Arena chinese leaderboard.

Rank #15 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,513
Percentile: 95.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: chinese. Source rank: #19. Votes: 1398. Organization: xai. License: Proprietary.

95.3% percentile inside its fair comparison set

1,513Raw benchmark valueCI 1,496 - 1,530

Text Arena · French

AR · Multilingual · Human

Observed user preference in Arena's Text Arena french leaderboard.

Rank #13 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,500
Percentile: 94.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: french. Source rank: #15. Votes: 1386. Organization: xai. License: Proprietary.

94.4% percentile inside its fair comparison set

1,500Raw benchmark valueCI 1,481 - 1,518

Text Arena · German

AR · Multilingual · Human

Observed user preference in Arena's Text Arena german leaderboard.

Rank #7 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,487
Percentile: 97.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: german. Source rank: #9. Votes: 437. Organization: xai. License: Proprietary.

97.5% percentile inside its fair comparison set

1,487Raw benchmark valueCI 1,458 - 1,515

Text Arena · Japanese

AR · Multilingual · Human

Observed user preference in Arena's Text Arena japanese leaderboard.

Rank #10 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,469
Percentile: 95.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: japanese. Source rank: #12. Votes: 207. Organization: xai. License: Proprietary.

95.6% percentile inside its fair comparison set

1,469Raw benchmark valueCI 1,426 - 1,511

Text Arena · Korean

AR · Multilingual · Human

Observed user preference in Arena's Text Arena korean leaderboard.

Rank #13 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,444
Percentile: 94.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: korean. Source rank: #14. Votes: 634. Organization: xai. License: Proprietary.

94.2% percentile inside its fair comparison set

1,444Raw benchmark valueCI 1,418 - 1,470

Text Arena · Russian

AR · Multilingual · Human

Observed user preference in Arena's Text Arena russian leaderboard.

Rank #12 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,483
Percentile: 96.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: russian. Source rank: #15. Votes: 4628. Organization: xai. License: Proprietary.

96.2% percentile inside its fair comparison set

1,483Raw benchmark valueCI 1,473 - 1,493

Text Arena · Spanish

AR · Multilingual · Human

Observed user preference in Arena's Text Arena spanish leaderboard.

Rank #12 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,471
Percentile: 94.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: spanish. Source rank: #14. Votes: 880. Organization: xai. License: Proprietary.

94.9% percentile inside its fair comparison set

1,471Raw benchmark valueCI 1,450 - 1,492

Text Arena · Chinese · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena chinese leaderboard.

Rank #32 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,483
Percentile: 89.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: chinese. Source rank: #37. Votes: 2241. Organization: xai. License: Proprietary.

89.5% percentile inside its fair comparison set

1,483Raw benchmark valueCI 1,469 - 1,497

Text Arena · French · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena french leaderboard.

Rank #17 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,474
Percentile: 92.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: french. Source rank: #21. Votes: 1386. Organization: xai. License: Proprietary.

92.6% percentile inside its fair comparison set

1,474Raw benchmark valueCI 1,456 - 1,492

Text Arena · German · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena german leaderboard.

Rank #13 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,462
Percentile: 94.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: german. Source rank: #16. Votes: 437. Organization: xai. License: Proprietary.

94.9% percentile inside its fair comparison set

1,462Raw benchmark valueCI 1,434 - 1,490

Text Arena · Japanese · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena japanese leaderboard.

Rank #13 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,441
Percentile: 94.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: japanese. Source rank: #17. Votes: 207. Organization: xai. License: Proprietary.

94.1% percentile inside its fair comparison set

1,441Raw benchmark valueCI 1,398 - 1,483

Text Arena · Korean · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena korean leaderboard.

Rank #14 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,428
Percentile: 93.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: korean. Source rank: #18. Votes: 634. Organization: xai. License: Proprietary.

93.8% percentile inside its fair comparison set

1,428Raw benchmark valueCI 1,403 - 1,454

Text Arena · Russian · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena russian leaderboard.

Rank #17 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,461
Percentile: 94.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: russian. Source rank: #21. Votes: 4628. Organization: xai. License: Proprietary.

94.5% percentile inside its fair comparison set

1,461Raw benchmark valueCI 1,452 - 1,471

Text Arena · Spanish · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena spanish leaderboard.

Rank #32 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,446
Percentile: 85.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: spanish. Source rank: #38. Votes: 1302. Organization: xai. License: Proprietary.

85.5% percentile inside its fair comparison set

1,446Raw benchmark valueCI 1,428 - 1,464

Vision Arena · Chinese

AR · Multilingual · Human

Observed user preference in Arena's Vision Arena chinese leaderboard.

Rank #22 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,298
Percentile: 72.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: chinese. Source rank: #27. Votes: 950. Organization: xai. License: Proprietary.

72.7% percentile inside its fair comparison set

1,298Raw benchmark valueCI 1,274 - 1,322

Vision Arena · Chinese · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Vision Arena chinese leaderboard.

Rank #23 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,303
Percentile: 71.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: chinese. Source rank: #28. Votes: 950. Organization: xai. License: Proprietary.

71.4% percentile inside its fair comparison set

1,303Raw benchmark valueCI 1,279 - 1,327

Source links and registry checks

official

xAI models and pricing

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Artificial Analysis

Jun 20, 2026

source →

official

LiveBench

Jun 20, 2026

source →

Model profile · xAI

Grok 4.20

Closed weightspremium · registry tag 2026 flagship

Thin verified coverage

Reads as thin verified coverage across the resolved source data.

Visible coverage: 52.1%
Verified coverage: 52.1%
Spread: 100%
Last verified: Jun 20, 2026

43%bench fit

textcodevisionsearchdocument16 aliases45 official source links

Open compare

Data version

Current snapshot.

Data version Jun 20, 2026Model list checked9 providers · 1081 tracked modelsPage refreshed Jul 5, 2026

The registry snapshot and page stamp are shown so a stale deploy is visible at a glance.

Source-linked scores by benchmark

Each row keeps the benchmark source, source type, raw metric, and percentile inside its fair comparison set.

Thin verified coverageThis model currently reads as thin verified coverage across the resolved source data.

Chat / text37 benchmarks76.3%

Intelligence Index

AA · Chat / text · Combined

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #108 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 22
Percentile: 72.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `intelligenceIndex`.

72.9% percentile inside its fair comparison set

22Raw benchmark value

AA-Omniscience accuracy

AA · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #44 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 26.6%
Percentile: 85.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `omniscienceAccuracy`.

85.6% percentile inside its fair comparison set

26.6%Raw benchmark value

AA-Omniscience non-hallucination

AA · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #291 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 3.1%
Percentile: 2.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `omniscienceNonHallucination`.

2.7% percentile inside its fair comparison set

3.1%Raw benchmark value

IFBench

AA · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #95 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 49.3%
Percentile: 70.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `ifbench`.

70.2% percentile inside its fair comparison set

49.3%Raw benchmark value

Blended price

AA · Chat / text · Speed / cost

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #221 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: $3 /1M tokens
Percentile: 21%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `price1mBlended0To3To1`.

21% percentile inside its fair comparison set

$3 /1M tokensRaw benchmark value

Input price

AA · Chat / text · Speed / cost

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #231 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: $2 /1M input tokens
Percentile: 17.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `price1mInputTokens`.

17.8% percentile inside its fair comparison set

$2 /1M input tokensRaw benchmark value

Output price

AA · Chat / text · Speed / cost

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #214 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: $6 /1M output tokens
Percentile: 23.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `price1mOutputTokens`.

23.6% percentile inside its fair comparison set

$6 /1M output tokensRaw benchmark value

Output Speed

AA · Chat / text · Speed / cost

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #21 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 226.5 tokens/s
Percentile: 90.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `medianOutputTokensPerSecond`.

90.5% percentile inside its fair comparison set

226.5 tokens/sRaw benchmark value

Time to first token

AA · Chat / text · Speed / cost

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #180 · Source label: Grok 4.20 0309 v2 (Reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 12.92s
Percentile: 14.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `medianTimeToFirstTokenSeconds`.

14.8% percentile inside its fair comparison set

12.92sRaw benchmark value

Time to first answer token

AA · Chat / text · Speed / cost

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #104 · Source label: Grok 4.20 0309 v2 (Reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 12.92s
Percentile: 51%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `medianTimeToFirstAnswerTokenSeconds`.

51% percentile inside its fair comparison set

12.92sRaw benchmark value

Text Arena

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #14 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,474
Percentile: 96%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: overall. Source rank: #18. Votes: 42370. Organization: xai. License: Proprietary.

96% percentile inside its fair comparison set

1,474Raw benchmark valueCI 1,470 - 1,479

Text Arena · Creative Writing

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #11 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,460
Percentile: 96.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: creative_writing. Source rank: #14. Votes: 4030. Organization: xai. License: Proprietary.

96.9% percentile inside its fair comparison set

1,460Raw benchmark valueCI 1,450 - 1,470

Text Arena · English

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #16 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,480
Percentile: 95.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: english. Source rank: #18. Votes: 12881. Organization: xai. License: Proprietary.

95.4% percentile inside its fair comparison set

1,480Raw benchmark valueCI 1,474 - 1,487

Text Arena · Exclude Ties

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #17 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,483
Percentile: 95.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: exclude_ties. Source rank: #19. Votes: 20478. Organization: xai. License: Proprietary.

95.1% percentile inside its fair comparison set

1,483Raw benchmark valueCI 1,477 - 1,489

Text Arena · Hard Prompts

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #20 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,491
Percentile: 94.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: hard_prompts. Source rank: #25. Votes: 27360. Organization: xai. License: Proprietary.

94.2% percentile inside its fair comparison set

1,491Raw benchmark valueCI 1,486 - 1,496

Text Arena · Hard Prompts English

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #20 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,489
Percentile: 94.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: hard_prompts_english. Source rank: #26. Votes: 13613. Organization: xai. License: Proprietary.

94.1% percentile inside its fair comparison set

1,489Raw benchmark valueCI 1,483 - 1,496

Text Arena · Instruction Following

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #29 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,449
Percentile: 91.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: instruction_following. Source rank: #37. Votes: 8420. Organization: xai. License: Proprietary.

91.4% percentile inside its fair comparison set

1,449Raw benchmark valueCI 1,442 - 1,456

Text Arena · Longer Query

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #31 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,465
Percentile: 90.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: longer_query. Source rank: #39. Votes: 17283. Organization: xai. License: Proprietary.

90.1% percentile inside its fair comparison set

1,465Raw benchmark valueCI 1,459 - 1,471

Text Arena · Multi Turn

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #18 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,482
Percentile: 94.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: multi_turn. Source rank: #22. Votes: 4401. Organization: xai. License: Proprietary.

94.7% percentile inside its fair comparison set

1,482Raw benchmark valueCI 1,472 - 1,492

Text Arena · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #21 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,453
Percentile: 93.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: overall. Source rank: #26. Votes: 42370. Organization: xai. License: Proprietary.

93.8% percentile inside its fair comparison set

1,453Raw benchmark valueCI 1,449 - 1,458

Text Arena · Creative Writing · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #24 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,437
Percentile: 92.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: creative_writing. Source rank: #29. Votes: 6804. Organization: xai. License: Proprietary.

92.9% percentile inside its fair comparison set

1,437Raw benchmark valueCI 1,429 - 1,445

Text Arena · English · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #21 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,457
Percentile: 93.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: english. Source rank: #26. Votes: 20156. Organization: xai. License: Proprietary.

93.8% percentile inside its fair comparison set

1,457Raw benchmark valueCI 1,452 - 1,463

Text Arena · Exclude Ties · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #21 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,453
Percentile: 93.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: exclude_ties. Source rank: #26. Votes: 32293. Organization: xai. License: Proprietary.

93.8% percentile inside its fair comparison set

1,453Raw benchmark valueCI 1,447 - 1,458

Text Arena · Hard Prompts · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #31 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,453
Percentile: 90.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: hard_prompts. Source rank: #38. Votes: 27360. Organization: xai. License: Proprietary.

90.8% percentile inside its fair comparison set

1,453Raw benchmark valueCI 1,448 - 1,458

Text Arena · Hard Prompts English · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #35 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,454
Percentile: 89.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: hard_prompts_english. Source rank: #42. Votes: 13613. Organization: xai. License: Proprietary.

89.5% percentile inside its fair comparison set

1,454Raw benchmark valueCI 1,448 - 1,460

Text Arena · Instruction Following · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #38 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,421
Percentile: 88.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: instruction_following. Source rank: #46. Votes: 14000. Organization: xai. License: Proprietary.

88.6% percentile inside its fair comparison set

1,421Raw benchmark valueCI 1,415 - 1,427

Text Arena · Longer Query · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #41 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,436
Percentile: 86.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: longer_query. Source rank: #49. Votes: 17283. Organization: xai. License: Proprietary.

86.8% percentile inside its fair comparison set

1,436Raw benchmark valueCI 1,430 - 1,442

Text Arena · Multi Turn · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #25 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,456
Percentile: 92.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: multi_turn. Source rank: #32. Votes: 6754. Organization: xai. License: Proprietary.

92.6% percentile inside its fair comparison set

1,456Raw benchmark valueCI 1,448 - 1,464

Instruction following

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #21 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 63.4%
Percentile: 81.5%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: IF. Tasks scored: 4.

81.5% percentile inside its fair comparison set

63.4%Raw benchmark value

Language

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #30 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 77.7%
Percentile: 73.1%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: Language. Tasks scored: 3.

73.1% percentile inside its fair comparison set

77.7%Raw benchmark value

Paraphrase

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #32 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 58.9%
Percentile: 71.3%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: paraphrase. Category: IF.

71.3% percentile inside its fair comparison set

58.9%Raw benchmark value

Simplify

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #18 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 60.7%
Percentile: 84.3%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: simplify. Category: IF.

84.3% percentile inside its fair comparison set

60.7%Raw benchmark value

Story generation

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #19 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 67.9%
Percentile: 83.3%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: story_generation. Category: IF.

83.3% percentile inside its fair comparison set

67.9%Raw benchmark value

Summarize

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #23 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 66.1%
Percentile: 79.6%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: summarize. Category: IF.

79.6% percentile inside its fair comparison set

66.1%Raw benchmark value

Connections

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #9 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 100%
Percentile: 100%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: connections. Category: Language.

100% percentile inside its fair comparison set

100%Raw benchmark value

Plot unscrambling

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #61 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 51.2%
Percentile: 44.4%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: plot_unscrambling. Category: Language.

44.4% percentile inside its fair comparison set

51.2%Raw benchmark value

Typos

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #24 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 82%
Percentile: 86%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: typos. Category: Language.

86% percentile inside its fair comparison set

82%Raw benchmark value

Coding21 benchmarks47%

Terminal-Bench Hard

AA · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #117 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 16.7%
Percentile: 61.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `terminalbenchHard`.

61.6% percentile inside its fair comparison set

16.7%Raw benchmark value

SciCode

AA · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #154 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 32.8%
Percentile: 58.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `scicode`.

58.4% percentile inside its fair comparison set

32.8%Raw benchmark value

Code Arena

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #42 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,385
Percentile: 43.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: overall. Source rank: #50. Votes: 9535. Organization: xai. License: Proprietary.

43.8% percentile inside its fair comparison set

1,385Raw benchmark valueCI 1,378 - 1,391

WebDev Arena

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #42 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,385
Percentile: 43.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: webdev. Source rank: #50. Votes: 9535. Organization: xai. License: Proprietary.

43.8% percentile inside its fair comparison set

1,385Raw benchmark valueCI 1,378 - 1,391

Code Arena · Webdev Html

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #45 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,370
Percentile: 39.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: webdev-html. Source rank: #53. Votes: 1155. Organization: xai. License: Proprietary.

39.7% percentile inside its fair comparison set

1,370Raw benchmark valueCI 1,352 - 1,388

Code Arena · Webdev React

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #35 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,381
Percentile: 42.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: webdev-react. Source rank: #44. Votes: 8351. Organization: xai. License: Proprietary.

42.4% percentile inside its fair comparison set

1,381Raw benchmark valueCI 1,374 - 1,388

LiveCodeBench

VALS-AI · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #22 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 84.3%
Percentile: 76.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: lcb; provider: xAI.

76.7% percentile inside its fair comparison set

84.3%Raw benchmark valueCI 82.3% - 86.3%

SWE-bench Verified

VALS-AI · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #29 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 72.2%
Percentile: 48.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: swebench; provider: xAI.

48.1% percentile inside its fair comparison set

72.2%Raw benchmark valueCI 68.3% - 76.1%

Terminal-Bench 2.1

VALS-AI · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #23 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 44.2%
Percentile: 18.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: terminal-bench-2-1; provider: xAI.

18.5% percentile inside its fair comparison set

44.2%Raw benchmark valueCI 42.3% - 46.1%

Vibe Code Bench v1.1

VALS-AI · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #42 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 4.1%
Percentile: 16.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: vibe-code; provider: xAI.

16.3% percentile inside its fair comparison set

4.1%Raw benchmark valueCI 0% - 8.1%

Text Arena · Coding

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #21 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,512
Percentile: 93.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: coding. Source rank: #27. Votes: 11388. Organization: xai. License: Proprietary.

93.8% percentile inside its fair comparison set

1,512Raw benchmark valueCI 1,506 - 1,519

Text Arena · Coding · No Style Control

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #38 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,460
Percentile: 88.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: coding. Source rank: #47. Votes: 11388. Organization: xai. License: Proprietary.

88.4% percentile inside its fair comparison set

1,460Raw benchmark valueCI 1,453 - 1,466

IOI

VALS-AI · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #10 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 30.2%
Percentile: 79.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: ioi; provider: xAI.

79.5% percentile inside its fair comparison set

30.2%Raw benchmark valueCI 15.5% - 44.8%

HiL-Bench

SL · Coding · Rubric

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #6

verified runtimeexact direct

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Scale Labs
Raw value: 8%
Percentile: 0%
Last updated: recent
Eligibility: headline eligible

0% percentile inside its fair comparison set

8%Raw benchmark value

Agentic coding

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #60 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 43.3%
Percentile: 47.2%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: Agentic Coding. Tasks scored: 3.

47.2% percentile inside its fair comparison set

43.3%Raw benchmark value

Coding

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #89 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 66.1%
Percentile: 18.5%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: Coding. Tasks scored: 2.

18.5% percentile inside its fair comparison set

66.1%Raw benchmark value

JavaScript

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #83 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 25%
Percentile: 29.6%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: javascript. Category: Agentic Coding.

29.6% percentile inside its fair comparison set

25%Raw benchmark value

TypeScript

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #57 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 35%
Percentile: 56.1%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: typescript. Category: Agentic Coding.

56.1% percentile inside its fair comparison set

35%Raw benchmark value

Python

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #39 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 70%
Percentile: 77.8%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: python. Category: Agentic Coding.

77.8% percentile inside its fair comparison set

70%Raw benchmark value

Coding generation

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #98 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 64.8%
Percentile: 10.2%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: code_generation. Category: Coding.

10.2% percentile inside its fair comparison set

64.8%Raw benchmark value

Coding completion

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #76 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 67.4%
Percentile: 36.1%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: code_completion. Category: Coding.

36.1% percentile inside its fair comparison set

67.4%Raw benchmark value

Reasoning / math / science18 benchmarks74.7%

Humanity's Last Exam

AA · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #28 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 24.2%
Percentile: 92.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `hle`.

92.7% percentile inside its fair comparison set

24.2%Raw benchmark value

GPQA

AA · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #82 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 77.6%
Percentile: 78.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `gpqa`.

78.3% percentile inside its fair comparison set

77.6%Raw benchmark value

CritPt

AA · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #76 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 0.3%
Percentile: 75.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `critpt`.

75.2% percentile inside its fair comparison set

0.3%Raw benchmark value

ProofBench

VALS-AI · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #25 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 14%
Percentile: 31.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: proof_bench; provider: xAI.

31.4% percentile inside its fair comparison set

14%Raw benchmark valueCI 7.2% - 20.8%

GPQA Diamond

VALS-AI · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #17 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 88.6%
Percentile: 82%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: gpqa; provider: xAI.

82% percentile inside its fair comparison set

88.6%Raw benchmark valueCI 85.5% - 91.8%

MMLU Pro

VALS-AI · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #26 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 86.3%
Percentile: 71.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: mmlu_pro; provider: xAI.

71.9% percentile inside its fair comparison set

86.3%Raw benchmark valueCI 85.6% - 86.9%

Text Arena · Math

AR · Reasoning / math / science · Human

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #19 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,470
Percentile: 94.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: math. Source rank: #23. Votes: 2399. Organization: xai. License: Proprietary.

94.3% percentile inside its fair comparison set

1,470Raw benchmark valueCI 1,457 - 1,482

Text Arena · Math · No Style Control

AR · Reasoning / math / science · Human

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #24 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,460
Percentile: 92.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: math. Source rank: #28. Votes: 2399. Organization: xai. License: Proprietary.

92.7% percentile inside its fair comparison set

1,460Raw benchmark valueCI 1,447 - 1,472

Mathematics

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #20 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 87.1%
Percentile: 82.4%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: Mathematics. Tasks scored: 4.

82.4% percentile inside its fair comparison set

87.1%Raw benchmark value

Reasoning

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #39 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 75.3%
Percentile: 64.8%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: Reasoning. Tasks scored: 4.

64.8% percentile inside its fair comparison set

75.3%Raw benchmark value

AMPS Hard

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #24 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 98%
Percentile: 94.4%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: AMPS_Hard. Category: Mathematics.

94.4% percentile inside its fair comparison set

98%Raw benchmark value

Integrals with game

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #11 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 81%
Percentile: 91.7%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: integrals_with_game. Category: Mathematics.

91.7% percentile inside its fair comparison set

81%Raw benchmark value

Math competition

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #36 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 94.1%
Percentile: 74.1%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: math_comp. Category: Mathematics.

74.1% percentile inside its fair comparison set

94.1%Raw benchmark value

Olympiad

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #81 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 75.1%
Percentile: 25.9%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: olympiad. Category: Mathematics.

25.9% percentile inside its fair comparison set

75.1%Raw benchmark value

Theory of mind

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #51 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 65.4%
Percentile: 55.6%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: theory_of_mind. Category: Reasoning.

55.6% percentile inside its fair comparison set

65.4%Raw benchmark value

Zebra puzzle

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #38 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 67.8%
Percentile: 65.4%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: zebra_puzzle. Category: Reasoning.

65.4% percentile inside its fair comparison set

67.8%Raw benchmark value

Spatial

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #26 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 98%
Percentile: 88.9%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: spatial. Category: Reasoning.

88.9% percentile inside its fair comparison set

98%Raw benchmark value

Logic with navigation

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #26 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 70%
Percentile: 83.3%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: logic_with_navigation. Category: Reasoning.

83.3% percentile inside its fair comparison set

70%Raw benchmark value

Professional reasoning31 benchmarks69.6%

APEX-Agents-AA

AA · Professional reasoning · Objective

Long-horizon agentic task completion.

Rank #18 · Source label: Grok 4.20 0309 (Reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 14.2%
Percentile: 29.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `apexAgents`.

29.2% percentile inside its fair comparison set

14.2%Raw benchmark value

Vals Index

VALS-AI · Professional reasoning · Combined

Weighted model performance across economically relevant Vals tasks.

Rank #25 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 39
Percentile: 7.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: vals_index; provider: xAI.

7.7% percentile inside its fair comparison set

39Raw benchmark valueCI 38 - 41

LegalBench

VALS-AI · Professional reasoning · Objective

Academic legal reasoning tasks.

Rank #66 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 77.7%
Percentile: 27.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: legal_bench; provider: xAI.

27.8% percentile inside its fair comparison set

77.7%Raw benchmark valueCI 76.8% - 78.7%

Finance Agent v2

VALS-AI · Professional reasoning · Objective

Core financial analyst tasks for agentic models.

Rank #25 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 28.5%
Percentile: 4%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: fabv2; provider: xAI.

4% percentile inside its fair comparison set

28.5%Raw benchmark valueCI 27.9% - 29.1%

TaxEval v2

VALS-AI · Professional reasoning · Objective

Answer quality on tax questions and responses.

Rank #24 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 74.1%
Percentile: 74.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: tax_eval_v2; provider: xAI.

74.7% percentile inside its fair comparison set

74.1%Raw benchmark valueCI 72.4% - 75.8%

MedCode

VALS-AI · Professional reasoning · Objective

Medical billing support and coding tasks.

Rank #48 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 32.2%
Percentile: 7.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: medcode; provider: xAI.

7.8% percentile inside its fair comparison set

32.2%Raw benchmark valueCI 28% - 36.3%

MedScribe

VALS-AI · Professional reasoning · Objective

Administrative documentation support for doctors.

Rank #50 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 63.4%
Percentile: 2%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: medscribe; provider: xAI.

2% percentile inside its fair comparison set

63.4%Raw benchmark valueCI 59.3% - 67.5%

Text Arena · Expert

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena expert leaderboard.

Rank #27 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,484
Percentile: 90.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: expert. Source rank: #34. Votes: 3775. Organization: xai. License: Proprietary.

90.5% percentile inside its fair comparison set

1,484Raw benchmark valueCI 1,474 - 1,495

Text Arena · Industry Business And Management And Financial Operations

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_business_and_management_and_financial_operations leaderboard.

Rank #16 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,470
Percentile: 95.3%
Last updated: recent
Eligibility: headline eligible

95.3% percentile inside its fair comparison set

1,470Raw benchmark valueCI 1,461 - 1,479

Text Arena · Industry Entertainment And Sports And Media

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_entertainment_and_sports_and_media leaderboard.

Rank #17 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,447
Percentile: 95%
Last updated: recent
Eligibility: headline eligible

95% percentile inside its fair comparison set

1,447Raw benchmark valueCI 1,439 - 1,454

Text Arena · Industry Legal And Government

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_legal_and_government leaderboard.

Rank #16 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,478
Percentile: 95%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: industry_legal_and_government. Source rank: #22. Votes: 3282. Organization: xai. License: Proprietary.

95% percentile inside its fair comparison set

1,478Raw benchmark valueCI 1,467 - 1,489

Text Arena · Industry Life And Physical And Social Science

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_life_and_physical_and_social_science leaderboard.

Rank #24 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,483
Percentile: 92.9%
Last updated: recent
Eligibility: headline eligible

92.9% percentile inside its fair comparison set

1,483Raw benchmark valueCI 1,475 - 1,491

Text Arena · Industry Mathematical

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_mathematical leaderboard.

Rank #28 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,464
Percentile: 91.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: industry_mathematical. Source rank: #34. Votes: 2321. Organization: xai. License: Proprietary.

91.2% percentile inside its fair comparison set

1,464Raw benchmark valueCI 1,450 - 1,477

Text Arena · Industry Medicine And Healthcare

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_medicine_and_healthcare leaderboard.

Rank #12 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,495
Percentile: 96.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: industry_medicine_and_healthcare. Source rank: #14. Votes: 1953. Organization: xai. License: Proprietary.

96.3% percentile inside its fair comparison set

1,495Raw benchmark valueCI 1,481 - 1,509

Text Arena · Industry Software And It Services

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_software_and_it_services leaderboard.

Rank #18 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,506
Percentile: 94.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: industry_software_and_it_services. Source rank: #22. Votes: 16466. Organization: xai. License: Proprietary.

94.8% percentile inside its fair comparison set

1,506Raw benchmark valueCI 1,500 - 1,512

Text Arena · Industry Writing And Literature And Language

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_writing_and_literature_and_language leaderboard.

Rank #24 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,451
Percentile: 92.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: industry_writing_and_literature_and_language. Source rank: #30. Votes: 6118. Organization: xai. License: Proprietary.

92.9% percentile inside its fair comparison set

1,451Raw benchmark valueCI 1,443 - 1,459

Text Arena · Expert · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena expert leaderboard.

Rank #39 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,449
Percentile: 86.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: expert. Source rank: #47. Votes: 3775. Organization: xai. License: Proprietary.

86.2% percentile inside its fair comparison set

1,449Raw benchmark valueCI 1,439 - 1,460

Text Arena · Industry Business And Management And Financial Operations · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_business_and_management_and_financial_operations leaderboard.

Rank #33 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,434
Percentile: 89.9%
Last updated: recent
Eligibility: headline eligible

89.9% percentile inside its fair comparison set

1,434Raw benchmark valueCI 1,426 - 1,441

Text Arena · Industry Entertainment And Sports And Media · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_entertainment_and_sports_and_media leaderboard.

Rank #21 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,428
Percentile: 93.8%
Last updated: recent
Eligibility: headline eligible

93.8% percentile inside its fair comparison set

1,428Raw benchmark valueCI 1,420 - 1,435

Text Arena · Industry Legal And Government · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_legal_and_government leaderboard.

Rank #29 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,454
Percentile: 90.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: industry_legal_and_government. Source rank: #35. Votes: 3282. Organization: xai. License: Proprietary.

90.6% percentile inside its fair comparison set

1,454Raw benchmark valueCI 1,443 - 1,465

Text Arena · Industry Life And Physical And Social Science · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_life_and_physical_and_social_science leaderboard.

Rank #31 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,457
Percentile: 90.7%
Last updated: recent
Eligibility: headline eligible

90.7% percentile inside its fair comparison set

1,457Raw benchmark valueCI 1,449 - 1,465

Text Arena · Industry Mathematical · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_mathematical leaderboard.

Rank #34 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,449
Percentile: 89.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: industry_mathematical. Source rank: #41. Votes: 2321. Organization: xai. License: Proprietary.

89.3% percentile inside its fair comparison set

1,449Raw benchmark valueCI 1,436 - 1,462

Text Arena · Industry Medicine And Healthcare · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_medicine_and_healthcare leaderboard.

Rank #35 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,457
Percentile: 88.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: industry_medicine_and_healthcare. Source rank: #38. Votes: 2979. Organization: xai. License: Proprietary.

88.5% percentile inside its fair comparison set

1,457Raw benchmark valueCI 1,445 - 1,469

Text Arena · Industry Software And It Services · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_software_and_it_services leaderboard.

Rank #31 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,464
Percentile: 90.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: industry_software_and_it_services. Source rank: #38. Votes: 16466. Organization: xai. License: Proprietary.

90.8% percentile inside its fair comparison set

1,464Raw benchmark valueCI 1,458 - 1,470

Text Arena · Industry Writing And Literature And Language · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_writing_and_literature_and_language leaderboard.

Rank #25 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,436
Percentile: 92.6%
Last updated: recent
Eligibility: headline eligible

92.6% percentile inside its fair comparison set

1,436Raw benchmark valueCI 1,429 - 1,443

SAGE

VALS-AI · Professional reasoning · Objective

Student Assessment with Generative Evaluation.

Rank #31 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 38.2%
Percentile: 33.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: sage; provider: xAI.

33.3% percentile inside its fair comparison set

38.2%Raw benchmark valueCI 31.5% - 45%

Data analysis

LB · Professional reasoning · Objective

Structured data manipulation and table reasoning accuracy.

Rank #42 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 62.9%
Percentile: 62%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: Data Analysis. Tasks scored: 3.

62% percentile inside its fair comparison set

62.9%Raw benchmark value

Overall

LB · Professional reasoning · Objective

Average objective performance across LiveBench's current public category mix.

Rank #44 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 68%
Percentile: 60.2%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category averages included: 7.

60.2% percentile inside its fair comparison set

68%Raw benchmark value

Consecutive events

LB · Professional reasoning · Objective

Objective consecutive events score in LiveBench.

Rank #40 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 50.8%
Percentile: 63.9%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: consecutive_events. Category: Data Analysis.

63.9% percentile inside its fair comparison set

50.8%Raw benchmark value

Table join

LB · Professional reasoning · Objective

Objective table join score in LiveBench.

Rank #78 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 37.8%
Percentile: 28.7%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: tablejoin. Category: Data Analysis.

28.7% percentile inside its fair comparison set

37.8%Raw benchmark value

Table reformat

LB · Professional reasoning · Objective

Objective table reformat score in LiveBench.

Rank #25 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 100%
Percentile: 100%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: tablereformat. Category: Data Analysis.

100% percentile inside its fair comparison set

100%Raw benchmark value

Search / tool use3 benchmarks73.6%

Tau2-Bench Telecom

AA · Search / tool use · Objective

It matters when the model must browse, call tools, and recover useful answers from external systems.

Rank #112 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 59.9%
Percentile: 64.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `tau2`.

64.1% percentile inside its fair comparison set

59.9%Raw benchmark value

Search Arena

AR · Search / tool use · Human

It matters when the model must browse, call tools, and recover useful answers from external systems.

Rank #6 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,208
Percentile: 83.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: overall. Source rank: #6. Votes: 56367. Organization: xai. License: Proprietary.

83.3% percentile inside its fair comparison set

1,208Raw benchmark valueCI 1,202 - 1,213

Search Arena · No Style Control

AR · Search / tool use · Human

It matters when the model must browse, call tools, and recover useful answers from external systems.

Rank #9 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,206
Percentile: 73.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: overall. Source rank: #9. Votes: 56367. Organization: xai. License: Proprietary.

73.3% percentile inside its fair comparison set

1,206Raw benchmark valueCI 1,201 - 1,212

Long context2 benchmarks52.2%

Long Context Reasoning

AA · Long context · Objective

It checks whether long-context claims survive contact with retrieval, memory, or long-document tasks.

Rank #202 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 17.3%
Percentile: 36.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `lcr`.

36.2% percentile inside its fair comparison set

17.3%Raw benchmark value

CorpFin v2

VALS-AI · Long context · Objective

It checks whether long-context claims survive contact with retrieval, memory, or long-document tasks.

Rank #29 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 63.7%
Percentile: 68.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: corp_fin_v2; provider: xAI.

68.2% percentile inside its fair comparison set

63.7%Raw benchmark valueCI 61.8% - 65.5%

Vision understanding18 benchmarks74.8%

MMMU-Pro

AA · Vision understanding · Objective

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #51 · Source label: Grok 4.20 0309 v2 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 64.9%
Percentile: 63%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `mmmuPro`.

63% percentile inside its fair comparison set

64.9%Raw benchmark value

Vision Arena

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #19 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,251
Percentile: 83.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: overall. Source rank: #25. Votes: 16615. Organization: xai. License: Proprietary.

83.5% percentile inside its fair comparison set

1,251Raw benchmark valueCI 1,244 - 1,258

Vision Arena · Creative Writing Vision

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #7 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,280
Percentile: 89.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: creative_writing_vision. Source rank: #9. Votes: 904. Organization: xai. License: Proprietary.

89.1% percentile inside its fair comparison set

1,280Raw benchmark valueCI 1,259 - 1,301

Vision Arena · Diagram

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #22 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,271
Percentile: 70%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: diagram. Source rank: #29. Votes: 4362. Organization: xai. License: Proprietary.

70% percentile inside its fair comparison set

1,271Raw benchmark valueCI 1,261 - 1,282

Vision Arena · English

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #19 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,252
Percentile: 83.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: english. Source rank: #24. Votes: 6998. Organization: xai. License: Proprietary.

83.5% percentile inside its fair comparison set

1,252Raw benchmark valueCI 1,243 - 1,262

Vision Arena · Entity Recognition

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #9 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,251
Percentile: 75%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: entity_recognition. Source rank: #8. Votes: 78. Organization: xai. License: Proprietary.

75% percentile inside its fair comparison set

1,251Raw benchmark valueCI 1,187 - 1,316

Vision Arena · Homework

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #26 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,272
Percentile: 63.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: homework. Source rank: #33. Votes: 2387. Organization: xai. License: Proprietary.

63.2% percentile inside its fair comparison set

1,272Raw benchmark valueCI 1,259 - 1,285

Vision Arena · Humor

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #10 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,262
Percentile: 81.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: humor. Source rank: #13. Votes: 530. Organization: xai. License: Proprietary.

81.6% percentile inside its fair comparison set

1,262Raw benchmark valueCI 1,235 - 1,288

Vision Arena · Ocr

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #23 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,258
Percentile: 68.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: ocr. Source rank: #30. Votes: 11893. Organization: xai. License: Proprietary.

68.6% percentile inside its fair comparison set

1,258Raw benchmark valueCI 1,251 - 1,265

Vision Arena · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #22 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,257
Percentile: 80.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: overall. Source rank: #28. Votes: 16615. Organization: xai. License: Proprietary.

80.7% percentile inside its fair comparison set

1,257Raw benchmark valueCI 1,250 - 1,264

Vision Arena · Creative Writing Vision · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #8 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,296
Percentile: 87.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: creative_writing_vision. Source rank: #10. Votes: 904. Organization: xai. License: Proprietary.

87.3% percentile inside its fair comparison set

1,296Raw benchmark valueCI 1,275 - 1,316

Vision Arena · Diagram · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #26 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,263
Percentile: 64.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: diagram. Source rank: #32. Votes: 4362. Organization: xai. License: Proprietary.

64.3% percentile inside its fair comparison set

1,263Raw benchmark valueCI 1,253 - 1,274

Vision Arena · English · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #21 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,261
Percentile: 81.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: english. Source rank: #26. Votes: 6998. Organization: xai. License: Proprietary.

81.7% percentile inside its fair comparison set

1,261Raw benchmark valueCI 1,252 - 1,271

Vision Arena · Entity Recognition · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #9 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,282
Percentile: 75%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: entity_recognition. Source rank: #10. Votes: 78. Organization: xai. License: Proprietary.

75% percentile inside its fair comparison set

1,282Raw benchmark valueCI 1,217 - 1,346

Vision Arena · Homework · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #27 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,274
Percentile: 61.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: homework. Source rank: #34. Votes: 2387. Organization: xai. License: Proprietary.

61.8% percentile inside its fair comparison set

1,274Raw benchmark valueCI 1,261 - 1,287

Vision Arena · Humor · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #11 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,278
Percentile: 79.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: humor. Source rank: #14. Votes: 530. Organization: xai. License: Proprietary.

79.6% percentile inside its fair comparison set

1,278Raw benchmark valueCI 1,252 - 1,305

Vision Arena · Ocr · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #25 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,259
Percentile: 65.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: ocr. Source rank: #31. Votes: 11893. Organization: xai. License: Proprietary.

65.7% percentile inside its fair comparison set

1,259Raw benchmark valueCI 1,252 - 1,266

MMMU Pro

VALS-AI · Vision understanding · Objective

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #17 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 83.5%
Percentile: 72.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: mmmu; provider: xAI.

72.4% percentile inside its fair comparison set

83.5%Raw benchmark valueCI 81.7% - 85.2%

Document understanding4 benchmarks14.6%

Document Arena

AR · Document understanding · Human

It matters when the job is reading PDFs, tables, forms, or mixed-layout documents rather than plain chat.

Rank #18 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,437
Percentile: 29.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: overall. Source rank: #21. Votes: 14105. Organization: xai. License: Proprietary.

29.2% percentile inside its fair comparison set

1,437Raw benchmark valueCI 1,429 - 1,445

Document Arena · No Style Control

AR · Document understanding · Human

It matters when the job is reading PDFs, tables, forms, or mixed-layout documents rather than plain chat.

Rank #22 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,410
Percentile: 12.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: overall. Source rank: #25. Votes: 14105. Organization: xai. License: Proprietary.

12.5% percentile inside its fair comparison set

1,410Raw benchmark valueCI 1,402 - 1,417

Vals Multimodal Index

VALS-AI · Document understanding · Combined

It matters when the job is reading PDFs, tables, forms, or mixed-layout documents rather than plain chat.

Rank #19 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 39
Percentile: 5.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: vals_multimodal_index; provider: xAI.

5.3% percentile inside its fair comparison set

39Raw benchmark valueCI 38 - 40

MortgageTax

VALS-AI · Document understanding · Objective

It matters when the job is reading PDFs, tables, forms, or mixed-layout documents rather than plain chat.

Rank #54 · Source label: grok/grok-4.20-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 45.4%
Percentile: 11.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: mortgage_tax; provider: xAI.

11.7% percentile inside its fair comparison set

45.4%Raw benchmark valueCI 43.4% - 47.3%

Multilingual16 benchmarks91.1%

Text Arena · Chinese

AR · Multilingual · Human

Observed user preference in Arena's Text Arena chinese leaderboard.

Rank #15 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,513
Percentile: 95.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: chinese. Source rank: #19. Votes: 1398. Organization: xai. License: Proprietary.

95.3% percentile inside its fair comparison set

1,513Raw benchmark valueCI 1,496 - 1,530

Text Arena · French

AR · Multilingual · Human

Observed user preference in Arena's Text Arena french leaderboard.

Rank #13 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,500
Percentile: 94.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: french. Source rank: #15. Votes: 1386. Organization: xai. License: Proprietary.

94.4% percentile inside its fair comparison set

1,500Raw benchmark valueCI 1,481 - 1,518

Text Arena · German

AR · Multilingual · Human

Observed user preference in Arena's Text Arena german leaderboard.

Rank #7 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,487
Percentile: 97.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: german. Source rank: #9. Votes: 437. Organization: xai. License: Proprietary.

97.5% percentile inside its fair comparison set

1,487Raw benchmark valueCI 1,458 - 1,515

Text Arena · Japanese

AR · Multilingual · Human

Observed user preference in Arena's Text Arena japanese leaderboard.

Rank #10 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,469
Percentile: 95.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: japanese. Source rank: #12. Votes: 207. Organization: xai. License: Proprietary.

95.6% percentile inside its fair comparison set

1,469Raw benchmark valueCI 1,426 - 1,511

Text Arena · Korean

AR · Multilingual · Human

Observed user preference in Arena's Text Arena korean leaderboard.

Rank #13 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,444
Percentile: 94.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: korean. Source rank: #14. Votes: 634. Organization: xai. License: Proprietary.

94.2% percentile inside its fair comparison set

1,444Raw benchmark valueCI 1,418 - 1,470

Text Arena · Russian

AR · Multilingual · Human

Observed user preference in Arena's Text Arena russian leaderboard.

Rank #12 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,483
Percentile: 96.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: russian. Source rank: #15. Votes: 4628. Organization: xai. License: Proprietary.

96.2% percentile inside its fair comparison set

1,483Raw benchmark valueCI 1,473 - 1,493

Text Arena · Spanish

AR · Multilingual · Human

Observed user preference in Arena's Text Arena spanish leaderboard.

Rank #12 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,471
Percentile: 94.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: spanish. Source rank: #14. Votes: 880. Organization: xai. License: Proprietary.

94.9% percentile inside its fair comparison set

1,471Raw benchmark valueCI 1,450 - 1,492

Text Arena · Chinese · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena chinese leaderboard.

Rank #32 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,483
Percentile: 89.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: chinese. Source rank: #37. Votes: 2241. Organization: xai. License: Proprietary.

89.5% percentile inside its fair comparison set

1,483Raw benchmark valueCI 1,469 - 1,497

Text Arena · French · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena french leaderboard.

Rank #17 · Source label: grok-4.20-multi-agent-beta-0309

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,474
Percentile: 92.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-multi-agent-beta-0309`. Category: french. Source rank: #21. Votes: 1386. Organization: xai. License: Proprietary.

92.6% percentile inside its fair comparison set

1,474Raw benchmark valueCI 1,456 - 1,492

Text Arena · German · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena german leaderboard.

Rank #13 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,462
Percentile: 94.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: german. Source rank: #16. Votes: 437. Organization: xai. License: Proprietary.

94.9% percentile inside its fair comparison set

1,462Raw benchmark valueCI 1,434 - 1,490

Text Arena · Japanese · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena japanese leaderboard.

Rank #13 · Source label: grok-4.20-beta1

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,441
Percentile: 94.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta1`. Category: japanese. Source rank: #17. Votes: 207. Organization: xai. License: Proprietary.

94.1% percentile inside its fair comparison set

1,441Raw benchmark valueCI 1,398 - 1,483

Text Arena · Korean · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena korean leaderboard.

Rank #14 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,428
Percentile: 93.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: korean. Source rank: #18. Votes: 634. Organization: xai. License: Proprietary.

93.8% percentile inside its fair comparison set

1,428Raw benchmark valueCI 1,403 - 1,454

Text Arena · Russian · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena russian leaderboard.

Rank #17 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,461
Percentile: 94.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: russian. Source rank: #21. Votes: 4628. Organization: xai. License: Proprietary.

94.5% percentile inside its fair comparison set

1,461Raw benchmark valueCI 1,452 - 1,471

Text Arena · Spanish · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena spanish leaderboard.

Rank #32 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,446
Percentile: 85.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: spanish. Source rank: #38. Votes: 1302. Organization: xai. License: Proprietary.

85.5% percentile inside its fair comparison set

1,446Raw benchmark valueCI 1,428 - 1,464

Vision Arena · Chinese

AR · Multilingual · Human

Observed user preference in Arena's Vision Arena chinese leaderboard.

Rank #22 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,298
Percentile: 72.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: chinese. Source rank: #27. Votes: 950. Organization: xai. License: Proprietary.

72.7% percentile inside its fair comparison set

1,298Raw benchmark valueCI 1,274 - 1,322

Vision Arena · Chinese · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Vision Arena chinese leaderboard.

Rank #23 · Source label: grok-4.20-beta-0309-reasoning

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,303
Percentile: 71.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `grok-4.20-beta-0309-reasoning`. Category: chinese. Source rank: #28. Votes: 950. Organization: xai. License: Proprietary.

71.4% percentile inside its fair comparison set

1,303Raw benchmark valueCI 1,279 - 1,327