Model profile · OpenAI

GPT-5.4

Closed weightsfrontier · registry tag 2026 flagship

Visible tradeoffs

Reads as visible tradeoffs across the resolved source data.

Visible coverage: 65.1%
Verified coverage: 60.3%
Spread: 99.5%
Last verified: Jun 20, 2026

72%bench fit

textcodevisiondocumentsearch17 aliases45 official source links

Open compare

Data version

Current snapshot.

Data version Jun 20, 2026Model list checked9 providers · 1081 tracked modelsPage refreshed Jul 5, 2026

The registry snapshot and page stamp are shown so a stale deploy is visible at a glance.

Source-linked scores by benchmark

Each row keeps the benchmark source, source type, raw metric, and percentile inside its fair comparison set.

Visible tradeoffsThis model currently reads as visible tradeoffs across the resolved source data.

Chat / text38 benchmarks76.3%

Intelligence Index

AA · Chat / text · Combined

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #67 · Source label: GPT-5.4 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 28
Percentile: 83.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `intelligenceIndex`.

83.3% percentile inside its fair comparison set

28Raw benchmark value

AA-Omniscience accuracy

AA · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #22 · Source label: GPT-5.4 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 36.8%
Percentile: 93%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `omniscienceAccuracy`.

93% percentile inside its fair comparison set

36.8%Raw benchmark value

AA-Omniscience non-hallucination

AA · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #190 · Source label: GPT-5.4 (xhigh)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 11.4%
Percentile: 36.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `omniscienceNonHallucination`.

36.6% percentile inside its fair comparison set

11.4%Raw benchmark value

IFBench

AA · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #100 · Source label: GPT-5.4 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 48.4%
Percentile: 68.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `ifbench`.

68.6% percentile inside its fair comparison set

48.4%Raw benchmark value

Blended price

AA · Chat / text · Speed / cost

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #246 · Source label: GPT-5.4 (low)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: $5.9 /1M tokens
Percentile: 11.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `price1mBlended0To3To1`.

11.2% percentile inside its fair comparison set

$5.9 /1M tokensRaw benchmark value

Input price

AA · Chat / text · Speed / cost

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #244 · Source label: GPT-5.4 (low)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: $2.6 /1M input tokens
Percentile: 12%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `price1mInputTokens`.

12% percentile inside its fair comparison set

$2.6 /1M input tokensRaw benchmark value

Output price

AA · Chat / text · Speed / cost

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #257 · Source label: GPT-5.4 (low)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: $15.8 /1M output tokens
Percentile: 7.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `price1mOutputTokens`.

7.2% percentile inside its fair comparison set

$15.8 /1M output tokensRaw benchmark value

Output Speed

AA · Chat / text · Speed / cost

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #79 · Source label: GPT-5.4 (low)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 121 tokens/s
Percentile: 62.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `medianOutputTokensPerSecond`.

62.9% percentile inside its fair comparison set

121 tokens/sRaw benchmark value

Time to first token

AA · Chat / text · Speed / cost

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #210 · Source label: GPT-5.4 (xhigh)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 122.24s
Percentile: 0.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `medianTimeToFirstTokenSeconds`.

0.5% percentile inside its fair comparison set

122.24sRaw benchmark value

Time to first answer token

AA · Chat / text · Speed / cost

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #210 · Source label: GPT-5.4 (xhigh)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 122.24s
Percentile: 0.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `medianTimeToFirstAnswerTokenSeconds`.

0.5% percentile inside its fair comparison set

122.24sRaw benchmark value

Openness Index

AA · Chat / text · Combined

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #187 · Source label: GPT-5 (minimal)

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 6
Percentile: 7.5%
Last updated: recent
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

Parsed from Artificial Analysis public leaderboard field `opennessBreakdown.opennessIndex`. Backfilled from GPT-5 via approved benchmark identity mapping map-gpt-5-4-to-gpt-5.

7.5% percentile inside its fair comparison set

6Raw benchmark value

Text Arena

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #9 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,478
Percentile: 97.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: overall. Source rank: #11. Votes: 40959. Organization: openai. License: Proprietary.

97.5% percentile inside its fair comparison set

1,478Raw benchmark valueCI 1,474 - 1,482

Text Arena · Creative Writing

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #18 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,447
Percentile: 94.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: creative_writing. Source rank: #24. Votes: 6611. Organization: openai. License: Proprietary.

94.7% percentile inside its fair comparison set

1,447Raw benchmark valueCI 1,439 - 1,455

Text Arena · English

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #14 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,481
Percentile: 96%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: english. Source rank: #16. Votes: 19173. Organization: openai. License: Proprietary.

96% percentile inside its fair comparison set

1,481Raw benchmark valueCI 1,476 - 1,487

Text Arena · Exclude Ties

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #10 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,488
Percentile: 97.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: exclude_ties. Source rank: #12. Votes: 31419. Organization: openai. License: Proprietary.

97.2% percentile inside its fair comparison set

1,488Raw benchmark valueCI 1,482 - 1,493

Text Arena · Hard Prompts

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #10 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,500
Percentile: 97.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: hard_prompts. Source rank: #12. Votes: 26098. Organization: openai. License: Proprietary.

97.2% percentile inside its fair comparison set

1,500Raw benchmark valueCI 1,495 - 1,505

Text Arena · Hard Prompts English

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #16 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,498
Percentile: 95.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: hard_prompts_english. Source rank: #19. Votes: 12813. Organization: openai. License: Proprietary.

95.4% percentile inside its fair comparison set

1,498Raw benchmark valueCI 1,492 - 1,505

Text Arena · Instruction Following

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #10 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,477
Percentile: 97.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: instruction_following. Source rank: #12. Votes: 13343. Organization: openai. License: Proprietary.

97.2% percentile inside its fair comparison set

1,477Raw benchmark valueCI 1,471 - 1,483

Text Arena · Longer Query

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #13 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,486
Percentile: 96.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: longer_query. Source rank: #17. Votes: 16701. Organization: openai. License: Proprietary.

96.1% percentile inside its fair comparison set

1,486Raw benchmark valueCI 1,480 - 1,492

Text Arena · Multi Turn

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #9 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,492
Percentile: 97.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: multi_turn. Source rank: #11. Votes: 7580. Organization: openai. License: Proprietary.

97.5% percentile inside its fair comparison set

1,492Raw benchmark valueCI 1,485 - 1,500

Text Arena · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #9 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,470
Percentile: 97.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: overall. Source rank: #12. Votes: 40959. Organization: openai. License: Proprietary.

97.5% percentile inside its fair comparison set

1,470Raw benchmark valueCI 1,466 - 1,474

Text Arena · Creative Writing · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #17 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,442
Percentile: 95%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: creative_writing. Source rank: #21. Votes: 6611. Organization: openai. License: Proprietary.

95% percentile inside its fair comparison set

1,442Raw benchmark valueCI 1,434 - 1,451

Text Arena · English · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #14 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,470
Percentile: 96%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: english. Source rank: #17. Votes: 19173. Organization: openai. License: Proprietary.

96% percentile inside its fair comparison set

1,470Raw benchmark valueCI 1,464 - 1,475

Text Arena · Exclude Ties · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #9 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,475
Percentile: 97.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: exclude_ties. Source rank: #12. Votes: 31419. Organization: openai. License: Proprietary.

97.5% percentile inside its fair comparison set

1,475Raw benchmark valueCI 1,469 - 1,480

Text Arena · Hard Prompts · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #4 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,487
Percentile: 99.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: hard_prompts. Source rank: #6. Votes: 26098. Organization: openai. License: Proprietary.

99.1% percentile inside its fair comparison set

1,487Raw benchmark valueCI 1,482 - 1,493

Text Arena · Hard Prompts English · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #10 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,482
Percentile: 97.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: hard_prompts_english. Source rank: #12. Votes: 12813. Organization: openai. License: Proprietary.

97.2% percentile inside its fair comparison set

1,482Raw benchmark valueCI 1,476 - 1,489

Text Arena · Instruction Following · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #9 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,473
Percentile: 97.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: instruction_following. Source rank: #11. Votes: 13343. Organization: openai. License: Proprietary.

97.5% percentile inside its fair comparison set

1,473Raw benchmark valueCI 1,467 - 1,479

Text Arena · Longer Query · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #14 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,476
Percentile: 95.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: longer_query. Source rank: #18. Votes: 16701. Organization: openai. License: Proprietary.

95.7% percentile inside its fair comparison set

1,476Raw benchmark valueCI 1,470 - 1,482

Text Arena · Multi Turn · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #8 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,480
Percentile: 97.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: multi_turn. Source rank: #11. Votes: 7580. Organization: openai. License: Proprietary.

97.8% percentile inside its fair comparison set

1,480Raw benchmark valueCI 1,472 - 1,487

Instruction following

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #16 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 65%
Percentile: 86.1%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: IF. Tasks scored: 4.

86.1% percentile inside its fair comparison set

65%Raw benchmark value

Language

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #9 · Source label: gpt-5.4-xhigh

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 82.6%
Percentile: 92.6%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: Language. Tasks scored: 3.

92.6% percentile inside its fair comparison set

82.6%Raw benchmark value

Paraphrase

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #20 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 62.4%
Percentile: 82.4%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: paraphrase. Category: IF.

82.4% percentile inside its fair comparison set

62.4%Raw benchmark value

Simplify

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #25 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 59.2%
Percentile: 77.8%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: simplify. Category: IF.

77.8% percentile inside its fair comparison set

59.2%Raw benchmark value

Story generation

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #8 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 73%
Percentile: 93.5%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: story_generation. Category: IF.

93.5% percentile inside its fair comparison set

73%Raw benchmark value

Summarize

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #25 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 65.2%
Percentile: 77.8%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: summarize. Category: IF.

77.8% percentile inside its fair comparison set

65.2%Raw benchmark value

Connections

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #16 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 99%
Percentile: 86.1%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: connections. Category: Language.

86.1% percentile inside its fair comparison set

99%Raw benchmark value

Plot unscrambling

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #10 · Source label: gpt-5.4-xhigh

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 65.9%
Percentile: 91.7%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: plot_unscrambling. Category: Language.

91.7% percentile inside its fair comparison set

65.9%Raw benchmark value

Typos

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #16 · Source label: gpt-5.4-xhigh

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 82%
Percentile: 86%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: typos. Category: Language.

86% percentile inside its fair comparison set

82%Raw benchmark value

Coding27 benchmarks74.9%

Terminal-Bench Hard

AA · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #26 · Source label: GPT-5.4 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 37.9%
Percentile: 91.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `terminalbenchHard`.

91.7% percentile inside its fair comparison set

37.9%Raw benchmark value

SciCode

AA · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #15 · Source label: GPT-5.4 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 47.1%
Percentile: 96.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `scicode`.

96.2% percentile inside its fair comparison set

47.1%Raw benchmark value

Coding Index

AA · Coding · Combined

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #4 · Source label: GPT-5.4 (xhigh)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 71
Percentile: 96%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `codingIndex`.

96% percentile inside its fair comparison set

71Raw benchmark value

Agentic Index

AA · Coding · Combined

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #5 · Source label: GPT-5.4 (xhigh)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 41
Percentile: 91.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `agenticIndex`.

91.3% percentile inside its fair comparison set

41Raw benchmark value

Code Arena

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #20 · Source label: gpt-5.4-high (codex-harness)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,457
Percentile: 74%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high (codex-harness)`. Category: overall. Source rank: #25. Votes: 1482. Organization: openai. License: Proprietary.

74% percentile inside its fair comparison set

1,457Raw benchmark valueCI 1,440 - 1,474

WebDev Arena

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #20 · Source label: gpt-5.4-high (codex-harness)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,457
Percentile: 74%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high (codex-harness)`. Category: webdev. Source rank: #25. Votes: 1482. Organization: openai. License: Proprietary.

74% percentile inside its fair comparison set

1,457Raw benchmark valueCI 1,440 - 1,474

Code Arena · Webdev Html

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #17 · Source label: gpt-5.4-medium (codex-harness)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,473
Percentile: 78.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-medium (codex-harness)`. Category: webdev-html. Source rank: #23. Votes: 165. Organization: openai. License: Proprietary.

78.1% percentile inside its fair comparison set

1,473Raw benchmark valueCI 1,425 - 1,520

Code Arena · Webdev React

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #20 · Source label: gpt-5.4-high (codex-harness)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,449
Percentile: 67.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high (codex-harness)`. Category: webdev-react. Source rank: #25. Votes: 1322. Organization: openai. License: Proprietary.

67.8% percentile inside its fair comparison set

1,449Raw benchmark valueCI 1,431 - 1,467

Code Migration

VALS-AI · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #7 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 35%
Percentile: 72.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: code-migration; provider: OpenAI.

72.7% percentile inside its fair comparison set

35%Raw benchmark valueCI 26.9% - 43%

LiveCodeBench

VALS-AI · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #23 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 84.1%
Percentile: 75.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: lcb; provider: OpenAI.

75.6% percentile inside its fair comparison set

84.1%Raw benchmark valueCI 82.1% - 86.2%

ProgramBench

VALS-AI · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #5 · Source label: openai/gpt-5.4-2026-03-05-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 0.5%
Percentile: 90%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: programbench; provider: OpenAI.

90% percentile inside its fair comparison set

0.5%Raw benchmark valueCI 0% - 1.5%

SWE-bench Verified

VALS-AI · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #8 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 78.2%
Percentile: 87%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: swebench; provider: OpenAI.

87% percentile inside its fair comparison set

78.2%Raw benchmark valueCI 74.6% - 81.8%

Vibe Code Bench v1.1

VALS-AI · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #5 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 67.4%
Percentile: 91.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: vibe-code; provider: OpenAI.

91.8% percentile inside its fair comparison set

67.4%Raw benchmark valueCI 57.9% - 76.9%

Text Arena · Coding

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #12 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,521
Percentile: 96.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: coding. Source rank: #16. Votes: 10857. Organization: openai. License: Proprietary.

96.6% percentile inside its fair comparison set

1,521Raw benchmark valueCI 1,514 - 1,527

Text Arena · Coding · No Style Control

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #11 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,495
Percentile: 96.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: coding. Source rank: #15. Votes: 10857. Organization: openai. License: Proprietary.

96.9% percentile inside its fair comparison set

1,495Raw benchmark valueCI 1,488 - 1,502

IOI

VALS-AI · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #2 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 67.8%
Percentile: 97.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: ioi; provider: OpenAI.

97.7% percentile inside its fair comparison set

67.8%Raw benchmark valueCI 48.5% - 87.2%

Code Arena · Image To Webdev

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #13 · Source label: gpt-5.4

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,435
Percentile: 29.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4`. Category: image_to_webdev. Source rank: #16. Votes: 1220. Organization: openai. License: Proprietary.

29.4% percentile inside its fair comparison set

1,435Raw benchmark valueCI 1,417 - 1,453

HiL-Bench

SL · Coding · Rubric

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #5

verified runtimeexact direct

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Scale Labs
Raw value: 9.3%
Percentile: 20%
Last updated: recent
Eligibility: headline eligible

20% percentile inside its fair comparison set

9.3%Raw benchmark value

Terminal-Bench 2.0

OFF · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #2

Official company resultmanual verifiedmanual verified

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Provider official evals
Raw value: 75.1%
Percentile: 83.3%
Last updated: aging
Eligibility: headline eligible

Verified against OpenAI's GPT-5.4 launch page dated March 5, 2026. Uses the published Terminal-Bench 2.0 figure. Checked 2026-04-29. Verification: manual_public_page_verification.

83.3% percentile inside its fair comparison set

75.1%Raw benchmark value

Agentic coding

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #54 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 46.7%
Percentile: 50.9%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: Agentic Coding. Tasks scored: 3.

50.9% percentile inside its fair comparison set

46.7%Raw benchmark value

Coding

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #28 · Source label: gpt-5.4-xhigh

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 77.5%
Percentile: 75%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: Coding. Tasks scored: 2.

75% percentile inside its fair comparison set

77.5%Raw benchmark value

JavaScript

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #30 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 45%
Percentile: 73.1%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: javascript. Category: Agentic Coding.

73.1% percentile inside its fair comparison set

45%Raw benchmark value

TypeScript

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #26 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 45%
Percentile: 76.6%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: typescript. Category: Agentic Coding.

76.6% percentile inside its fair comparison set

45%Raw benchmark value

Python

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #74 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 50%
Percentile: 32.4%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: python. Category: Agentic Coding.

32.4% percentile inside its fair comparison set

50%Raw benchmark value

Coding generation

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #53 · Source label: gpt-5.4-xhigh

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 74.6%
Percentile: 52.8%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: code_generation. Category: Coding.

52.8% percentile inside its fair comparison set

74.6%Raw benchmark value

Coding completion

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #24 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 76.1%
Percentile: 78.7%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: code_completion. Category: Coding.

78.7% percentile inside its fair comparison set

76.1%Raw benchmark value

Terminal-Bench 2.0

TERMINAL-BENCH · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #10 · Source label: gpt-5

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Terminal-Bench
Raw value: 49.6%
Percentile: 73.3%
Last updated: archived
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

Parsed from the public Terminal-Bench 2.0 verified leaderboard. Collapse policy: highest verified score per canonical model. Selected agent: Codex CLI (0.53.0). Display model: GPT-5. Integration method: API. Agent URL: https://developers.openai.com/codex/cli/. Reported stderr: 1.478 percentage points. Backfilled from GPT-5 via approved benchmark identity mapping map-gpt-5-4-to-gpt-5.

73.3% percentile inside its fair comparison set

49.6%Raw benchmark value

Reasoning / math / science22 benchmarks85.3%

Humanity's Last Exam

AA · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #91 · Source label: GPT-5.4 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 10.6%
Percentile: 75.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `hle`.

75.7% percentile inside its fair comparison set

10.6%Raw benchmark value

GPQA

AA · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #107 · Source label: GPT-5.4 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 74.8%
Percentile: 71.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `gpqa`.

71.7% percentile inside its fair comparison set

74.8%Raw benchmark value

CritPt

AA · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #61 · Source label: GPT-5.4 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 0.6%
Percentile: 80.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `critpt`.

80.1% percentile inside its fair comparison set

0.6%Raw benchmark value

ProofBench

VALS-AI · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #3 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 56%
Percentile: 94.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: proof_bench; provider: OpenAI.

94.3% percentile inside its fair comparison set

56%Raw benchmark valueCI 46.3% - 65.7%

GPQA Diamond

VALS-AI · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #9 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 91.7%
Percentile: 93.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: gpqa; provider: OpenAI.

93.3% percentile inside its fair comparison set

91.7%Raw benchmark valueCI 87.9% - 95.4%

MMLU Pro

VALS-AI · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #14 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 87.5%
Percentile: 85.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: mmlu_pro; provider: OpenAI.

85.4% percentile inside its fair comparison set

87.5%Raw benchmark valueCI 86.7% - 88.3%

Text Arena · Math

AR · Reasoning / math / science · Human

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #5 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,503
Percentile: 98.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: math. Source rank: #6. Votes: 2285. Organization: openai. License: Proprietary.

98.7% percentile inside its fair comparison set

1,503Raw benchmark valueCI 1,490 - 1,515

Text Arena · Math · No Style Control

AR · Reasoning / math / science · Human

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #5 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,497
Percentile: 98.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: math. Source rank: #6. Votes: 2285. Organization: openai. License: Proprietary.

98.7% percentile inside its fair comparison set

1,497Raw benchmark valueCI 1,484 - 1,510

TutorBench

SL · Reasoning / math / science · Rubric

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #1 · Source label: gpt-5.4-pro-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Scale Labs
Raw value: 56.6%
Percentile: 100%
Last updated: recent
Eligibility: headline eligible

100% percentile inside its fair comparison set

56.6%Raw benchmark value

MultiNRC

SL · Reasoning / math / science · Rubric

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #2

verified runtimeexact direct

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Scale Labs
Raw value: 62.3%
Percentile: 90%
Last updated: recent
Eligibility: headline eligible

90% percentile inside its fair comparison set

62.3%Raw benchmark value

EnigmaEval

SL · Reasoning / math / science · Rubric

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #5 · Source label: gpt-5

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Scale Labs
Raw value: 64%
Percentile: 80%
Last updated: aging
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

Backfilled from GPT-5 via approved benchmark identity mapping map-gpt-5-4-to-gpt-5.

80% percentile inside its fair comparison set

64%Raw benchmark value

Humanity's Last Exam

OFF · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #5

Official company resultmanual verifiedmanual verified

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Provider official evals
Raw value: 39.8%
Percentile: 42.9%
Last updated: aging
Eligibility: headline eligible

Verified against OpenAI's GPT-5.4 launch page dated March 5, 2026. Uses the published no-tools Humanity's Last Exam figure. Checked 2026-04-29. Verification: manual_public_page_verification.

42.9% percentile inside its fair comparison set

39.8%Raw benchmark value

Mathematics

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #12 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 90%
Percentile: 89.8%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: Mathematics. Tasks scored: 4.

89.8% percentile inside its fair comparison set

90%Raw benchmark value

Reasoning

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #9 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 85.7%
Percentile: 92.6%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: Reasoning. Tasks scored: 4.

92.6% percentile inside its fair comparison set

85.7%Raw benchmark value

AMPS Hard

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #22 · Source label: gpt-5.4-xhigh

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 98%
Percentile: 94.4%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: AMPS_Hard. Category: Mathematics.

94.4% percentile inside its fair comparison set

98%Raw benchmark value

Integrals with game

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #16 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 77%
Percentile: 86.1%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: integrals_with_game. Category: Mathematics.

86.1% percentile inside its fair comparison set

77%Raw benchmark value

Math competition

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #29 · Source label: gpt-5.4-xhigh

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 94.1%
Percentile: 74.1%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: math_comp. Category: Mathematics.

74.1% percentile inside its fair comparison set

94.1%Raw benchmark value

Olympiad

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #22 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 88.8%
Percentile: 80.6%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: olympiad. Category: Mathematics.

80.6% percentile inside its fair comparison set

88.8%Raw benchmark value

Theory of mind

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #1 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 84.6%
Percentile: 100%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: theory_of_mind. Category: Reasoning.

100% percentile inside its fair comparison set

84.6%Raw benchmark value

Zebra puzzle

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #16 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 90%
Percentile: 86%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: zebra_puzzle. Category: Reasoning.

86% percentile inside its fair comparison set

90%Raw benchmark value

Spatial

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #24 · Source label: gpt-5.4-xhigh

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 98%
Percentile: 88.9%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: spatial. Category: Reasoning.

88.9% percentile inside its fair comparison set

98%Raw benchmark value

Logic with navigation

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #30 · Source label: gpt-5.4-xhigh

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 68%
Percentile: 73.1%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: logic_with_navigation. Category: Reasoning.

73.1% percentile inside its fair comparison set

68%Raw benchmark value

Professional reasoning34 benchmarks85.7%

GDPval-AA

AA · Professional reasoning · Rubric

Agentic performance on economically valuable work tasks.

Rank #5 · Source label: GPT-5.4 (xhigh)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 1,401
Percentile: 91.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `gdpvalBreakdown.elo`.

91.3% percentile inside its fair comparison set

1,401Raw benchmark value

APEX-Agents-AA

AA · Professional reasoning · Objective

Long-horizon agentic task completion.

Rank #3 · Source label: GPT-5.4 (xhigh)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 33.3%
Percentile: 91.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `apexAgents`.

91.7% percentile inside its fair comparison set

33.3%Raw benchmark value

SkillsBench

VALS-AI · Professional reasoning · Objective

Applied professional skills tasks.

Rank #5 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 51.7%
Percentile: 60%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: skillsbench; provider: OpenAI.

60% percentile inside its fair comparison set

51.7%Raw benchmark valueCI 42.5% - 61%

Harvey's Legal Agent Benchmark

VALS-AI · Professional reasoning · Objective

Completing legal work with documents, spreadsheets, presentations, and file-system tools.

Rank #14 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 0%
Percentile: 15.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: hlab; provider: OpenAI.

15.4% percentile inside its fair comparison set

0%Raw benchmark valueCI 0% - 0%

LegalBench

VALS-AI · Professional reasoning · Objective

Academic legal reasoning tasks.

Rank #6 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 86%
Percentile: 94.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: legal_bench; provider: OpenAI.

94.4% percentile inside its fair comparison set

86%Raw benchmark valueCI 85.2% - 86.9%

TaxEval v2

VALS-AI · Professional reasoning · Objective

Answer quality on tax questions and responses.

Rank #27 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 74%
Percentile: 72.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: tax_eval_v2; provider: OpenAI.

72.5% percentile inside its fair comparison set

74%Raw benchmark valueCI 72.3% - 75.7%

MedCode

VALS-AI · Professional reasoning · Objective

Medical billing support and coding tasks.

Rank #24 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 41.3%
Percentile: 54.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: medcode; provider: OpenAI.

54.9% percentile inside its fair comparison set

41.3%Raw benchmark valueCI 37.1% - 45.5%

MedScribe

VALS-AI · Professional reasoning · Objective

Administrative documentation support for doctors.

Rank #25 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 77.5%
Percentile: 52%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: medscribe; provider: OpenAI.

52% percentile inside its fair comparison set

77.5%Raw benchmark valueCI 71% - 84%

Text Arena · Expert

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena expert leaderboard.

Rank #5 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,524
Percentile: 98.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: expert. Source rank: #7. Votes: 3597. Organization: openai. License: Proprietary.

98.5% percentile inside its fair comparison set

1,524Raw benchmark valueCI 1,513 - 1,534

Text Arena · Industry Business And Management And Financial Operations

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_business_and_management_and_financial_operations leaderboard.

Rank #10 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,483
Percentile: 97.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_business_and_management_and_financial_operations. Source rank: #13. Votes: 8167. Organization: openai. License: Proprietary.

97.2% percentile inside its fair comparison set

1,483Raw benchmark valueCI 1,475 - 1,490

Text Arena · Industry Entertainment And Sports And Media

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_entertainment_and_sports_and_media leaderboard.

Rank #14 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,447
Percentile: 96%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_entertainment_and_sports_and_media. Source rank: #17. Votes: 8353. Organization: openai. License: Proprietary.

96% percentile inside its fair comparison set

1,447Raw benchmark valueCI 1,440 - 1,455

Text Arena · Industry Legal And Government

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_legal_and_government leaderboard.

Rank #9 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,489
Percentile: 97.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_legal_and_government. Source rank: #10. Votes: 3184. Organization: openai. License: Proprietary.

97.3% percentile inside its fair comparison set

1,489Raw benchmark valueCI 1,477 - 1,500

Text Arena · Industry Life And Physical And Social Science

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_life_and_physical_and_social_science leaderboard.

Rank #19 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,489
Percentile: 94.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_life_and_physical_and_social_science. Source rank: #22. Votes: 6649. Organization: openai. License: Proprietary.

94.4% percentile inside its fair comparison set

1,489Raw benchmark valueCI 1,481 - 1,497

Text Arena · Industry Mathematical

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_mathematical leaderboard.

Rank #7 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,503
Percentile: 98.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_mathematical. Source rank: #8. Votes: 2211. Organization: openai. License: Proprietary.

98.1% percentile inside its fair comparison set

1,503Raw benchmark valueCI 1,490 - 1,517

Text Arena · Industry Medicine And Healthcare

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_medicine_and_healthcare leaderboard.

Rank #38 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,473
Percentile: 87.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_medicine_and_healthcare. Source rank: #48. Votes: 2988. Organization: openai. License: Proprietary.

87.5% percentile inside its fair comparison set

1,473Raw benchmark valueCI 1,461 - 1,485

Text Arena · Industry Software And It Services

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_software_and_it_services leaderboard.

Rank #17 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,506
Percentile: 95.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_software_and_it_services. Source rank: #21. Votes: 15917. Organization: openai. License: Proprietary.

95.1% percentile inside its fair comparison set

1,506Raw benchmark valueCI 1,500 - 1,512

Text Arena · Industry Writing And Literature And Language

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_writing_and_literature_and_language leaderboard.

Rank #8 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,469
Percentile: 97.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_writing_and_literature_and_language. Source rank: #10. Votes: 9946. Organization: openai. License: Proprietary.

97.8% percentile inside its fair comparison set

1,469Raw benchmark valueCI 1,462 - 1,476

Text Arena · Expert · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena expert leaderboard.

Rank #5 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,512
Percentile: 98.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: expert. Source rank: #6. Votes: 3597. Organization: openai. License: Proprietary.

98.5% percentile inside its fair comparison set

1,512Raw benchmark valueCI 1,501 - 1,523

Text Arena · Industry Business And Management And Financial Operations · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_business_and_management_and_financial_operations leaderboard.

Rank #4 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,476
Percentile: 99.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_business_and_management_and_financial_operations. Source rank: #6. Votes: 8167. Organization: openai. License: Proprietary.

99.1% percentile inside its fair comparison set

1,476Raw benchmark valueCI 1,468 - 1,483

Text Arena · Industry Entertainment And Sports And Media · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_entertainment_and_sports_and_media leaderboard.

Rank #13 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,443
Percentile: 96.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_entertainment_and_sports_and_media. Source rank: #16. Votes: 8353. Organization: openai. License: Proprietary.

96.3% percentile inside its fair comparison set

1,443Raw benchmark valueCI 1,436 - 1,451

Text Arena · Industry Legal And Government · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_legal_and_government leaderboard.

Rank #7 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,485
Percentile: 98%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_legal_and_government. Source rank: #8. Votes: 3184. Organization: openai. License: Proprietary.

98% percentile inside its fair comparison set

1,485Raw benchmark valueCI 1,473 - 1,496

Text Arena · Industry Life And Physical And Social Science · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_life_and_physical_and_social_science leaderboard.

Rank #16 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,478
Percentile: 95.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_life_and_physical_and_social_science. Source rank: #20. Votes: 6649. Organization: openai. License: Proprietary.

95.4% percentile inside its fair comparison set

1,478Raw benchmark valueCI 1,470 - 1,486

Text Arena · Industry Mathematical · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_mathematical leaderboard.

Rank #7 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,498
Percentile: 98.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_mathematical. Source rank: #8. Votes: 2211. Organization: openai. License: Proprietary.

98.1% percentile inside its fair comparison set

1,498Raw benchmark valueCI 1,484 - 1,511

Text Arena · Industry Medicine And Healthcare · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_medicine_and_healthcare leaderboard.

Rank #34 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,457
Percentile: 88.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_medicine_and_healthcare. Source rank: #37. Votes: 2988. Organization: openai. License: Proprietary.

88.8% percentile inside its fair comparison set

1,457Raw benchmark valueCI 1,446 - 1,469

Text Arena · Industry Software And It Services · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_software_and_it_services leaderboard.

Rank #11 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,490
Percentile: 96.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_software_and_it_services. Source rank: #14. Votes: 15917. Organization: openai. License: Proprietary.

96.9% percentile inside its fair comparison set

1,490Raw benchmark valueCI 1,484 - 1,495

Text Arena · Industry Writing And Literature And Language · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_writing_and_literature_and_language leaderboard.

Rank #12 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,461
Percentile: 96.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_writing_and_literature_and_language. Source rank: #15. Votes: 9946. Organization: openai. License: Proprietary.

96.6% percentile inside its fair comparison set

1,461Raw benchmark valueCI 1,454 - 1,468

SAGE

VALS-AI · Professional reasoning · Objective

Student Assessment with Generative Evaluation.

Rank #23 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 43.3%
Percentile: 51.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: sage; provider: OpenAI.

51.1% percentile inside its fair comparison set

43.3%Raw benchmark valueCI 37.2% - 49.4%

PRBench Legal

SL · Professional reasoning · Rubric

Applied legal reasoning on professional-domain tasks.

Rank #7

verified runtimeexact direct

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Scale Labs
Raw value: 44.4%
Percentile: 50%
Last updated: recent
Eligibility: headline eligible

50% percentile inside its fair comparison set

44.4%Raw benchmark value

Data analysis

LB · Professional reasoning · Objective

Structured data manipulation and table reasoning accuracy.

Rank #12 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 77%
Percentile: 89.8%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: Data Analysis. Tasks scored: 3.

89.8% percentile inside its fair comparison set

77%Raw benchmark value

Overall

LB · Professional reasoning · Objective

Average objective performance across LiveBench's current public category mix.

Rank #13 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 75.1%
Percentile: 88.9%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category averages included: 7.

88.9% percentile inside its fair comparison set

75.1%Raw benchmark value

Consecutive events

LB · Professional reasoning · Objective

Objective consecutive events score in LiveBench.

Rank #18 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 81.4%
Percentile: 84.3%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: consecutive_events. Category: Data Analysis.

84.3% percentile inside its fair comparison set

81.4%Raw benchmark value

Table join

LB · Professional reasoning · Objective

Objective table join score in LiveBench.

Rank #8 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 49.8%
Percentile: 93.5%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: tablejoin. Category: Data Analysis.

93.5% percentile inside its fair comparison set

49.8%Raw benchmark value

Table reformat

LB · Professional reasoning · Objective

Objective table reformat score in LiveBench.

Rank #20 · Source label: gpt-5.4-xhigh

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 100%
Percentile: 100%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: tablereformat. Category: Data Analysis.

100% percentile inside its fair comparison set

100%Raw benchmark value

Poker Agent

VALS-AI · Professional reasoning · Objective

Agent profit in poker-style strategic play.

Rank #3 · Source label: openai/gpt-5-2025-08-07

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 1,103.2 score
Percentile: 93.8%
Last updated: archived
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

Parsed from Vals AI BenchmarkView overall scores. Vals slug: poker_agent; provider: unknown. Backfilled from GPT-5 via approved benchmark identity mapping map-gpt-5-4-to-gpt-5.

93.8% percentile inside its fair comparison set

1,103.2 scoreRaw benchmark valueCI 1,103.2 score - 1,103.2 score

Search / tool use4 benchmarks54.1%

Tau2-Bench Telecom

AA · Search / tool use · Objective

It matters when the model must browse, call tools, and recover useful answers from external systems.

Rank #156 · Source label: GPT-5.4 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 35.1%
Percentile: 49.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `tau2`.

49.8% percentile inside its fair comparison set

35.1%Raw benchmark value

Search Arena

AR · Search / tool use · Human

It matters when the model must browse, call tools, and recover useful answers from external systems.

Rank #13 · Source label: gpt-5.4-search

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,193
Percentile: 60%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-search`. Category: overall. Source rank: #14. Votes: 56204. Organization: openai. License: Proprietary.

60% percentile inside its fair comparison set

1,193Raw benchmark valueCI 1,187 - 1,200

Search Arena · No Style Control

AR · Search / tool use · Human

It matters when the model must browse, call tools, and recover useful answers from external systems.

Rank #14 · Source label: gpt-5.4-search

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,195
Percentile: 56.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-search`. Category: overall. Source rank: #14. Votes: 56204. Organization: openai. License: Proprietary.

56.7% percentile inside its fair comparison set

1,195Raw benchmark valueCI 1,189 - 1,200

BrowseComp

OFF · Search / tool use · Objective

It matters when the model must browse, call tools, and recover useful answers from external systems.

Rank #4

Official company resultmanual verifiedmanual verified

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Provider official evals
Raw value: 82.7%
Percentile: 50%
Last updated: aging
Eligibility: headline eligible

Verified against OpenAI's GPT-5.4 launch page dated March 5, 2026. Uses the published BrowseComp figure. Checked 2026-04-29. Verification: manual_public_page_verification.

50% percentile inside its fair comparison set

82.7%Raw benchmark value

Long context2 benchmarks73.1%

Long Context Reasoning

AA · Long context · Objective

It checks whether long-context claims survive contact with retrieval, memory, or long-document tasks.

Rank #99 · Source label: GPT-5.4 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 47.3%
Percentile: 68.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `lcr`.

68.9% percentile inside its fair comparison set

47.3%Raw benchmark value

CorpFin v2

VALS-AI · Long context · Objective

It checks whether long-context claims survive contact with retrieval, memory, or long-document tasks.

Rank #21 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 65.3%
Percentile: 77.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: corp_fin_v2; provider: OpenAI.

77.3% percentile inside its fair comparison set

65.3%Raw benchmark valueCI 63.4% - 67.1%

Vision understanding25 benchmarks84.2%

MMMU-Pro

AA · Vision understanding · Objective

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #27 · Source label: GPT-5.4 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 70.6%
Percentile: 80.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `mmmuPro`.

80.7% percentile inside its fair comparison set

70.6%Raw benchmark value

Vision Arena

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #8 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,283
Percentile: 93.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: overall. Source rank: #10. Votes: 12727. Organization: openai. License: Proprietary.

93.6% percentile inside its fair comparison set

1,283Raw benchmark valueCI 1,275 - 1,291

Vision Arena · Captioning

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #13 · Source label: gpt-5-chat

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,200
Percentile: 57.7%
Last updated: recent
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

Parsed from Arena leaderboard dataset row `gpt-5-chat`. Category: captioning. Source rank: #13. Votes: 399. Organization: openai. License: Proprietary. Backfilled from GPT-5 via approved benchmark identity mapping map-gpt-5-4-to-gpt-5.

57.7% percentile inside its fair comparison set

1,200Raw benchmark valueCI 1,170 - 1,230

Vision Arena · Creative Writing Vision

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #18 · Source label: gpt-5.4

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,253
Percentile: 69.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4`. Category: creative_writing_vision. Source rank: #22. Votes: 675. Organization: openai. License: Proprietary.

69.1% percentile inside its fair comparison set

1,253Raw benchmark valueCI 1,229 - 1,277

Vision Arena · Diagram

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #5 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,315
Percentile: 94.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: diagram. Source rank: #7. Votes: 3446. Organization: openai. License: Proprietary.

94.3% percentile inside its fair comparison set

1,315Raw benchmark valueCI 1,304 - 1,327

Vision Arena · English

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #10 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,278
Percentile: 91.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: english. Source rank: #12. Votes: 5394. Organization: openai. License: Proprietary.

91.7% percentile inside its fair comparison set

1,278Raw benchmark valueCI 1,267 - 1,289

Vision Arena · Entity Recognition

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #6 · Source label: gpt-5-high

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,257
Percentile: 87.5%
Last updated: recent
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

Parsed from Arena leaderboard dataset row `gpt-5-high`. Category: entity_recognition. Source rank: #6. Votes: 434. Organization: openai. License: Proprietary. Backfilled from GPT-5 via approved benchmark identity mapping map-gpt-5-4-to-gpt-5.

87.5% percentile inside its fair comparison set

1,257Raw benchmark valueCI 1,224 - 1,289

Vision Arena · Homework

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #5 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,323
Percentile: 94.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: homework. Source rank: #7. Votes: 2005. Organization: openai. License: Proprietary.

94.1% percentile inside its fair comparison set

1,323Raw benchmark valueCI 1,308 - 1,337

Vision Arena · Humor

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #14 · Source label: gpt-5.4

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,254
Percentile: 73.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4`. Category: humor. Source rank: #17. Votes: 412. Organization: openai. License: Proprietary.

73.5% percentile inside its fair comparison set

1,254Raw benchmark valueCI 1,224 - 1,284

Vision Arena · Ocr

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #7 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,303
Percentile: 91.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: ocr. Source rank: #9. Votes: 9292. Organization: openai. License: Proprietary.

91.4% percentile inside its fair comparison set

1,303Raw benchmark valueCI 1,295 - 1,311

Vision Arena · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #6 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,301
Percentile: 95.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: overall. Source rank: #8. Votes: 12727. Organization: openai. License: Proprietary.

95.4% percentile inside its fair comparison set

1,301Raw benchmark valueCI 1,293 - 1,308

Vision Arena · Captioning · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #14 · Source label: gpt-5-chat

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,215
Percentile: 53.8%
Last updated: recent
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

Parsed from Arena leaderboard dataset row `gpt-5-chat`. Category: captioning. Source rank: #14. Votes: 399. Organization: openai. License: Proprietary. Backfilled from GPT-5 via approved benchmark identity mapping map-gpt-5-4-to-gpt-5.

53.8% percentile inside its fair comparison set

1,215Raw benchmark valueCI 1,185 - 1,245

Vision Arena · Creative Writing Vision · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #14 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,274
Percentile: 76.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: creative_writing_vision. Source rank: #18. Votes: 643. Organization: openai. License: Proprietary.

76.4% percentile inside its fair comparison set

1,274Raw benchmark valueCI 1,250 - 1,299

Vision Arena · Diagram · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #5 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,320
Percentile: 94.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: diagram. Source rank: #7. Votes: 3446. Organization: openai. License: Proprietary.

94.3% percentile inside its fair comparison set

1,320Raw benchmark valueCI 1,308 - 1,332

Vision Arena · English · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #6 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,297
Percentile: 95.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: english. Source rank: #8. Votes: 5394. Organization: openai. License: Proprietary.

95.4% percentile inside its fair comparison set

1,297Raw benchmark valueCI 1,286 - 1,308

Vision Arena · Entity Recognition · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #11 · Source label: gpt-5-high

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,261
Percentile: 71.9%
Last updated: recent
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

Parsed from Arena leaderboard dataset row `gpt-5-high`. Category: entity_recognition. Source rank: #11. Votes: 434. Organization: openai. License: Proprietary. Backfilled from GPT-5 via approved benchmark identity mapping map-gpt-5-4-to-gpt-5.

71.9% percentile inside its fair comparison set

1,261Raw benchmark valueCI 1,233 - 1,290

Vision Arena · Homework · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #6 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,328
Percentile: 92.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: homework. Source rank: #7. Votes: 2005. Organization: openai. License: Proprietary.

92.6% percentile inside its fair comparison set

1,328Raw benchmark valueCI 1,313 - 1,342

Vision Arena · Humor · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #15 · Source label: gpt-5.4

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,270
Percentile: 71.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4`. Category: humor. Source rank: #18. Votes: 412. Organization: openai. License: Proprietary.

71.4% percentile inside its fair comparison set

1,270Raw benchmark valueCI 1,240 - 1,301

Vision Arena · Ocr · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #4 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,314
Percentile: 95.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: ocr. Source rank: #6. Votes: 9292. Organization: openai. License: Proprietary.

95.7% percentile inside its fair comparison set

1,314Raw benchmark valueCI 1,306 - 1,322

MMMU Pro

VALS-AI · Vision understanding · Objective

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #7 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 87.5%
Percentile: 91.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: mmmu; provider: OpenAI.

91.4% percentile inside its fair comparison set

87.5%Raw benchmark valueCI 86% - 89.1%

VTB

SL · Vision understanding · Rubric

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #1 · Source label: gpt-5.4-2026-03-05 (reasoning effort = high)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Scale Labs
Raw value: 29.2%
Percentile: 100%
Last updated: recent
Eligibility: headline eligible

100% percentile inside its fair comparison set

29.2%Raw benchmark value

VISTA

SL · Vision understanding · Rubric

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #3 · Source label: gpt-5

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Scale Labs
Raw value: 79%
Percentile: 92.9%
Last updated: aging
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

Backfilled from GPT-5 via approved benchmark identity mapping map-gpt-5-4-to-gpt-5.

92.9% percentile inside its fair comparison set

79%Raw benchmark value

MMMU-Pro

OFF · Vision understanding · Objective

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #2

Official company resultmanual verifiedmanual verified

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Provider official evals
Raw value: 81.2%
Percentile: 100%
Last updated: aging
Eligibility: headline eligible

Verified against OpenAI's GPT-5.4 launch page dated March 5, 2026. Uses the published no-tools MMMU Pro figure. Checked 2026-04-29. Verification: manual_public_page_verification.

100% percentile inside its fair comparison set

81.2%Raw benchmark value

Vision Arena · Creative Writing

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #12 · Source label: gpt-5-chat

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,232
Percentile: 68.8%
Last updated: archived
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

Parsed from Arena leaderboard dataset row `gpt-5-chat`. Category: creative_writing. Source rank: #12. Votes: 1518. Organization: openai. License: Proprietary. Backfilled from GPT-5 via approved benchmark identity mapping map-gpt-5-4-to-gpt-5.

68.8% percentile inside its fair comparison set

1,232Raw benchmark valueCI 1,216 - 1,248

Vision Arena · Creative Writing · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #11 · Source label: gpt-5-chat

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,239
Percentile: 71.9%
Last updated: archived
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

71.9% percentile inside its fair comparison set

1,239Raw benchmark valueCI 1,224 - 1,255

Document understanding4 benchmarks80.7%

Document Arena

AR · Document understanding · Human

It matters when the job is reading PDFs, tables, forms, or mixed-layout documents rather than plain chat.

Rank #8 · Source label: gpt-5.4

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,472
Percentile: 70.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4`. Category: overall. Source rank: #11. Votes: 24400. Organization: openai. License: Proprietary.

70.8% percentile inside its fair comparison set

1,472Raw benchmark valueCI 1,465 - 1,479

Document Arena · No Style Control

AR · Document understanding · Human

It matters when the job is reading PDFs, tables, forms, or mixed-layout documents rather than plain chat.

Rank #6 · Source label: gpt-5.4

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,474
Percentile: 79.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4`. Category: overall. Source rank: #9. Votes: 24400. Organization: openai. License: Proprietary.

79.2% percentile inside its fair comparison set

1,474Raw benchmark valueCI 1,467 - 1,480

MortgageTax

VALS-AI · Document understanding · Objective

It matters when the job is reading PDFs, tables, forms, or mixed-layout documents rather than plain chat.

Rank #13 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 68.3%
Percentile: 80%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: mortgage_tax; provider: OpenAI.

80% percentile inside its fair comparison set

68.3%Raw benchmark valueCI 66.5% - 70.1%

Multimodal mix

OC · Document understanding · Objective

It matters when the job is reading PDFs, tables, forms, or mixed-layout documents rather than plain chat.

Rank #3 · Source label: gpt-5

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: OpenCompass
Raw value: 75.4%
Percentile: 92.9%
Last updated: recent
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

Backfilled from GPT-5 via approved benchmark identity mapping map-gpt-5-4-to-gpt-5.

92.9% percentile inside its fair comparison set

75.4%Raw benchmark value

Safety1 benchmark84.6%

MASK

SL · Safety · Rubric

Whether a model stays honest instead of covertly optimizing against the user.

Rank #3

verified runtimeexact direct

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Scale Labs
Raw value: 91.7%
Percentile: 84.6%
Last updated: recent
Eligibility: headline eligible

84.6% percentile inside its fair comparison set

91.7%Raw benchmark value

Embeddings / retrieval1 benchmark100%

Retrieval

MTEB · Embeddings / retrieval · Retrieval

It is one of the few direct signals for retrieval stacks, where embedding quality matters more than chat style.

Rank #2 · Source label: gpt-5

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: MTEB
Raw value: 58.8 ndcg
Percentile: 100%
Last updated: aging
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

Embedding endpoint score. Backfilled from GPT-5 via approved benchmark identity mapping map-gpt-5-4-to-gpt-5.

100% percentile inside its fair comparison set

58.8 ndcgRaw benchmark value

Multilingual16 benchmarks94.6%

Text Arena · Chinese

AR · Multilingual · Human

Observed user preference in Arena's Text Arena chinese leaderboard.

Rank #18 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,508
Percentile: 94.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: chinese. Source rank: #22. Votes: 2204. Organization: openai. License: Proprietary.

94.2% percentile inside its fair comparison set

1,508Raw benchmark valueCI 1,494 - 1,522

Text Arena · French

AR · Multilingual · Human

Observed user preference in Arena's Text Arena french leaderboard.

Rank #14 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,499
Percentile: 94%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: french. Source rank: #16. Votes: 1454. Organization: openai. License: Proprietary.

94% percentile inside its fair comparison set

1,499Raw benchmark valueCI 1,481 - 1,518

Text Arena · German

AR · Multilingual · Human

Observed user preference in Arena's Text Arena german leaderboard.

Rank #10 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,478
Percentile: 96.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: german. Source rank: #13. Votes: 677. Organization: openai. License: Proprietary.

96.2% percentile inside its fair comparison set

1,478Raw benchmark valueCI 1,454 - 1,502

Text Arena · Japanese

AR · Multilingual · Human

Observed user preference in Arena's Text Arena japanese leaderboard.

Rank #8 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,476
Percentile: 96.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: japanese. Source rank: #10. Votes: 371. Organization: openai. License: Proprietary.

96.6% percentile inside its fair comparison set

1,476Raw benchmark valueCI 1,443 - 1,510

Text Arena · Korean

AR · Multilingual · Human

Observed user preference in Arena's Text Arena korean leaderboard.

Rank #19 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,428
Percentile: 91.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: korean. Source rank: #25. Votes: 682. Organization: openai. License: Proprietary.

91.3% percentile inside its fair comparison set

1,428Raw benchmark valueCI 1,403 - 1,453

Text Arena · Russian

AR · Multilingual · Human

Observed user preference in Arena's Text Arena russian leaderboard.

Rank #8 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,494
Percentile: 97.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: russian. Source rank: #10. Votes: 4509. Organization: openai. License: Proprietary.

97.6% percentile inside its fair comparison set

1,494Raw benchmark valueCI 1,484 - 1,504

Text Arena · Spanish

AR · Multilingual · Human

Observed user preference in Arena's Text Arena spanish leaderboard.

Rank #13 · Source label: gpt-5.4

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,469
Percentile: 94.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4`. Category: spanish. Source rank: #15. Votes: 1216. Organization: openai. License: Proprietary.

94.4% percentile inside its fair comparison set

1,469Raw benchmark valueCI 1,450 - 1,488

Text Arena · Chinese · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena chinese leaderboard.

Rank #9 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,519
Percentile: 97.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: chinese. Source rank: #13. Votes: 2204. Organization: openai. License: Proprietary.

97.3% percentile inside its fair comparison set

1,519Raw benchmark valueCI 1,506 - 1,533

Text Arena · French · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena french leaderboard.

Rank #9 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,494
Percentile: 96.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: french. Source rank: #11. Votes: 1454. Organization: openai. License: Proprietary.

96.3% percentile inside its fair comparison set

1,494Raw benchmark valueCI 1,475 - 1,512

Text Arena · German · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena german leaderboard.

Rank #9 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,476
Percentile: 96.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: german. Source rank: #11. Votes: 677. Organization: openai. License: Proprietary.

96.6% percentile inside its fair comparison set

1,476Raw benchmark valueCI 1,453 - 1,500

Text Arena · Japanese · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena japanese leaderboard.

Rank #8 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,481
Percentile: 96.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: japanese. Source rank: #8. Votes: 371. Organization: openai. License: Proprietary.

96.6% percentile inside its fair comparison set

1,481Raw benchmark valueCI 1,447 - 1,515

Text Arena · Korean · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena korean leaderboard.

Rank #11 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,441
Percentile: 95.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: korean. Source rank: #12. Votes: 682. Organization: openai. License: Proprietary.

95.2% percentile inside its fair comparison set

1,441Raw benchmark valueCI 1,417 - 1,466

Text Arena · Russian · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena russian leaderboard.

Rank #7 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,486
Percentile: 97.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: russian. Source rank: #8. Votes: 4509. Organization: openai. License: Proprietary.

97.9% percentile inside its fair comparison set

1,486Raw benchmark valueCI 1,477 - 1,496

Text Arena · Spanish · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena spanish leaderboard.

Rank #18 · Source label: gpt-5.4

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,460
Percentile: 92.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4`. Category: spanish. Source rank: #20. Votes: 1216. Organization: openai. License: Proprietary.

92.1% percentile inside its fair comparison set

1,460Raw benchmark valueCI 1,441 - 1,478

Vision Arena · Chinese

AR · Multilingual · Human

Observed user preference in Arena's Vision Arena chinese leaderboard.

Rank #14 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,311
Percentile: 83.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: chinese. Source rank: #17. Votes: 752. Organization: openai. License: Proprietary.

83.1% percentile inside its fair comparison set

1,311Raw benchmark valueCI 1,284 - 1,338

Vision Arena · Chinese · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Vision Arena chinese leaderboard.

Rank #6 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,350
Percentile: 93.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: chinese. Source rank: #8. Votes: 752. Organization: openai. License: Proprietary.

93.5% percentile inside its fair comparison set

1,350Raw benchmark valueCI 1,324 - 1,377

Source links and registry checks

official

OpenAI models docs

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Arena

Jun 20, 2026

source →

official

Artificial Analysis

Jun 20, 2026

source →

official

LiveBench

Jun 20, 2026

source →

Model profile · OpenAI

GPT-5.4

Closed weightsfrontier · registry tag 2026 flagship

Visible tradeoffs

Reads as visible tradeoffs across the resolved source data.

Visible coverage: 65.1%
Verified coverage: 60.3%
Spread: 99.5%
Last verified: Jun 20, 2026

72%bench fit

textcodevisiondocumentsearch17 aliases45 official source links

Open compare

Data version

Current snapshot.

Data version Jun 20, 2026Model list checked9 providers · 1081 tracked modelsPage refreshed Jul 5, 2026

The registry snapshot and page stamp are shown so a stale deploy is visible at a glance.

Source-linked scores by benchmark

Each row keeps the benchmark source, source type, raw metric, and percentile inside its fair comparison set.

Visible tradeoffsThis model currently reads as visible tradeoffs across the resolved source data.

Chat / text38 benchmarks76.3%

Intelligence Index

AA · Chat / text · Combined

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #67 · Source label: GPT-5.4 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 28
Percentile: 83.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `intelligenceIndex`.

83.3% percentile inside its fair comparison set

28Raw benchmark value

AA-Omniscience accuracy

AA · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #22 · Source label: GPT-5.4 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 36.8%
Percentile: 93%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `omniscienceAccuracy`.

93% percentile inside its fair comparison set

36.8%Raw benchmark value

AA-Omniscience non-hallucination

AA · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #190 · Source label: GPT-5.4 (xhigh)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 11.4%
Percentile: 36.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `omniscienceNonHallucination`.

36.6% percentile inside its fair comparison set

11.4%Raw benchmark value

IFBench

AA · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #100 · Source label: GPT-5.4 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 48.4%
Percentile: 68.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `ifbench`.

68.6% percentile inside its fair comparison set

48.4%Raw benchmark value

Blended price

AA · Chat / text · Speed / cost

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #246 · Source label: GPT-5.4 (low)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: $5.9 /1M tokens
Percentile: 11.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `price1mBlended0To3To1`.

11.2% percentile inside its fair comparison set

$5.9 /1M tokensRaw benchmark value

Input price

AA · Chat / text · Speed / cost

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #244 · Source label: GPT-5.4 (low)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: $2.6 /1M input tokens
Percentile: 12%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `price1mInputTokens`.

12% percentile inside its fair comparison set

$2.6 /1M input tokensRaw benchmark value

Output price

AA · Chat / text · Speed / cost

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #257 · Source label: GPT-5.4 (low)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: $15.8 /1M output tokens
Percentile: 7.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `price1mOutputTokens`.

7.2% percentile inside its fair comparison set

$15.8 /1M output tokensRaw benchmark value

Output Speed

AA · Chat / text · Speed / cost

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #79 · Source label: GPT-5.4 (low)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 121 tokens/s
Percentile: 62.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `medianOutputTokensPerSecond`.

62.9% percentile inside its fair comparison set

121 tokens/sRaw benchmark value

Time to first token

AA · Chat / text · Speed / cost

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #210 · Source label: GPT-5.4 (xhigh)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 122.24s
Percentile: 0.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `medianTimeToFirstTokenSeconds`.

0.5% percentile inside its fair comparison set

122.24sRaw benchmark value

Time to first answer token

AA · Chat / text · Speed / cost

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #210 · Source label: GPT-5.4 (xhigh)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 122.24s
Percentile: 0.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `medianTimeToFirstAnswerTokenSeconds`.

0.5% percentile inside its fair comparison set

122.24sRaw benchmark value

Openness Index

AA · Chat / text · Combined

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #187 · Source label: GPT-5 (minimal)

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 6
Percentile: 7.5%
Last updated: recent
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

Parsed from Artificial Analysis public leaderboard field `opennessBreakdown.opennessIndex`. Backfilled from GPT-5 via approved benchmark identity mapping map-gpt-5-4-to-gpt-5.

7.5% percentile inside its fair comparison set

6Raw benchmark value

Text Arena

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #9 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,478
Percentile: 97.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: overall. Source rank: #11. Votes: 40959. Organization: openai. License: Proprietary.

97.5% percentile inside its fair comparison set

1,478Raw benchmark valueCI 1,474 - 1,482

Text Arena · Creative Writing

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #18 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,447
Percentile: 94.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: creative_writing. Source rank: #24. Votes: 6611. Organization: openai. License: Proprietary.

94.7% percentile inside its fair comparison set

1,447Raw benchmark valueCI 1,439 - 1,455

Text Arena · English

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #14 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,481
Percentile: 96%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: english. Source rank: #16. Votes: 19173. Organization: openai. License: Proprietary.

96% percentile inside its fair comparison set

1,481Raw benchmark valueCI 1,476 - 1,487

Text Arena · Exclude Ties

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #10 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,488
Percentile: 97.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: exclude_ties. Source rank: #12. Votes: 31419. Organization: openai. License: Proprietary.

97.2% percentile inside its fair comparison set

1,488Raw benchmark valueCI 1,482 - 1,493

Text Arena · Hard Prompts

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #10 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,500
Percentile: 97.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: hard_prompts. Source rank: #12. Votes: 26098. Organization: openai. License: Proprietary.

97.2% percentile inside its fair comparison set

1,500Raw benchmark valueCI 1,495 - 1,505

Text Arena · Hard Prompts English

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #16 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,498
Percentile: 95.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: hard_prompts_english. Source rank: #19. Votes: 12813. Organization: openai. License: Proprietary.

95.4% percentile inside its fair comparison set

1,498Raw benchmark valueCI 1,492 - 1,505

Text Arena · Instruction Following

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #10 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,477
Percentile: 97.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: instruction_following. Source rank: #12. Votes: 13343. Organization: openai. License: Proprietary.

97.2% percentile inside its fair comparison set

1,477Raw benchmark valueCI 1,471 - 1,483

Text Arena · Longer Query

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #13 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,486
Percentile: 96.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: longer_query. Source rank: #17. Votes: 16701. Organization: openai. License: Proprietary.

96.1% percentile inside its fair comparison set

1,486Raw benchmark valueCI 1,480 - 1,492

Text Arena · Multi Turn

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #9 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,492
Percentile: 97.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: multi_turn. Source rank: #11. Votes: 7580. Organization: openai. License: Proprietary.

97.5% percentile inside its fair comparison set

1,492Raw benchmark valueCI 1,485 - 1,500

Text Arena · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #9 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,470
Percentile: 97.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: overall. Source rank: #12. Votes: 40959. Organization: openai. License: Proprietary.

97.5% percentile inside its fair comparison set

1,470Raw benchmark valueCI 1,466 - 1,474

Text Arena · Creative Writing · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #17 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,442
Percentile: 95%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: creative_writing. Source rank: #21. Votes: 6611. Organization: openai. License: Proprietary.

95% percentile inside its fair comparison set

1,442Raw benchmark valueCI 1,434 - 1,451

Text Arena · English · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #14 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,470
Percentile: 96%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: english. Source rank: #17. Votes: 19173. Organization: openai. License: Proprietary.

96% percentile inside its fair comparison set

1,470Raw benchmark valueCI 1,464 - 1,475

Text Arena · Exclude Ties · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #9 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,475
Percentile: 97.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: exclude_ties. Source rank: #12. Votes: 31419. Organization: openai. License: Proprietary.

97.5% percentile inside its fair comparison set

1,475Raw benchmark valueCI 1,469 - 1,480

Text Arena · Hard Prompts · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #4 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,487
Percentile: 99.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: hard_prompts. Source rank: #6. Votes: 26098. Organization: openai. License: Proprietary.

99.1% percentile inside its fair comparison set

1,487Raw benchmark valueCI 1,482 - 1,493

Text Arena · Hard Prompts English · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #10 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,482
Percentile: 97.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: hard_prompts_english. Source rank: #12. Votes: 12813. Organization: openai. License: Proprietary.

97.2% percentile inside its fair comparison set

1,482Raw benchmark valueCI 1,476 - 1,489

Text Arena · Instruction Following · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #9 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,473
Percentile: 97.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: instruction_following. Source rank: #11. Votes: 13343. Organization: openai. License: Proprietary.

97.5% percentile inside its fair comparison set

1,473Raw benchmark valueCI 1,467 - 1,479

Text Arena · Longer Query · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #14 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,476
Percentile: 95.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: longer_query. Source rank: #18. Votes: 16701. Organization: openai. License: Proprietary.

95.7% percentile inside its fair comparison set

1,476Raw benchmark valueCI 1,470 - 1,482

Text Arena · Multi Turn · No Style Control

AR · Chat / text · Human

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #8 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,480
Percentile: 97.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: multi_turn. Source rank: #11. Votes: 7580. Organization: openai. License: Proprietary.

97.8% percentile inside its fair comparison set

1,480Raw benchmark valueCI 1,472 - 1,487

Instruction following

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #16 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 65%
Percentile: 86.1%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: IF. Tasks scored: 4.

86.1% percentile inside its fair comparison set

65%Raw benchmark value

Language

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #9 · Source label: gpt-5.4-xhigh

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 82.6%
Percentile: 92.6%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: Language. Tasks scored: 3.

92.6% percentile inside its fair comparison set

82.6%Raw benchmark value

Paraphrase

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #20 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 62.4%
Percentile: 82.4%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: paraphrase. Category: IF.

82.4% percentile inside its fair comparison set

62.4%Raw benchmark value

Simplify

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #25 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 59.2%
Percentile: 77.8%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: simplify. Category: IF.

77.8% percentile inside its fair comparison set

59.2%Raw benchmark value

Story generation

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #8 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 73%
Percentile: 93.5%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: story_generation. Category: IF.

93.5% percentile inside its fair comparison set

73%Raw benchmark value

Summarize

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #25 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 65.2%
Percentile: 77.8%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: summarize. Category: IF.

77.8% percentile inside its fair comparison set

65.2%Raw benchmark value

Connections

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #16 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 99%
Percentile: 86.1%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: connections. Category: Language.

86.1% percentile inside its fair comparison set

99%Raw benchmark value

Plot unscrambling

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #10 · Source label: gpt-5.4-xhigh

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 65.9%
Percentile: 91.7%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: plot_unscrambling. Category: Language.

91.7% percentile inside its fair comparison set

65.9%Raw benchmark value

Typos

LB · Chat / text · Objective

It tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.

Rank #16 · Source label: gpt-5.4-xhigh

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 82%
Percentile: 86%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: typos. Category: Language.

86% percentile inside its fair comparison set

82%Raw benchmark value

Coding27 benchmarks74.9%

Terminal-Bench Hard

AA · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #26 · Source label: GPT-5.4 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 37.9%
Percentile: 91.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `terminalbenchHard`.

91.7% percentile inside its fair comparison set

37.9%Raw benchmark value

SciCode

AA · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #15 · Source label: GPT-5.4 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 47.1%
Percentile: 96.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `scicode`.

96.2% percentile inside its fair comparison set

47.1%Raw benchmark value

Coding Index

AA · Coding · Combined

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #4 · Source label: GPT-5.4 (xhigh)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 71
Percentile: 96%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `codingIndex`.

96% percentile inside its fair comparison set

71Raw benchmark value

Agentic Index

AA · Coding · Combined

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #5 · Source label: GPT-5.4 (xhigh)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 41
Percentile: 91.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `agenticIndex`.

91.3% percentile inside its fair comparison set

41Raw benchmark value

Code Arena

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #20 · Source label: gpt-5.4-high (codex-harness)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,457
Percentile: 74%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high (codex-harness)`. Category: overall. Source rank: #25. Votes: 1482. Organization: openai. License: Proprietary.

74% percentile inside its fair comparison set

1,457Raw benchmark valueCI 1,440 - 1,474

WebDev Arena

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #20 · Source label: gpt-5.4-high (codex-harness)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,457
Percentile: 74%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high (codex-harness)`. Category: webdev. Source rank: #25. Votes: 1482. Organization: openai. License: Proprietary.

74% percentile inside its fair comparison set

1,457Raw benchmark valueCI 1,440 - 1,474

Code Arena · Webdev Html

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #17 · Source label: gpt-5.4-medium (codex-harness)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,473
Percentile: 78.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-medium (codex-harness)`. Category: webdev-html. Source rank: #23. Votes: 165. Organization: openai. License: Proprietary.

78.1% percentile inside its fair comparison set

1,473Raw benchmark valueCI 1,425 - 1,520

Code Arena · Webdev React

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #20 · Source label: gpt-5.4-high (codex-harness)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,449
Percentile: 67.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high (codex-harness)`. Category: webdev-react. Source rank: #25. Votes: 1322. Organization: openai. License: Proprietary.

67.8% percentile inside its fair comparison set

1,449Raw benchmark valueCI 1,431 - 1,467

Code Migration

VALS-AI · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #7 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 35%
Percentile: 72.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: code-migration; provider: OpenAI.

72.7% percentile inside its fair comparison set

35%Raw benchmark valueCI 26.9% - 43%

LiveCodeBench

VALS-AI · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #23 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 84.1%
Percentile: 75.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: lcb; provider: OpenAI.

75.6% percentile inside its fair comparison set

84.1%Raw benchmark valueCI 82.1% - 86.2%

ProgramBench

VALS-AI · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #5 · Source label: openai/gpt-5.4-2026-03-05-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 0.5%
Percentile: 90%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: programbench; provider: OpenAI.

90% percentile inside its fair comparison set

0.5%Raw benchmark valueCI 0% - 1.5%

SWE-bench Verified

VALS-AI · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #8 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 78.2%
Percentile: 87%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: swebench; provider: OpenAI.

87% percentile inside its fair comparison set

78.2%Raw benchmark valueCI 74.6% - 81.8%

Vibe Code Bench v1.1

VALS-AI · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #5 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 67.4%
Percentile: 91.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: vibe-code; provider: OpenAI.

91.8% percentile inside its fair comparison set

67.4%Raw benchmark valueCI 57.9% - 76.9%

Text Arena · Coding

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #12 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,521
Percentile: 96.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: coding. Source rank: #16. Votes: 10857. Organization: openai. License: Proprietary.

96.6% percentile inside its fair comparison set

1,521Raw benchmark valueCI 1,514 - 1,527

Text Arena · Coding · No Style Control

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #11 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,495
Percentile: 96.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: coding. Source rank: #15. Votes: 10857. Organization: openai. License: Proprietary.

96.9% percentile inside its fair comparison set

1,495Raw benchmark valueCI 1,488 - 1,502

IOI

VALS-AI · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #2 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 67.8%
Percentile: 97.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: ioi; provider: OpenAI.

97.7% percentile inside its fair comparison set

67.8%Raw benchmark valueCI 48.5% - 87.2%

Code Arena · Image To Webdev

AR · Coding · Human

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #13 · Source label: gpt-5.4

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,435
Percentile: 29.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4`. Category: image_to_webdev. Source rank: #16. Votes: 1220. Organization: openai. License: Proprietary.

29.4% percentile inside its fair comparison set

1,435Raw benchmark valueCI 1,417 - 1,453

HiL-Bench

SL · Coding · Rubric

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #5

verified runtimeexact direct

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Scale Labs
Raw value: 9.3%
Percentile: 20%
Last updated: recent
Eligibility: headline eligible

20% percentile inside its fair comparison set

9.3%Raw benchmark value

Terminal-Bench 2.0

OFF · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #2

Official company resultmanual verifiedmanual verified

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Provider official evals
Raw value: 75.1%
Percentile: 83.3%
Last updated: aging
Eligibility: headline eligible

Verified against OpenAI's GPT-5.4 launch page dated March 5, 2026. Uses the published Terminal-Bench 2.0 figure. Checked 2026-04-29. Verification: manual_public_page_verification.

83.3% percentile inside its fair comparison set

75.1%Raw benchmark value

Agentic coding

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #54 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 46.7%
Percentile: 50.9%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: Agentic Coding. Tasks scored: 3.

50.9% percentile inside its fair comparison set

46.7%Raw benchmark value

Coding

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #28 · Source label: gpt-5.4-xhigh

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 77.5%
Percentile: 75%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: Coding. Tasks scored: 2.

75% percentile inside its fair comparison set

77.5%Raw benchmark value

JavaScript

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #30 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 45%
Percentile: 73.1%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: javascript. Category: Agentic Coding.

73.1% percentile inside its fair comparison set

45%Raw benchmark value

TypeScript

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #26 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 45%
Percentile: 76.6%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: typescript. Category: Agentic Coding.

76.6% percentile inside its fair comparison set

45%Raw benchmark value

Python

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #74 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 50%
Percentile: 32.4%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: python. Category: Agentic Coding.

32.4% percentile inside its fair comparison set

50%Raw benchmark value

Coding generation

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #53 · Source label: gpt-5.4-xhigh

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 74.6%
Percentile: 52.8%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: code_generation. Category: Coding.

52.8% percentile inside its fair comparison set

74.6%Raw benchmark value

Coding completion

LB · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #24 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 76.1%
Percentile: 78.7%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: code_completion. Category: Coding.

78.7% percentile inside its fair comparison set

76.1%Raw benchmark value

Terminal-Bench 2.0

TERMINAL-BENCH · Coding · Objective

It tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.

Rank #10 · Source label: gpt-5

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Terminal-Bench
Raw value: 49.6%
Percentile: 73.3%
Last updated: archived
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

73.3% percentile inside its fair comparison set

49.6%Raw benchmark value

Reasoning / math / science22 benchmarks85.3%

Humanity's Last Exam

AA · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #91 · Source label: GPT-5.4 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 10.6%
Percentile: 75.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `hle`.

75.7% percentile inside its fair comparison set

10.6%Raw benchmark value

GPQA

AA · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #107 · Source label: GPT-5.4 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 74.8%
Percentile: 71.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `gpqa`.

71.7% percentile inside its fair comparison set

74.8%Raw benchmark value

CritPt

AA · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #61 · Source label: GPT-5.4 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 0.6%
Percentile: 80.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `critpt`.

80.1% percentile inside its fair comparison set

0.6%Raw benchmark value

ProofBench

VALS-AI · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #3 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 56%
Percentile: 94.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: proof_bench; provider: OpenAI.

94.3% percentile inside its fair comparison set

56%Raw benchmark valueCI 46.3% - 65.7%

GPQA Diamond

VALS-AI · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #9 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 91.7%
Percentile: 93.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: gpqa; provider: OpenAI.

93.3% percentile inside its fair comparison set

91.7%Raw benchmark valueCI 87.9% - 95.4%

MMLU Pro

VALS-AI · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #14 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 87.5%
Percentile: 85.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: mmlu_pro; provider: OpenAI.

85.4% percentile inside its fair comparison set

87.5%Raw benchmark valueCI 86.7% - 88.3%

Text Arena · Math

AR · Reasoning / math / science · Human

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #5 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,503
Percentile: 98.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: math. Source rank: #6. Votes: 2285. Organization: openai. License: Proprietary.

98.7% percentile inside its fair comparison set

1,503Raw benchmark valueCI 1,490 - 1,515

Text Arena · Math · No Style Control

AR · Reasoning / math / science · Human

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #5 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,497
Percentile: 98.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: math. Source rank: #6. Votes: 2285. Organization: openai. License: Proprietary.

98.7% percentile inside its fair comparison set

1,497Raw benchmark valueCI 1,484 - 1,510

TutorBench

SL · Reasoning / math / science · Rubric

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #1 · Source label: gpt-5.4-pro-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Scale Labs
Raw value: 56.6%
Percentile: 100%
Last updated: recent
Eligibility: headline eligible

100% percentile inside its fair comparison set

56.6%Raw benchmark value

MultiNRC

SL · Reasoning / math / science · Rubric

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #2

verified runtimeexact direct

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Scale Labs
Raw value: 62.3%
Percentile: 90%
Last updated: recent
Eligibility: headline eligible

90% percentile inside its fair comparison set

62.3%Raw benchmark value

EnigmaEval

SL · Reasoning / math / science · Rubric

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #5 · Source label: gpt-5

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Scale Labs
Raw value: 64%
Percentile: 80%
Last updated: aging
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

Backfilled from GPT-5 via approved benchmark identity mapping map-gpt-5-4-to-gpt-5.

80% percentile inside its fair comparison set

64%Raw benchmark value

Humanity's Last Exam

OFF · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #5

Official company resultmanual verifiedmanual verified

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Provider official evals
Raw value: 39.8%
Percentile: 42.9%
Last updated: aging
Eligibility: headline eligible

Verified against OpenAI's GPT-5.4 launch page dated March 5, 2026. Uses the published no-tools Humanity's Last Exam figure. Checked 2026-04-29. Verification: manual_public_page_verification.

42.9% percentile inside its fair comparison set

39.8%Raw benchmark value

Mathematics

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #12 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 90%
Percentile: 89.8%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: Mathematics. Tasks scored: 4.

89.8% percentile inside its fair comparison set

90%Raw benchmark value

Reasoning

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #9 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 85.7%
Percentile: 92.6%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: Reasoning. Tasks scored: 4.

92.6% percentile inside its fair comparison set

85.7%Raw benchmark value

AMPS Hard

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #22 · Source label: gpt-5.4-xhigh

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 98%
Percentile: 94.4%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: AMPS_Hard. Category: Mathematics.

94.4% percentile inside its fair comparison set

98%Raw benchmark value

Integrals with game

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #16 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 77%
Percentile: 86.1%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: integrals_with_game. Category: Mathematics.

86.1% percentile inside its fair comparison set

77%Raw benchmark value

Math competition

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #29 · Source label: gpt-5.4-xhigh

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 94.1%
Percentile: 74.1%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: math_comp. Category: Mathematics.

74.1% percentile inside its fair comparison set

94.1%Raw benchmark value

Olympiad

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #22 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 88.8%
Percentile: 80.6%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: olympiad. Category: Mathematics.

80.6% percentile inside its fair comparison set

88.8%Raw benchmark value

Theory of mind

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #1 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 84.6%
Percentile: 100%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: theory_of_mind. Category: Reasoning.

100% percentile inside its fair comparison set

84.6%Raw benchmark value

Zebra puzzle

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #16 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 90%
Percentile: 86%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: zebra_puzzle. Category: Reasoning.

86% percentile inside its fair comparison set

90%Raw benchmark value

Spatial

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #24 · Source label: gpt-5.4-xhigh

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 98%
Percentile: 88.9%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: spatial. Category: Reasoning.

88.9% percentile inside its fair comparison set

98%Raw benchmark value

Logic with navigation

LB · Reasoning / math / science · Objective

It is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.

Rank #30 · Source label: gpt-5.4-xhigh

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 68%
Percentile: 73.1%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: logic_with_navigation. Category: Reasoning.

73.1% percentile inside its fair comparison set

68%Raw benchmark value

Professional reasoning34 benchmarks85.7%

GDPval-AA

AA · Professional reasoning · Rubric

Agentic performance on economically valuable work tasks.

Rank #5 · Source label: GPT-5.4 (xhigh)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 1,401
Percentile: 91.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `gdpvalBreakdown.elo`.

91.3% percentile inside its fair comparison set

1,401Raw benchmark value

APEX-Agents-AA

AA · Professional reasoning · Objective

Long-horizon agentic task completion.

Rank #3 · Source label: GPT-5.4 (xhigh)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 33.3%
Percentile: 91.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `apexAgents`.

91.7% percentile inside its fair comparison set

33.3%Raw benchmark value

SkillsBench

VALS-AI · Professional reasoning · Objective

Applied professional skills tasks.

Rank #5 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 51.7%
Percentile: 60%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: skillsbench; provider: OpenAI.

60% percentile inside its fair comparison set

51.7%Raw benchmark valueCI 42.5% - 61%

Harvey's Legal Agent Benchmark

VALS-AI · Professional reasoning · Objective

Completing legal work with documents, spreadsheets, presentations, and file-system tools.

Rank #14 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 0%
Percentile: 15.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: hlab; provider: OpenAI.

15.4% percentile inside its fair comparison set

0%Raw benchmark valueCI 0% - 0%

LegalBench

VALS-AI · Professional reasoning · Objective

Academic legal reasoning tasks.

Rank #6 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 86%
Percentile: 94.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: legal_bench; provider: OpenAI.

94.4% percentile inside its fair comparison set

86%Raw benchmark valueCI 85.2% - 86.9%

TaxEval v2

VALS-AI · Professional reasoning · Objective

Answer quality on tax questions and responses.

Rank #27 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 74%
Percentile: 72.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: tax_eval_v2; provider: OpenAI.

72.5% percentile inside its fair comparison set

74%Raw benchmark valueCI 72.3% - 75.7%

MedCode

VALS-AI · Professional reasoning · Objective

Medical billing support and coding tasks.

Rank #24 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 41.3%
Percentile: 54.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: medcode; provider: OpenAI.

54.9% percentile inside its fair comparison set

41.3%Raw benchmark valueCI 37.1% - 45.5%

MedScribe

VALS-AI · Professional reasoning · Objective

Administrative documentation support for doctors.

Rank #25 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 77.5%
Percentile: 52%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: medscribe; provider: OpenAI.

52% percentile inside its fair comparison set

77.5%Raw benchmark valueCI 71% - 84%

Text Arena · Expert

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena expert leaderboard.

Rank #5 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,524
Percentile: 98.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: expert. Source rank: #7. Votes: 3597. Organization: openai. License: Proprietary.

98.5% percentile inside its fair comparison set

1,524Raw benchmark valueCI 1,513 - 1,534

Text Arena · Industry Business And Management And Financial Operations

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_business_and_management_and_financial_operations leaderboard.

Rank #10 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,483
Percentile: 97.2%
Last updated: recent
Eligibility: headline eligible

97.2% percentile inside its fair comparison set

1,483Raw benchmark valueCI 1,475 - 1,490

Text Arena · Industry Entertainment And Sports And Media

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_entertainment_and_sports_and_media leaderboard.

Rank #14 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,447
Percentile: 96%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_entertainment_and_sports_and_media. Source rank: #17. Votes: 8353. Organization: openai. License: Proprietary.

96% percentile inside its fair comparison set

1,447Raw benchmark valueCI 1,440 - 1,455

Text Arena · Industry Legal And Government

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_legal_and_government leaderboard.

Rank #9 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,489
Percentile: 97.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_legal_and_government. Source rank: #10. Votes: 3184. Organization: openai. License: Proprietary.

97.3% percentile inside its fair comparison set

1,489Raw benchmark valueCI 1,477 - 1,500

Text Arena · Industry Life And Physical And Social Science

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_life_and_physical_and_social_science leaderboard.

Rank #19 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,489
Percentile: 94.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_life_and_physical_and_social_science. Source rank: #22. Votes: 6649. Organization: openai. License: Proprietary.

94.4% percentile inside its fair comparison set

1,489Raw benchmark valueCI 1,481 - 1,497

Text Arena · Industry Mathematical

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_mathematical leaderboard.

Rank #7 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,503
Percentile: 98.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_mathematical. Source rank: #8. Votes: 2211. Organization: openai. License: Proprietary.

98.1% percentile inside its fair comparison set

1,503Raw benchmark valueCI 1,490 - 1,517

Text Arena · Industry Medicine And Healthcare

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_medicine_and_healthcare leaderboard.

Rank #38 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,473
Percentile: 87.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_medicine_and_healthcare. Source rank: #48. Votes: 2988. Organization: openai. License: Proprietary.

87.5% percentile inside its fair comparison set

1,473Raw benchmark valueCI 1,461 - 1,485

Text Arena · Industry Software And It Services

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_software_and_it_services leaderboard.

Rank #17 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,506
Percentile: 95.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_software_and_it_services. Source rank: #21. Votes: 15917. Organization: openai. License: Proprietary.

95.1% percentile inside its fair comparison set

1,506Raw benchmark valueCI 1,500 - 1,512

Text Arena · Industry Writing And Literature And Language

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_writing_and_literature_and_language leaderboard.

Rank #8 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,469
Percentile: 97.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_writing_and_literature_and_language. Source rank: #10. Votes: 9946. Organization: openai. License: Proprietary.

97.8% percentile inside its fair comparison set

1,469Raw benchmark valueCI 1,462 - 1,476

Text Arena · Expert · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena expert leaderboard.

Rank #5 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,512
Percentile: 98.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: expert. Source rank: #6. Votes: 3597. Organization: openai. License: Proprietary.

98.5% percentile inside its fair comparison set

1,512Raw benchmark valueCI 1,501 - 1,523

Text Arena · Industry Business And Management And Financial Operations · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_business_and_management_and_financial_operations leaderboard.

Rank #4 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,476
Percentile: 99.1%
Last updated: recent
Eligibility: headline eligible

99.1% percentile inside its fair comparison set

1,476Raw benchmark valueCI 1,468 - 1,483

Text Arena · Industry Entertainment And Sports And Media · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_entertainment_and_sports_and_media leaderboard.

Rank #13 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,443
Percentile: 96.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_entertainment_and_sports_and_media. Source rank: #16. Votes: 8353. Organization: openai. License: Proprietary.

96.3% percentile inside its fair comparison set

1,443Raw benchmark valueCI 1,436 - 1,451

Text Arena · Industry Legal And Government · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_legal_and_government leaderboard.

Rank #7 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,485
Percentile: 98%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_legal_and_government. Source rank: #8. Votes: 3184. Organization: openai. License: Proprietary.

98% percentile inside its fair comparison set

1,485Raw benchmark valueCI 1,473 - 1,496

Text Arena · Industry Life And Physical And Social Science · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_life_and_physical_and_social_science leaderboard.

Rank #16 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,478
Percentile: 95.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_life_and_physical_and_social_science. Source rank: #20. Votes: 6649. Organization: openai. License: Proprietary.

95.4% percentile inside its fair comparison set

1,478Raw benchmark valueCI 1,470 - 1,486

Text Arena · Industry Mathematical · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_mathematical leaderboard.

Rank #7 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,498
Percentile: 98.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_mathematical. Source rank: #8. Votes: 2211. Organization: openai. License: Proprietary.

98.1% percentile inside its fair comparison set

1,498Raw benchmark valueCI 1,484 - 1,511

Text Arena · Industry Medicine And Healthcare · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_medicine_and_healthcare leaderboard.

Rank #34 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,457
Percentile: 88.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_medicine_and_healthcare. Source rank: #37. Votes: 2988. Organization: openai. License: Proprietary.

88.8% percentile inside its fair comparison set

1,457Raw benchmark valueCI 1,446 - 1,469

Text Arena · Industry Software And It Services · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_software_and_it_services leaderboard.

Rank #11 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,490
Percentile: 96.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_software_and_it_services. Source rank: #14. Votes: 15917. Organization: openai. License: Proprietary.

96.9% percentile inside its fair comparison set

1,490Raw benchmark valueCI 1,484 - 1,495

Text Arena · Industry Writing And Literature And Language · No Style Control

AR · Professional reasoning · Human

Observed user preference in Arena's Text Arena industry_writing_and_literature_and_language leaderboard.

Rank #12 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,461
Percentile: 96.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: industry_writing_and_literature_and_language. Source rank: #15. Votes: 9946. Organization: openai. License: Proprietary.

96.6% percentile inside its fair comparison set

1,461Raw benchmark valueCI 1,454 - 1,468

SAGE

VALS-AI · Professional reasoning · Objective

Student Assessment with Generative Evaluation.

Rank #23 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 43.3%
Percentile: 51.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: sage; provider: OpenAI.

51.1% percentile inside its fair comparison set

43.3%Raw benchmark valueCI 37.2% - 49.4%

PRBench Legal

SL · Professional reasoning · Rubric

Applied legal reasoning on professional-domain tasks.

Rank #7

verified runtimeexact direct

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Scale Labs
Raw value: 44.4%
Percentile: 50%
Last updated: recent
Eligibility: headline eligible

50% percentile inside its fair comparison set

44.4%Raw benchmark value

Data analysis

LB · Professional reasoning · Objective

Structured data manipulation and table reasoning accuracy.

Rank #12 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 77%
Percentile: 89.8%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category: Data Analysis. Tasks scored: 3.

89.8% percentile inside its fair comparison set

77%Raw benchmark value

Overall

LB · Professional reasoning · Objective

Average objective performance across LiveBench's current public category mix.

Rank #13 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 75.1%
Percentile: 88.9%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Category averages included: 7.

88.9% percentile inside its fair comparison set

75.1%Raw benchmark value

Consecutive events

LB · Professional reasoning · Objective

Objective consecutive events score in LiveBench.

Rank #18 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 81.4%
Percentile: 84.3%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: consecutive_events. Category: Data Analysis.

84.3% percentile inside its fair comparison set

81.4%Raw benchmark value

Table join

LB · Professional reasoning · Objective

Objective table join score in LiveBench.

Rank #8 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 49.8%
Percentile: 93.5%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: tablejoin. Category: Data Analysis.

93.5% percentile inside its fair comparison set

49.8%Raw benchmark value

Table reformat

LB · Professional reasoning · Objective

Objective table reformat score in LiveBench.

Rank #20 · Source label: gpt-5.4-xhigh

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: LiveBench
Raw value: 100%
Percentile: 100%
Last updated: archived
Eligibility: headline eligible

Derived from the official LiveBench website leaderboard table. Task: tablereformat. Category: Data Analysis.

100% percentile inside its fair comparison set

100%Raw benchmark value

Poker Agent

VALS-AI · Professional reasoning · Objective

Agent profit in poker-style strategic play.

Rank #3 · Source label: openai/gpt-5-2025-08-07

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 1,103.2 score
Percentile: 93.8%
Last updated: archived
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

Parsed from Vals AI BenchmarkView overall scores. Vals slug: poker_agent; provider: unknown. Backfilled from GPT-5 via approved benchmark identity mapping map-gpt-5-4-to-gpt-5.

93.8% percentile inside its fair comparison set

1,103.2 scoreRaw benchmark valueCI 1,103.2 score - 1,103.2 score

Search / tool use4 benchmarks54.1%

Tau2-Bench Telecom

AA · Search / tool use · Objective

It matters when the model must browse, call tools, and recover useful answers from external systems.

Rank #156 · Source label: GPT-5.4 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 35.1%
Percentile: 49.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `tau2`.

49.8% percentile inside its fair comparison set

35.1%Raw benchmark value

Search Arena

AR · Search / tool use · Human

It matters when the model must browse, call tools, and recover useful answers from external systems.

Rank #13 · Source label: gpt-5.4-search

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,193
Percentile: 60%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-search`. Category: overall. Source rank: #14. Votes: 56204. Organization: openai. License: Proprietary.

60% percentile inside its fair comparison set

1,193Raw benchmark valueCI 1,187 - 1,200

Search Arena · No Style Control

AR · Search / tool use · Human

It matters when the model must browse, call tools, and recover useful answers from external systems.

Rank #14 · Source label: gpt-5.4-search

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,195
Percentile: 56.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-search`. Category: overall. Source rank: #14. Votes: 56204. Organization: openai. License: Proprietary.

56.7% percentile inside its fair comparison set

1,195Raw benchmark valueCI 1,189 - 1,200

BrowseComp

OFF · Search / tool use · Objective

It matters when the model must browse, call tools, and recover useful answers from external systems.

Rank #4

Official company resultmanual verifiedmanual verified

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Provider official evals
Raw value: 82.7%
Percentile: 50%
Last updated: aging
Eligibility: headline eligible

Verified against OpenAI's GPT-5.4 launch page dated March 5, 2026. Uses the published BrowseComp figure. Checked 2026-04-29. Verification: manual_public_page_verification.

50% percentile inside its fair comparison set

82.7%Raw benchmark value

Long context2 benchmarks73.1%

Long Context Reasoning

AA · Long context · Objective

It checks whether long-context claims survive contact with retrieval, memory, or long-document tasks.

Rank #99 · Source label: GPT-5.4 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 47.3%
Percentile: 68.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `lcr`.

68.9% percentile inside its fair comparison set

47.3%Raw benchmark value

CorpFin v2

VALS-AI · Long context · Objective

It checks whether long-context claims survive contact with retrieval, memory, or long-document tasks.

Rank #21 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 65.3%
Percentile: 77.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: corp_fin_v2; provider: OpenAI.

77.3% percentile inside its fair comparison set

65.3%Raw benchmark valueCI 63.4% - 67.1%

Vision understanding25 benchmarks84.2%

MMMU-Pro

AA · Vision understanding · Objective

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #27 · Source label: GPT-5.4 (Non-reasoning)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Artificial Analysis
Raw value: 70.6%
Percentile: 80.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Artificial Analysis public leaderboard field `mmmuPro`.

80.7% percentile inside its fair comparison set

70.6%Raw benchmark value

Vision Arena

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #8 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,283
Percentile: 93.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: overall. Source rank: #10. Votes: 12727. Organization: openai. License: Proprietary.

93.6% percentile inside its fair comparison set

1,283Raw benchmark valueCI 1,275 - 1,291

Vision Arena · Captioning

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #13 · Source label: gpt-5-chat

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,200
Percentile: 57.7%
Last updated: recent
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

57.7% percentile inside its fair comparison set

1,200Raw benchmark valueCI 1,170 - 1,230

Vision Arena · Creative Writing Vision

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #18 · Source label: gpt-5.4

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,253
Percentile: 69.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4`. Category: creative_writing_vision. Source rank: #22. Votes: 675. Organization: openai. License: Proprietary.

69.1% percentile inside its fair comparison set

1,253Raw benchmark valueCI 1,229 - 1,277

Vision Arena · Diagram

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #5 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,315
Percentile: 94.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: diagram. Source rank: #7. Votes: 3446. Organization: openai. License: Proprietary.

94.3% percentile inside its fair comparison set

1,315Raw benchmark valueCI 1,304 - 1,327

Vision Arena · English

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #10 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,278
Percentile: 91.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: english. Source rank: #12. Votes: 5394. Organization: openai. License: Proprietary.

91.7% percentile inside its fair comparison set

1,278Raw benchmark valueCI 1,267 - 1,289

Vision Arena · Entity Recognition

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #6 · Source label: gpt-5-high

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,257
Percentile: 87.5%
Last updated: recent
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

87.5% percentile inside its fair comparison set

1,257Raw benchmark valueCI 1,224 - 1,289

Vision Arena · Homework

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #5 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,323
Percentile: 94.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: homework. Source rank: #7. Votes: 2005. Organization: openai. License: Proprietary.

94.1% percentile inside its fair comparison set

1,323Raw benchmark valueCI 1,308 - 1,337

Vision Arena · Humor

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #14 · Source label: gpt-5.4

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,254
Percentile: 73.5%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4`. Category: humor. Source rank: #17. Votes: 412. Organization: openai. License: Proprietary.

73.5% percentile inside its fair comparison set

1,254Raw benchmark valueCI 1,224 - 1,284

Vision Arena · Ocr

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #7 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,303
Percentile: 91.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: ocr. Source rank: #9. Votes: 9292. Organization: openai. License: Proprietary.

91.4% percentile inside its fair comparison set

1,303Raw benchmark valueCI 1,295 - 1,311

Vision Arena · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #6 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,301
Percentile: 95.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: overall. Source rank: #8. Votes: 12727. Organization: openai. License: Proprietary.

95.4% percentile inside its fair comparison set

1,301Raw benchmark valueCI 1,293 - 1,308

Vision Arena · Captioning · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #14 · Source label: gpt-5-chat

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,215
Percentile: 53.8%
Last updated: recent
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

53.8% percentile inside its fair comparison set

1,215Raw benchmark valueCI 1,185 - 1,245

Vision Arena · Creative Writing Vision · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #14 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,274
Percentile: 76.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: creative_writing_vision. Source rank: #18. Votes: 643. Organization: openai. License: Proprietary.

76.4% percentile inside its fair comparison set

1,274Raw benchmark valueCI 1,250 - 1,299

Vision Arena · Diagram · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #5 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,320
Percentile: 94.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: diagram. Source rank: #7. Votes: 3446. Organization: openai. License: Proprietary.

94.3% percentile inside its fair comparison set

1,320Raw benchmark valueCI 1,308 - 1,332

Vision Arena · English · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #6 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,297
Percentile: 95.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: english. Source rank: #8. Votes: 5394. Organization: openai. License: Proprietary.

95.4% percentile inside its fair comparison set

1,297Raw benchmark valueCI 1,286 - 1,308

Vision Arena · Entity Recognition · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #11 · Source label: gpt-5-high

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,261
Percentile: 71.9%
Last updated: recent
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

71.9% percentile inside its fair comparison set

1,261Raw benchmark valueCI 1,233 - 1,290

Vision Arena · Homework · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #6 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,328
Percentile: 92.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: homework. Source rank: #7. Votes: 2005. Organization: openai. License: Proprietary.

92.6% percentile inside its fair comparison set

1,328Raw benchmark valueCI 1,313 - 1,342

Vision Arena · Humor · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #15 · Source label: gpt-5.4

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,270
Percentile: 71.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4`. Category: humor. Source rank: #18. Votes: 412. Organization: openai. License: Proprietary.

71.4% percentile inside its fair comparison set

1,270Raw benchmark valueCI 1,240 - 1,301

Vision Arena · Ocr · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #4 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,314
Percentile: 95.7%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: ocr. Source rank: #6. Votes: 9292. Organization: openai. License: Proprietary.

95.7% percentile inside its fair comparison set

1,314Raw benchmark valueCI 1,306 - 1,322

MMMU Pro

VALS-AI · Vision understanding · Objective

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #7 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 87.5%
Percentile: 91.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: mmmu; provider: OpenAI.

91.4% percentile inside its fair comparison set

87.5%Raw benchmark valueCI 86% - 89.1%

VTB

SL · Vision understanding · Rubric

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #1 · Source label: gpt-5.4-2026-03-05 (reasoning effort = high)

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Scale Labs
Raw value: 29.2%
Percentile: 100%
Last updated: recent
Eligibility: headline eligible

100% percentile inside its fair comparison set

29.2%Raw benchmark value

VISTA

SL · Vision understanding · Rubric

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #3 · Source label: gpt-5

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Scale Labs
Raw value: 79%
Percentile: 92.9%
Last updated: aging
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

Backfilled from GPT-5 via approved benchmark identity mapping map-gpt-5-4-to-gpt-5.

92.9% percentile inside its fair comparison set

79%Raw benchmark value

MMMU-Pro

OFF · Vision understanding · Objective

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #2

Official company resultmanual verifiedmanual verified

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Provider official evals
Raw value: 81.2%
Percentile: 100%
Last updated: aging
Eligibility: headline eligible

Verified against OpenAI's GPT-5.4 launch page dated March 5, 2026. Uses the published no-tools MMMU Pro figure. Checked 2026-04-29. Verification: manual_public_page_verification.

100% percentile inside its fair comparison set

81.2%Raw benchmark value

Vision Arena · Creative Writing

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #12 · Source label: gpt-5-chat

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,232
Percentile: 68.8%
Last updated: archived
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

68.8% percentile inside its fair comparison set

1,232Raw benchmark valueCI 1,216 - 1,248

Vision Arena · Creative Writing · No Style Control

AR · Vision understanding · Human

It is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.

Rank #11 · Source label: gpt-5-chat

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,239
Percentile: 71.9%
Last updated: archived
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

71.9% percentile inside its fair comparison set

1,239Raw benchmark valueCI 1,224 - 1,255

Document understanding4 benchmarks80.7%

Document Arena

AR · Document understanding · Human

It matters when the job is reading PDFs, tables, forms, or mixed-layout documents rather than plain chat.

Rank #8 · Source label: gpt-5.4

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,472
Percentile: 70.8%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4`. Category: overall. Source rank: #11. Votes: 24400. Organization: openai. License: Proprietary.

70.8% percentile inside its fair comparison set

1,472Raw benchmark valueCI 1,465 - 1,479

Document Arena · No Style Control

AR · Document understanding · Human

It matters when the job is reading PDFs, tables, forms, or mixed-layout documents rather than plain chat.

Rank #6 · Source label: gpt-5.4

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,474
Percentile: 79.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4`. Category: overall. Source rank: #9. Votes: 24400. Organization: openai. License: Proprietary.

79.2% percentile inside its fair comparison set

1,474Raw benchmark valueCI 1,467 - 1,480

MortgageTax

VALS-AI · Document understanding · Objective

It matters when the job is reading PDFs, tables, forms, or mixed-layout documents rather than plain chat.

Rank #13 · Source label: openai/gpt-5.4-2026-03-05

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Vals AI
Raw value: 68.3%
Percentile: 80%
Last updated: recent
Eligibility: headline eligible

Parsed from Vals AI BenchmarkView overall scores. Vals slug: mortgage_tax; provider: OpenAI.

80% percentile inside its fair comparison set

68.3%Raw benchmark valueCI 66.5% - 70.1%

Multimodal mix

OC · Document understanding · Objective

It matters when the job is reading PDFs, tables, forms, or mixed-layout documents rather than plain chat.

Rank #3 · Source label: gpt-5

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: OpenCompass
Raw value: 75.4%
Percentile: 92.9%
Last updated: recent
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

Backfilled from GPT-5 via approved benchmark identity mapping map-gpt-5-4-to-gpt-5.

92.9% percentile inside its fair comparison set

75.4%Raw benchmark value

Safety1 benchmark84.6%

MASK

SL · Safety · Rubric

Whether a model stays honest instead of covertly optimizing against the user.

Rank #3

verified runtimeexact direct

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Scale Labs
Raw value: 91.7%
Percentile: 84.6%
Last updated: recent
Eligibility: headline eligible

84.6% percentile inside its fair comparison set

91.7%Raw benchmark value

Embeddings / retrieval1 benchmark100%

Retrieval

MTEB · Embeddings / retrieval · Retrieval

It is one of the few direct signals for retrieval stacks, where embedding quality matters more than chat style.

Rank #2 · Source label: gpt-5

backfilledproxy backfilledBackground only

Raw row drilldownsource row, percentile, last updated, eligibility

Source: MTEB
Raw value: 58.8 ndcg
Percentile: 100%
Last updated: aging
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.

Embedding endpoint score. Backfilled from GPT-5 via approved benchmark identity mapping map-gpt-5-4-to-gpt-5.

100% percentile inside its fair comparison set

58.8 ndcgRaw benchmark value

Multilingual16 benchmarks94.6%

Text Arena · Chinese

AR · Multilingual · Human

Observed user preference in Arena's Text Arena chinese leaderboard.

Rank #18 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,508
Percentile: 94.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: chinese. Source rank: #22. Votes: 2204. Organization: openai. License: Proprietary.

94.2% percentile inside its fair comparison set

1,508Raw benchmark valueCI 1,494 - 1,522

Text Arena · French

AR · Multilingual · Human

Observed user preference in Arena's Text Arena french leaderboard.

Rank #14 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,499
Percentile: 94%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: french. Source rank: #16. Votes: 1454. Organization: openai. License: Proprietary.

94% percentile inside its fair comparison set

1,499Raw benchmark valueCI 1,481 - 1,518

Text Arena · German

AR · Multilingual · Human

Observed user preference in Arena's Text Arena german leaderboard.

Rank #10 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,478
Percentile: 96.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: german. Source rank: #13. Votes: 677. Organization: openai. License: Proprietary.

96.2% percentile inside its fair comparison set

1,478Raw benchmark valueCI 1,454 - 1,502

Text Arena · Japanese

AR · Multilingual · Human

Observed user preference in Arena's Text Arena japanese leaderboard.

Rank #8 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,476
Percentile: 96.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: japanese. Source rank: #10. Votes: 371. Organization: openai. License: Proprietary.

96.6% percentile inside its fair comparison set

1,476Raw benchmark valueCI 1,443 - 1,510

Text Arena · Korean

AR · Multilingual · Human

Observed user preference in Arena's Text Arena korean leaderboard.

Rank #19 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,428
Percentile: 91.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: korean. Source rank: #25. Votes: 682. Organization: openai. License: Proprietary.

91.3% percentile inside its fair comparison set

1,428Raw benchmark valueCI 1,403 - 1,453

Text Arena · Russian

AR · Multilingual · Human

Observed user preference in Arena's Text Arena russian leaderboard.

Rank #8 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,494
Percentile: 97.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: russian. Source rank: #10. Votes: 4509. Organization: openai. License: Proprietary.

97.6% percentile inside its fair comparison set

1,494Raw benchmark valueCI 1,484 - 1,504

Text Arena · Spanish

AR · Multilingual · Human

Observed user preference in Arena's Text Arena spanish leaderboard.

Rank #13 · Source label: gpt-5.4

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,469
Percentile: 94.4%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4`. Category: spanish. Source rank: #15. Votes: 1216. Organization: openai. License: Proprietary.

94.4% percentile inside its fair comparison set

1,469Raw benchmark valueCI 1,450 - 1,488

Text Arena · Chinese · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena chinese leaderboard.

Rank #9 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,519
Percentile: 97.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: chinese. Source rank: #13. Votes: 2204. Organization: openai. License: Proprietary.

97.3% percentile inside its fair comparison set

1,519Raw benchmark valueCI 1,506 - 1,533

Text Arena · French · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena french leaderboard.

Rank #9 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,494
Percentile: 96.3%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: french. Source rank: #11. Votes: 1454. Organization: openai. License: Proprietary.

96.3% percentile inside its fair comparison set

1,494Raw benchmark valueCI 1,475 - 1,512

Text Arena · German · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena german leaderboard.

Rank #9 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,476
Percentile: 96.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: german. Source rank: #11. Votes: 677. Organization: openai. License: Proprietary.

96.6% percentile inside its fair comparison set

1,476Raw benchmark valueCI 1,453 - 1,500

Text Arena · Japanese · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena japanese leaderboard.

Rank #8 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,481
Percentile: 96.6%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: japanese. Source rank: #8. Votes: 371. Organization: openai. License: Proprietary.

96.6% percentile inside its fair comparison set

1,481Raw benchmark valueCI 1,447 - 1,515

Text Arena · Korean · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena korean leaderboard.

Rank #11 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,441
Percentile: 95.2%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: korean. Source rank: #12. Votes: 682. Organization: openai. License: Proprietary.

95.2% percentile inside its fair comparison set

1,441Raw benchmark valueCI 1,417 - 1,466

Text Arena · Russian · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena russian leaderboard.

Rank #7 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,486
Percentile: 97.9%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: russian. Source rank: #8. Votes: 4509. Organization: openai. License: Proprietary.

97.9% percentile inside its fair comparison set

1,486Raw benchmark valueCI 1,477 - 1,496

Text Arena · Spanish · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Text Arena spanish leaderboard.

Rank #18 · Source label: gpt-5.4

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,460
Percentile: 92.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4`. Category: spanish. Source rank: #20. Votes: 1216. Organization: openai. License: Proprietary.

92.1% percentile inside its fair comparison set

1,460Raw benchmark valueCI 1,441 - 1,478

Vision Arena · Chinese

AR · Multilingual · Human

Observed user preference in Arena's Vision Arena chinese leaderboard.

Rank #14 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility

Source: Arena
Raw value: 1,311
Percentile: 83.1%
Last updated: recent
Eligibility: headline eligible

Parsed from Arena leaderboard dataset row `gpt-5.4-high`. Category: chinese. Source rank: #17. Votes: 752. Organization: openai. License: Proprietary.

83.1% percentile inside its fair comparison set

1,311Raw benchmark valueCI 1,284 - 1,338

Vision Arena · Chinese · No Style Control

AR · Multilingual · Human

Observed user preference in Arena's Vision Arena chinese leaderboard.

Rank #6 · Source label: gpt-5.4-high

verified runtimeexact alias

Raw row drilldownsource row, percentile, last updated, eligibility