Benchmarks · /benchmarks/provider-hle

Humanity's Last Exam

Name: Humanity's Last Exam
Creator: Provider official evals

Provider-official HLE results manually verified from public model-card pages.

Source · Provider official evals
Version · 2026-Q2 public provider cards
Scores · 8

Test details

Visible tradeoffsThis is an objective signal, so it is mainly about measurable task performance rather than public taste.

source

Provider official evals

metric

Accuracy (%)

judge

Objective

direction

higher better

group id

provider_official_hle_2026_q2

domain

Reasoning / math / science

What it measures vs what it misses

✓ Measures

Hard academic reasoning and knowledge on the full HLE set.

✗ Misses

Independent replication, user preference, and deployment ergonomics.

Why this countsIt is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.Same-test ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt still misses product usability, latency, and whether the model stays correct in messy real workflows.

Leaderboard · this benchmark version

#1 · Claude Opus 4.7

OFF · Apr 29, 2026

Official company resultmanual verifiedmanual verified

Raw row drilldownsource, percentile, eligibility

Source URL: https://anthropic.com/claude-opus-4-7-system-card
Percentile: 100%
Last updated: recent
Eligibility: headline eligible
Identity: exact (1.00)

Verified against Anthropic's public Claude Opus 4.7 system card, checked on April 29, 2026. Uses the published no-tools Humanity's Last Exam figure. Checked 2026-04-29. Verification: manual_public_page_verification.

46.9%

#2 · Gemini 3.1 Pro

OFF · Apr 29, 2026

Official company resultmanual verifiedmanual verified

Raw row drilldownsource, percentile, eligibility

Source URL: https://deepmind.google/models/gemini/pro/
Percentile: 85.7%
Last updated: recent
Eligibility: headline eligible
Identity: exact (1.00)

Verified against the current Gemini Pro public model page. Page did not expose a visible page date on April 29, 2026. Uses the published no-tools Humanity's Last Exam figure for Gemini 3.1 Pro. Checked 2026-04-29. Verification: manual_public_page_verification.

44.4%

#3 · GPT-5.5

OFF · Apr 23, 2026

Official company resultmanual verifiedmanual verified

Raw row drilldownsource, percentile, eligibility

Source URL: https://openai.com/index/introducing-gpt-5-5/
Percentile: 71.4%
Last updated: recent
Eligibility: headline eligible
Identity: exact (1.00)

Verified against OpenAI's GPT-5.5 launch page dated April 23, 2026. Uses the published no-tools Humanity's Last Exam figure. Checked 2026-04-29. Verification: manual_public_page_verification.

41.4%

#4 · Claude Opus 4.6

OFF · Apr 29, 2026

Official company resultmanual verifiedmanual verifiedBackground only

Raw row drilldownsource, percentile, eligibility

Source URL: https://anthropic.com/claude-opus-4-6-system-card
Percentile: 57.1%
Last updated: recent
Eligibility: historical_model
Identity: exact (1.00)

Verified against Anthropic's public Claude Opus 4.6 system card, checked on April 29, 2026. Uses the published no-tools Humanity's Last Exam figure. Checked 2026-04-29. Verification: manual_public_page_verification.

40%

#5 · GPT-5.4

OFF · Mar 5, 2026

Official company resultmanual verifiedmanual verified

Raw row drilldownsource, percentile, eligibility

Source URL: https://openai.com/index/introducing-gpt-5-4/
Percentile: 42.9%
Last updated: aging
Eligibility: headline eligible
Identity: exact (1.00)

Verified against OpenAI's GPT-5.4 launch page dated March 5, 2026. Uses the published no-tools Humanity's Last Exam figure. Checked 2026-04-29. Verification: manual_public_page_verification.

39.8%

#6 · Gemini 3 Pro Preview

OFF · Apr 29, 2026

Official company resultmanual verifiedmanual verifiedBackground only

Raw row drilldownsource, percentile, eligibility

Source URL: https://deepmind.google/models/gemini/pro/
Percentile: 28.6%
Last updated: recent
Eligibility: preview_model
Identity: exact (1.00)

Verified against the current Gemini Pro public model page. Page did not expose a visible page date on April 29, 2026. Uses the published no-tools Humanity's Last Exam figure for Gemini 3 Pro Preview. Checked 2026-04-29. Verification: manual_public_page_verification.

37.5%

#7 · Claude Sonnet 4.6

OFF · Apr 29, 2026

Official company resultmanual verifiedmanual verified

Raw row drilldownsource, percentile, eligibility

Source URL: https://anthropic.com/claude-sonnet-4-6-system-card
Percentile: 14.3%
Last updated: recent
Eligibility: headline eligible
Identity: exact (1.00)

Verified against Anthropic's public Claude Sonnet 4.6 system card, checked on April 29, 2026. Uses the published no-tools Humanity's Last Exam figure. Checked 2026-04-29. Verification: manual_public_page_verification.

33.2%

#8 · Gemini 2.5 Pro

OFF · Jun 27, 2025

Official company resultmanual verifiedmanual verified

Raw row drilldownsource, percentile, eligibility

Source URL: https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-2-5-Pro-Model-Card.pdf
Percentile: 0%
Last updated: archived
Eligibility: headline eligible
Identity: exact (1.00)

Verified against the public Gemini 2.5 Pro model card PDF dated June 27, 2025. Uses the published no-tools Humanity's Last Exam figure. Checked 2026-04-29. Verification: manual_public_page_verification.

21.6%

Benchmarks · /benchmarks/provider-hle

Humanity's Last Exam

Provider-official HLE results manually verified from public model-card pages.

Source · Provider official evals
Version · 2026-Q2 public provider cards
Scores · 8

Test details

Visible tradeoffsThis is an objective signal, so it is mainly about measurable task performance rather than public taste.

source

Provider official evals

metric

Accuracy (%)

judge

Objective

direction

higher better

group id

provider_official_hle_2026_q2

domain

Reasoning / math / science

What it measures vs what it misses

✓ Measures

Hard academic reasoning and knowledge on the full HLE set.

✗ Misses

Independent replication, user preference, and deployment ergonomics.

Leaderboard · this benchmark version

#1 · Claude Opus 4.7

OFF · Apr 29, 2026

Official company resultmanual verifiedmanual verified

Raw row drilldownsource, percentile, eligibility

Source URL: https://anthropic.com/claude-opus-4-7-system-card
Percentile: 100%
Last updated: recent
Eligibility: headline eligible
Identity: exact (1.00)

46.9%

#2 · Gemini 3.1 Pro

OFF · Apr 29, 2026

Official company resultmanual verifiedmanual verified

Raw row drilldownsource, percentile, eligibility

Source URL: https://deepmind.google/models/gemini/pro/
Percentile: 85.7%
Last updated: recent
Eligibility: headline eligible
Identity: exact (1.00)

44.4%

#3 · GPT-5.5

OFF · Apr 23, 2026

Official company resultmanual verifiedmanual verified

Raw row drilldownsource, percentile, eligibility

Source URL: https://openai.com/index/introducing-gpt-5-5/
Percentile: 71.4%
Last updated: recent
Eligibility: headline eligible
Identity: exact (1.00)

Verified against OpenAI's GPT-5.5 launch page dated April 23, 2026. Uses the published no-tools Humanity's Last Exam figure. Checked 2026-04-29. Verification: manual_public_page_verification.

41.4%

#4 · Claude Opus 4.6

OFF · Apr 29, 2026

Official company resultmanual verifiedmanual verifiedBackground only

Raw row drilldownsource, percentile, eligibility

Source URL: https://anthropic.com/claude-opus-4-6-system-card
Percentile: 57.1%
Last updated: recent
Eligibility: historical_model
Identity: exact (1.00)

40%

#5 · GPT-5.4

OFF · Mar 5, 2026

Official company resultmanual verifiedmanual verified

Raw row drilldownsource, percentile, eligibility

Source URL: https://openai.com/index/introducing-gpt-5-4/
Percentile: 42.9%
Last updated: aging
Eligibility: headline eligible
Identity: exact (1.00)

Verified against OpenAI's GPT-5.4 launch page dated March 5, 2026. Uses the published no-tools Humanity's Last Exam figure. Checked 2026-04-29. Verification: manual_public_page_verification.

39.8%

#6 · Gemini 3 Pro Preview

OFF · Apr 29, 2026

Official company resultmanual verifiedmanual verifiedBackground only

Raw row drilldownsource, percentile, eligibility

Source URL: https://deepmind.google/models/gemini/pro/
Percentile: 28.6%
Last updated: recent
Eligibility: preview_model
Identity: exact (1.00)

Verified against the current Gemini Pro public model page. Page did not expose a visible page date on April 29, 2026. Uses the published no-tools Humanity's Last Exam figure for Gemini 3 Pro Preview. Checked 2026-04-29. Verification: manual_public_page_verification.

37.5%

#7 · Claude Sonnet 4.6

OFF · Apr 29, 2026

Official company resultmanual verifiedmanual verified

Raw row drilldownsource, percentile, eligibility

Source URL: https://anthropic.com/claude-sonnet-4-6-system-card
Percentile: 14.3%
Last updated: recent
Eligibility: headline eligible
Identity: exact (1.00)

33.2%

#8 · Gemini 2.5 Pro

OFF · Jun 27, 2025

Official company resultmanual verifiedmanual verified

Raw row drilldownsource, percentile, eligibility

Source URL: https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-2-5-Pro-Model-Card.pdf
Percentile: 0%
Last updated: archived
Eligibility: headline eligible
Identity: exact (1.00)

21.6%

Humanity's Last Exam

Test details

What it measures vs what it misses

✓ Measures

✗ Misses

Leaderboard · this benchmark version

Loading benchmark evidence.

Humanity's Last Exam

Test details

What it measures vs what it misses

✓ Measures

✗ Misses

Leaderboard · this benchmark version