Provider-official HLE results manually verified from public model-card pages.
Source · Provider official evals Version · 2026-Q2 public provider cards Scores · 8
Test details
Visible tradeoffsThis is an objective signal, so it is mainly about measurable task performance rather than public taste.
source
Provider official evals
metric
Accuracy (%)
judge
Objective
direction
higher better
group id
provider_official_hle_2026_q2
domain
Reasoning / math / science
What it measures vs what it misses
✓ Measures
Hard academic reasoning and knowledge on the full HLE set.
✗ Misses
Independent replication, user preference, and deployment ergonomics.
Why this countsIt is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.Same-test ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt still misses product usability, latency, and whether the model stays correct in messy real workflows.
Leaderboard · this benchmark version
#1 · Claude Opus 4.7
OFF · Apr 29, 2026
Official company resultmanual verifiedmanual verified
Verified against Anthropic's public Claude Opus 4.7 system card, checked on April 29, 2026. Uses the published no-tools Humanity's Last Exam figure. Checked 2026-04-29. Verification: manual_public_page_verification.
46.9%
#2 · Gemini 3.1 Pro
OFF · Apr 29, 2026
Official company resultmanual verifiedmanual verified
Verified against the current Gemini Pro public model page. Page did not expose a visible page date on April 29, 2026. Uses the published no-tools Humanity's Last Exam figure for Gemini 3.1 Pro. Checked 2026-04-29. Verification: manual_public_page_verification.
44.4%
#3 · GPT-5.5
OFF · Apr 23, 2026
Official company resultmanual verifiedmanual verified
Verified against OpenAI's GPT-5.5 launch page dated April 23, 2026. Uses the published no-tools Humanity's Last Exam figure. Checked 2026-04-29. Verification: manual_public_page_verification.
41.4%
#4 · Claude Opus 4.6
OFF · Apr 29, 2026
Official company resultmanual verifiedmanual verifiedBackground only
Verified against Anthropic's public Claude Opus 4.6 system card, checked on April 29, 2026. Uses the published no-tools Humanity's Last Exam figure. Checked 2026-04-29. Verification: manual_public_page_verification.
40%
#5 · GPT-5.4
OFF · Mar 5, 2026
Official company resultmanual verifiedmanual verified
Verified against OpenAI's GPT-5.4 launch page dated March 5, 2026. Uses the published no-tools Humanity's Last Exam figure. Checked 2026-04-29. Verification: manual_public_page_verification.
39.8%
#6 · Gemini 3 Pro Preview
OFF · Apr 29, 2026
Official company resultmanual verifiedmanual verifiedBackground only
Verified against the current Gemini Pro public model page. Page did not expose a visible page date on April 29, 2026. Uses the published no-tools Humanity's Last Exam figure for Gemini 3 Pro Preview. Checked 2026-04-29. Verification: manual_public_page_verification.
37.5%
#7 · Claude Sonnet 4.6
OFF · Apr 29, 2026
Official company resultmanual verifiedmanual verified
Verified against Anthropic's public Claude Sonnet 4.6 system card, checked on April 29, 2026. Uses the published no-tools Humanity's Last Exam figure. Checked 2026-04-29. Verification: manual_public_page_verification.
33.2%
#8 · Gemini 2.5 Pro
OFF · Jun 27, 2025
Official company resultmanual verifiedmanual verified
Verified against the public Gemini 2.5 Pro model card PDF dated June 27, 2025. Uses the published no-tools Humanity's Last Exam figure. Checked 2026-04-29. Verification: manual_public_page_verification.