Provider-official MRCR v2 long-context results manually verified from public provider pages.
Source · Provider official evals Version · 2026-Q2 public provider cards Scores · 2
Test details
Visible tradeoffsThis is an objective signal, so it is mainly about measurable task performance rather than public taste.
source
Provider official evals
metric
Accuracy (%)
judge
Objective
direction
higher better
group id
provider_official_mrcr_v2_2026_q2
domain
Long context
What it measures vs what it misses
✓ Measures
Needle-style long-context recall and sustained retrieval under long windows.
✗ Misses
Real workflow synthesis quality and multi-document judgment.
Why this countsIt checks whether long-context claims survive contact with retrieval, memory, or long-document tasks.Same-test ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt does not guarantee good synthesis quality once real documents, tools, and latency constraints are involved.
Leaderboard · this benchmark version
#1 · Gemini 3.1 Pro
OFF · Apr 29, 2026
Official company resultmanual verifiedmanual verified
Verified against the current Gemini Pro public model page. Page did not expose a visible page date on April 29, 2026. Uses the published MRCR v2 8-needle 128k average for Gemini 3.1 Pro. Checked 2026-04-29. Verification: manual_public_page_verification.
84.9%
#2 · Gemini 3 Pro Preview
OFF · Apr 29, 2026
Official company resultmanual verifiedmanual verifiedBackground only
Verified against the current Gemini Pro public model page. Page did not expose a visible page date on April 29, 2026. Uses the published MRCR v2 8-needle 128k average for Gemini 3 Pro Preview. Checked 2026-04-29. Verification: manual_public_page_verification.