#1Gemini 3.1 Pro
Google · frontier · 100% visible · 67% exact · 0% indirect
PreviewOfficial source data includedModel family view
Visible tradeoffs27.1% benchmark spread · 100% freshness · exact direct
Fit score82.0
Strongest evidencelong context · reasoning math science
Gemini 3.1 Pro is strongest on Long context and Reasoning / math / science for this preset.
The current lead depends partly on official company results, so share it with a warning until independent coverage deepens.
Some visible coverage is coming from provider-official source links while independent coverage catches up.
- Verified rows
- 6
- Manual checks
- 2
- Relay rows
- 0
- Backfilled rows
- 0
- Headline lane
- May 13, 2026
- Context lane
- No extra context
Base score is the weighted mean of preset-domain benchmark fit (85.4).
#2GPT-5
OpenAI · frontier · 100% visible · 100% exact · 0% indirect
Visible tradeoffs33.3% benchmark spread · 95% freshness · exact direct
Fit score49.6
Strongest evidencereasoning math science · chat text
GPT-5 is strongest on Reasoning / math / science and Chat / text for this preset.
The visible evidence mix still leans on weaker or split signals, especially around Long context, source verification state, and any backfilled or relay evidence still in play.
- Verified rows
- 6
- Manual checks
- 0
- Relay rows
- 0
- Backfilled rows
- 0
- Headline lane
- May 13, 2026
- Context lane
- No extra context
Base score is the weighted mean of preset-domain benchmark fit (44.1).
#3Grok 4
xAI · premium · 67% visible · 67% exact · 0% indirect
Visible tradeoffs91.1% benchmark spread · 77.5% freshness · exact direct
Fit score41.4
Strongest evidencechat text · reasoning math science
Grok 4 is strongest on Chat / text and Reasoning / math / science for this preset.
The visible evidence mix still leans on weaker or split signals, especially around Reasoning / math / science, source verification state, and any backfilled or relay evidence still in play.
- Verified rows
- 3
- Manual checks
- 0
- Relay rows
- 0
- Backfilled rows
- 0
- Headline lane
- May 13, 2026
- Context lane
- No extra context
Base score is the weighted mean of preset-domain benchmark fit (52.3).
#4Qwen3 235B A22B
Qwen · mid · 67% visible · 67% exact · 0% indirect
Visible tradeoffs58.9% benchmark spread · 77.5% freshness · exact direct
Fit score35.4
Strongest evidencechat text · reasoning math science
Qwen3 235B A22B is strongest on Chat / text and Reasoning / math / science for this preset.
The visible evidence mix still leans on weaker or split signals, especially around Reasoning / math / science, source verification state, and any backfilled or relay evidence still in play.
- Verified rows
- 2
- Manual checks
- 0
- Relay rows
- 0
- Backfilled rows
- 0
- Headline lane
- May 13, 2026
- Context lane
- No extra context
Base score is the weighted mean of preset-domain benchmark fit (40.5).
#5Llama 4 Maverick
Meta · mid · 67% visible · 67% exact · 0% indirect
Visible tradeoffs43.9% benchmark spread · 100% freshness · exact direct
Fit score22.8
Strongest evidencechat text · reasoning math science
Llama 4 Maverick is strongest on Chat / text and Reasoning / math / science for this preset.
The visible evidence mix still leans on weaker or split signals, especially around Reasoning / math / science, source verification state, and any backfilled or relay evidence still in play.
- Verified rows
- 2
- Manual checks
- 0
- Relay rows
- 0
- Backfilled rows
- 0
- Headline lane
- May 13, 2026
- Context lane
- No extra context
Base score is the weighted mean of preset-domain benchmark fit (25.2).