GDPval-AA
AA · Professional reasoning · Rubric
Agentic performance on economically valuable work tasks.
Rank #8 · Source label: Gemini 3.5 Flash (high)
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Artificial Analysis
- Raw value
- 1,349
- Percentile
- 84.8%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Artificial Analysis public leaderboard field `gdpvalBreakdown.elo`.
84.8% percentile inside its fair comparison set1,349Raw benchmark value
APEX-Agents-AA
AA · Professional reasoning · Objective
Long-horizon agentic task completion.
Rank #1 · Source label: Gemini 3.5 Flash (high)
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Artificial Analysis
- Raw value
- 47.1%
- Percentile
- 100%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Artificial Analysis public leaderboard field `apexAgents`.
100% percentile inside its fair comparison set47.1%Raw benchmark value
Legal Research Bench
VALS-AI · Professional reasoning · Objective
Applied legal research tasks.
Rank #5 · Source label: google/gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Vals AI
- Raw value
- 30.8%
- Percentile
- 66.7%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Vals AI BenchmarkView overall scores. Vals slug: legal_research; provider: Google.
66.7% percentile inside its fair comparison set30.8%Raw benchmark valueCI 24.5% - 37.1%
SkillsBench
VALS-AI · Professional reasoning · Objective
Applied professional skills tasks.
Rank #4 · Source label: google/gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Vals AI
- Raw value
- 52.7%
- Percentile
- 70%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Vals AI BenchmarkView overall scores. Vals slug: skillsbench; provider: Google.
70% percentile inside its fair comparison set52.7%Raw benchmark valueCI 44% - 61.5%
Public Benefits Bench
VALS-AI · Professional reasoning · Objective
Answering SNAP benefits questions across the public-benefits lifecycle.
Rank #7 · Source label: google/gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Vals AI
- Raw value
- 59.5%
- Percentile
- 45.5%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Vals AI BenchmarkView overall scores. Vals slug: public-benefits-bench; provider: Google.
45.5% percentile inside its fair comparison set59.5%Raw benchmark valueCI 57% - 62%
Vals Index
VALS-AI · Professional reasoning · Combined
Weighted model performance across economically relevant Vals tasks.
Rank #6 · Source label: google/gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Vals AI
- Raw value
- 63
- Percentile
- 80.8%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Vals AI BenchmarkView overall scores. Vals slug: vals_index; provider: Google.
80.8% percentile inside its fair comparison set63Raw benchmark valueCI 60 - 66
Harvey's Legal Agent Benchmark
VALS-AI · Professional reasoning · Objective
Completing legal work with documents, spreadsheets, presentations, and file-system tools.
Rank #8 · Source label: google/gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Vals AI
- Raw value
- 2.5%
- Percentile
- 46.2%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Vals AI BenchmarkView overall scores. Vals slug: hlab; provider: Google.
46.2% percentile inside its fair comparison set2.5%Raw benchmark valueCI 0.9% - 4.1%
LegalBench
VALS-AI · Professional reasoning · Objective
Academic legal reasoning tasks.
Rank #29 · Source label: google/gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Vals AI
- Raw value
- 83.6%
- Percentile
- 68.9%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Vals AI BenchmarkView overall scores. Vals slug: legal_bench; provider: Google.
68.9% percentile inside its fair comparison set83.6%Raw benchmark valueCI 81.9% - 85.3%
Finance Agent v2
VALS-AI · Professional reasoning · Objective
Core financial analyst tasks for agentic models.
Rank #1 · Source label: google/gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Vals AI
- Raw value
- 57.9%
- Percentile
- 100%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Vals AI BenchmarkView overall scores. Vals slug: fabv2; provider: Google.
100% percentile inside its fair comparison set57.9%Raw benchmark valueCI 57.4% - 58.3%
TaxEval v2
VALS-AI · Professional reasoning · Objective
Answer quality on tax questions and responses.
Rank #21 · Source label: google/gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Vals AI
- Raw value
- 74.4%
- Percentile
- 78%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Vals AI BenchmarkView overall scores. Vals slug: tax_eval_v2; provider: Google.
78% percentile inside its fair comparison set74.4%Raw benchmark valueCI 72.7% - 76%
MedCode
VALS-AI · Professional reasoning · Objective
Medical billing support and coding tasks.
Rank #4 · Source label: google/gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Vals AI
- Raw value
- 55.8%
- Percentile
- 94.1%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Vals AI BenchmarkView overall scores. Vals slug: medcode; provider: Google.
94.1% percentile inside its fair comparison set55.8%Raw benchmark valueCI 51.7% - 60%
MedScribe
VALS-AI · Professional reasoning · Objective
Administrative documentation support for doctors.
Rank #29 · Source label: google/gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Vals AI
- Raw value
- 76.6%
- Percentile
- 44%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Vals AI BenchmarkView overall scores. Vals slug: medscribe; provider: Google.
44% percentile inside its fair comparison set76.6%Raw benchmark valueCI 72.8% - 80.3%
Text Arena · Expert
AR · Professional reasoning · Human
Observed user preference in Arena's Text Arena expert leaderboard.
Rank #10 · Source label: gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,509
- Percentile
- 96.7%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Arena leaderboard dataset row `gemini-3.5-flash`. Category: expert. Source rank: #12. Votes: 1040. Organization: google. License: Proprietary.
96.7% percentile inside its fair comparison set1,509Raw benchmark valueCI 1,490 - 1,528
Text Arena · Industry Business And Management And Financial Operations
AR · Professional reasoning · Human
Observed user preference in Arena's Text Arena industry_business_and_management_and_financial_operations leaderboard.
Rank #28 · Source label: gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,457
- Percentile
- 91.5%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Arena leaderboard dataset row `gemini-3.5-flash`. Category: industry_business_and_management_and_financial_operations. Source rank: #39. Votes: 2099. Organization: google. License: Proprietary.
91.5% percentile inside its fair comparison set1,457Raw benchmark valueCI 1,443 - 1,470
Text Arena · Industry Entertainment And Sports And Media
AR · Professional reasoning · Human
Observed user preference in Arena's Text Arena industry_entertainment_and_sports_and_media leaderboard.
Rank #13 · Source label: gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,448
- Percentile
- 96.3%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Arena leaderboard dataset row `gemini-3.5-flash`. Category: industry_entertainment_and_sports_and_media. Source rank: #16. Votes: 2054. Organization: google. License: Proprietary.
96.3% percentile inside its fair comparison set1,448Raw benchmark valueCI 1,434 - 1,462
Text Arena · Industry Legal And Government
AR · Professional reasoning · Human
Observed user preference in Arena's Text Arena industry_legal_and_government leaderboard.
Rank #20 · Source label: gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,474
- Percentile
- 93.6%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Arena leaderboard dataset row `gemini-3.5-flash`. Category: industry_legal_and_government. Source rank: #26. Votes: 790. Organization: google. License: Proprietary.
93.6% percentile inside its fair comparison set1,474Raw benchmark valueCI 1,452 - 1,496
Text Arena · Industry Life And Physical And Social Science
AR · Professional reasoning · Human
Observed user preference in Arena's Text Arena industry_life_and_physical_and_social_science leaderboard.
Rank #13 · Source label: gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,495
- Percentile
- 96.3%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Arena leaderboard dataset row `gemini-3.5-flash`. Category: industry_life_and_physical_and_social_science. Source rank: #16. Votes: 1638. Organization: google. License: Proprietary.
96.3% percentile inside its fair comparison set1,495Raw benchmark valueCI 1,480 - 1,510
Text Arena · Industry Mathematical
AR · Professional reasoning · Human
Observed user preference in Arena's Text Arena industry_mathematical leaderboard.
Rank #5 · Source label: gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,509
- Percentile
- 98.7%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Arena leaderboard dataset row `gemini-3.5-flash`. Category: industry_mathematical. Source rank: #6. Votes: 644. Organization: google. License: Proprietary.
98.7% percentile inside its fair comparison set1,509Raw benchmark valueCI 1,484 - 1,533
Text Arena · Industry Medicine And Healthcare
AR · Professional reasoning · Human
Observed user preference in Arena's Text Arena industry_medicine_and_healthcare leaderboard.
Rank #15 · Source label: gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,490
- Percentile
- 95.3%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Arena leaderboard dataset row `gemini-3.5-flash`. Category: industry_medicine_and_healthcare. Source rank: #17. Votes: 751. Organization: google. License: Proprietary.
95.3% percentile inside its fair comparison set1,490Raw benchmark valueCI 1,467 - 1,513
Text Arena · Industry Software And It Services
AR · Professional reasoning · Human
Observed user preference in Arena's Text Arena industry_software_and_it_services leaderboard.
Rank #23 · Source label: gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,499
- Percentile
- 93.2%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Arena leaderboard dataset row `gemini-3.5-flash`. Category: industry_software_and_it_services. Source rank: #31. Votes: 4221. Organization: google. License: Proprietary.
93.2% percentile inside its fair comparison set1,499Raw benchmark valueCI 1,489 - 1,509
Text Arena · Industry Writing And Literature And Language
AR · Professional reasoning · Human
Observed user preference in Arena's Text Arena industry_writing_and_literature_and_language leaderboard.
Rank #13 · Source label: gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,462
- Percentile
- 96.3%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Arena leaderboard dataset row `gemini-3.5-flash`. Category: industry_writing_and_literature_and_language. Source rank: #17. Votes: 2335. Organization: google. License: Proprietary.
96.3% percentile inside its fair comparison set1,462Raw benchmark valueCI 1,449 - 1,474
Text Arena · Expert · No Style Control
AR · Professional reasoning · Human
Observed user preference in Arena's Text Arena expert leaderboard.
Rank #6 · Source label: gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,508
- Percentile
- 98.2%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Arena leaderboard dataset row `gemini-3.5-flash`. Category: expert. Source rank: #8. Votes: 1040. Organization: google. License: Proprietary.
98.2% percentile inside its fair comparison set1,508Raw benchmark valueCI 1,489 - 1,527
Text Arena · Industry Business And Management And Financial Operations · No Style Control
AR · Professional reasoning · Human
Observed user preference in Arena's Text Arena industry_business_and_management_and_financial_operations leaderboard.
Rank #15 · Source label: gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,456
- Percentile
- 95.6%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Arena leaderboard dataset row `gemini-3.5-flash`. Category: industry_business_and_management_and_financial_operations. Source rank: #19. Votes: 2099. Organization: google. License: Proprietary.
95.6% percentile inside its fair comparison set1,456Raw benchmark valueCI 1,443 - 1,470
Text Arena · Industry Entertainment And Sports And Media · No Style Control
AR · Professional reasoning · Human
Observed user preference in Arena's Text Arena industry_entertainment_and_sports_and_media leaderboard.
Rank #7 · Source label: gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,454
- Percentile
- 98.1%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Arena leaderboard dataset row `gemini-3.5-flash`. Category: industry_entertainment_and_sports_and_media. Source rank: #9. Votes: 2054. Organization: google. License: Proprietary.
98.1% percentile inside its fair comparison set1,454Raw benchmark valueCI 1,440 - 1,468
Text Arena · Industry Legal And Government · No Style Control
AR · Professional reasoning · Human
Observed user preference in Arena's Text Arena industry_legal_and_government leaderboard.
Rank #8 · Source label: gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,483
- Percentile
- 97.7%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Arena leaderboard dataset row `gemini-3.5-flash`. Category: industry_legal_and_government. Source rank: #9. Votes: 790. Organization: google. License: Proprietary.
97.7% percentile inside its fair comparison set1,483Raw benchmark valueCI 1,461 - 1,504
Text Arena · Industry Life And Physical And Social Science · No Style Control
AR · Professional reasoning · Human
Observed user preference in Arena's Text Arena industry_life_and_physical_and_social_science leaderboard.
Rank #5 · Source label: gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,497
- Percentile
- 98.8%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Arena leaderboard dataset row `gemini-3.5-flash`. Category: industry_life_and_physical_and_social_science. Source rank: #6. Votes: 1638. Organization: google. License: Proprietary.
98.8% percentile inside its fair comparison set1,497Raw benchmark valueCI 1,482 - 1,512
Text Arena · Industry Mathematical · No Style Control
AR · Professional reasoning · Human
Observed user preference in Arena's Text Arena industry_mathematical leaderboard.
Rank #2 · Source label: gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,513
- Percentile
- 99.7%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Arena leaderboard dataset row `gemini-3.5-flash`. Category: industry_mathematical. Source rank: #3. Votes: 644. Organization: google. License: Proprietary.
99.7% percentile inside its fair comparison set1,513Raw benchmark valueCI 1,488 - 1,537
Text Arena · Industry Medicine And Healthcare · No Style Control
AR · Professional reasoning · Human
Observed user preference in Arena's Text Arena industry_medicine_and_healthcare leaderboard.
Rank #5 · Source label: gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,489
- Percentile
- 98.6%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Arena leaderboard dataset row `gemini-3.5-flash`. Category: industry_medicine_and_healthcare. Source rank: #7. Votes: 751. Organization: google. License: Proprietary.
98.6% percentile inside its fair comparison set1,489Raw benchmark valueCI 1,466 - 1,511
Text Arena · Industry Software And It Services · No Style Control
AR · Professional reasoning · Human
Observed user preference in Arena's Text Arena industry_software_and_it_services leaderboard.
Rank #6 · Source label: gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,491
- Percentile
- 98.5%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Arena leaderboard dataset row `gemini-3.5-flash`. Category: industry_software_and_it_services. Source rank: #9. Votes: 4221. Organization: google. License: Proprietary.
98.5% percentile inside its fair comparison set1,491Raw benchmark valueCI 1,482 - 1,501
Text Arena · Industry Writing And Literature And Language · No Style Control
AR · Professional reasoning · Human
Observed user preference in Arena's Text Arena industry_writing_and_literature_and_language leaderboard.
Rank #9 · Source label: gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Arena
- Raw value
- 1,466
- Percentile
- 97.5%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Arena leaderboard dataset row `gemini-3.5-flash`. Category: industry_writing_and_literature_and_language. Source rank: #11. Votes: 2335. Organization: google. License: Proprietary.
97.5% percentile inside its fair comparison set1,466Raw benchmark valueCI 1,454 - 1,479
SAGE
VALS-AI · Professional reasoning · Objective
Student Assessment with Generative Evaluation.
Rank #12 · Source label: google/gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Vals AI
- Raw value
- 49.9%
- Percentile
- 75.6%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Vals AI BenchmarkView overall scores. Vals slug: sage; provider: Google.
75.6% percentile inside its fair comparison set49.9%Raw benchmark valueCI 43.1% - 56.6%
Public Benefits Bench v1
VALS-AI · Professional reasoning · Objective
Answering public-benefits questions across the benefits lifecycle.
Rank #5 · Source label: google/gemini-3.5-flash
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- Vals AI
- Raw value
- 58%
- Percentile
- 66.7%
- Last updated
- recent
- Eligibility
- headline eligible
Parsed from Vals AI BenchmarkView overall scores. Vals slug: public-benefits-bench-v1; provider: Google.
66.7% percentile inside its fair comparison set58%Raw benchmark valueCI 55.5% - 60.5%
Data analysis
LB · Professional reasoning · Objective
Structured data manipulation and table reasoning accuracy.
Rank #38 · Source label: gemini-3.5-flash-high
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- LiveBench
- Raw value
- 64.9%
- Percentile
- 65.7%
- Last updated
- archived
- Eligibility
- headline eligible
Derived from the official LiveBench website leaderboard table. Category: Data Analysis. Tasks scored: 3.
65.7% percentile inside its fair comparison set64.9%Raw benchmark value
Overall
LB · Professional reasoning · Objective
Average objective performance across LiveBench's current public category mix.
Rank #14 · Source label: gemini-3.5-flash-high
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- LiveBench
- Raw value
- 75%
- Percentile
- 88%
- Last updated
- archived
- Eligibility
- headline eligible
Derived from the official LiveBench website leaderboard table. Category averages included: 7.
88% percentile inside its fair comparison set75%Raw benchmark value
Consecutive events
LB · Professional reasoning · Objective
Objective consecutive events score in LiveBench.
Rank #43 · Source label: gemini-3.5-flash-high
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- LiveBench
- Raw value
- 48%
- Percentile
- 61.1%
- Last updated
- archived
- Eligibility
- headline eligible
Derived from the official LiveBench website leaderboard table. Task: consecutive_events. Category: Data Analysis.
61.1% percentile inside its fair comparison set48%Raw benchmark value
Table join
LB · Professional reasoning · Objective
Objective table join score in LiveBench.
Rank #7 · Source label: gemini-3.5-flash-high
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- LiveBench
- Raw value
- 50.5%
- Percentile
- 94.4%
- Last updated
- archived
- Eligibility
- headline eligible
Derived from the official LiveBench website leaderboard table. Task: tablejoin. Category: Data Analysis.
94.4% percentile inside its fair comparison set50.5%Raw benchmark value
Table reformat
LB · Professional reasoning · Objective
Objective table reformat score in LiveBench.
Rank #82 · Source label: gemini-3.5-flash-high
verified runtimeexact alias
Raw row drilldownsource row, percentile, last updated, eligibility
- Source
- LiveBench
- Raw value
- 96.1%
- Percentile
- 30.6%
- Last updated
- archived
- Eligibility
- headline eligible
Derived from the official LiveBench website leaderboard table. Task: tablereformat. Category: Data Analysis.
30.6% percentile inside its fair comparison set96.1%Raw benchmark value