UAB
Home/Trust/Benchmark quality
Benchmark quality
Loading search
Live · updated continuously
Benchmark quality scoring

Benchmarks get scored too.

Quality scores expose contamination risk, independence, reproducibility, sample size, transparency, and task realism.
Benchmarks · 204
API · /api/benchmark-quality
BenchmarkSourceScoreIndependenceReproducibilityTask realism
Reasoninglivebench84.2858070
Agentic codinglivebench84.2858090
JavaScriptlivebench84.2858090
TypeScriptlivebench84.2858090
Pythonlivebench84.2858090
Agentic Indexartificial-analysis84.2858090
Terminal-Bench Hardartificial-analysis84.2858090
APEX-Agents-AAartificial-analysis84.2858090
Terminal-Bench 2.0terminal-bench84.2858090
Text Arenaarena80.8858070
Code Arenaarena80.8858070
Vision Arenaarena80.8858070
WebDev Arenaarena80.8858070
Search Arenaarena80.8858070
Document Arenaarena80.8858070
Text-to-Image Arenaarena80.8858070
Image Edit Arenaarena80.8858070
Text-to-Video Arenaarena80.8858070
Image-to-Video Arenaarena80.8858070
Overalllivebench80.8858070
Codinglivebench80.8858070
Mathematicslivebench80.8858070
Data analysislivebench80.8858070
Languagelivebench80.8858070
Instruction followinglivebench80.8858070
Theory of mindlivebench80.8858070
Zebra puzzlelivebench80.8858070
Spatiallivebench80.8858070
Logic with navigationlivebench80.8858070
Coding generationlivebench80.8858070
Coding completionlivebench80.8858070
AMPS Hardlivebench80.8858070
Integrals with gamelivebench80.8858070
Math competitionlivebench80.8858070
Olympiadlivebench80.8858070
Consecutive eventslivebench80.8858070
Table joinlivebench80.8858070
Table reformatlivebench80.8858070
Connectionslivebench80.8858070
Plot unscramblinglivebench80.8858070
Typoslivebench80.8858070
Paraphraselivebench80.8858070
Simplifylivebench80.8858070
Story generationlivebench80.8858070
Summarizelivebench80.8858070
Intelligence Indexartificial-analysis80.8858070
Time to first tokenartificial-analysis80.8858070
Coding Indexartificial-analysis80.8858070
Openness Indexartificial-analysis80.8858070
Output Speedartificial-analysis80.8858070
Time to first answer tokenartificial-analysis80.8858070
Blended priceartificial-analysis80.8858070
Input priceartificial-analysis80.8858070
Output priceartificial-analysis80.8858070
GPQAartificial-analysis80.8858070
Humanity's Last Examartificial-analysis80.8858070
CritPtartificial-analysis80.8858070
SciCodeartificial-analysis80.8858070
Tau2-Bench Telecomartificial-analysis80.8858070
AA-Omniscience accuracyartificial-analysis80.8858070
AA-Omniscience non-hallucinationartificial-analysis80.8858070
IFBenchartificial-analysis80.8858070
GDPval-AAartificial-analysis80.8858070
MMMU-Proartificial-analysis80.8858070
Finance Agent v2vals-ai80.8858090
Harvey's Legal Agent Benchmarkvals-ai80.8858090
Terminal-Bench 2.1vals-ai80.8858090
Poker Agentvals-ai80.8858090
Long Context Reasoningartificial-analysis80.8858070
Text to Imageartificial-analysis80.8858070
Image Editingartificial-analysis80.8858070
Text to Videoartificial-analysis80.8858070
Image to Videoartificial-analysis80.8858070
Text to Speechartificial-analysis80.8858070
Text-to-Image Arena · Photorealisticarena80858090
Vals Indexvals-ai77.5858070
Vals Multimodal Indexvals-ai77.5858070
LegalBenchvals-ai77.5858070
CorpFin v2vals-ai77.5858070
MortgageTaxvals-ai77.5858070