Benchmark quality scoring
Benchmarks get scored too.
Quality scores expose contamination risk, independence, reproducibility, sample size, transparency, and task realism.
| Benchmark | Source | Score | Independence | Reproducibility | Task realism |
|---|---|---|---|---|---|
| Reasoning | livebench | 84.2 | 85 | 80 | 70 |
| Agentic coding | livebench | 84.2 | 85 | 80 | 90 |
| JavaScript | livebench | 84.2 | 85 | 80 | 90 |
| TypeScript | livebench | 84.2 | 85 | 80 | 90 |
| Python | livebench | 84.2 | 85 | 80 | 90 |
| Agentic Index | artificial-analysis | 84.2 | 85 | 80 | 90 |
| Terminal-Bench Hard | artificial-analysis | 84.2 | 85 | 80 | 90 |
| APEX-Agents-AA | artificial-analysis | 84.2 | 85 | 80 | 90 |
| Terminal-Bench 2.0 | terminal-bench | 84.2 | 85 | 80 | 90 |
| Text Arena | arena | 80.8 | 85 | 80 | 70 |
| Code Arena | arena | 80.8 | 85 | 80 | 70 |
| Vision Arena | arena | 80.8 | 85 | 80 | 70 |
| WebDev Arena | arena | 80.8 | 85 | 80 | 70 |
| Search Arena | arena | 80.8 | 85 | 80 | 70 |
| Document Arena | arena | 80.8 | 85 | 80 | 70 |
| Text-to-Image Arena | arena | 80.8 | 85 | 80 | 70 |
| Image Edit Arena | arena | 80.8 | 85 | 80 | 70 |
| Text-to-Video Arena | arena | 80.8 | 85 | 80 | 70 |
| Image-to-Video Arena | arena | 80.8 | 85 | 80 | 70 |
| Overall | livebench | 80.8 | 85 | 80 | 70 |
| Coding | livebench | 80.8 | 85 | 80 | 70 |
| Mathematics | livebench | 80.8 | 85 | 80 | 70 |
| Data analysis | livebench | 80.8 | 85 | 80 | 70 |
| Language | livebench | 80.8 | 85 | 80 | 70 |
| Instruction following | livebench | 80.8 | 85 | 80 | 70 |
| Theory of mind | livebench | 80.8 | 85 | 80 | 70 |
| Zebra puzzle | livebench | 80.8 | 85 | 80 | 70 |
| Spatial | livebench | 80.8 | 85 | 80 | 70 |
| Logic with navigation | livebench | 80.8 | 85 | 80 | 70 |
| Coding generation | livebench | 80.8 | 85 | 80 | 70 |
| Coding completion | livebench | 80.8 | 85 | 80 | 70 |
| AMPS Hard | livebench | 80.8 | 85 | 80 | 70 |
| Integrals with game | livebench | 80.8 | 85 | 80 | 70 |
| Math competition | livebench | 80.8 | 85 | 80 | 70 |
| Olympiad | livebench | 80.8 | 85 | 80 | 70 |
| Consecutive events | livebench | 80.8 | 85 | 80 | 70 |
| Table join | livebench | 80.8 | 85 | 80 | 70 |
| Table reformat | livebench | 80.8 | 85 | 80 | 70 |
| Connections | livebench | 80.8 | 85 | 80 | 70 |
| Plot unscrambling | livebench | 80.8 | 85 | 80 | 70 |
| Typos | livebench | 80.8 | 85 | 80 | 70 |
| Paraphrase | livebench | 80.8 | 85 | 80 | 70 |
| Simplify | livebench | 80.8 | 85 | 80 | 70 |
| Story generation | livebench | 80.8 | 85 | 80 | 70 |
| Summarize | livebench | 80.8 | 85 | 80 | 70 |
| Intelligence Index | artificial-analysis | 80.8 | 85 | 80 | 70 |
| Time to first token | artificial-analysis | 80.8 | 85 | 80 | 70 |
| Coding Index | artificial-analysis | 80.8 | 85 | 80 | 70 |
| Openness Index | artificial-analysis | 80.8 | 85 | 80 | 70 |
| Output Speed | artificial-analysis | 80.8 | 85 | 80 | 70 |
| Time to first answer token | artificial-analysis | 80.8 | 85 | 80 | 70 |
| Blended price | artificial-analysis | 80.8 | 85 | 80 | 70 |
| Input price | artificial-analysis | 80.8 | 85 | 80 | 70 |
| Output price | artificial-analysis | 80.8 | 85 | 80 | 70 |
| GPQA | artificial-analysis | 80.8 | 85 | 80 | 70 |
| Humanity's Last Exam | artificial-analysis | 80.8 | 85 | 80 | 70 |
| CritPt | artificial-analysis | 80.8 | 85 | 80 | 70 |
| SciCode | artificial-analysis | 80.8 | 85 | 80 | 70 |
| Tau2-Bench Telecom | artificial-analysis | 80.8 | 85 | 80 | 70 |
| AA-Omniscience accuracy | artificial-analysis | 80.8 | 85 | 80 | 70 |
| AA-Omniscience non-hallucination | artificial-analysis | 80.8 | 85 | 80 | 70 |
| IFBench | artificial-analysis | 80.8 | 85 | 80 | 70 |
| GDPval-AA | artificial-analysis | 80.8 | 85 | 80 | 70 |
| MMMU-Pro | artificial-analysis | 80.8 | 85 | 80 | 70 |
| Finance Agent v2 | vals-ai | 80.8 | 85 | 80 | 90 |
| Harvey's Legal Agent Benchmark | vals-ai | 80.8 | 85 | 80 | 90 |
| Terminal-Bench 2.1 | vals-ai | 80.8 | 85 | 80 | 90 |
| Poker Agent | vals-ai | 80.8 | 85 | 80 | 90 |
| Long Context Reasoning | artificial-analysis | 80.8 | 85 | 80 | 70 |
| Text to Image | artificial-analysis | 80.8 | 85 | 80 | 70 |
| Image Editing | artificial-analysis | 80.8 | 85 | 80 | 70 |
| Text to Video | artificial-analysis | 80.8 | 85 | 80 | 70 |
| Image to Video | artificial-analysis | 80.8 | 85 | 80 | 70 |
| Text to Speech | artificial-analysis | 80.8 | 85 | 80 | 70 |
| Text-to-Image Arena · Photorealistic | arena | 80 | 85 | 80 | 90 |
| Vals Index | vals-ai | 77.5 | 85 | 80 | 70 |
| Vals Multimodal Index | vals-ai | 77.5 | 85 | 80 | 70 |
| LegalBench | vals-ai | 77.5 | 85 | 80 | 70 |
| CorpFin v2 | vals-ai | 77.5 | 85 | 80 | 70 |
| MortgageTax | vals-ai | 77.5 | 85 | 80 | 70 |