Eval source registry
Every benchmark source has operating notes.
The registry catalogs reliability notes, update cadence, known weaknesses, and licensing status for each benchmark source.
| Source | Cadence | Status | Licensing | Weaknesses |
|---|---|---|---|---|
| Arena | continuous | verified | CC-BY-4.0 public leaderboard dataset | Preference data is useful because it captures direct user taste. Preference data can still drift with prompt mix and community participation. |
| LiveBench | release-based | verified | public source links retained; license notes need source-by-source review | Best interpreted inside specific LiveBench releases rather than across unrelated suites. |
| Artificial Analysis | continuous | verified | public source links retained; license notes need source-by-source review | Useful because it keeps price and speed beside capability rather than inside it. |
| Vals AI | continuous | verified | public source links retained; license notes need source-by-source review | The app ingests the public Vals benchmark pages server-side and uses the published overall score for each benchmark. Archived Vals benchmark pages are skipped so current comparisons are not mixed with retired task surfaces. |
| Terminal-Bench | release-based | verified | public source links retained; license notes need source-by-source review | Useful for measuring long-horizon CLI execution, not just patch generation in isolated repos. Scores can still vary materially with the agent scaffold, tool setup, and runtime harness. |
| LLMBase | continuous | relay | public source links retained; license notes need source-by-source review | This app only ingests exact same-model rows from benchmark-bearing public LLMBase pages. |
| Scale Labs | release-based | verified | public source links retained; license notes need source-by-source review | Rubric-based scores need visible judging rules and prompt disclosures to be trusted. |
| OpenCompass | continuous | verified | public source links retained; license notes need source-by-source review | Breadth helps coverage, but you still need to inspect which datasets were actually included. |
| MTEB | continuous | verified | public source links retained; license notes need source-by-source review | MTEB scores are domain-specific and should never be averaged into text arenas by default. |
| Provider official evals | release-based | verified | provider page source links; verify quote scope before redistribution | This layer is additive and explicitly labeled so provider-official results cannot silently replace third-party evidence. |