Eval source registry

Every benchmark source has operating notes.

The registry catalogs reliability notes, update cadence, known weaknesses, and licensing status for each benchmark source.

Sources · 10
API · /api/sources/registry

Source	Cadence	Status	Licensing	Weaknesses
Arena	continuous	verified	CC-BY-4.0 public leaderboard dataset	Preference data is useful because it captures direct user taste. Preference data can still drift with prompt mix and community participation.
LiveBench	release-based	verified	public source links retained; license notes need source-by-source review	Best interpreted inside specific LiveBench releases rather than across unrelated suites.
Artificial Analysis	continuous	verified	public source links retained; license notes need source-by-source review	Useful because it keeps price and speed beside capability rather than inside it.
Vals AI	continuous	verified	public source links retained; license notes need source-by-source review	The app ingests the public Vals benchmark pages server-side and uses the published overall score for each benchmark. Archived Vals benchmark pages are skipped so current comparisons are not mixed with retired task surfaces.
Terminal-Bench	release-based	verified	public source links retained; license notes need source-by-source review	Useful for measuring long-horizon CLI execution, not just patch generation in isolated repos. Scores can still vary materially with the agent scaffold, tool setup, and runtime harness.
LLMBase	continuous	relay	public source links retained; license notes need source-by-source review	This app only ingests exact same-model rows from benchmark-bearing public LLMBase pages.
Scale Labs	release-based	verified	public source links retained; license notes need source-by-source review	Rubric-based scores need visible judging rules and prompt disclosures to be trusted.
OpenCompass	continuous	verified	public source links retained; license notes need source-by-source review	Breadth helps coverage, but you still need to inspect which datasets were actually included.
MTEB	continuous	verified	public source links retained; license notes need source-by-source review	MTEB scores are domain-specific and should never be averaged into text arenas by default.
Provider official evals	release-based	verified	provider page source links; verify quote scope before redistribution	This layer is additive and explicitly labeled so provider-official results cannot silently replace third-party evidence.