UAB
Home/Sources/Registry
Registry
Loading search
Live · updated continuously
Eval source registry

Every benchmark source has operating notes.

The registry catalogs reliability notes, update cadence, known weaknesses, and licensing status for each benchmark source.
Sources · 10
API · /api/sources/registry
SourceCadenceStatusLicensingWeaknesses
ArenacontinuousverifiedCC-BY-4.0 public leaderboard datasetPreference data is useful because it captures direct user taste. Preference data can still drift with prompt mix and community participation.
LiveBenchrelease-basedverifiedpublic source links retained; license notes need source-by-source reviewBest interpreted inside specific LiveBench releases rather than across unrelated suites.
Artificial Analysiscontinuousverifiedpublic source links retained; license notes need source-by-source reviewUseful because it keeps price and speed beside capability rather than inside it.
Vals AIcontinuousverifiedpublic source links retained; license notes need source-by-source reviewThe app ingests the public Vals benchmark pages server-side and uses the published overall score for each benchmark. Archived Vals benchmark pages are skipped so current comparisons are not mixed with retired task surfaces.
Terminal-Benchrelease-basedverifiedpublic source links retained; license notes need source-by-source reviewUseful for measuring long-horizon CLI execution, not just patch generation in isolated repos. Scores can still vary materially with the agent scaffold, tool setup, and runtime harness.
LLMBasecontinuousrelaypublic source links retained; license notes need source-by-source reviewThis app only ingests exact same-model rows from benchmark-bearing public LLMBase pages.
Scale Labsrelease-basedverifiedpublic source links retained; license notes need source-by-source reviewRubric-based scores need visible judging rules and prompt disclosures to be trusted.
OpenCompasscontinuousverifiedpublic source links retained; license notes need source-by-source reviewBreadth helps coverage, but you still need to inspect which datasets were actually included.
MTEBcontinuousverifiedpublic source links retained; license notes need source-by-source reviewMTEB scores are domain-specific and should never be averaged into text arenas by default.
Provider official evalsrelease-basedverifiedprovider page source links; verify quote scope before redistributionThis layer is additive and explicitly labeled so provider-official results cannot silently replace third-party evidence.