Methodology

We do not mix unrelated scores.

Raw scores stay raw. Percentiles only happen when the test, metric, judge, and version line up. We check the source, infer only when the source data is explicit, and refuse to guess unsupported matches.

Layer 1 · raw source records
Layer 2 · verified and labeled records
Layer 3 · model-family-aware recommendations and custom rankings

Verify the source

Fetch the published leaderboard or dataset, save the snapshot, and keep the public source link. We verify the source URL, content type, and snapshot hash before we treat anything as a measurement.

Source URL preservedSnapshot hash loggedCapture time stored

Parse carefully

Parse the source into structured records. We infer only when aliases, mappings, and anomalies are explicit enough to support the match. If not, the item stays open instead of being silently guessed.

Parser version attachedAnomalies loggedManual review opened

Label the source

Attach benchmark metadata: judge type, metric family, direction, benchmark version, modality, fair comparison set, and source type. Copied rows, official-company results, human-checked rows, historical inserts, and demo fixtures all stay labeled.

Judge type keptSource type keptFair comparison set kept

Normalize locally

Normalize only inside the same test setup. The product does not flatten unrelated units into one global score.

Within-group onlyNo universal scalarCoverage gaps remain visible

Publish the readout

Expose secondary metrics that describe the shape of the visible source data. Product recommendations can rank a reviewed model family rather than a single raw model ID, but raw scores stay attached to their exact source labels. Preview rows stay preview-labeled, official-company rows stay labeled, and neither is silently rewritten as third-party consensus.

CoverageSpreadLast updatedOpen reviewsModel-family grouping

State the limits

We verify public source links, snapshot hashes, parser outputs, and human-check metadata. We infer only when the mapping rule is explicit enough to audit later. We do not guess aliases, source type, modality, or pricing bands from vibes.

VerifiedInferredNever guessed