Benchmarks · /benchmarks/scale-tutorbench

TutorBench

Name: TutorBench
Creator: Scale Labs

Scale tutoring evaluation on math, science, and instructional guidance quality.

Source · Scale Labs
Version · scale-labs snapshot 2026-06-24
Scores · 11

Test details

Visible tradeoffsThis is a rubric-judged signal, so it is more structured than arena taste but still depends on the scoring rubric.

source

Scale Labs

metric

Score (%)

judge

Rubric

direction

higher better

group id

scale_tutorbench_current

domain

Reasoning / math / science

What it measures vs what it misses

✓ Measures

How well a model tutors through multi-step academic problems. Instruction quality, pedagogy, and reasoning support on teaching-style prompts.

✗ Misses

Live classroom preference. Latency and cost.

Why this countsIt is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.Same-test ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt still misses product usability, latency, and whether the model stays correct in messy real workflows.

Leaderboard · this benchmark version

#1 · GPT-5.4

SL · Apr 29, 2026

Source label: gpt-5.4-pro-2026-03-05

verified runtimeexact alias

Raw row drilldownsource, percentile, eligibility

Source URL: https://labs.scale.com/leaderboard/tutorbench
Percentile: 100%
Last updated: recent
Eligibility: headline eligible
Identity: provider alias (0.92)

56.6%

#2 · Gemini 2.5 Pro

SL · Apr 29, 2026

verified runtimeexact direct

Raw row drilldownsource, percentile, eligibility

Source URL: https://labs.scale.com/leaderboard/tutorbench
Percentile: 90%
Last updated: recent
Eligibility: headline eligible
Identity: exact (1.00)

55.7%

#3 · GPT-5

SL · Apr 29, 2026

verified runtimeexact direct

Raw row drilldownsource, percentile, eligibility

Source URL: https://labs.scale.com/leaderboard/tutorbench
Percentile: 80%
Last updated: recent
Eligibility: headline eligible
Identity: exact (1.00)

Manually verified from the official Scale Labs TutorBench leaderboard.

55.3%

#4 · GPT-5.4 mini

SL · Apr 29, 2026

Source label: gpt-5

backfilledproxy backfilledBackground only

Raw row drilldownsource, percentile, eligibility

Source URL: https://labs.scale.com/leaderboard/tutorbench
Percentile: 80%
Last updated: recent
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.
Identity: benchmark proxy (0.58)

Manually verified from the official Scale Labs TutorBench leaderboard. Backfilled from GPT-5 via approved benchmark identity mapping map-gpt-5-4-mini-to-gpt-5.

55.3%

#5 · GPT-5.4 nano

SL · Apr 29, 2026

Source label: gpt-5

backfilledproxy backfilledBackground only

Raw row drilldownsource, percentile, eligibility

Source URL: https://labs.scale.com/leaderboard/tutorbench
Percentile: 80%
Last updated: recent
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.
Identity: benchmark proxy (0.58)

Manually verified from the official Scale Labs TutorBench leaderboard. Backfilled from GPT-5 via approved benchmark identity mapping map-gpt-5-4-nano-to-gpt-5.

55.3%

#6 · GPT-5.1

SL · Apr 29, 2026

Source label: gpt-5.1-thinking

verified runtimeexact aliasBackground only

Raw row drilldownsource, percentile, eligibility

Source URL: https://labs.scale.com/leaderboard/tutorbench
Percentile: 50%
Last updated: recent
Eligibility: historical_model
Identity: provider alias (0.92)

54.1%

#7 · Gemini 3 Pro Preview

SL · Apr 29, 2026

verified runtimeexact directBackground only

Raw row drilldownsource, percentile, eligibility

Source URL: https://labs.scale.com/leaderboard/tutorbench
Percentile: 40%
Last updated: recent
Eligibility: preview_model
Identity: exact (1.00)

53.7%

#8 · Claude Opus 4.6

SL · Apr 29, 2026

verified runtimeexact directBackground only

Raw row drilldownsource, percentile, eligibility

Source URL: https://labs.scale.com/leaderboard/tutorbench
Percentile: 30%
Last updated: recent
Eligibility: historical_model
Identity: exact (1.00)

53.6%

#9 · Gemini 3.1 Pro

SL · Apr 29, 2026

verified runtimeexact direct

Raw row drilldownsource, percentile, eligibility

Source URL: https://labs.scale.com/leaderboard/tutorbench
Percentile: 20%
Last updated: recent
Eligibility: headline eligible
Identity: exact (1.00)

53%

#10 · Gemini 3.1 Flash-Lite Preview

SL · Apr 29, 2026

Source label: gemini-3.1-flash-lite-preview

verified runtimeexact aliasBackground only

Raw row drilldownsource, percentile, eligibility

Source URL: https://labs.scale.com/leaderboard/tutorbench
Percentile: 10%
Last updated: recent
Eligibility: preview_model
Identity: provider alias (0.92)

51.5%

#11 · Llama 4 Maverick

SL · Apr 29, 2026

verified runtimeexact direct

Raw row drilldownsource, percentile, eligibility

Source URL: https://labs.scale.com/leaderboard/tutorbench
Percentile: 0%
Last updated: recent
Eligibility: headline eligible
Identity: exact (1.00)

40.2%

Benchmarks · /benchmarks/scale-tutorbench

TutorBench

Scale tutoring evaluation on math, science, and instructional guidance quality.

Source · Scale Labs
Version · scale-labs snapshot 2026-06-24
Scores · 11

Test details

Visible tradeoffsThis is a rubric-judged signal, so it is more structured than arena taste but still depends on the scoring rubric.

source

Scale Labs

metric

Score (%)

judge

Rubric

direction

higher better

group id

scale_tutorbench_current

domain

Reasoning / math / science

What it measures vs what it misses

✓ Measures

How well a model tutors through multi-step academic problems. Instruction quality, pedagogy, and reasoning support on teaching-style prompts.

✗ Misses

Live classroom preference. Latency and cost.

Leaderboard · this benchmark version

#1 · GPT-5.4

SL · Apr 29, 2026

Source label: gpt-5.4-pro-2026-03-05

verified runtimeexact alias

Raw row drilldownsource, percentile, eligibility

Source URL: https://labs.scale.com/leaderboard/tutorbench
Percentile: 100%
Last updated: recent
Eligibility: headline eligible
Identity: provider alias (0.92)

56.6%

#2 · Gemini 2.5 Pro

SL · Apr 29, 2026

verified runtimeexact direct

Raw row drilldownsource, percentile, eligibility

Source URL: https://labs.scale.com/leaderboard/tutorbench
Percentile: 90%
Last updated: recent
Eligibility: headline eligible
Identity: exact (1.00)

55.7%

#3 · GPT-5

SL · Apr 29, 2026

verified runtimeexact direct

Raw row drilldownsource, percentile, eligibility

Source URL: https://labs.scale.com/leaderboard/tutorbench
Percentile: 80%
Last updated: recent
Eligibility: headline eligible
Identity: exact (1.00)

Manually verified from the official Scale Labs TutorBench leaderboard.

55.3%

#4 · GPT-5.4 mini

SL · Apr 29, 2026

Source label: gpt-5

backfilledproxy backfilledBackground only

Raw row drilldownsource, percentile, eligibility

Source URL: https://labs.scale.com/leaderboard/tutorbench
Percentile: 80%
Last updated: recent
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.
Identity: benchmark proxy (0.58)

Manually verified from the official Scale Labs TutorBench leaderboard. Backfilled from GPT-5 via approved benchmark identity mapping map-gpt-5-4-mini-to-gpt-5.

55.3%

#5 · GPT-5.4 nano

SL · Apr 29, 2026

Source label: gpt-5

backfilledproxy backfilledBackground only

Raw row drilldownsource, percentile, eligibility

Source URL: https://labs.scale.com/leaderboard/tutorbench
Percentile: 80%
Last updated: recent
Eligibility: Fallback benchmark identity is visible for context but excluded from default ranking.
Identity: benchmark proxy (0.58)

Manually verified from the official Scale Labs TutorBench leaderboard. Backfilled from GPT-5 via approved benchmark identity mapping map-gpt-5-4-nano-to-gpt-5.

55.3%

#6 · GPT-5.1

SL · Apr 29, 2026

Source label: gpt-5.1-thinking

verified runtimeexact aliasBackground only

Raw row drilldownsource, percentile, eligibility

Source URL: https://labs.scale.com/leaderboard/tutorbench
Percentile: 50%
Last updated: recent
Eligibility: historical_model
Identity: provider alias (0.92)

54.1%

#7 · Gemini 3 Pro Preview

SL · Apr 29, 2026

verified runtimeexact directBackground only

Raw row drilldownsource, percentile, eligibility

Source URL: https://labs.scale.com/leaderboard/tutorbench
Percentile: 40%
Last updated: recent
Eligibility: preview_model
Identity: exact (1.00)

53.7%

#8 · Claude Opus 4.6

SL · Apr 29, 2026

verified runtimeexact directBackground only

Raw row drilldownsource, percentile, eligibility

Source URL: https://labs.scale.com/leaderboard/tutorbench
Percentile: 30%
Last updated: recent
Eligibility: historical_model
Identity: exact (1.00)

53.6%

#9 · Gemini 3.1 Pro

SL · Apr 29, 2026

verified runtimeexact direct

Raw row drilldownsource, percentile, eligibility

Source URL: https://labs.scale.com/leaderboard/tutorbench
Percentile: 20%
Last updated: recent
Eligibility: headline eligible
Identity: exact (1.00)

53%

#10 · Gemini 3.1 Flash-Lite Preview

SL · Apr 29, 2026

Source label: gemini-3.1-flash-lite-preview

verified runtimeexact aliasBackground only

Raw row drilldownsource, percentile, eligibility

Source URL: https://labs.scale.com/leaderboard/tutorbench
Percentile: 10%
Last updated: recent
Eligibility: preview_model
Identity: provider alias (0.92)

51.5%

#11 · Llama 4 Maverick

SL · Apr 29, 2026

verified runtimeexact direct

Raw row drilldownsource, percentile, eligibility

Source URL: https://labs.scale.com/leaderboard/tutorbench
Percentile: 0%
Last updated: recent
Eligibility: headline eligible
Identity: exact (1.00)

40.2%

TutorBench

Test details

What it measures vs what it misses

✓ Measures

✗ Misses

Leaderboard · this benchmark version

Loading benchmark evidence.

TutorBench

Test details

What it measures vs what it misses

✓ Measures

✗ Misses

Leaderboard · this benchmark version