What it measures vs what it misses
✓ Measures
End-to-end task success on hard terminal workflows that require planning, editing, debugging, and execution.
✗ Misses
IDE-native workflows, code review quality, and non-terminal product engineering work.
Why this countsIt tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.Same-test ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt does not fully capture repo-scale iteration, IDE ergonomics, or long debugging loops.
TerminalBench task registry
Categoriessoftware-engineering (26), system-administration (9), data-science (8), scientific-computing (8), security (8), debugging (5), file-operations (5), data-processing (4), mathematics (4), model-training (4), machine-learning (3), data-querying (1), games (1), optimization (1), personal-assistant (1), video-processing (1)Difficultymedium (55), hard (30), easy (4)Top tagscoding (18), file-operations (10), system (10), security (9), data-processing (7), software-engineering (7), compilation (4), machine-learning (4), version-control (4), biology (3), cloning (3), data-science (3), sys-admin (3), web (3), build-tools (2), corewars (2), gaming (2), git (2), image-processing (2), images (2)
adaptive-rejection-sampler
scientific-computing · medium
applied-statistics, adaptive-rejection-sampling, Bayesian-inference, simulation
bn-fit-modify
scientific-computing · hard
bayesian-network, stats
break-filter-js-from-html
security · medium
security
build-cython-ext
debugging · medium
coding, dependency, compilation
build-pmars
software-engineering · medium
build-tools, compilation, debian, gaming, pmars, corewars
build-pov-ray
software-engineering · medium
build-tools, compilation, graphics, ray-tracing, legacy-software, research
caffe-cifar-10
machine-learning · medium
cnn, caffe
cancel-async-tasks
software-engineering · hard
async, concurrency, python
chess-best-move
games · medium
circuit-fibsqrt
software-engineering · hard
software-engineering
cobol-modernization
software-engineering · easy
coding
code-from-image
software-engineering · medium
ocr
compile-compcert
system-administration · medium
compilation, compilers
configure-git-webserver
system-administration · hard
system, version-control, web
constraints-scheduling
personal-assistant · medium
calendar, scheduling, constraint-satisfaction, ics-parsing, temporal-reasoning
count-dataset-tokens
model-training · medium
machine-learning, data, datasets, tokenization, huggingface
crack-7z-hash
security · medium
decrypt, security, file-operations
custom-memory-heap-crash
debugging · medium
cpp, memory-management, debugging
db-wal-recovery
file-operations · medium
database, encryption, recovery
distribution-search
machine-learning · medium
coding, statistics, machine-learning
dna-assembly
scientific-computing · hard
biology, cloning
dna-insert
scientific-computing · medium
biology, cloning
extract-elf
file-operations · medium
extract-moves-from-video
file-operations · hard
file-operations, web, video-processing
feal-differential-cryptanalysis
mathematics · hard
software-engineering
feal-linear-cryptanalysis
mathematics · hard
software-engineering
filter-js-from-html
security · medium
security
financial-document-processor
data-processing · medium
ocr, image-processing, financial, file-operations
fix-code-vulnerability
security · hard
security, code-vulnerability, common-weakness-enumeration
fix-git
software-engineering · easy
coding, version-control
fix-ocaml-gc
software-engineering · hard
troubleshooting
gcode-to-text
file-operations · medium
file-operations
git-leak-recovery
software-engineering · medium
git, security
git-multibranch
system-administration · medium
system, version-control, web
gpt2-codegolf
software-engineering · hard
headless-terminal
software-engineering · medium
bash, terminal
hf-model-inference
data-science · medium
api, coding, data-processing, data-science
install-windows-3.11
system-administration · hard
virtualization, qemu, windows-3.11, vnc, sys-admin, retro-computing
kv-store-grpc
software-engineering · medium
coding, file-operations, system
large-scale-text-editing
file-operations · medium
text-editing, large-scale-text-manipulation, vim, vim-macros
largest-eigenval
mathematics · medium
coding, optimization, constraint, numerical-approximation
llm-inference-batching-scheduler
machine-learning · hard
batching, inference, performance-optimization, scheduling
log-summary-date-ranges
data-processing · medium
log-analysis, report-generation, data-processing
mailman
system-administration · medium
email-server, mailing-list
make-doom-for-mips
software-engineering · hard
software-engineering
make-mips-interpreter
software-engineering · hard
software-engineering
mcmc-sampling-stan
data-science · hard
R, stan, bayesian-statistics, mcmc
merge-diff-arc-agi-task
debugging · medium
git, coding
model-extraction-relu-logits
mathematics · hard
security
modernize-scientific-stack
scientific-computing · medium
python-migration, scientific-computing, legacy-modernization, climate-science, data-processing
mteb-leaderboard
data-science · medium
retrieval, mteb
mteb-retrieve
data-science · medium
data-processing, data-science, mteb
multi-source-data-merger
data-processing · medium
data-processing, etl, schema-mapping, conflict-resolution, pandas, parquet
nginx-request-logging
system-administration · medium
web-server
openssl-selfsigned-cert
security · medium
coding, file-operations, security, system
overfull-hbox
debugging · easy
latex, document-processing, combinatorial-optimization
password-recovery
security · hard
system, file-operations, troubleshooting
path-tracing
software-engineering · hard
images
path-tracing-reverse
software-engineering · hard
images
polyglot-c-py
software-engineering · medium
coding
polyglot-rust-c
software-engineering · hard
coding, no-verified-solution
portfolio-optimization
optimization · medium
c-programming, python-extension, optimization
protein-assembly
scientific-computing · hard
biology, cloning, proteins
prove-plus-comm
software-engineering · easy
coding
pypi-server
software-engineering · medium
coding, system
pytorch-model-cli
model-training · medium
coding, C, pytorch
pytorch-model-recovery
model-training · medium
coding, pytorch, machine-learning
qemu-alpine-ssh
system-administration · medium
sys-admin
qemu-startup
system-administration · medium
sys-admin
query-optimize
data-science · medium
query-optimization, sql-query
raman-fitting
scientific-computing · medium
coding, fitting, analysis, physics
regex-chess
software-engineering · hard
software-engineering
regex-log
data-processing · medium
regex, string-parsing, log-analysis
reshard-c4-data
data-science · medium
coding, data-processing, file-operations
rstan-to-pystan
data-science · medium
pystan, rstan, gaussian-process
sam-cell-seg
data-science · hard
image-processing, machine-learning, histopathology
sanitize-git-repo
security · medium
security, system, version-control
schemelike-metacircular-eval
software-engineering · medium
software-engineering
sparql-university
data-querying · hard
knowledge-graph, sparql-query, information-retrieval
sqlite-db-truncate
debugging · medium
file-operations
sqlite-with-gcov
system-administration · medium
software-installation, system
torch-pipeline-parallelism
software-engineering · hard
system
torch-tensor-parallelism
software-engineering · hard
system
train-fasttext
model-training · hard
data-processing, data-science
tune-mjcf
scientific-computing · medium
mujoco, physics, simulation, numerical-optimization
video-processing
video-processing · hard
video-processing
vulnerable-secret
security · medium
security, file-operations
winning-avg-corewars
software-engineering · medium
pmars, corewars, gaming
write-compressor
software-engineering · hard
coding