Enable terminalbench CI smoke runs by neubig · Pull Request #491 · OpenHands/benchmarks

neubig · 2026-03-08T15:12:24Z

Summary

fix the Terminal-Bench Harbor defaults used by the benchmarks repo (terminal-bench@2.0) and add --n-limit passthrough for CI smoke runs
update Terminal-Bench docs/tests and expose terminalbench in the benchmarks dispatch workflow
record the Harbor package/dataset gotchas in AGENTS.md

Details

Harbor's installable package is harbor, not harbor-bench.
I validated the Harbor registry entry locally: terminal-bench version 2.0 currently exposes 89 tasks.
This change keeps the smoke-run path aligned with the evaluation-side terminalbench support.

Testing

make build
uv run pre-commit run --files benchmarks/terminalbench/config.py benchmarks/terminalbench/run_infer.py benchmarks/terminalbench/README.md tests/test_terminalbench.py .github/workflows/run-eval.yml
uv run pytest tests/test_terminalbench.py

Evidence

Verification link: View conversation

Follow-up investigation: the previously cited terminalbench smoke run did not complete end-to-end, so this PR is being moved back to draft pending real live-run evidence.

$ gh run view 22823734279 --repo OpenHands/software-agent-sdk --json status,conclusion,displayTitle,url
{"conclusion":"success","displayTitle":"Run Eval (terminalbench) Smoke test for OpenHands/benchmarks#490","status":"completed","url":"https://github.com/OpenHands/software-agent-sdk/actions/runs/22823734279"}

$ gh run view 22823745521 --repo OpenHands/evaluation --json status,conclusion,displayTitle,url
{"conclusion":"success","displayTitle":"Eval Job (terminalbench) Smoke test for OpenHands/benchmarks#490","status":"completed","url":"https://github.com/OpenHands/evaluation/actions/runs/22823745521"}

$ # Datadog pod logs for eval-22823745521-claude-son* service:python
[2026-03-08 15:11:16 UTC] Benchmark: terminalbench
[2026-03-08 15:11:16 UTC] Dispatching terminalbench build for SDK commit: 77c68ccfd7bdffb27be88e8793f76cafc45faf9d
[2026-03-08 15:11:17 UTC] ERROR: Benchmarks build dispatch failed (status 404): {"message":"Not Found","documentation_url":"https://docs.github.com/rest/actions/workflows#create-a-workflow-dispatch-event","status":"404"}
[2026-03-08 15:11:17 UTC] Deleted temporary branch: dispatch-22823745521

The GitHub Actions runs only proved that the workflow dispatch/deploy path was reachable. Datadog shows the orchestration failed before the evaluation phase, so there was no completed benchmark run, no uploaded results archive, and no Slack success notification.

Likely root cause: OpenHands/evaluation currently derives the benchmark build workflow name as build-{benchmark}-images.yml, which becomes build-terminalbench-images.yml. That workflow file does not exist on OpenHands/benchmarks (including branch openhands/terminalbench-ci-490), so the dispatch returns HTTP 404.

Checklist

CI passing
Tests are minimal and pass
No unnecessary code
Evidence from live run (with conversation link if available)
All review comments resolved
Documentation updated (if applicable)

Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

🟢 Good taste - Clean, focused fix that solves real problems (wrong package/dataset names) and adds practical CI smoke test support. Tests appropriately validate command construction without requiring full Harbor integration. No fundamental issues found.

all-hands-bot

🟢 Good taste - Clean, focused fix that solves real problems (wrong package/dataset names harbor-bench → harbor, terminal-bench-2 → terminal-bench@2.0) and adds practical CI smoke test support with --n-limit. Tests appropriately validate command construction without requiring full Harbor integration. Evidence provided shows successful smoke runs. No fundamental issues found.

Verdict: ✅ Worth merging

Key insight: Pragmatic fix that solves real integration issues with minimal, well-tested code and proper documentation.

Enable terminalbench CI smoke runs

6fbdb66

Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai bot mentioned this pull request Mar 8, 2026

Make terminalbench work with CI pipeline #490

Open

all-hands-bot approved these changes Mar 8, 2026

View reviewed changes

neubig marked this pull request as draft March 9, 2026 03:07

neubig marked this pull request as ready for review March 9, 2026 17:44

all-hands-bot approved these changes Mar 9, 2026

View reviewed changes

neubig marked this pull request as draft March 10, 2026 12:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable terminalbench CI smoke runs#491

Enable terminalbench CI smoke runs#491
neubig wants to merge 1 commit intomainfrom
openhands/terminalbench-ci-490

neubig commented Mar 8, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

neubig commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Testing

Evidence

Checklist

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neubig commented Mar 8, 2026 •

edited

Loading