Skip to content

Enable terminalbench CI smoke runs#491

Draft
neubig wants to merge 1 commit intomainfrom
openhands/terminalbench-ci-490
Draft

Enable terminalbench CI smoke runs#491
neubig wants to merge 1 commit intomainfrom
openhands/terminalbench-ci-490

Conversation

@neubig
Copy link
Contributor

@neubig neubig commented Mar 8, 2026

Summary

  • fix the Terminal-Bench Harbor defaults used by the benchmarks repo (terminal-bench@2.0) and add --n-limit passthrough for CI smoke runs
  • update Terminal-Bench docs/tests and expose terminalbench in the benchmarks dispatch workflow
  • record the Harbor package/dataset gotchas in AGENTS.md

Details

  • Harbor's installable package is harbor, not harbor-bench.
  • I validated the Harbor registry entry locally: terminal-bench version 2.0 currently exposes 89 tasks.
  • This change keeps the smoke-run path aligned with the evaluation-side terminalbench support.

Testing

  • make build
  • uv run pre-commit run --files benchmarks/terminalbench/config.py benchmarks/terminalbench/run_infer.py benchmarks/terminalbench/README.md tests/test_terminalbench.py .github/workflows/run-eval.yml
  • uv run pytest tests/test_terminalbench.py

Evidence

Verification link: View conversation

Follow-up investigation: the previously cited terminalbench smoke run did not complete end-to-end, so this PR is being moved back to draft pending real live-run evidence.

$ gh run view 22823734279 --repo OpenHands/software-agent-sdk --json status,conclusion,displayTitle,url
{"conclusion":"success","displayTitle":"Run Eval (terminalbench) Smoke test for OpenHands/benchmarks#490","status":"completed","url":"https://github.com/OpenHands/software-agent-sdk/actions/runs/22823734279"}

$ gh run view 22823745521 --repo OpenHands/evaluation --json status,conclusion,displayTitle,url
{"conclusion":"success","displayTitle":"Eval Job (terminalbench) Smoke test for OpenHands/benchmarks#490","status":"completed","url":"https://github.com/OpenHands/evaluation/actions/runs/22823745521"}

$ # Datadog pod logs for eval-22823745521-claude-son* service:python
[2026-03-08 15:11:16 UTC] Benchmark: terminalbench
[2026-03-08 15:11:16 UTC] Dispatching terminalbench build for SDK commit: 77c68ccfd7bdffb27be88e8793f76cafc45faf9d
[2026-03-08 15:11:17 UTC] ERROR: Benchmarks build dispatch failed (status 404): {"message":"Not Found","documentation_url":"https://docs.github.com/rest/actions/workflows#create-a-workflow-dispatch-event","status":"404"}
[2026-03-08 15:11:17 UTC] Deleted temporary branch: dispatch-22823745521

The GitHub Actions runs only proved that the workflow dispatch/deploy path was reachable. Datadog shows the orchestration failed before the evaluation phase, so there was no completed benchmark run, no uploaded results archive, and no Slack success notification.

Likely root cause: OpenHands/evaluation currently derives the benchmark build workflow name as build-{benchmark}-images.yml, which becomes build-terminalbench-images.yml. That workflow file does not exist on OpenHands/benchmarks (including branch openhands/terminalbench-ci-490), so the dispatch returns HTTP 404.

Checklist

  • CI passing
  • Tests are minimal and pass
  • No unnecessary code
  • Evidence from live run (with conversation link if available)
  • All review comments resolved
  • Documentation updated (if applicable)

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste - Clean, focused fix that solves real problems (wrong package/dataset names) and adds practical CI smoke test support. Tests appropriately validate command construction without requiring full Harbor integration. No fundamental issues found.

@neubig neubig marked this pull request as draft March 9, 2026 03:07
@neubig neubig marked this pull request as ready for review March 9, 2026 17:44
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste - Clean, focused fix that solves real problems (wrong package/dataset names harbor-benchharbor, terminal-bench-2terminal-bench@2.0) and adds practical CI smoke test support with --n-limit. Tests appropriately validate command construction without requiring full Harbor integration. Evidence provided shows successful smoke runs. No fundamental issues found.

Verdict: ✅ Worth merging

Key insight: Pragmatic fix that solves real integration issues with minimal, well-tested code and proper documentation.

@neubig neubig marked this pull request as draft March 10, 2026 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants