Skip to content

Expose terminalbench in run-eval workflow#2360

Draft
neubig wants to merge 1 commit intomainfrom
openhands/terminalbench-ci-490
Draft

Expose terminalbench in run-eval workflow#2360
neubig wants to merge 1 commit intomainfrom
openhands/terminalbench-ci-490

Conversation

@neubig
Copy link
Contributor

@neubig neubig commented Mar 8, 2026

Summary

  • add terminalbench to the manual run-eval workflow choices
  • update the internal run-eval skill docs so agents know the new benchmark option exists
  • verify the cross-repo dispatch path against the corresponding evaluation/benchmarks feature branches

Details

  • The smoke run used benchmark=terminalbench, eval_limit=5, sdk_ref=main, eval_branch=openhands/terminalbench-ci-490, and benchmarks_branch=openhands/terminalbench-ci-490.
  • This PR pairs with matching workflow/input changes in OpenHands/evaluation and benchmark-side Harbor fixes in OpenHands/benchmarks.

Testing

Evidence

Verification link: View conversation

Follow-up investigation: the previously cited terminalbench smoke run did not complete end-to-end, so this PR is being moved back to draft pending real live-run evidence.

$ gh run view 22823734279 --repo OpenHands/software-agent-sdk --json status,conclusion,displayTitle,url
{"conclusion":"success","displayTitle":"Run Eval (terminalbench) Smoke test for OpenHands/benchmarks#490","status":"completed","url":"https://github.com/OpenHands/software-agent-sdk/actions/runs/22823734279"}

$ gh run view 22823745521 --repo OpenHands/evaluation --json status,conclusion,displayTitle,url
{"conclusion":"success","displayTitle":"Eval Job (terminalbench) Smoke test for OpenHands/benchmarks#490","status":"completed","url":"https://github.com/OpenHands/evaluation/actions/runs/22823745521"}

$ # Datadog pod logs for eval-22823745521-claude-son* service:python
[2026-03-08 15:11:16 UTC] Benchmark: terminalbench
[2026-03-08 15:11:16 UTC] Dispatching terminalbench build for SDK commit: 77c68ccfd7bdffb27be88e8793f76cafc45faf9d
[2026-03-08 15:11:17 UTC] ERROR: Benchmarks build dispatch failed (status 404): {"message":"Not Found","documentation_url":"https://docs.github.com/rest/actions/workflows#create-a-workflow-dispatch-event","status":"404"}
[2026-03-08 15:11:17 UTC] Deleted temporary branch: dispatch-22823745521

The GitHub Actions runs only proved that the workflow dispatch/deploy path was reachable. Datadog shows the orchestration failed before the evaluation phase, so there was no completed benchmark run, no uploaded results archive, and no Slack success notification.

Likely root cause: OpenHands/evaluation currently derives the benchmark build workflow name as build-{benchmark}-images.yml, which becomes build-terminalbench-images.yml. That workflow file does not exist on OpenHands/benchmarks (including branch openhands/terminalbench-ci-490), so the dispatch returns HTTP 404.

Checklist

  • CI passing
  • Tests are minimal and pass
  • No unnecessary code
  • Evidence from live run (with conversation link if available)
  • All review comments resolved
  • Documentation updated (if applicable)

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 8, 2026

API breakage checks (Griffe)

Result: Passed

Action log

@github-actions
Copy link
Contributor

github-actions bot commented Mar 8, 2026

Agent server REST API breakage checks (OpenAPI)

Result: Failed

Log excerpt (first 1000 characters)
{"asctime": "2026-03-08 15:13:03,092", "levelname": "WARNING", "name": "openhands.agent_server.config", "filename": "config.py", "lineno": 173, "message": "\u26a0\ufe0f OH_SECRET_KEY was not defined. Secrets will not be persisted between restarts."}
::error title=openhands-agent-server REST API::Breaking REST API change detected without MINOR version bump (1.12.0 -> 1.12.0).

Breaking REST API changes detected compared to baseline release:
- the 'file' request property type/format changed from 'string'/'' to 'string'/'binary'
/home/runner/work/software-agent-sdk/software-agent-sdk/.venv/lib/python3.13/site-packages/litellm/llms/custom_httpx/async_client_cleanup.py:66: DeprecationWarning: There is no current event loop
  loop = asyncio.get_event_loop()

Action log

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟢 Good taste

Simple, straightforward configuration change. Adds terminalbench to workflow choices and updates skill docs. Follows existing patterns, tested with smoke run. LGTM! 🚀

@neubig neubig marked this pull request as draft March 9, 2026 03:07
@neubig neubig marked this pull request as ready for review March 9, 2026 17:44
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟢 Good taste

Simple, straightforward configuration change. Adds terminalbench to workflow choices and updates skill docs. Follows existing patterns, tested with smoke run. LGTM! 🚀

@neubig neubig marked this pull request as draft March 10, 2026 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants