Skip to content

Comments

feat: WAA eval pipeline fixes, instrumentation, and demo-conditioned evaluation#35

Open
abrichr wants to merge 14 commits intomainfrom
fix/pool-automation
Open

feat: WAA eval pipeline fixes, instrumentation, and demo-conditioned evaluation#35
abrichr wants to merge 14 commits intomainfrom
fix/pool-automation

Conversation

@abrichr
Copy link
Member

@abrichr abrichr commented Feb 19, 2026

Summary

End-to-end improvements to the WAA evaluation pipeline, from infrastructure fixes through instrumentation to the first demo-conditioned evaluation results.

Infrastructure & Pool Automation

  • Fix WAA probe IP, add QMP support, add pool-auto command
  • Use waa-auto Docker image instead of broken windowsarena/winarena
  • Replace fragile streaming SSH with docker exec -d + tail -f --pid
  • Add dedicated evaluate server with socat proxy
  • Add 120s timeout to OpenAI API calls

Agent Fixes

  • Handle double_click, right_click, and drag in action parser (was silently terminating tasks)
  • Fix coordinate normalization: auto-detect actual screen size from screenshot via PIL instead of hardcoded 1920x1200 (actual VM is 1280x720)
  • Update system prompt to reference screenshot dimensions instead of hardcoded resolution

Visualization & Instrumentation

  • Add agent instrumentation: plan_result, parse_strategy, token_usage, agent_think_ms, env_execute_ms per step
  • Add comparison viewer (openadapt-evals compare) for side-by-side ZS vs DC replay
  • Enhance viewer with Agent Thinking panel, heatmap overlay, keyboard shortcuts, timeline
  • Simplified click marker positioning (percentage-based in img-wrapper div)
  • Backward-compatible divergence check for old vs new coordinate data

Evaluation Results

Test plan

  • 216 tests pass (uv run pytest tests/ --ignore=tests/test_api_agent_ml.py)
  • Mock eval passes (openadapt-evals mock --tasks 1)
  • Pool create -> wait -> run end-to-end on Azure
  • 6 live WAA evals completed with correct coordinate storage
  • Viewers render correctly for both old and new data
  • Deploy WAA /evaluate endpoint for actual pass/fail scoring

🤖 Generated with Claude Code

abrichr and others added 13 commits February 18, 2026 18:52
The `while True: pass` loop burned an entire CPU core during recording.
Replace with `time.sleep(0.5)` to yield CPU while waiting for Ctrl+C.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Call recorder.wait_for_ready() before entering the wait loop
- Use recorder.is_recording check and 1s sleep to match CLI behavior

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The third WAA task requires .docx files in Documents. The script now
creates empty report.docx, meeting_notes.docx, and proposal.docx
before recording that task, and cleans up any Archive folder from
previous runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Change "Press Ctrl+C" to "press Ctrl 3 times" (matches stop sequence)
- Clarify wormhole send instructions (each send blocks until received)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The DOCKER_SETUP_SCRIPT builds waa-auto:latest (based on dockurr/windows:latest
which can auto-download Windows ISO) but WAA_START_SCRIPT and setup-waa were
starting windowsarena/winarena:latest which uses the old dockurr/windows v0.00
that cannot download the ISO, causing "ISO file not found" error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three bugs prevented pool-run from working:

1. WAA probe used 172.30.0.2 (QEMU guest IP) but Docker port-forwards
   to localhost — pool-wait timed out every time. Changed to localhost
   in pool.py and vm_monitor.py.

2. dockurr/windows base image doesn't configure QMP (QEMU Machine
   Protocol). WAA client needs QMP on port 7200 for VM status. Added
   ARGUMENTS env var to inject -qmp flag into QEMU startup.

3. Config defaults had Standard_D2_v3 (8GB, OOMs) and old
   windowsarena/winarena image. Fixed to D8ds_v5 and waa-auto.

Also adds:
- pool-auto command: single oa-vm pool-auto --workers N --tasks M
  chains create → wait → run
- /evaluate endpoint injection in waa_deploy Dockerfile
- Handle WAA server wrapping 404 in 500 responses (live.py)
- openai dependency for API agents

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion

Replace fragile streaming SSH with docker exec -d (detached) for
starting benchmarks. Logs stream via tail -f --pid which auto-exits
when the benchmark finishes. On SSH drop, reconnects and resumes.
Also adds 120s timeout to OpenAI API calls to prevent infinite hangs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WAA's run.py ignores --tasks and runs all 154 tasks based on
worker_id/num_workers. Fix by creating a subset test JSON with
only the requested number of tasks and passing it via
--test_all_meta_path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a standalone evaluate server (port 5050) that runs inside the WAA
Docker container and has direct access to WAA evaluator modules. This
avoids needing to patch the WAA Flask server's /evaluate endpoint.

- Add evaluate_server.py and start_with_evaluate.sh
- Add evaluate_url config to WAALiveConfig
- Set up socat proxy (5051→5050) for Docker bridge networking
- Add SSH tunnel for evaluate port
- Simplify Dockerfile

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ments

Instrumentation (captures richer data per step):
- Propagate agent logs (LLM response, parse strategy, demo info,
  loop detection, memory) from ApiAgent to execution trace
- Add per-step timing (agent_think_ms, env_execute_ms)
- Capture token counts from OpenAI/Anthropic API responses

Viewer enhancements (viewer.py):
- Agent Thinking panel showing LLM response, memory, parse strategy
- Action timeline bar color-coded by action type
- Click heatmap overlay showing click frequency hotspots
- Click marker using raw pixel coords for correct positioning

Comparison viewer (new):
- comparison_viewer.py generates side-by-side HTML comparisons
- Synchronized step slider, click markers, action diffs
- First-divergence detection, action type distribution charts
- CLI 'compare' command for generating comparisons
- Demo prompts and initial eval results for 3 WAA tasks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_parse_computer_action() only handled click, type, press, hotkey, and
scroll. Any other action (double_click, right_click, drag) fell through
to the default return of type="done", which prematurely terminated the
task. This caused the demo-conditioned notepad eval to stop after 1 step
when the agent correctly issued computer.double_click() to open Notepad.

Also add a warning log when an unrecognized action falls through,
and update viewer regexes to handle double_click/right_click coordinates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dcoded config

WAALiveConfig defaulted to 1920x1200 but actual VM screen is 1280x720.
This caused stored action.x/y to be normalized against the wrong resolution.
Now detects real dimensions from the screenshot via PIL, uses them for
viewport, denormalization, window_rect, and drag coordinates. Viewers use
a divergence check for backward compatibility with old data.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ZS vs demo-conditioned on 3 WAA tasks (GPT-5.1). DC agent signals
completion on 2/3 tasks (Settings: 11 steps, Notepad: 8 steps) while
ZS hits max steps on all 3. Includes Playwright screenshots of
comparison viewers and step-by-step screenshots.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@abrichr abrichr changed the title fix(pool): fix WAA pool automation end-to-end feat: WAA eval pipeline fixes, instrumentation, and demo-conditioned evaluation Feb 22, 2026
Replace inline 25-line Dockerfile in pool.py with SCP of waa_deploy/
build context. This eliminates drift between the inline and full
Dockerfile, and ensures evaluate_server.py + Flask are included in the
container image. Adds evaluate server health check during pool-wait.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant