feat: WAA eval pipeline fixes, instrumentation, and demo-conditioned evaluation by abrichr · Pull Request #35 · OpenAdaptAI/openadapt-evals

abrichr · 2026-02-19T20:55:52Z

Summary

End-to-end improvements to the WAA evaluation pipeline, from infrastructure fixes through instrumentation to the first demo-conditioned evaluation results.

Infrastructure & Pool Automation

Fix WAA probe IP, add QMP support, add pool-auto command
Use waa-auto Docker image instead of broken windowsarena/winarena
Replace fragile streaming SSH with docker exec -d + tail -f --pid
Add dedicated evaluate server with socat proxy
Add 120s timeout to OpenAI API calls

Agent Fixes

Handle double_click, right_click, and drag in action parser (was silently terminating tasks)
Fix coordinate normalization: auto-detect actual screen size from screenshot via PIL instead of hardcoded 1920x1200 (actual VM is 1280x720)
Update system prompt to reference screenshot dimensions instead of hardcoded resolution

Visualization & Instrumentation

Add agent instrumentation: plan_result, parse_strategy, token_usage, agent_think_ms, env_execute_ms per step
Add comparison viewer (openadapt-evals compare) for side-by-side ZS vs DC replay
Enhance viewer with Agent Thinking panel, heatmap overlay, keyboard shortcuts, timeline
Simplified click marker positioning (percentage-based in img-wrapper div)
Backward-compatible divergence check for old vs new coordinate data

Evaluation Results

6 instrumented eval runs: 3 zero-shot + 3 demo-conditioned (GPT-5.1)
DC agent signals completion on 2/3 tasks (Settings: 11 steps, Notepad: 8 steps) while ZS hits max steps on all 3
Full results with screenshots: docs/eval_results/2026-02-21_zero_shot_vs_demo_conditioned.md

Test plan

216 tests pass (uv run pytest tests/ --ignore=tests/test_api_agent_ml.py)
Mock eval passes (openadapt-evals mock --tasks 1)
Pool create -> wait -> run end-to-end on Azure
6 live WAA evals completed with correct coordinate storage
Viewers render correctly for both old and new data
Deploy WAA /evaluate endpoint for actual pass/fail scoring

🤖 Generated with Claude Code

The `while True: pass` loop burned an entire CPU core during recording. Replace with `time.sleep(0.5)` to yield CPU while waiting for Ctrl+C. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Call recorder.wait_for_ready() before entering the wait loop - Use recorder.is_recording check and 1s sleep to match CLI behavior Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The third WAA task requires .docx files in Documents. The script now creates empty report.docx, meeting_notes.docx, and proposal.docx before recording that task, and cleans up any Archive folder from previous runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Change "Press Ctrl+C" to "press Ctrl 3 times" (matches stop sequence) - Clarify wormhole send instructions (each send blocks until received) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The DOCKER_SETUP_SCRIPT builds waa-auto:latest (based on dockurr/windows:latest which can auto-download Windows ISO) but WAA_START_SCRIPT and setup-waa were starting windowsarena/winarena:latest which uses the old dockurr/windows v0.00 that cannot download the ISO, causing "ISO file not found" error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Three bugs prevented pool-run from working: 1. WAA probe used 172.30.0.2 (QEMU guest IP) but Docker port-forwards to localhost — pool-wait timed out every time. Changed to localhost in pool.py and vm_monitor.py. 2. dockurr/windows base image doesn't configure QMP (QEMU Machine Protocol). WAA client needs QMP on port 7200 for VM status. Added ARGUMENTS env var to inject -qmp flag into QEMU startup. 3. Config defaults had Standard_D2_v3 (8GB, OOMs) and old windowsarena/winarena image. Fixed to D8ds_v5 and waa-auto. Also adds: - pool-auto command: single oa-vm pool-auto --workers N --tasks M chains create → wait → run - /evaluate endpoint injection in waa_deploy Dockerfile - Handle WAA server wrapping 404 in 500 responses (live.py) - openai dependency for API agents Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tion Replace fragile streaming SSH with docker exec -d (detached) for starting benchmarks. Logs stream via tail -f --pid which auto-exits when the benchmark finishes. On SSH drop, reconnects and resumes. Also adds 120s timeout to OpenAI API calls to prevent infinite hangs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

WAA's run.py ignores --tasks and runs all 154 tasks based on worker_id/num_workers. Fix by creating a subset test JSON with only the requested number of tasks and passing it via --test_all_meta_path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add a standalone evaluate server (port 5050) that runs inside the WAA Docker container and has direct access to WAA evaluator modules. This avoids needing to patch the WAA Flask server's /evaluate endpoint. - Add evaluate_server.py and start_with_evaluate.sh - Add evaluate_url config to WAALiveConfig - Set up socat proxy (5051→5050) for Docker bridge networking - Add SSH tunnel for evaluate port - Simplify Dockerfile Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ments Instrumentation (captures richer data per step): - Propagate agent logs (LLM response, parse strategy, demo info, loop detection, memory) from ApiAgent to execution trace - Add per-step timing (agent_think_ms, env_execute_ms) - Capture token counts from OpenAI/Anthropic API responses Viewer enhancements (viewer.py): - Agent Thinking panel showing LLM response, memory, parse strategy - Action timeline bar color-coded by action type - Click heatmap overlay showing click frequency hotspots - Click marker using raw pixel coords for correct positioning Comparison viewer (new): - comparison_viewer.py generates side-by-side HTML comparisons - Synchronized step slider, click markers, action diffs - First-divergence detection, action type distribution charts - CLI 'compare' command for generating comparisons - Demo prompts and initial eval results for 3 WAA tasks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

_parse_computer_action() only handled click, type, press, hotkey, and scroll. Any other action (double_click, right_click, drag) fell through to the default return of type="done", which prematurely terminated the task. This caused the demo-conditioned notepad eval to stop after 1 step when the agent correctly issued computer.double_click() to open Notepad. Also add a warning log when an unrecognized action falls through, and update viewer regexes to handle double_click/right_click coordinates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…dcoded config WAALiveConfig defaulted to 1920x1200 but actual VM screen is 1280x720. This caused stored action.x/y to be normalized against the wrong resolution. Now detects real dimensions from the screenshot via PIL, uses them for viewport, denormalization, window_rect, and drag coordinates. Viewers use a divergence check for backward compatibility with old data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ZS vs demo-conditioned on 3 WAA tasks (GPT-5.1). DC agent signals completion on 2/3 tasks (Settings: 11 steps, Notepad: 8 steps) while ZS hits max steps on all 3. Includes Playwright screenshots of comparison viewers and step-by-step screenshots. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace inline 25-line Dockerfile in pool.py with SCP of waa_deploy/ build context. This eliminates drift between the inline and full Dockerfile, and ensures evaluate_server.py + Flask are included in the container image. Adds evaluate server health check during pool-wait. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

abrichr and others added 13 commits February 18, 2026 18:52

fix(recording): replace busy-wait loop with time.sleep

639f005

The `while True: pass` loop burned an entire CPU core during recording. Replace with `time.sleep(0.5)` to yield CPU while waiting for Ctrl+C. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: add wait_for_ready() and match CLI recording loop pattern

428fd9c

- Call recorder.wait_for_ready() before entering the wait loop - Use recorder.is_recording check and 1s sleep to match CLI behavior Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: update stop instructions and clarify wormhole send flow

a155e48

- Change "Press Ctrl+C" to "press Ctrl 3 times" (matches stop sequence) - Clarify wormhole send instructions (each send blocks until received) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

abrichr changed the title ~~fix(pool): fix WAA pool automation end-to-end~~ feat: WAA eval pipeline fixes, instrumentation, and demo-conditioned evaluation Feb 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: WAA eval pipeline fixes, instrumentation, and demo-conditioned evaluation#35

feat: WAA eval pipeline fixes, instrumentation, and demo-conditioned evaluation#35
abrichr wants to merge 14 commits intomainfrom
fix/pool-automation

abrichr commented Feb 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

abrichr commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Infrastructure & Pool Automation

Agent Fixes

Visualization & Instrumentation

Evaluation Results

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abrichr commented Feb 19, 2026 •

edited

Loading