feat: WAA eval pipeline fixes, instrumentation, and demo-conditioned evaluation#35
Open
feat: WAA eval pipeline fixes, instrumentation, and demo-conditioned evaluation#35
Conversation
The `while True: pass` loop burned an entire CPU core during recording. Replace with `time.sleep(0.5)` to yield CPU while waiting for Ctrl+C. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Call recorder.wait_for_ready() before entering the wait loop - Use recorder.is_recording check and 1s sleep to match CLI behavior Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The third WAA task requires .docx files in Documents. The script now creates empty report.docx, meeting_notes.docx, and proposal.docx before recording that task, and cleans up any Archive folder from previous runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Change "Press Ctrl+C" to "press Ctrl 3 times" (matches stop sequence) - Clarify wormhole send instructions (each send blocks until received) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The DOCKER_SETUP_SCRIPT builds waa-auto:latest (based on dockurr/windows:latest which can auto-download Windows ISO) but WAA_START_SCRIPT and setup-waa were starting windowsarena/winarena:latest which uses the old dockurr/windows v0.00 that cannot download the ISO, causing "ISO file not found" error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three bugs prevented pool-run from working: 1. WAA probe used 172.30.0.2 (QEMU guest IP) but Docker port-forwards to localhost — pool-wait timed out every time. Changed to localhost in pool.py and vm_monitor.py. 2. dockurr/windows base image doesn't configure QMP (QEMU Machine Protocol). WAA client needs QMP on port 7200 for VM status. Added ARGUMENTS env var to inject -qmp flag into QEMU startup. 3. Config defaults had Standard_D2_v3 (8GB, OOMs) and old windowsarena/winarena image. Fixed to D8ds_v5 and waa-auto. Also adds: - pool-auto command: single oa-vm pool-auto --workers N --tasks M chains create → wait → run - /evaluate endpoint injection in waa_deploy Dockerfile - Handle WAA server wrapping 404 in 500 responses (live.py) - openai dependency for API agents Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion Replace fragile streaming SSH with docker exec -d (detached) for starting benchmarks. Logs stream via tail -f --pid which auto-exits when the benchmark finishes. On SSH drop, reconnects and resumes. Also adds 120s timeout to OpenAI API calls to prevent infinite hangs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WAA's run.py ignores --tasks and runs all 154 tasks based on worker_id/num_workers. Fix by creating a subset test JSON with only the requested number of tasks and passing it via --test_all_meta_path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a standalone evaluate server (port 5050) that runs inside the WAA Docker container and has direct access to WAA evaluator modules. This avoids needing to patch the WAA Flask server's /evaluate endpoint. - Add evaluate_server.py and start_with_evaluate.sh - Add evaluate_url config to WAALiveConfig - Set up socat proxy (5051→5050) for Docker bridge networking - Add SSH tunnel for evaluate port - Simplify Dockerfile Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ments Instrumentation (captures richer data per step): - Propagate agent logs (LLM response, parse strategy, demo info, loop detection, memory) from ApiAgent to execution trace - Add per-step timing (agent_think_ms, env_execute_ms) - Capture token counts from OpenAI/Anthropic API responses Viewer enhancements (viewer.py): - Agent Thinking panel showing LLM response, memory, parse strategy - Action timeline bar color-coded by action type - Click heatmap overlay showing click frequency hotspots - Click marker using raw pixel coords for correct positioning Comparison viewer (new): - comparison_viewer.py generates side-by-side HTML comparisons - Synchronized step slider, click markers, action diffs - First-divergence detection, action type distribution charts - CLI 'compare' command for generating comparisons - Demo prompts and initial eval results for 3 WAA tasks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_parse_computer_action() only handled click, type, press, hotkey, and scroll. Any other action (double_click, right_click, drag) fell through to the default return of type="done", which prematurely terminated the task. This caused the demo-conditioned notepad eval to stop after 1 step when the agent correctly issued computer.double_click() to open Notepad. Also add a warning log when an unrecognized action falls through, and update viewer regexes to handle double_click/right_click coordinates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dcoded config WAALiveConfig defaulted to 1920x1200 but actual VM screen is 1280x720. This caused stored action.x/y to be normalized against the wrong resolution. Now detects real dimensions from the screenshot via PIL, uses them for viewport, denormalization, window_rect, and drag coordinates. Viewers use a divergence check for backward compatibility with old data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ZS vs demo-conditioned on 3 WAA tasks (GPT-5.1). DC agent signals completion on 2/3 tasks (Settings: 11 steps, Notepad: 8 steps) while ZS hits max steps on all 3. Includes Playwright screenshots of comparison viewers and step-by-step screenshots. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace inline 25-line Dockerfile in pool.py with SCP of waa_deploy/ build context. This eliminates drift between the inline and full Dockerfile, and ensures evaluate_server.py + Flask are included in the container image. Adds evaluate server health check during pool-wait. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
End-to-end improvements to the WAA evaluation pipeline, from infrastructure fixes through instrumentation to the first demo-conditioned evaluation results.
Infrastructure & Pool Automation
pool-autocommandwaa-autoDocker image instead of brokenwindowsarena/winarenadocker exec -d+tail -f --pidAgent Fixes
double_click,right_click, anddragin action parser (was silently terminating tasks)Visualization & Instrumentation
plan_result,parse_strategy,token_usage,agent_think_ms,env_execute_msper stepopenadapt-evals compare) for side-by-side ZS vs DC replayEvaluation Results
docs/eval_results/2026-02-21_zero_shot_vs_demo_conditioned.mdTest plan
uv run pytest tests/ --ignore=tests/test_api_agent_ml.py)openadapt-evals mock --tasks 1)/evaluateendpoint for actual pass/fail scoring🤖 Generated with Claude Code