Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
639f005
fix(recording): replace busy-wait loop with time.sleep
abrichr Feb 16, 2026
428fd9c
fix: add wait_for_ready() and match CLI recording loop pattern
abrichr Feb 17, 2026
e9bdc9a
fix: auto-create dummy .docx files for archive task
abrichr Feb 17, 2026
a155e48
fix: update stop instructions and clarify wormhole send flow
abrichr Feb 17, 2026
a20b4b1
fix(pool): use waa-auto image instead of broken windowsarena/winarena
abrichr Feb 18, 2026
79017ef
fix(pool): fix WAA probe IP, add QMP support, add pool-auto command
abrichr Feb 19, 2026
a7886c3
fix(pool): use docker exec -d + tail -f for resilient benchmark execu…
abrichr Feb 19, 2026
ef0816f
fix(pool): limit tasks with --test_all_meta_path subset JSON
abrichr Feb 19, 2026
2e06004
feat(pool): add dedicated evaluate server with socat proxy
abrichr Feb 20, 2026
8d85f12
feat(viz): add instrumentation, comparison viewer, and viewer enhance…
abrichr Feb 20, 2026
51207c4
fix(agent): handle double_click, right_click, and drag in action parser
abrichr Feb 20, 2026
171cc73
fix(coords): detect actual screen size from screenshot instead of har…
abrichr Feb 21, 2026
2e86190
docs: add Feb 21 eval results with comparison screenshots
abrichr Feb 22, 2026
37b91f7
fix(pool): consolidate Dockerfiles and deploy evaluate server
abrichr Feb 22, 2026
6f11163
fix(evaluate): add cache_dir to MockEnv for WAA file getters
abrichr Feb 22, 2026
0112901
feat(setup): implement WAA task setup config array processing
abrichr Feb 22, 2026
7ebbc00
feat(cli): add eval-suite command for automated full-cycle evaluation
abrichr Feb 23, 2026
8d5ae30
fix(agent): improve eval reliability with 6 targeted fixes
abrichr Feb 23, 2026
3485e15
fix(agent): pass through raw a11y tree without filtering
abrichr Feb 23, 2026
81fd758
feat(agent): add Qwen3-VL agent with normalized coordinates and think…
abrichr Feb 23, 2026
831d8f9
fix(agent): align training and inference prompt formats
abrichr Feb 23, 2026
2fc8546
fix(pool): resolve merge conflict in WAA startup script
abrichr Feb 23, 2026
0b185eb
feat(agent): add ClaudeComputerUseAgent with screenshot/wait loop fix
abrichr Feb 23, 2026
a6accc1
docs: add eval suite v2 results — 6/6 tasks scored 1.00
abrichr Feb 23, 2026
3f0791e
feat(pool): add pool-pause and pool-resume for deallocate/resume life…
abrichr Feb 23, 2026
04a7e58
feat(scripts): add WAA API recording, VLM annotation, and DC eval sub…
abrichr Feb 24, 2026
b73d470
feat(infra): add golden image support, ACR pull, and pool lifecycle i…
abrichr Feb 24, 2026
01395ea
chore: update beads local state
abrichr Feb 24, 2026
1c9cf20
fix: address review findings — drag action type, screenshot error han…
abrichr Feb 24, 2026
3034a2f
ci: add test workflow for PR checks
abrichr Feb 24, 2026
d2ea481
fix(ci): install dev extras for pytest in test workflow
abrichr Feb 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .beads/.local_version
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.47.1
0.49.0
Binary file modified .beads/beads.db
Binary file not shown.
16 changes: 8 additions & 8 deletions .beads/issues.jsonl

Large diffs are not rendered by default.

7 changes: 0 additions & 7 deletions .beads/sync-state.json

This file was deleted.

32 changes: 32 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: test

on:
push:
branches: [main]
pull_request:
branches: [main]

jobs:
test:
runs-on: ubuntu-latest

steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'

- name: Install uv
uses: astral-sh/setup-uv@v4

- name: Install dependencies
run: uv sync --extra dev

- name: Run tests
run: |
uv run pytest tests/ -q \
--ignore=tests/test_api_agent_ml.py \
-k "not (test_demo_format_and_persistence or test_synthetic_demos)"
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,10 @@ demo_library/index.json

# Live benchmark state (changes during execution)
benchmark_live.json

# Vim swap files
*.swp
*.swo

# Cost reports (generated during evaluation runs)
cost_report.json
41 changes: 0 additions & 41 deletions cost_report.json

This file was deleted.

58 changes: 58 additions & 0 deletions demo_prompts/0c9dda13-428c-492b-900b-f48562111f93-WOS.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
DEMONSTRATION:
Task: Create a new folder named 'Archive' in the Documents folder and move all .docx files into it.

Step 1:
Observation: The BEFORE image shows a Windows PowerShell window open with various log entries. The taskbar is visible at the bottom with icons for different applications. The red marker indicates interaction with the 'Search' icon on the taskbar.
Intent: The user is attempting to open File Explorer to access the Documents folder.
Action: TYPE("e")
Result: The AFTER image shows that File Explorer has opened, displaying the Home directory with quick access to folders like Desktop, Documents, and Downloads.

Step 2:
Observation: The BEFORE image shows the File Explorer application open to the Home directory. Key visible UI elements include: the 'Quick access' section with folders like Desktop and Documents, the 'Documents' folder in the navigation pane, the 'Search Home' bar at the top, and the 'New' button in the toolbar.
Intent: The user is navigating to the Documents folder to create a new folder named 'Archive'.
Action: CLICK(0.166, 0.569)
Result: The AFTER image shows that the File Explorer window now displays the contents of the Documents folder, including several files and folders.

Step 3:
Observation: The application is File Explorer. The current panel displays the contents of the 'Documents' folder. Key visible UI elements include the navigation pane on the left, the 'New' button in the toolbar, the search bar at the top right, and a list of files and folders in the main area. The red marker indicates a right-click action in the main area.
Intent: The user is attempting to create a new folder in the 'Documents' directory.
Action: CLICK(0.345, 0.649)
Result: A context menu has appeared, offering options such as 'View', 'Sort by', 'Group by', 'New', and others.

Step 4:
Observation: The application is Windows File Explorer, currently displaying the 'Documents' folder. Key visible UI elements include the navigation pane on the left, the file list in the center, a context menu with options like 'View', 'Sort by', 'Group by', and 'New', and a highlighted 'Folder' option in the context menu.
Intent: The user is creating a new folder named 'Archive' in the 'Documents' folder.
Action: TYPE("Archive")
Result: A new folder named 'Archive' has been created in the 'Documents' folder.

Step 5:
Observation: The application is File Explorer, displaying the 'Documents' folder. Key visible UI elements include the navigation pane on the left, the toolbar with options like 'New' and 'Sort' at the top, the file list in the center showing various files and folders, and the search bar at the top right.
Intent: The user is preparing to move .docx files into the newly created 'Archive' folder.
Action: TYPE("")
Result: The file 'meeting_notes' is now selected in the file list.

Step 6:
Observation: The application is File Explorer, currently displaying the 'Documents' folder. Key visible UI elements include the navigation pane on the left, the file list in the center, the 'New' button at the top left, and the search bar at the top right. The file 'meeting_notes' is selected.
Intent: The user is selecting multiple .docx files to move them into the 'Archive' folder.
Action: CLICK(0.284, 0.524)
Result: The file 'proposal' is now selected along with 'meeting_notes', indicating multiple selection.

Step 7:
Observation: The application is File Explorer, currently displaying the 'Documents' folder. Key visible UI elements include the navigation pane on the left, the toolbar with options like 'New' and 'Sort' at the top, the file list in the center showing files and folders, and the search bar at the top right. The files 'meeting_notes' and 'proposal' are selected.
Intent: The user is attempting to select all .docx files to move them to the 'Archive' folder.
Action: TYPE("")
Result: The file 'report' is now also selected, along with 'meeting_notes' and 'proposal'.

Step 8:
Observation: The application is File Explorer, currently displaying the 'Documents' folder. Key visible UI elements include the navigation pane on the left, the toolbar with options like 'New' and 'Sort', the file list showing items such as 'meeting_notes', 'proposal', and 'report', and the 'Archive' folder. The red marker indicates a click on the 'Archive' folder.
Intent: The user intends to open the 'Archive' folder to move the selected .docx files into it.
Action: DOUBLE_CLICK(0.283, 0.610)
Result: The 'Documents' folder view is now empty, indicating that the user has navigated into the 'Archive' folder.

Step 9:
Observation: The application is File Explorer. The current panel is the 'Archive' folder within the 'Documents' directory. Key visible UI elements include the navigation bar at the top showing 'Documents > Archive', the 'New' button on the toolbar, the empty file list area, and the sidebar with folder shortcuts.
Intent: The user is pasting the previously copied or cut .docx files into the 'Archive' folder.
Action: TYPE("")
Result: The .docx files should appear in the 'Archive' folder, populating the previously empty file list area.

---
40 changes: 40 additions & 0 deletions demo_prompts/366de66e-cbae-4d72-b042-26390db2b145-WOS.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
DEMONSTRATION:
Task: Open Notepad, create a new file named 'draft.txt', type 'This is a draft.', and save it to the Documents folder.

Step 1:
Observation: The BEFORE image shows the Windows PowerShell application open. The taskbar is visible at the bottom with various application icons. The Start menu is not open.
Intent: The user is attempting to open Notepad by searching for it.
Action: TYPE("notepad")
Result: The AFTER image shows the Start menu open with search results for 'notepad'. The Notepad application is listed as the best match.

Step 2:
Observation: The BEFORE image shows the Start menu with search results for 'notepad'. Key UI elements include the search bar at the top, 'Best match' section with 'notepad' listed, 'Apps' section with 'Notepad', and options like 'Open', 'Run as administrator', and 'Open file location' on the right.
Intent: The user intends to open Notepad to create a new text file.
Action: TYPE("Thisisadraft.")
Result: The AFTER image shows the Notepad application open with a 'Save As' dialog. The text 'This is a draft.' is typed in the 'File name' field.

Step 3:
Observation: The window is titled 'Save As' within the Notepad application. The current panel shows the Documents folder. Key visible UI elements include the 'File name' field with 'This is a draft.' typed in it, the 'Save as type' dropdown set to 'Text documents (*.txt)', the 'Save' button, and the 'Cancel' button.
Intent: The user is attempting to save the file with the specified name and content.
Action: CLICK(0.294, 0.532)
Result: There is no visible change between the BEFORE and AFTER images.

Step 4:
Observation: The application window is 'Save As' dialog in Notepad. Key visible UI elements include the 'File name' input field at the bottom, 'Save as type' dropdown next to it, 'Save' button at the bottom right, 'Cancel' button next to 'Save', and the file list in the main area showing existing files.
Intent: The user is entering the desired file name to save the document as 'draft.txt'.
Action: TYPE("draft.txt")
Result: A 'Confirm Save As' dialog appeared, indicating that 'draft.txt' already exists and asking if the user wants to replace it.

Step 5:
Observation: The application window is 'Save As' with a 'Confirm Save As' dialog open. Key UI elements include the dialog box with the message 'draft.txt already exists. Do you want to replace it?', and buttons labeled 'Yes' and 'No'. The red marker indicates interaction with the 'Yes' button.
Intent: The user intends to confirm overwriting the existing 'draft.txt' file.
Action: TYPE("")
Result: The 'Confirm Save As' dialog is closed, and the focus returns to the previous application window.

Step 6:
Observation: The application window is Windows PowerShell. The current panel displays a series of log entries. Key visible UI elements include the title bar with the application name at the top, a series of log entries in the main panel, a tab bar with a '+' button for new tabs, and a taskbar at the bottom with various application icons.
Intent: The user is attempting to type a command or text into the PowerShell window.
Action: TYPE("")
Result: The expected result is that the text or command typed by the user will appear in the PowerShell window at the location of the cursor.

---
34 changes: 34 additions & 0 deletions demo_prompts/37e10fc4-b4c5-4b02-a65c-bfae8bc51d3f-wos.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
DEMONSTRATION:
Task: Turn off notifications for my system in the settings.

Step 1:
Observation: The application is Windows PowerShell. The taskbar is visible at the bottom of the screen. Key UI elements include the Start button on the left, several application icons in the taskbar, and the system tray on the right.
Intent: The user is trying to open the Start menu to access system settings.
Action: CLICK(0.263, 0.958)
Result: The Start menu is now open, displaying pinned applications and a search bar.

Step 2:
Observation: The Start menu is open, displaying pinned applications and a search bar. Key visible UI elements include: 'Search for apps, settings, and documents' bar at the top, 'Pinned' section with app icons like Edge, Word, and Excel, 'Settings' icon in the Pinned section, 'Recommended' section with recent documents, and 'All apps' button on the right.
Intent: The user is trying to access the system settings to turn off notifications.
Action: CLICK(0.335, 0.518)
Result: The Settings application is now open, displaying the Home page with options like System, Bluetooth & devices, and Network & internet on the left sidebar.

Step 3:
Observation: The application is 'Settings'. The current panel is 'Home'. Key visible UI elements include: 'Find a setting' search bar at the top, 'Home' highlighted in the left sidebar, 'System' option below 'Home' in the sidebar, a notification about backing up to Microsoft account in the main area, and 'Recommended settings' section at the bottom.
Intent: The user is navigating to the 'System' settings to access notification settings.
Action: CLICK(0.349, 0.311)
Result: The 'System' panel is now open, displaying options such as 'Display', 'Sound', 'Notifications', 'Focus', and others.

Step 4:
Observation: The application is 'Settings'. The current panel is 'System'. Key visible UI elements include 'Display' at the top, 'Sound' below it, 'Notifications' with a red marker indicating interaction, 'Focus', and 'Power & battery' further down.
Intent: The user is attempting to access the notifications settings to turn off notifications.
Action: CLICK(0.576, 0.533)
Result: The screen now displays the 'Notifications' settings page, showing options like 'Notifications' toggle, 'Do not disturb', and 'Set priority notifications'.

Step 5:
Observation: The application is 'Settings'. The current panel is 'System > Notifications'. Key visible UI elements include: 'Notifications' toggle at the top right, 'Do not disturb' toggle below it, 'Turn on do not disturb automatically' option below that, 'Set priority notifications' option further down, and 'Focus' option below that.
Intent: The user is performing this action to turn off notifications for the system.
Action: TYPE("")
Result: The 'Notifications' toggle is expected to change from 'On' to 'Off'.

---
Loading