From a0f82d5fb6cf976c80ea7405c40bf8c310c719f1 Mon Sep 17 00:00:00 2001
From: Christopher Tso <christso@gmail.com>
Date: Tue, 2 Jun 2026 02:35:19 +0200
Subject: [PATCH 1/4] chore: set up beads workflow

---
 .agents/skills/README.md                      |  13 +
 .../skills/agentv-core-development/SKILL.md   |  77 ++
 .agents/skills/agentv-git-workflow/SKILL.md   |  88 +++
 .agents/skills/agentv-grader-changes/SKILL.md |  51 ++
 .../skills/agentv-release-publishing/SKILL.md |  31 +
 .../agentv-testing-verification/SKILL.md      |  72 ++
 .claude/skills                                |   1 +
 .gitignore                                    |  11 +-
 AGENTS.md                                     | 674 +++---------------
 biome.json                                    |   1 +
 10 files changed, 459 insertions(+), 560 deletions(-)
 create mode 100644 .agents/skills/README.md
 create mode 100644 .agents/skills/agentv-core-development/SKILL.md
 create mode 100644 .agents/skills/agentv-git-workflow/SKILL.md
 create mode 100644 .agents/skills/agentv-grader-changes/SKILL.md
 create mode 100644 .agents/skills/agentv-release-publishing/SKILL.md
 create mode 100644 .agents/skills/agentv-testing-verification/SKILL.md
 create mode 120000 .claude/skills

diff --git a/.agents/skills/README.md b/.agents/skills/README.md
new file mode 100644
index 00000000..e71459fd
--- /dev/null
+++ b/.agents/skills/README.md
@@ -0,0 +1,13 @@
+# AgentV Coding Agent Skills
+
+This directory contains repo-local skills that teach coding agents how to work with AgentV. They are shared across compatible tools through `.agents/skills`, with `.claude/skills` symlinked here for Claude compatibility.
+
+## Skills
+
+| Skill | Description |
+| ----- | ----------- |
+| [agentv-core-development](agentv-core-development/) | Core design principles, TypeScript conventions, naming, wire-format rules, docs expectations, and project structure. |
+| [agentv-testing-verification](agentv-testing-verification/) | AgentV test strategy, CLI verification, grader e2e checks, browser verification, and pre-push behavior. |
+| [agentv-git-workflow](agentv-git-workflow/) | Beads/GitHub collaboration, worktrees, issue claiming, draft PRs, and merge cleanup. |
+| [agentv-grader-changes](agentv-grader-changes/) | Grader type conventions, live eval verification, baseline updates, and score-range checks. |
+| [agentv-release-publishing](agentv-release-publishing/) | Versioning, release workflow, and package publishing. |
diff --git a/.agents/skills/agentv-core-development/SKILL.md b/.agents/skills/agentv-core-development/SKILL.md
new file mode 100644
index 00000000..076ca3b8
--- /dev/null
+++ b/.agents/skills/agentv-core-development/SKILL.md
@@ -0,0 +1,77 @@
+---
+name: agentv-core-development
+description: Use when changing AgentV core, SDK, CLI, Studio APIs, config schemas, docs, examples, or any cross-process wire format. Covers design principles, TypeScript conventions, naming, snake_case boundaries, and documentation updates.
+---
+
+# AgentV Core Development
+
+AgentV is a TypeScript monorepo for a declarative AI agent evaluation framework.
+
+## Goals
+
+- Declarative YAML eval definitions.
+- Structured, type-safe grading.
+- Multi-objective scoring for correctness, latency, cost, and safety.
+- Optimization-ready primitives without speculative built-ins.
+
+## Design Principles
+
+- Keep core lightweight and extensible through plugins.
+- Built-ins should be universal primitives: deterministic, stateless, single-purpose, and broadly useful.
+- Prefer composition over new features. If existing primitives cover a need, document the pattern instead of adding code.
+- Research peer frameworks before adding a new capability, and choose the lowest common denominator.
+- Apply YAGNI to implementation size, not just feature selection. Audit existing primitives before adding knobs, modes, precedence rules, or new invariants.
+- New fields must be optional and non-breaking.
+- Design for AI agents: intuitive primitives, self-documenting modules, concise extension recipes in file headers, and no dead speculative infrastructure.
+
+If you notice existing overengineering while working, create a Beads issue titled `cleanup: simplify X` with current behavior, simpler model, migration notes, and code links. Do not widen the current PR unless asked.
+
+## Stack
+
+- TypeScript 5.x targeting ES2022 and Node 20+.
+- Bun for all package and script operations.
+- Bun workspaces, tsup, Biome, Vitest, Vercel AI SDK, Zod.
+
+## Project Structure
+
+- `packages/core/`: evaluation engine, providers, grading, registry, programmatic API.
+- `packages/eval/`: lightweight assertion SDK.
+- `apps/cli/`: command-line interface published as `agentv`.
+- `apps/studio/`: Studio frontend.
+- `apps/web/`: documentation site.
+- `examples/`: documentation and integration coverage.
+
+## TypeScript
+
+- Prefer inference over explicit types when clear.
+- Use `async`/`await`.
+- Prefer named exports.
+- Keep modules cohesive.
+- Update stale file headers when behavior changes.
+
+## Project vs Benchmark
+
+- `Project`: top-level Studio container around a registered workspace directory. Modelled by `ProjectEntry` / `ProjectRegistry` and stored in `~/.agentv/projects.yaml`.
+- `Benchmark`: curated eval suite designed to measure a capability. Example benchmark directories should keep that name.
+- Legacy `~/.agentv/benchmarks.yaml` migration and per-run `benchmark.json` artifacts are separate concepts.
+
+When in doubt: if it holds runs/traces/experiments, it is a project. If it is a curated eval suite, it is a benchmark.
+
+## Wire Format
+
+Everything crossing a process boundary uses `snake_case`. Internal TypeScript uses `camelCase`. Translate at the boundary only.
+
+Snake case surfaces include YAML, JSONL result files, artifact output, HTTP responses, CLI JSON, and anything consumed by non-TS tooling. Camel case surfaces are TypeScript variables, parameters, type members, and in-memory shapes.
+
+Use paired wire/internal interfaces and converters, following `packages/core/src/projects.ts`. Do not dump TS objects directly to YAML or JSON responses.
+
+Treat existing camelCase on disk or in responses as a bug when touching that path.
+
+## Documentation
+
+When functionality changes, update:
+
+- Docs site under `apps/web/src/content/docs/`.
+- Skills if YAML schema, grader types, or CLI commands changed.
+- Examples that exercise changed behavior.
+- README only when the high-level pointer changes.
diff --git a/.agents/skills/agentv-git-workflow/SKILL.md b/.agents/skills/agentv-git-workflow/SKILL.md
new file mode 100644
index 00000000..666e6891
--- /dev/null
+++ b/.agents/skills/agentv-git-workflow/SKILL.md
@@ -0,0 +1,88 @@
+---
+name: agentv-git-workflow
+description: Use when starting, claiming, committing, pushing, opening, updating, reviewing, merging, or cleaning up AgentV work. Covers Beads as canonical task memory, GitHub as collaboration surface, worktrees, draft PRs, issue workflow, and merge cleanup.
+---
+
+# AgentV Git Workflow
+
+## Tracking Model
+
+- Beads is the canonical task tracker and agent memory: task state, dependencies, discoveries, and durable project knowledge.
+- GitHub is the collaboration surface: draft PRs, reviews, CI, merge coordination, and communication with other parties.
+- Interpret "do not use external issue trackers" as "do not create a second private task brain." It does not replace GitHub collaboration.
+- Runtime orchestration should stay lightweight: Beads tracks coordination state, tmux/Codex wrappers run agents, and git worktrees provide isolation. Use `ep-spawn-agent` for generic worktree + tmux spawning when it fits. Do not introduce Gastown/AO unless the missing value is specifically their spawning or dashboard ergonomics.
+
+Use Beads instead of markdown TODO lists:
+
+```bash
+bd ready --json
+bd create "Issue title" --description="Detailed context" -t bug|feature|task -p 0-4 --json
+bd update <id> --claim --json
+bd close <id> --reason "Completed" --json
+bd remember "durable project insight"
+bd dolt push
+```
+
+Until a `bead-start` helper exists, the manual Beads-first launch flow is:
+
+```bash
+bd list
+bd show <id>
+bd update <id> --status in_progress
+git fetch origin
+git worktree add ../agentv.worktrees/<id> -b work/<id> origin/main
+cd ../agentv.worktrees/<id>
+codex-eng
+bd close <id>
+```
+
+Follow-up automation is tracked in `agentv-9gh`: create Beads glue around `ep-spawn-agent`, not a parallel spawner. The helper should mark a bead in progress, pass the bead id through `EP_TASK_ID` or an equivalent identifier, let `ep-spawn-agent` handle worktree + tmux startup, and write a session note back to the bead.
+
+## Worktrees
+
+For feature, bug fix, or non-trivial repo changes, work from a dedicated sibling worktree based on latest `origin/main`. Keep the primary checkout clean; do not do feature work in the main folder.
+
+```bash
+git fetch origin
+git worktree add ../agentv.worktrees/<type>-<short-desc> -b <type>/<issue-or-topic>-<short-desc> origin/main
+cd ../agentv.worktrees/<type>-<short-desc>
+bun install
+cp "$(git worktree list --porcelain | head -1 | sed 's/worktree //')/.env" .env
+```
+
+AgentV worktrees live in sibling `../agentv.worktrees/`, not `.worktrees/` inside the repo and not the primary checkout.
+
+After checking out a branch or PR, run `bun install` if `package.json` or `bun.lock` may have changed.
+
+## GitHub Issues
+
+When working from a GitHub issue, claim it on the project board before work. If already `In Progress`, do not duplicate work.
+
+Use `AGENT_ID` from `.env`; in this environment default to `devbox2-codex` if unset.
+
+## Draft PRs
+
+After the first meaningful commit, push and open a draft PR. Continue pushing meaningful checkpoints.
+
+```bash
+git push -u origin HEAD
+gh pr create --draft --title "<type>(scope): summary" --body "Refs <beads-id-or-github-issue>"
+```
+
+Do not push directly to `main`.
+
+## PR Readiness
+
+Keep draft until verification evidence is complete: unit tests, test plan evidence, manual red/green UAT for user-facing changes, CI green, no conflicts, and final review pass when warranted.
+
+## Merge and Cleanup
+
+Use squash merge only:
+
+```bash
+gh pr merge <PR_NUMBER> --squash --delete-branch
+```
+
+After squash merge, do not continue pushing to the old branch. Start follow-up fixes from fresh `main`.
+
+Before ending a session, sync Beads, push committed code, and confirm the branch is up to date with its remote.
diff --git a/.agents/skills/agentv-grader-changes/SKILL.md b/.agents/skills/agentv-grader-changes/SKILL.md
new file mode 100644
index 00000000..1e9900fa
--- /dev/null
+++ b/.agents/skills/agentv-grader-changes/SKILL.md
@@ -0,0 +1,51 @@
+---
+name: agentv-grader-changes
+description: Use when adding, modifying, renaming, parsing, or verifying AgentV graders/evaluators, assertion types, scoring behavior, thresholds, baseline files, or eval output shape.
+---
+
+# AgentV Grader Changes
+
+## Type System
+
+Grader types are kebab-case everywhere:
+
+- YAML config: `llm-grader`, `is-json`, `execution-metrics`.
+- Internal `EvaluatorKind`.
+- Output `scores[].type`.
+- Registry keys.
+
+Source of truth: `EVALUATOR_KIND_VALUES` in `packages/core/src/evaluation/types.ts`.
+
+Snake_case aliases can be accepted for backward compatibility through `normalizeGraderType()` in `grader-parser.ts`. SDK-facing `AssertionType` in `packages/eval/src/assertion.ts` must stay in sync.
+
+## Verification
+
+Unit tests are not enough for grader changes.
+
+1. Ensure `.env` exists in the worktree.
+2. Run an actual eval with a real example file:
+
+```bash
+bun apps/cli/src/cli.ts eval examples/features/rubric/evals/dataset.eval.yaml --test-id <test-id>
+```
+
+3. Inspect JSONL output:
+   - correct `scores[].type`
+   - expected score calculation
+   - assertions have `text`, `passed`, and optional `evidence`
+
+4. Update `*.baseline.jsonl` files when output format changes.
+
+`--dry-run` is useful for harness plumbing but returns mock scores and cannot validate grading quality.
+
+## Score Range Checks
+
+For manual e2e score guardrails:
+
+```bash
+bun apps/cli/src/cli.ts eval examples/path/to/suite.eval.yaml --target azure \
+  --out examples/path/to/suite.results.jsonl
+bun scripts/check-grader-scores.ts
+```
+
+Add `<eval-stem>.grader-scores.yaml` next to an eval when a new suite needs score-range assertions.
diff --git a/.agents/skills/agentv-release-publishing/SKILL.md b/.agents/skills/agentv-release-publishing/SKILL.md
new file mode 100644
index 00000000..3317905e
--- /dev/null
+++ b/.agents/skills/agentv-release-publishing/SKILL.md
@@ -0,0 +1,31 @@
+---
+name: agentv-release-publishing
+description: Use when changing AgentV versioning, release automation, package publishing, npm package configuration, or release docs.
+---
+
+# AgentV Release and Publishing
+
+## Versioning
+
+Git commit history is the changelog. Use GitHub Actions for releases; do not publish manually from a local machine.
+
+## Standard Release Flow
+
+1. Run the Release workflow with `channel=next` and desired bump. It creates `x.y.z-next.1`, commits, tags, and pushes.
+2. Publish workflow publishes npm `next`.
+3. Run Release workflow with `channel=finalize`. It strips the prerelease suffix.
+4. Publish workflow publishes npm `latest`.
+
+## Direct Stable Release
+
+Run the Release workflow with `channel=stable` and the desired bump. Publish workflow publishes npm `latest`.
+
+## Local Scripts
+
+`bun scripts/release.ts` can inspect version state locally, but do not run `bun run publish` or `bun run publish:next` locally. npm publish uses OIDC trusted publishing from GitHub Actions.
+
+## Packages
+
+- `packages/core/` publishes `@agentv/core`.
+- `apps/cli/` publishes `agentv`.
+- tsup bundles workspace dependencies with `noExternal: ["@agentv/core"]`.
diff --git a/.agents/skills/agentv-testing-verification/SKILL.md b/.agents/skills/agentv-testing-verification/SKILL.md
new file mode 100644
index 00000000..50fde676
--- /dev/null
+++ b/.agents/skills/agentv-testing-verification/SKILL.md
@@ -0,0 +1,72 @@
+---
+name: agentv-testing-verification
+description: Use when testing, verifying, debugging checks, changing CLI behavior, grader behavior, Studio UI/API behavior, docs site visuals, examples, or preparing an AgentV PR for review.
+---
+
+# AgentV Testing and Verification
+
+## Pre-Push
+
+The repo uses `prek` pre-push hooks. Do not manually run the full pre-push suite before pushing unless diagnosing a failure. Push to the feature branch and let the hook run:
+
+- `bun run build`
+- `bun run typecheck`
+- `bun run lint`
+- `bun run test`
+- `bun run validate:examples`
+
+Manual equivalent:
+
+```bash
+bunx prek run --all-files --hook-stage pre-push
+```
+
+## CLI Testing
+
+Never use global `agentv` for functional testing. Use current source:
+
+```bash
+bun apps/cli/src/cli.ts <args>
+```
+
+If changes touch `packages/core/`, run `bun run build` first because the CLI imports `@agentv/core` from compiled `dist`.
+
+For built output use `bun apps/cli/dist/cli.js <args>` or `bun agentv <args>`, but only after building.
+
+## Studio UI
+
+`agentv studio` serves `apps/studio/dist/`. Rebuild before UAT or screenshots:
+
+```bash
+cd apps/studio && bun run build
+```
+
+## Docs Browser E2E
+
+Use `agent-browser` for docs site verification. Always pass `--session <name>` and do not use `--headed`.
+
+If session launch hangs with EAGAIN on ARM64, pre-start Chrome with CDP and use `agent-browser --cdp 9222`.
+
+## Agent Provider Evals
+
+Limit coding-agent provider eval concurrency to 3 targets at a time for `claude`, `claude-sdk`, `codex`, `copilot`, `copilot-sdk`, `pi`, and `pi-cli`. Lightweight LLM-only targets can use higher concurrency.
+
+## Writing Tests
+
+- Test new or changed behavior only.
+- Prefer one test per distinct behavior.
+- Avoid tests for obvious one-line behavior unless it is a regression risk.
+- Regression tests matter more than broad happy-path duplication.
+- Tests are executable contracts; update them when behavior promises change.
+
+## Completion Checklist
+
+Before marking a branch ready:
+
+- Ensure `.env` exists in a worktree when evals or LLM-dependent tests may run.
+- Run targeted tests while developing and rely on pre-push for the full suite.
+- Complete manual red/green UAT for user-facing behavior before review readiness.
+- Verify adjacent behavior where the change touches shared parsing, scoring, config, or UI paths.
+- For scoring/grader changes, run at least one real eval with a live provider when feasible.
+- For Studio UX/API changes, verify with browser testing.
+- Document verification evidence in the PR.
diff --git a/.claude/skills b/.claude/skills
new file mode 120000
index 00000000..2b7a412b
--- /dev/null
+++ b/.claude/skills
@@ -0,0 +1 @@
+../.agents/skills
\ No newline at end of file
diff --git a/.gitignore b/.gitignore
index fe8ad9cf..c3665720 100644
--- a/.gitignore
+++ b/.gitignore
@@ -25,8 +25,10 @@ examples/**/*.results.jsonl
 agent-orchestrator.yaml
 
 # Agent configuration and activity logs
-.agents/
-.claude/
+.agents/*
+!.agents/skills/
+.claude/*
+!.claude/skills
 .opencode/
 .ao/
 
@@ -35,3 +37,8 @@ agent-orchestrator.yaml
 .runtime/
 .logs/
 state.json
+
+# Beads / Dolt files (added by bd init)
+.dolt/
+*.db
+.beads-credential-key
diff --git a/AGENTS.md b/AGENTS.md
index 366d1381..5bc5c8aa 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -1,614 +1,172 @@
 # AgentV Repository Guidelines
 
-This is a TypeScript monorepo for AgentV - an AI agent evaluation framework.
+This is a TypeScript monorepo for AgentV, an AI agent evaluation framework.
 
-## High-Level Goals
-
-AgentV aims to provide a robust, declarative framework for evaluating AI agents.
-- **Declarative Definitions**: Define tasks, expected outcomes, and rubrics in simple YAML files.
-- **Structured Evaluation**: Use "Rubric as Object" (Google ADK style) for deterministic, type-safe grading.
-- **Multi-Objective Scoring**: Measure correctness, latency, cost, and safety in a single run.
-- **Optimization Ready**: Designed to support future automated hyperparameter tuning and candidate generation.
+## Load Skills First
 
-## Design Principles
-
-These principles guide all feature decisions. **Follow these when proposing or implementing changes.**
-
-### 1. Lightweight Core, Plugin Extensibility
-AgentV's core should remain minimal. Complex or domain-specific logic belongs in plugins, not built-in features.
-
-**Extension points (prefer these over adding built-ins):**
-- `code-grader` scripts for custom evaluation logic
-- `llm-grader` graders with custom prompt files for domain-specific LLM grading
-- CLI wrappers that consume AgentV's JSON/JSONL output for post-processing (aggregation, comparison, reporting)
-
-**Ask yourself:** "Can this be achieved with existing primitives + a plugin or wrapper?" If yes, it should not be a built-in. This includes adding config overrides to existing graders — if a niche provider needs custom tool-name matching, that's a code-grader, not a new config field.
-
-### 2. Built-ins for Primitives Only
-Built-in graders provide **universal primitives** that users compose. A primitive is:
-- Stateless and deterministic
-- Has a single, clear responsibility
-- Cannot be trivially composed from other primitives
-- Needed by the majority of users
-
-If a feature serves a niche use case or adds conditional logic, it belongs in a plugin.
-
-### 3. Maximize Feature Surface Through Composition
-The goal is to achieve the **maximum feature surface with the minimum primitives** due to high reusability. Before proposing a new feature, enumerate which existing primitives could achieve the same outcome when composed:
-
-- **Oracle validation** is not a feature — it's a `cli` provider target that runs a reference solution through the same evaluators.
-- **Snapshot MCP for benchmarks** is not a feature — it's frozen data in the workspace template + `before_all`/`after_all` hooks to start/stop the server.
-- **Harness variant comparison** is not a feature — it's target hooks with different `before_each` setup scripts.
-- **Skill evaluation** is not a feature — it's `tool-trajectory` + `execution-metrics` + `rubric` composed via `composite`.
-
-**If existing primitives cover it, document the pattern instead of building a feature.** New primitives are justified only when the composition is impossible, not merely when it's undocumented.
-
-### 4. Align with Industry Standards
-Before adding features, research how peer frameworks solve the problem. Prefer the **lowest common denominator** that covers most use cases. Novel features without industry precedent require strong justification and should default to plugin implementation.
-
-### 5. YAGNI — You Aren't Gonna Need It
-Don't build features until there's a concrete need. Before adding a new capability, ask: "Is there real demand for this today, or am I anticipating future needs?" Numeric thresholds, extra tracking fields, and configurable knobs should be omitted until users actually request them. Start with the simplest version (e.g., boolean over numeric range) and extend later if needed.
-
-**YAGNI applies to *how* you meet a real request, not just *whether* to meet it.** The common failure mode is not "I built X and nobody wanted it." It's "someone asked for X and I built a bigger X than they asked for." Guard against that with these habits:
-
-1. **Audit existing primitives before adding new ones.** When an issue asks for capability Y, the first question is not "how do I build Y?" — it's **"what does the codebase already do that addresses Y?"** Grep for existing functions, endpoints, and config shapes. Many requests are satisfied by a behavior that already exists and just needs to be surfaced, configured, or exercised differently.
-2. **Treat issue language as a hint, not a spec.** Issues describe problems *and* implementations. "We need a discovery root" is one implementation of "we need the registry to update live." When an issue lists multiple acceptable approaches (or its acceptance criteria don't actually require the implementation it names), pick the one with the least code surface. Summarize the acceptance criteria in your own words, strip out implementation nouns ("discovery root," "watcher," "registry reload"), then match them against existing primitives before designing anything new.
-3. **Prefer data/config changes over new mechanisms.** If the observable effect is "this list should be editable at runtime," prefer "re-read the file per request" over "add a watcher + a new field + a precedence rule + a new endpoint." Config-driven beats code-driven when both are sufficient.
-4. **Stop when scope doubles.** If an implementation's surface area grows more than ~2× the starting estimate (extra types, extra endpoints, extra invariants), that's a red flag to re-plan, not a sign to push through. Pause and ask: "What would the smallest possible version look like? Does the issue actually require more than that?"
-5. **If you are about to add a second mode, two-layer precedence, or an invariant between two optional fields, stop.** `source: manual | discovered`, "pinned wins over discovered," `excluded_paths` filtering the discovered set — every one of these is a sign that you're in complexity territory that a simpler data model would have avoided.
-
-**Call out existing overengineering.** If, while working on a task, you notice a *current* feature in the repo that looks overengineered relative to what it's used for (multiple modes, optional precedence rules, dead-looking extensibility scaffolding), flag it — don't silently fix it. Open a tracking issue titled "cleanup: simplify X" that lists: the observable behavior today, the simpler model that would cover it, and the migration notes. Link to the code. Do not widen your current PR to absorb the cleanup unless the user asks.
-
-### 6. Non-Breaking Extensions
-New fields should be optional. Existing configurations must continue working unchanged.
-
-### 7. AI-First Design
-AI agents are the primary users of AgentV—not humans reading docs. Design for AI comprehension and composability.
-
-**Skills over rigid commands:**
-- Use Claude Code skills (or agent skill standards) to teach AI *how* to create evals, not step-by-step CLI instructions
-- Skills should cover most use cases; rigid commands trade off AI intelligence
-- Only prescribe exact steps where there's an established best practice
-
-**Intuitive primitives:**
-- Expose simple, single-purpose primitives that AI can combine flexibly
-- Avoid monolithic commands that do multiple things
-- SDK internals should be intuitive enough for AI to modify when needed
-
-**Self-documenting code:**
-- File headers should explain what the file does, how it works, and how to extend it — no need to read other files to understand this one
-- Don't reference external projects, PRs, or issues in code comments; make everything standalone
-- Prefer data-driven patterns (static mappings, config tables) over conditional chains — AI can extend a mapping by adding an entry, but has to trace logic to extend an if/else tree
-- No dead code or speculative infrastructure; if it's unused, delete it
-- When a module has an extension point, include a short recipe in the header (e.g., "To add a new provider: 1. Create a matcher, 2. Add it to the mapping")
-- When changing a module's behavior, update its file header to match. Stale headers are worse than no headers.
-
-**Scope:** Applies to skills, repo structure, documentation, SDK design, and source code — anything AI might need to reason about or extend.
-
-## Tech Stack & Tools
-- **Language:** TypeScript 5.x targeting ES2022
-- **Runtime:** Bun (use `bun` for all package and script operations)
-- **Monorepo:** Bun workspaces
-- **Bundler:** tsup (TypeScript bundler)
-- **Linter/Formatter:** Biome
-- **Testing:** Vitest
-- **LLM Framework:** Vercel AI SDK
-- **Validation:** Zod
-
-## Project Structure
-- `packages/core/` - Evaluation engine, providers, grading
-  - `src/evaluation/registry/` - Extensible grader registry (EvaluatorRegistry, assertion discovery)
-  - `src/evaluation/providers/provider-registry.ts` - Provider plugin registry
-  - `src/evaluation/evaluate.ts` - `evaluate()` programmatic API
-  - `src/evaluation/config.ts` - `defineConfig()` for typed agentv.config.ts
-- `packages/eval/` - Lightweight assertion SDK (`defineAssertion`, `defineCodeGrader`)
-- `apps/cli/` - Command-line interface (published as `agentv`)
-  - `src/commands/create/` - Scaffold commands (`agentv create assertion/eval`)
-- `examples/features/sdk-*` - SDK usage examples (custom assertion, programmatic API, config file)
-
-## Working Style
-
-### Worktree Setup
-- For any feature, bug fix, or non-trivial repo change, work from a dedicated git worktree based on the latest `origin/main`.
-- Before starting implementation, run `git fetch origin` and verify your worktree `HEAD` is based on the current `origin/main` commit.
-- Do not implement from the primary checkout, from a stale local `main`, or from a branch created off an outdated base.
-- Default setup:
-```bash
-git fetch origin
-git worktree add ../agentv.worktrees/<type>-<short-desc> -b <type>/<issue-or-topic>-<short-desc> origin/main
-cd ../agentv.worktrees/<type>-<short-desc>
-```
-- If you discover you are not on a fresh worktree from the latest `origin/main`, stop and fix that first before changing code.
-
-### Planning
-- Use plan mode for any non-trivial task (5+ steps or architectural decisions).
-- If something goes sideways, STOP and re-plan immediately — don't keep pushing a broken approach.
-- For non-trivial changes, pause and ask: "Is there a more elegant solution?" before diving in.
-- Check in with the user before starting implementation on ambiguous tasks.
-- Prefer automation: execute the requested work without extra confirmation unless blocked by missing information, safety concerns, or an irreversible/destructive action the user has not approved.
-
-### Subagent Strategy
-- Use subagents aggressively to keep the main context window clean.
-- Subagents for: research, file exploration, running tests, code review.
-- For complex problems, throw more subagents at it — parallelize where possible.
-- Name subagents descriptively.
-- Before declaring a repo change complete or opening/finalizing a PR, complete manual e2e verification first (see E2E Checklist), **then** spawn a subagent for a final code review pass. E2E must pass before code review — if e2e fails, fix the issue before investing time in review. The user may explicitly skip the review step.
-
-### Autonomous Bug Fixes
-- When you spot a bug, just fix it. Don't ask for hand-holding.
-- Point at logs, errors, failing tests — then resolve them.
-- Only ask when there's genuine ambiguity about intent.
-- Fix failing CI tests without being told.
-
-### Simplicity
-- Every change should be as simple as possible. Import existing code; don't reinvent.
-- Find root causes and fix them directly. No shotgun debugging.
-
-### Progress Updates
-- Provide high-level status updates at natural milestones.
-- When scope changes mid-task, communicate the shift and adjust the plan.
-- Use parallel tool calls when applicable, especially for independent reads, checks, and validation steps.
-
-### PR & Commit Titles
-- Prefer conventional commit style for branch-facing titles: `type(scope): summary`.
-- Use the repository's normal types where they fit, such as `feat`, `fix`, `chore`, `refactor`, `docs`, and `test`.
-- Use the most relevant module or product area as `scope`, such as `studio`, `cli`, `results`, or `evals`.
-- Do not prefix PR titles with `[codex]` unless the user explicitly requests it.
-
-## TypeScript Guidelines
-- Target ES2022 with Node 20+
-- Prefer type inference over explicit types
-- Use `async/await` for async operations
-- Prefer named exports
-- Keep modules cohesive
-
-## Naming Convention: "Project" vs "Benchmark"
-
-These two words have distinct, non-interchangeable meanings in this codebase. Get them right when adding new symbols, docs, or example dirs:
-
-- **Project** — the top-level container Studio organises around: a registered workspace directory (`.agentv/` + run artifacts + traces + experiments). Lives in `~/.agentv/projects.yaml`. Modelled by `ProjectEntry` / `ProjectRegistry` in `packages/core/src/projects.ts`. Matches the terminology used by Phoenix, Langfuse, Braintrust, W&B Weave, and LangSmith.
-- **Benchmark** — a curated *eval suite* designed to measure something specific (academic ML sense: MMLU, HumanEval, SWE-bench). Example dirs use this sense: `examples/showcase/multi-model-benchmark/`, `examples/showcase/offline-grader-benchmark/`, `examples/features/benchmark-tooling/`. Do not rename these — they are correctly named.
-
-The legacy registry file `~/.agentv/benchmarks.yaml` is auto-migrated to `projects.yaml` on first load by `migrateLegacyBenchmarksFile()`. The unrelated per-run `benchmark.json` artifact (Agent Skills compatibility output) is a third, separate concept — also keep that name.
-
-When in doubt: if the thing holds runs / traces / experiments, it's a **project**. If it's a curated set of eval cases meant to measure capability, it's a **benchmark**.
-
-## Wire Format Convention
-
-**Everything that crosses a process boundary uses `snake_case` keys. Internal TypeScript uses `camelCase`. Translate at the boundary — never in the middle.**
-
-The rule is blanket: if the key is going to disk, to a user's editor, into a JSON response, or onto a CLI, it's snake_case. There is no "well this file is internal-ish" carve-out. If in doubt, snake_case.
-
-### snake_case surfaces
-- All YAML files on disk: `*.eval.yaml`, `agentv.config.yaml`, `projects.yaml`, `studio/config.yaml`, any future YAML we add.
-- JSONL result files (`test_id`, `token_usage`, `duration_ms`).
-- Artifact-writer output (`pass_rate`, `tests_run`, `total_tool_calls`).
-- HTTP response bodies from `agentv serve` / Studio (`added_at`, `pass_rate`, `project_id`).
-- CLI JSON output (`agentv results summary`, `results failures`, `results show`).
-- Anything consumed by non-TS tooling (Python, jq pipelines, external dashboards).
-
-### camelCase surfaces
-- TypeScript source: all variables, parameters, fields, type members.
-- Internal in-memory shapes passed between TS modules.
-
-### Translate only at the boundary
-Define a second interface for the wire shape and convert in one place — don't smear snake_case through TS internals.
-
-```typescript
-// Wire shape — snake_case, matches what hits disk / the network
-interface ProjectEntryYaml {
-  id: string;
-  name: string;
-  path: string;
-  added_at: string;
-  last_opened_at: string;
-}
-
-// Internal shape — camelCase, what every TS call site sees
-interface ProjectEntry {
-  id: string;
-  name: string;
-  path: string;
-  addedAt: string;
-  lastOpenedAt: string;
-}
-
-function fromYaml(e: ProjectEntryYaml): ProjectEntry {
-  return { id: e.id, name: e.name, path: e.path, addedAt: e.added_at, lastOpenedAt: e.last_opened_at };
-}
-
-function toYaml(e: ProjectEntry): ProjectEntryYaml {
-  return { id: e.id, name: e.name, path: e.path, added_at: e.addedAt, last_opened_at: e.lastOpenedAt };
-}
-```
+Keep this file as bootstrap context. Detailed AgentV playbooks live in committed skills under `.agents/skills/`, following the Phoenix-style repo skill layout. `.claude/skills` is a symlink to the same directory for Claude compatibility.
 
-Yes, this is two interfaces and two functions per entity. That's the price of keeping TS idiomatic while staying faithful to the wire contract. Don't skip it — dumping TS objects directly to YAML leaks `addedAt`-style camelCase onto disk and breaks jq/Python consumers.
+Before non-trivial work, load the relevant skill:
 
-### Anti-patterns
-- `writeFileSync(path, stringifyYaml(tsObject))` — dumps TS field names verbatim. Wrong.
-- `interface Foo { testId: string; ... }` for a JSON response body — `test_id`, always.
-- Accepting both `testId` and `test_id` on input "for back-compat" when nothing is shipped yet. Just snake_case.
+- `agentv-core-development`: core design principles, TypeScript conventions, naming, snake_case wire formats, docs, examples, and repo structure.
+- `agentv-testing-verification`: CLI testing, Studio/browser verification, grader e2e checks, pre-push hooks, and PR readiness evidence.
+- `agentv-git-workflow`: Beads/GitHub workflow, worktrees, issue claiming, draft PRs, pushing, merging, and cleanup.
+- `agentv-grader-changes`: grader/evaluator type changes, score output, baselines, live eval verification, and score-range checks.
+- `agentv-release-publishing`: versioning, release automation, and package publishing.
 
-### Existing divergences
-If you spot a camelCase key already on disk or in a response (e.g. a legacy endpoint), treat it as a bug: migrate it to snake_case in the same PR where you touch that code path. Don't grandfather it in.
+## Always-On Rules
 
-**Reading back:** `parseJsonlResults()` in `artifact-writer.ts` converts snake_case → camelCase when reading JSONL into TypeScript. `fromYaml` / `toYaml` in `packages/core/src/projects.ts` is the model for YAML boundaries.
+- Use Bun for all package and script operations.
+- Run Python scripts with `uv run <script.py>`.
+- Internal TypeScript uses `camelCase`; anything crossing a process boundary uses `snake_case`. Translate at the boundary.
+- Keep AgentV core lightweight. Prefer existing primitives, plugins, examples, and docs over new built-ins.
+- Do not use global `agentv` for CLI testing. Use `bun apps/cli/src/cli.ts <args>`; rebuild first when `packages/core/` changes.
+- For Studio UI verification, rebuild `apps/studio/dist/` before UAT or screenshots.
+- For non-trivial repo changes, work in a fresh sibling worktree under `../agentv.worktrees/` based on latest `origin/main`. Keep the primary checkout clean; do not do feature work in the main folder.
+- Never push directly to `main`. Push feature branches and open/update draft PRs.
+- Use conventional commit and PR titles: `type(scope): summary`.
+- Do not create markdown TODO lists or memory files. Beads is the canonical task tracker and agent memory.
 
-**Why:** Aligns with skill-creator (claude-plugins-official) and broader Python/JSON ecosystem conventions where snake_case is the standard wire format.
+## Key Paths
 
-## Testing & Verification
+- `packages/core/`: evaluation engine, providers, grading, registry, programmatic API.
+- `packages/eval/`: lightweight assertion SDK.
+- `apps/cli/`: CLI published as `agentv`.
+- `apps/studio/`: Studio frontend.
+- `apps/web/`: documentation site.
+- `examples/`: documentation and integration coverage.
+- `.agents/skills/`: committed coding-agent skills.
 
-### Pre-Push Hooks (Automated)
+<!-- BEGIN BEADS INTEGRATION v:1 profile:full hash:f65d5d33 -->
+## Issue Tracking with bd (beads)
 
-The repository uses [prek](https://github.com/nickel-lang/prek) (`@j178/prek`) for pre-push hooks that automatically run build, typecheck, lint, and tests before pushing. **Do not manually run these checks before pushing** — just push to the feature branch and let the pre-push hook validate.
+**IMPORTANT**: This project uses **bd (beads)** for ALL issue tracking. Do NOT use markdown TODOs, task lists, or other tracking methods.
 
-**Setup (automatic):**
-The hooks are installed automatically when you run `bun install` via the `prepare` script. To manually install:
-```bash
-bunx prek install -t pre-push
-```
+### Why bd?
 
-**What runs on push:**
-- `bun run build` - Build all packages
-- `bun run typecheck` - TypeScript type checking
-- `bun run lint` - Biome linting
-- `bun run test` - All tests
-- `bun run validate:examples` - Validate example eval YAML files against the agentv schema
+- Dependency-aware: Track blockers and relationships between issues
+- Git-friendly: Dolt-powered version control with native sync
+- Agent-optimized: JSON output, ready work detection, discovered-from links
+- Prevents duplicate tracking systems and confusion
 
-If any check fails, the push is blocked until the issues are fixed.
+### Quick Start
 
-**Manual run (without pushing):**
-```bash
-bunx prek run --all-files --hook-stage pre-push
-```
-
-### Functional Testing (CLI)
-
-When functionally testing changes to the AgentV CLI, **NEVER** use `agentv` directly as it may run the globally installed version (bun or npm). Instead:
-
-- **From TypeScript source (preferred):** `bun apps/cli/src/cli.ts <args>` — always runs current CLI code, no build step needed. **Exception:** changes inside `packages/core/` require `bun run build` first, because the CLI imports `@agentv/core` from its compiled `dist/`, not from TypeScript source.
-- **From built dist:** `bun apps/cli/dist/cli.js <args>` — requires `bun run build` first, can be stale
-- **From repository root:** `bun agentv <args>` — runs the locally built version (also requires build)
-
-**Prefer running from source** (`src/cli.ts`) during development. The dist build can silently serve stale code if you forget to rebuild after changes. After pulling changes that touch `packages/core/`, always run `bun run build` before CLI testing.
-
-**Studio frontend exception — rebuild `apps/dashboard/dist/` before UAT.** Running `agentv studio` from source (`bun apps/cli/src/cli.ts studio ...`) only reloads the CLI and backend routes from source. The Studio web UI (React/Tailwind bundle) is served as static assets from `apps/dashboard/dist/`, which is build output and does **not** recompile on change. If you are testing Studio UI changes — especially post-merge on `main` or after pulling — rebuild the frontend first:
+**Check for ready work:**
 
 ```bash
-cd apps/dashboard && bun run build
+bd ready --json
 ```
 
-Skipping this step silently serves the previous bundle, so you'll see the old UI even though your source edits and the backend API are live. This has burned at least one post-merge UAT; always rebuild before screenshotting or driving Studio with `agent-browser`.
-
-### Browser E2E Testing (Docs Site)
-
-Use `agent-browser` for visual verification of docs site changes. Environment-specific rules:
-
-- **Always use `--session <name>`** — isolates browser instances; close with `agent-browser --session <name> close` when done
-- **Never use `--headed`** — no display server available; headless (default) works correctly
-
-**Troubleshooting: `--session` hangs with EAGAIN on ARM64**
-
-If `agent-browser --session <name> open <url>` consistently fails with "Resource temporarily unavailable" or times out, Chrome is taking longer to start than the client's retry window. Workaround: pre-start Chrome manually and use `--cdp`:
+**Create new issues:**
 
 ```bash
-nohup chromium --headless=new --remote-debugging-port=9222 \
-  --no-first-run --disable-background-networking --disable-default-apps \
-  --disable-sync --ozone-platform=headless --window-size=1280,720 \
-  --user-data-dir=/tmp/ab-chrome > /tmp/chrome.log 2>&1 &
-curl -s http://localhost:9222/json/version  # verify ready
-
-agent-browser --cdp 9222 open <url>
-agent-browser --cdp 9222 screenshot output.png
+bd create "Issue title" --description="Detailed context" -t bug|feature|task -p 0-4 --json
+bd create "Issue title" --description="What this issue is about" -p 1 --deps discovered-from:bd-123 --json
 ```
 
-### Agent Provider Eval Concurrency
-
-When running evals against agent provider targets (claude, claude-sdk, codex, copilot, copilot-sdk, pi, pi-cli), **limit concurrency to 3 targets at a time**. Each agent provider spawns heavyweight subprocesses (CLI binaries, SDK sessions) that consume significant memory and CPU. Running more than 3 in parallel can exhaust system resources.
+**Claim and update:**
 
 ```bash
-# Good: batch targets in groups of 2-3
-bun apps/cli/src/cli.ts eval my.EVAL.yaml --target claude &
-bun apps/cli/src/cli.ts eval my.EVAL.yaml --target codex &
-wait
-bun apps/cli/src/cli.ts eval my.EVAL.yaml --target copilot &
-bun apps/cli/src/cli.ts eval my.EVAL.yaml --target pi &
-wait
+bd update <id> --claim --json
+bd update bd-42 --priority 1 --json
 ```
 
-This does not apply to lightweight LLM-only targets (azure, openai, gemini, openrouter) which can run with higher concurrency.
-
-### Writing Tests
+**Complete work:**
 
-Tests should be lean and focused on what matters. Follow these principles:
-
-- **Only test new or changed behavior.** Don't write tests for existing behavior that's already covered by the 1600+ core tests. If you fix a bug, test the fix and its edge cases — not the surrounding module.
-- **One test per distinct behavior.** Don't write separate tests for trivially different inputs that exercise the same code path.
-- **No tests for obvious code.** If a function returns `undefined` for missing input and that's a one-line null check, you don't need a test for it unless it's a regression risk.
-- **Regression tests > comprehensive tests.** A test that would have caught the bug is worth more than five tests that exercise happy paths.
-- **Tests are executable contracts.** When a module's behavioral contract changes, the tests must reflect the new contract — not just the happy path. If you change what a function promises, update its tests to assert the new promise.
-
-### Verifying Grader Changes
-
-Unit tests alone are insufficient for grader changes. After implementing or modifying graders:
-
-1. **Copy `.env` to the worktree** if running in a git worktree (e2e tests need environment variables):
-   ```bash
-   cp /path/to/main/.env .env
-   ```
-   ```powershell
-   Copy-Item D:/path/to/main/.env .env
-   ```
-   Do not claim e2e or grader verification results unless this preflight has passed.
-
-2. **Run an actual eval** with a real example file:
-   ```bash
-   bun apps/cli/src/cli.ts eval examples/features/rubric/evals/dataset.eval.yaml --test-id <test-id>
-   ```
-
-3. **Inspect the results JSONL** to verify:
-   - The correct grader type is invoked (check `scores[].type`)
-   - Scores are calculated as expected
-   - Assertions array reflects the evaluation logic (each entry has `text`, `passed`, optional `evidence`)
-
-4. **Update baseline files** if output format changes (e.g., type name renames). Baseline files live alongside eval YAML files as `*.baseline.jsonl` and contain expected `scores[].type` values. There are 30+ baseline files across `examples/`.
-
-5. **Note:** `--dry-run` returns schema-valid mock responses for both agent output and grader evaluation (score=1, empty assertions/checks). Built-in LLM graders run without parse errors but scores are meaningless. Use it for end-to-end harness testing including grader plumbing.
-
-### Checking Grader Score Ranges (manual e2e)
-
-`scripts/check-grader-scores.ts` is a post-processor that asserts each grader's score on each test case falls within an expected range. Run it manually after an eval to catch grader regressions (false positives / false negatives) before merging.
-
-**Workflow:**
 ```bash
-# 1. Run the eval, writing results to a sibling *.results.jsonl file
-bun apps/cli/src/cli.ts eval examples/path/to/suite.eval.yaml --target azure \
-  --out examples/path/to/suite.results.jsonl
-
-# 2. Assert all expected score ranges pass
-bun scripts/check-grader-scores.ts
+bd close bd-42 --reason "Completed" --json
 ```
 
-The script auto-discovers `examples/**/*.grader-scores.yaml`, locates the sibling `*.results.jsonl` (same stem), and exits non-zero if any score is out of range.
-
-**To add score checks for a new eval:**
-1. Create `<eval-stem>.grader-scores.yaml` next to the eval YAML.
-2. Add entries for each `(test_id, grader, range)` you care about — `grader` must match a `scores[].name` value in the JSONL output, and `range.min`/`range.max` default to 0/1 if omitted.
-3. Run the eval with `--out <eval-stem>.results.jsonl`, then run the script.
-
-See `examples/red-team/archetypes/coding-agent/suites/screenshot-pii-upload.grader-scores.yaml` for a concrete example.
-
-### Completing Work — E2E Checklist
-
-Before marking any branch as ready for review, complete this checklist:
-
-1. **Preflight:** If in a git worktree, ensure `.env` exists in the worktree root.
-   ```bash
-   cp "$(git worktree list --porcelain | head -1 | sed 's/worktree //')/.env" .env
-   ```
-   Without this, any eval run or LLM-dependent test will fail with missing API key errors.
-
-2. **Run unit tests**: `bun run test` — all must pass.
-
-3. **⚠️ BLOCKING: Manual red/green UAT — must complete before steps 4-5:**
-   Unit tests passing is NOT sufficient. Every change must be manually verified from the end user's perspective. Do NOT skip this step or proceed to step 4 until red/green evidence is documented.
-
-   - **Red (before your changes):** Run the scenario on `main` (or the code state before your changes). Confirm the bug or missing feature is observable from the CLI / user-facing output. Capture the output.
-   - **Green (with your changes):** Run the identical scenario with your branch. Confirm the fix or feature works correctly from the end user's perspective. Capture the output.
-   - **Document both** red and green results in the PR description or comments so reviewers can see the before/after evidence.
-
-   For grader changes, this means running a real eval (not `--dry-run`) and inspecting the output JSONL. For CLI/UX changes, this means running the CLI command and verifying the console output.
-
-4. **Verify no regressions** in areas adjacent to your changes (e.g., if you changed grader parsing, run an eval that exercises different grader types).
-
-5. **Live eval verification**: For changes affecting scoring, thresholds, or grader behavior, run at least one real eval with a live provider (not `--dry-run`) and verify the output JSONL has correct scores, verdicts, and execution status.
+### Issue Types
 
-6. **Studio UX verification**: For changes affecting config, scoring display, or studio API, use `agent-browser` to verify the studio UI still renders and functions correctly (settings page loads, pass/fail indicators are correct, config saves work).
+- `bug` - Something broken
+- `feature` - New functionality
+- `task` - Work item (tests, docs, refactoring)
+- `epic` - Large feature with subtasks
+- `chore` - Maintenance (dependencies, tooling)
 
-7. **Mark PR as ready** only after steps 1-6 have been completed AND red/green UAT evidence is included in the PR.
+### Priorities
 
-## Documentation Updates
+- `0` - Critical (security, data loss, broken builds)
+- `1` - High (major features, important bugs)
+- `2` - Medium (default, nice-to-have)
+- `3` - Low (polish, optimization)
+- `4` - Backlog (future ideas)
 
-When making changes to functionality:
+### Workflow for AI Agents
 
-1. **Docs site** (`apps/web/src/content/docs/`): Update human-readable documentation on agentv.dev. This is the comprehensive reference.
+1. **Check ready work**: `bd ready` shows unblocked issues
+2. **Claim your task atomically**: `bd update <id> --claim`
+3. **Work on it**: Implement, test, document
+4. **Discover new work?** Create linked issue:
+   - `bd create "Found bug" --description="Details about what was found" -p 1 --deps discovered-from:<parent-id>`
+5. **Complete**: `bd close <id> --reason "Done"`
 
-2. **Skill files** (`plugins/agentv-dev/skills/agentv-eval-builder/`): Update the AI-focused reference card if the change affects YAML schema, grader types, or CLI commands. Keep concise — link to docs site for details.
+### Quality
+- Use `--acceptance` and `--design` fields when creating issues
+- Use `--validate` to check description completeness
 
-3. **Examples** (`examples/`): Update any example code, scripts, or eval YAML files that exercise the changed functionality. Examples are both documentation and integration tests.
+### Lifecycle
+- `bd defer <id>` / `bd supersede <id>` for issue management
+- `bd stale` / `bd orphans` / `bd lint` for hygiene
+- `bd human <id>` to flag for human decisions
+- `bd formula list` / `bd mol pour <name>` for structured workflows
 
-4. **README.md**: Keep minimal. Links point to agentv.dev.
+### Auto-Sync
 
-## Grader Type System
+bd automatically syncs via Dolt:
 
-Grader types use **kebab-case** everywhere (matching promptfoo convention):
+- Each write auto-commits to Dolt history
+- Use `bd dolt push`/`bd dolt pull` for remote sync
+- No manual export/import needed!
 
-- **YAML config:** `type: llm-grader`, `type: is-json`, `type: execution-metrics`
-- **Internal TypeScript:** `EvaluatorKind = 'llm-grader' | 'is-json' | ...`
-- **Output `scores[].type`:** `"llm-grader"`, `"is-json"`
-- **Registry keys:** `registry.register('llm-grader', ...)`
+### Important Rules
 
-**Source of truth:** `EVALUATOR_KIND_VALUES` array in `packages/core/src/evaluation/types.ts`
+- ✅ Use bd for ALL task tracking
+- ✅ Always use `--json` flag for programmatic use
+- ✅ Link discovered work with `discovered-from` dependencies
+- ✅ Check `bd ready` before asking "what should I work on?"
+- ❌ Do NOT create markdown TODO lists
+- ❌ Do NOT use external issue trackers
+- ❌ Do NOT duplicate tracking systems
 
-**Backward compatibility:** Snake_case is accepted in YAML (`llm_judge` → `llm-grader`) via `normalizeGraderType()` in `grader-parser.ts`. Single-word types (`contains`, `equals`, `regex`, `latency`, `cost`) have no separator and are unchanged.
+For more details, see README.md and docs/QUICKSTART.md.
 
-**Two type definitions exist:**
-- `EvaluatorKind` in `packages/core/src/evaluation/types.ts` — internal, canonical
-- `AssertionType` in `packages/eval/src/assertion.ts` — SDK-facing, must stay in sync
+## Session Completion
 
-## Git Workflow
+**When ending a work session**, you MUST complete ALL steps below. Work is NOT complete until `git push` succeeds.
 
-### Commit Convention
+**MANDATORY WORKFLOW:**
 
-Follow conventional commits: `type(scope): description`
-
-Types: `feat`, `fix`, `docs`, `style`, `refactor`, `test`, `chore`
-
-### Issue Workflow
-
-When working on a GitHub issue, **ALWAYS** follow this workflow:
-
-1. **Claim the issue** — prevents other agents from duplicating work by stamping Agent ID and setting status on the project board:
-   ```bash
-   # Load AGENT_ID from .env; if not set, ask the user or default to <harness>-<model>
-   # Harness = the coding tool (claude-code, opencode, codex-cli, cursor, etc.)
-   # Model = the LLM (opus, sonnet, o3, etc.)
-   # Examples: "claude-code-opus", "opencode-sonnet", "cursor-o3", "codex-cli-o3"
-   # In this local dev environment, default to "devbox2-codex" unless the user specifies another AGENT_ID.
-   # Do NOT use hostname or machine name.
-   source .env 2>/dev/null
-   if [ -z "$AGENT_ID" ]; then
-     echo "AGENT_ID is not set. Ask the user for an agent identifier, or default to devbox2-codex in this environment (otherwise use <harness>-<model>)."
-   fi
-
-   # Check if already claimed via project board status
-   ITEM_ID=$(gh project item-list 1 --owner EntityProcess --format json | jq -r '.items[] | select(.content.number == <number> and .content.repository == "EntityProcess/agentv") | .id')
-   CURRENT_STATUS=$(gh project item-list 1 --owner EntityProcess --format json | jq -r '.items[] | select(.content.number == <number> and .content.repository == "EntityProcess/agentv") | .status')
-   [ "$CURRENT_STATUS" = "In Progress" ] && echo "SKIP — already claimed" && exit 1
-
-   # Update project roadmap: ensure the issue is on the AgentV OSS board,
-   # then set status to "In Progress" and stamp Agent ID
-   if [ -z "$ITEM_ID" ] || [ "$ITEM_ID" = "null" ]; then
-     ITEM_ID=$(gh project item-add 1 --owner EntityProcess --url "https://github.com/EntityProcess/agentv/issues/<number>" --format json | jq -r '.id')
-   fi
-   if [ -n "$ITEM_ID" ]; then
-     gh project item-edit --project-id PVT_kwDOAIbbRc4BSmjF --id "$ITEM_ID" --field-id PVTSSF_lADOAIbbRc4BSmjFzhAFomw --single-select-option-id c3991b20
-     gh project item-edit --project-id PVT_kwDOAIbbRc4BSmjF --id "$ITEM_ID" --field-id PVTF_lADOAIbbRc4BSmjFzhAHSnk --text "$AGENT_ID"
-   fi
-   ```
-   If the issue has project board status "In Progress", **do not work on it** — pick a different issue.
-
-2. **Update local `main` to the latest `origin/main`** before branching:
+1. **File issues for remaining work** - Create issues for anything that needs follow-up
+2. **Run quality gates** (if code changed) - Tests, linters, builds
+3. **Update issue status** - Close finished work, update in-progress items
+4. **PUSH TO REMOTE** - This is MANDATORY:
    ```bash
-   git checkout main
-   git pull --ff-only origin main
+   git pull --rebase
+   bd dolt push
+   git push
+   git status  # MUST show "up to date with origin"
    ```
+5. **Clean up** - Clear stashes, prune remote branches
+6. **Verify** - All changes committed AND pushed
+7. **Hand off** - Provide context for next session
 
-3. **Create a worktree** with a feature branch:
-   ```bash
-   git worktree add agentv.worktrees/<branch-name> -b <type>/<issue-number>-<short-description>
-   cd agentv.worktrees/<branch-name>
-   bun install
-   cp "$(git worktree list --porcelain | head -1 | sed 's/worktree //')/.env" .env
-   # Example: git worktree add agentv.worktrees/feat/42-add-new-embedder -b feat/42-add-new-embedder
-   ```
-
-   The feature branch must be based on the freshly updated `main`, not a stale local checkout.
-
-4. **After your first commit, push and open a draft PR immediately:**
-   ```bash
-   git push -u origin <branch-name>
-   gh pr create --draft --title "<type>(scope): description" --body "Closes #<issue-number>"
-   ```
-   Do NOT wait until implementation is complete. The draft PR is a handoff artifact — if the session is interrupted, the user or another agent can pick up where you left off.
-
-5. **Implement the changes.** Commit and push incrementally as you work. Every meaningful checkpoint (feature compiles, tests pass, new behavior added) should be pushed to the draft PR so progress is visible and recoverable.
-
-6. **Complete E2E verification** (see "Completing Work — E2E Checklist") — this is BLOCKING. Do NOT mark the PR ready for review until every step of the E2E checklist has passed and evidence is documented in the PR body. Specifically:
-   1. Run unit tests.
-   2. Execute every test plan item from the issue/PR checklist, mark each `[x]`, and paste CLI output as evidence.
-   3. Manual red/green UAT with before/after evidence.
-   4. **After e2e passes**, spawn a final subagent code review pass and address or call out any findings — **unless the change is focused** (single-responsibility, well-tested, no architectural impact), in which case this step may be skipped. Do NOT run the code review before e2e — if e2e fails you'll need to fix it first, which invalidates the review.
-   5. CI pipeline passes (all checks green).
-   6. No merge conflicts with `main`.
-
-7. **Only after verification is complete**:
-   - Mark the draft PR ready for review, or
-   - Merge directly if the change is low risk and the repo policy allows it
-
-8. **After merge, clean up local state**:
-   - Delete the local feature branch
-   - Remove the local worktree created for the issue
-   - Confirm the primary checkout is back on an up-to-date `main`
-
-**IMPORTANT:** Never push directly to `main`. Always use branches and PRs.
-
-### Tracker Conventions
-
-- The roadmap project is the source of truth for prioritization and claim status — use it, not labels.
-- Issues in the roadmap are prioritized; issues outside it are not.
-- `bug` marks defects.
-- Issues without `bug` are non-bug work by default.
-- `core`, `wui`, and `tui` are area labels.
-- Keep issue bodies focused on the handoff contract: objective, design latitude, acceptance signals, non-goals, and related links.
-- Do not put priority metadata in issue bodies.
-
-### Pull Requests
-
-**Always use squash merge** when merging PRs to main. This keeps the commit history clean with one commit per feature/fix.
-
-```bash
-# Using GitHub CLI to squash merge a PR
-gh pr merge <PR_NUMBER> --squash --delete-branch
-
-# Or with auto-merge enabled
-gh pr merge <PR_NUMBER> --squash --auto
-```
-
-Do NOT use regular merge or rebase merge, as these create noisy commit history with intermediate commits.
-
-### After Squash Merge
-
-Once a PR is squash-merged, its source branch diverges from main. **Do NOT** try to push additional commits from that branch—you will get merge conflicts.
-
-For follow-up fixes:
-```bash
-git checkout main
-git pull origin main
-git checkout -b fix/<short-description>
-# Apply fixes on the fresh branch
-```
-
-### Plans and Worktrees
-
-#### Plans
-
-Design documents and implementation plans are stored in `docs/plans/` inside the worktree (not the main repo). Save plans to the worktree so they are committed on the feature branch and visible in the draft PR.
-
-**Path warning:** When working in a worktree, use paths relative to the worktree root (e.g., `docs/plans/plan.md`). Do NOT prefix with the worktree directory from the main repo (e.g., `agentv.worktrees/feat/xxx/docs/plans/plan.md`) — this creates accidental nested directories inside the worktree.
-
-Plans are temporary working materials. **Before merging the PR**, delete the plan file and incorporate any user-relevant details into the official documentation.
-
-#### Git Worktrees
-
-Use the sibling `../agentv.worktrees/` directory for all AgentV worktrees. This overrides any generic skill or default preference for `.worktrees/` or `worktrees/` inside the repository. Do not create new AgentV worktrees inside the repository root.
-
-After creating a worktree, always run setup:
-```bash
-bun install                                    # worktrees do NOT share node_modules
-cp "$(git worktree list --porcelain | head -1 | sed 's/worktree //')/.env" .env    # required for e2e tests and LLM operations
-```
-Both steps are required before running builds, tests, or evals in the worktree.
-
-### After Checking Out an Existing Branch or PR
-
-Whenever you `git checkout`, `gh pr checkout`, `git pull`, or otherwise switch to a ref that may have changed `package.json` / `bun.lock`, run `bun install` before building, testing, or pushing. The pre-push hook builds all workspaces — if dependencies are stale, the push fails with errors like `Cannot find module 'recharts'` even though the source change is unrelated. `bun install` is cheap when already up-to-date, so run it by default after any ref switch.
-
-## Version Management
-
-This project uses a simple release script for version bumping. The git commit history serves as the changelog.
-
-### Releasing a new version
-
-Use the **GitHub Actions workflows** — do not publish manually from a local machine.
-
-**Standard flow (pre-release → stable):**
-1. Run the [Release workflow](https://github.com/EntityProcess/agentv/actions/workflows/release.yml) with `channel=next` (and desired bump: patch/minor/major). This bumps the version to `x.y.z-next.1`, commits, tags, and pushes.
-2. The [Publish workflow](https://github.com/EntityProcess/agentv/actions/workflows/publish.yml) triggers automatically and publishes to npm `next`.
-3. Run the [Release workflow](https://github.com/EntityProcess/agentv/actions/workflows/release.yml) with `channel=finalize`. This strips the `-next.N` suffix (e.g. `4.12.0-next.1` → `4.12.0`), commits, tags, and pushes.
-4. The Publish workflow triggers automatically and publishes to npm `latest`.
+**CRITICAL RULES:**
+- Work is NOT complete until `git push` succeeds
+- NEVER stop before pushing - that leaves work stranded locally
+- NEVER say "ready to push when you are" - YOU must push
+- If push fails, resolve and retry until it succeeds
 
-**Direct stable release (skip pre-release):**
-1. Run the Release workflow with `channel=stable` (and bump).
-2. Publish workflow auto-publishes to npm `latest`.
+<!-- END BEADS INTEGRATION -->
 
-The release script (`bun scripts/release.ts`) is what the Release workflow calls; it can also be run locally for non-publishing tasks (e.g. inspecting version state), but **do not run `bun run publish` or `bun run publish:next` locally** — npm publish uses OIDC trusted publishing which only works in GitHub Actions.
+## AgentV Beads Workflow Overrides
 
-## Package Publishing
-- Core package (`packages/core/`) - Core evaluation engine and grading logic (published as `@agentv/core`)
-- CLI package (`apps/cli/`) is published as `agentv` on npm
-- Uses tsup with `noExternal: ["@agentv/core"]` to bundle workspace dependencies
-- Install command: `bun install -g agentv` (preferred) or `npm install -g agentv`
+The Beads block above is managed by `bd setup codex`. For this repository, keep these local rules in addition to the generated Beads workflow:
 
-## Python Scripts
-When running Python scripts, always use: `uv run <script.py>`
+- Beads is the canonical task tracker and agent memory for this project: it is the working brain for task state, dependencies, discoveries, and durable project knowledge.
+- GitHub is the team collaboration surface: use it for draft PRs, reviews, CI, merge coordination, and communication with other parties.
+- Interpret the generated "do not use external issue trackers" rule as "do not create a second private task brain." It does not replace this repo's GitHub PR, review, CI, and team communication workflow.
+- After the first meaningful commit for Beads-backed work, push the branch and open a draft PR. Continue pushing incremental commits to that draft PR so work is visible and recoverable before merge.
+- Before ending a work session, sync Beads with `bd dolt push`, push committed code with `git push`, and confirm the branch is up to date with its remote.
+- Do not create markdown TODO lists or separate memory files. Use `bd create` for follow-up work and `bd remember "insight"` for durable project memory.
diff --git a/biome.json b/biome.json
index 5e9695a9..bd201cd5 100644
--- a/biome.json
+++ b/biome.json
@@ -38,6 +38,7 @@
       "**/test-output/**",
       "**/__tmp_*/**",
       "**/.agentv/**",
+      ".beads/**",
       ".claude/**",
       ".opencode/**",
       ".entire/**",

From 1aacf9ec2122aa5a43f3ba210d8c23b4ee31dee0 Mon Sep 17 00:00:00 2001
From: Christopher Tso <christso@gmail.com>
Date: Tue, 2 Jun 2026 05:29:31 +0200
Subject: [PATCH 2/4] docs: adopt beads-only orchestration workflow

---
 .agents/skills/README.md                      |   4 +-
 .agents/skills/agentv-git-workflow/SKILL.md   |  96 +++++++++---
 .../skills/beads-epic-delivery-loop/SKILL.md  | 125 ++++++++++++++++
 .../beads-execplan-issue-creator/SKILL.md     | 141 ++++++++++++++++++
 AGENTS.md                                     |  12 +-
 5 files changed, 348 insertions(+), 30 deletions(-)
 create mode 100644 .agents/skills/beads-epic-delivery-loop/SKILL.md
 create mode 100644 .agents/skills/beads-execplan-issue-creator/SKILL.md

diff --git a/.agents/skills/README.md b/.agents/skills/README.md
index e71459fd..b64ca17a 100644
--- a/.agents/skills/README.md
+++ b/.agents/skills/README.md
@@ -8,6 +8,8 @@ This directory contains repo-local skills that teach coding agents how to work w
 | ----- | ----------- |
 | [agentv-core-development](agentv-core-development/) | Core design principles, TypeScript conventions, naming, wire-format rules, docs expectations, and project structure. |
 | [agentv-testing-verification](agentv-testing-verification/) | AgentV test strategy, CLI verification, grader e2e checks, browser verification, and pre-push behavior. |
-| [agentv-git-workflow](agentv-git-workflow/) | Beads/GitHub collaboration, worktrees, issue claiming, draft PRs, and merge cleanup. |
+| [agentv-git-workflow](agentv-git-workflow/) | Beads-first decentralized orchestration, worktrees, existing PR takeover, draft PRs, and merge cleanup. |
+| [beads-execplan-issue-creator](beads-execplan-issue-creator/) | Convert approved plans into dependency-aware bead epics/tasks with acceptance criteria, verification, and invariants. |
+| [beads-epic-delivery-loop](beads-epic-delivery-loop/) | Execute a bead epic end-to-end with select, claim, implement, verify, review, commit, close, and repeat loops. |
 | [agentv-grader-changes](agentv-grader-changes/) | Grader type conventions, live eval verification, baseline updates, and score-range checks. |
 | [agentv-release-publishing](agentv-release-publishing/) | Versioning, release workflow, and package publishing. |
diff --git a/.agents/skills/agentv-git-workflow/SKILL.md b/.agents/skills/agentv-git-workflow/SKILL.md
index 666e6891..e11615fe 100644
--- a/.agents/skills/agentv-git-workflow/SKILL.md
+++ b/.agents/skills/agentv-git-workflow/SKILL.md
@@ -1,64 +1,94 @@
 ---
 name: agentv-git-workflow
-description: Use when starting, claiming, committing, pushing, opening, updating, reviewing, merging, or cleaning up AgentV work. Covers Beads as canonical task memory, GitHub as collaboration surface, worktrees, draft PRs, issue workflow, and merge cleanup.
+description: Use when starting, claiming, committing, pushing, opening, updating, reviewing, merging, or cleaning up AgentV work. Covers Beads as decentralized orchestration, GitHub as collaboration surface, worktrees, draft PRs, existing PR takeover, and merge cleanup.
 ---
 
 # AgentV Git Workflow
 
 ## Tracking Model
 
-- Beads is the canonical task tracker and agent memory: task state, dependencies, discoveries, and durable project knowledge.
+- Beads is the decentralized orchestration layer: task state, ownership, dependencies, discoveries, and durable project knowledge live in the bead graph.
 - GitHub is the collaboration surface: draft PRs, reviews, CI, merge coordination, and communication with other parties.
-- Interpret "do not use external issue trackers" as "do not create a second private task brain." It does not replace GitHub collaboration.
-- Runtime orchestration should stay lightweight: Beads tracks coordination state, tmux/Codex wrappers run agents, and git worktrees provide isolation. Use `ep-spawn-agent` for generic worktree + tmux spawning when it fits. Do not introduce Gastown/AO unless the missing value is specifically their spawning or dashboard ergonomics.
+- Interpret "do not use external issue trackers" as "do not create a second private task brain." GitHub PRs still handle code review and merge state.
+- Runtime stays lightweight: Beads tracks durable coordination state, `ep-spawn-agent` or manual worktree setup launches disposable workers, and git worktrees provide isolation.
 
 Use Beads instead of markdown TODO lists:
 
 ```bash
 bd ready --json
-bd create "Issue title" --description="Detailed context" -t bug|feature|task -p 0-4 --json
+bd show <id> --json
+bd create "Issue title" --description="Detailed context" -t bug|feature|task|chore|epic -p 0-4 --json
 bd update <id> --claim --json
+bd update <id> --status in_progress --json
 bd close <id> --reason "Completed" --json
 bd remember "durable project insight"
 bd dolt push
 ```
 
-Until a `bead-start` helper exists, the manual Beads-first launch flow is:
+## Starting New Bead Work
+
+Prefer a bead-aware launcher when available:
 
 ```bash
-bd list
-bd show <id>
-bd update <id> --status in_progress
-git fetch origin
-git worktree add ../agentv.worktrees/<id> -b work/<id> origin/main
-cd ../agentv.worktrees/<id>
-codex-eng
-bd close <id>
+ep-spawn-agent <bead-id>
 ```
 
-Follow-up automation is tracked in `agentv-9gh`: create Beads glue around `ep-spawn-agent`, not a parallel spawner. The helper should mark a bead in progress, pass the bead id through `EP_TASK_ID` or an equivalent identifier, let `ep-spawn-agent` handle worktree + tmux startup, and write a session note back to the bead.
+The launcher should:
 
-## Worktrees
+1. read the bead with `bd show <bead-id> --json`;
+2. claim or mark it in progress;
+3. create a fresh sibling worktree from latest `origin/main`;
+4. launch the agent with bead context;
+5. write the session/worktree/branch note back to the bead.
 
-For feature, bug fix, or non-trivial repo changes, work from a dedicated sibling worktree based on latest `origin/main`. Keep the primary checkout clean; do not do feature work in the main folder.
+Manual fallback:
 
 ```bash
+bd show <id> --json
+bd update <id> --claim --json
+bd update <id> --status in_progress --json
 git fetch origin
-git worktree add ../agentv.worktrees/<type>-<short-desc> -b <type>/<issue-or-topic>-<short-desc> origin/main
-cd ../agentv.worktrees/<type>-<short-desc>
+git worktree add ../agentv.worktrees/<id> -b work/<id> origin/main
+cd ../agentv.worktrees/<id>
 bun install
 cp "$(git worktree list --porcelain | head -1 | sed 's/worktree //')/.env" .env
+codex-eng
 ```
 
+## Worktrees
+
+For feature, bug fix, or non-trivial repo changes, work from a dedicated sibling worktree based on latest `origin/main`. Keep the primary checkout clean; do not do feature work in the main folder.
+
 AgentV worktrees live in sibling `../agentv.worktrees/`, not `.worktrees/` inside the repo and not the primary checkout.
 
 After checking out a branch or PR, run `bun install` if `package.json` or `bun.lock` may have changed.
 
-## GitHub Issues
+## Existing PR Takeover
 
-When working from a GitHub issue, claim it on the project board before work. If already `In Progress`, do not duplicate work.
+When continuing an existing PR, keep the PR branch as the source of truth for code and use Beads for durable task state/handoff.
 
-Use `AGENT_ID` from `.env`; in this environment default to `devbox2-codex` if unset.
+1. Inspect the PR first:
+
+   ```bash
+   gh pr view <number> --json number,title,state,isDraft,headRefName,headRefOid,baseRefName,mergeStateStatus,reviewDecision,statusCheckRollup,url
+   gh pr checks <number> --watch=false
+   ```
+
+2. Check out the PR branch. If Git reports the branch is already used by another worktree, do not force it; `cd` into that existing worktree instead.
+
+   ```bash
+   gh pr checkout <number>
+   # or: cd /path/to/existing/worktree
+   ```
+
+3. Make or update a bead for the continuation if one is not already provided. Reference the PR number in the bead description or notes.
+
+   ```bash
+   bd create "Continue PR <number>: <summary>" --description="Current state, requested changes, and handoff context" -t task -p 1 --json
+   bd note <id> "Working tree: <path>; PR: https://github.com/EntityProcess/agentv/pull/<number>"
+   ```
+
+4. Push focused commits to the existing PR branch. Do not create a second PR for the same work.
 
 ## Draft PRs
 
@@ -66,7 +96,8 @@ After the first meaningful commit, push and open a draft PR. Continue pushing me
 
 ```bash
 git push -u origin HEAD
-gh pr create --draft --title "<type>(scope): summary" --body "Refs <beads-id-or-github-issue>"
+gh pr create --draft --title "<type>(scope): summary" --body "Refs <bead-id>"
+bd note <bead-id> "Draft PR: <url>"
 ```
 
 Do not push directly to `main`.
@@ -75,6 +106,14 @@ Do not push directly to `main`.
 
 Keep draft until verification evidence is complete: unit tests, test plan evidence, manual red/green UAT for user-facing changes, CI green, no conflicts, and final review pass when warranted.
 
+Before marking ready:
+
+```bash
+gh pr checks <number> --watch=false
+gh pr view <number> --json isDraft,mergeStateStatus,reviewDecision,statusCheckRollup
+bd note <bead-id> "Verification complete: <summary>"
+```
+
 ## Merge and Cleanup
 
 Use squash merge only:
@@ -85,4 +124,13 @@ gh pr merge <PR_NUMBER> --squash --delete-branch
 
 After squash merge, do not continue pushing to the old branch. Start follow-up fixes from fresh `main`.
 
-Before ending a session, sync Beads, push committed code, and confirm the branch is up to date with its remote.
+Before ending a session:
+
+```bash
+git status
+bd dolt push
+git push
+git status
+```
+
+Work is not complete until both Beads state and git commits are pushed.
diff --git a/.agents/skills/beads-epic-delivery-loop/SKILL.md b/.agents/skills/beads-epic-delivery-loop/SKILL.md
new file mode 100644
index 00000000..cc8b4b78
--- /dev/null
+++ b/.agents/skills/beads-epic-delivery-loop/SKILL.md
@@ -0,0 +1,125 @@
+---
+name: beads-epic-delivery-loop
+description: Use when executing a top-level Beads epic end-to-end for AgentV. Iterates unblocked tasks with claim, implement, test, review, commit, close, and repeat until completion or a hard stop condition.
+---
+
+# Beads Epic Delivery Loop
+
+Execute a top-level epic by repeatedly selecting unblocked work, implementing only the selected scope, verifying, reviewing, committing, and closing tasks.
+
+## Inputs
+
+- `EPIC_ID` required: top-level epic to execute.
+- `PLAN_FILE` optional but recommended: source plan with acceptance and architecture context.
+
+## Required Rules
+
+- Use `bd ... --json` for issue-tracking operations.
+- Keep statuses accurate: `open` -> `in_progress` -> `closed`, or `blocked` with a clear reason.
+- Do not work outside `EPIC_ID` and its descendants.
+- Do not close a bead until acceptance criteria and verification are satisfied.
+- Do not skip review before close.
+- Stop on hard blockers instead of inventing scope.
+
+## High-Level Loop
+
+1. Read the epic and plan context.
+2. Select the next incomplete unblocked sub-epic or task in deterministic order.
+3. Claim and mark the selected task in progress.
+4. Gather only the relevant task, dependency, plan, and repo context.
+5. Implement the scoped slice.
+6. Run task-specific verification first, then broader required checks.
+7. Review against plan/spec and perform a code review pass.
+8. Commit focused changes.
+9. Push the branch and update or create the draft PR when appropriate.
+10. Close the bead with a completion reason.
+11. Repeat until the epic completes or a stop condition is reached.
+
+## Deterministic Selection
+
+Use creation order to break ties:
+
+1. Load the epic:
+
+   ```bash
+   bd show <EPIC_ID> --json
+   bd children <EPIC_ID> --json
+   ```
+
+2. Prefer incomplete unblocked sub-epics before direct top-level tasks.
+3. Within the active scope, list children, filter to non-epic open tasks with satisfied dependencies, and pick the oldest.
+4. If open tasks remain but none are executable, stop with `blocked_waiting_on_dependencies`.
+5. If no open tasks remain in the active scope, mark the scope complete and advance.
+
+## Task Execution Pattern
+
+For each selected task:
+
+```bash
+bd update <TASK_ID> --claim --json
+bd update <TASK_ID> --status in_progress --json
+bd show <TASK_ID> --json
+```
+
+Then:
+
+- implement only the scoped task;
+- avoid opportunistic unrelated refactors;
+- run verification named in the bead or plan;
+- inspect changed files with `git diff`;
+- fix deviations before committing;
+- create a focused conventional commit;
+- push the branch;
+- update PR notes if a PR exists;
+- close the bead after acceptance is met.
+
+Close shape:
+
+```bash
+bd close <TASK_ID> --reason "Completed: <short evidence summary>" --json
+```
+
+## Discovery Handling
+
+When discovering follow-up work:
+
+```bash
+bd create "<follow-up title>" \
+  --description "<what was discovered, why it matters, and suggested next step>" \
+  -t bug|feature|task|chore \
+  -p 0-4 \
+  --deps discovered-from:<TASK_ID> \
+  --json
+```
+
+Keep the current task open if its declared acceptance criteria are not complete. Do not widen the current PR unless the follow-up is required for the selected task.
+
+## Stop Conditions
+
+Stop immediately on:
+
+- blocked dependency or missing prerequisite;
+- failing verification that cannot be resolved within the current task scope;
+- unclear plan/spec that would make implementation unsafe;
+- inconsistent Beads state preventing deterministic selection;
+- merge conflicts or PR/CI failures that require separate focused work.
+
+When stopping, leave a bead note:
+
+```bash
+bd note <TASK_ID> "Stopped: <reason>; current branch/worktree: <path>; next action: <action>"
+bd dolt push
+```
+
+## Completion Output
+
+Return one summary with:
+
+1. epic ID and plan file;
+2. tasks completed in order;
+3. commits created;
+4. PR URL or branch;
+5. checks run and results;
+6. new beads created from discoveries;
+7. blocked tasks or stop reason;
+8. next recommended action.
diff --git a/.agents/skills/beads-execplan-issue-creator/SKILL.md b/.agents/skills/beads-execplan-issue-creator/SKILL.md
new file mode 100644
index 00000000..162ddd56
--- /dev/null
+++ b/.agents/skills/beads-execplan-issue-creator/SKILL.md
@@ -0,0 +1,141 @@
+---
+name: beads-execplan-issue-creator
+description: Use when converting an approved implementation plan or ExecPlan into dependency-aware Beads epics/issues for AgentV work. Creates a durable task graph with acceptance criteria, verification commands, invariants, and parallelization notes.
+---
+
+# Beads ExecPlan Issue Creator
+
+Convert one approved plan into high-quality Beads tracking in two passes:
+
+1. Create epics and issues with explicit hierarchy and true blocker dependencies.
+2. Review and polish the created graph so implementers can execute from fresh context with minimal ambiguity.
+
+## Inputs
+
+- `PLAN_FILE`: approved implementation plan path.
+- `ROOT_EPIC_ID` optional: existing epic to attach work under.
+
+If `ROOT_EPIC_ID` is omitted, create a root epic from the plan title and purpose.
+
+## Rules
+
+- Use `bd ... --json` for tracking operations.
+- Use `--dry-run` before large `bd create` bursts when the command supports it.
+- Keep plan markdown as planning input; Beads becomes the execution source of truth.
+- Prefer fewer high-confidence beads over many vague beads.
+- Ask for clarification when dependency edges or scope boundaries are ambiguous enough to risk incorrect work.
+- Do not serialize independent work. Use dependencies only for true blockers.
+
+## Parse The Plan
+
+Extract:
+
+- plan title and purpose;
+- milestones or phases;
+- concrete implementation steps;
+- validation and acceptance criteria;
+- interfaces, dependencies, and invariants;
+- idempotence, recovery, and safety constraints;
+- explicit non-goals.
+
+## Build The Graph Before Creating
+
+Model:
+
+- root phase epic;
+- child milestone epics when the plan has major phases;
+- implementation issues under the relevant epic;
+- blocker dependencies only where work cannot start without another bead;
+- parallelization notes where independent tracks can run concurrently.
+
+## Create Beads
+
+Use clear descriptions. For epics, include:
+
+```markdown
+## Context
+<why this epic exists>
+
+## Success Criteria
+- <verifiable outcome>
+
+## Dependencies and Parallelization
+- Blocked by: <ids or none>
+- Can run in parallel with: <ids or none>
+```
+
+For tasks/features/chores/bugs, include:
+
+```markdown
+## Context
+<why this work exists>
+
+## Detailed Design
+<technical approach and boundaries>
+
+## Acceptance Criteria
+- <observable behavior>
+
+## Verification
+- <command or explicit test path>
+
+## Parallelization Notes
+- Blocked by: <ids or none>
+- Parallel with: <ids or none>
+
+## Invariants
+- <must remain true>
+```
+
+Command shape:
+
+```bash
+bd create "<title>" \
+  --description "<well-structured description>" \
+  -t epic|feature|task|chore|bug \
+  -p 0-4 \
+  --parent <optional-parent-id> \
+  --deps discovered-from:<root-epic-id>[,<true-blocker-id>...] \
+  --json
+```
+
+## Review And Polish Pass
+
+After creation:
+
+```bash
+bd show <ROOT_EPIC_ID> --json
+bd children <ROOT_EPIC_ID> --json
+bd list --json
+```
+
+Check every created bead for:
+
+- clear title with actor/outcome/scope;
+- complete description sections;
+- specific acceptance criteria;
+- concrete verification commands;
+- correct dependency direction;
+- no accidental dependency cycles;
+- no unnecessary serialization;
+- enough context for a fresh worker to execute.
+
+Polish with:
+
+```bash
+bd update <ID> --title "<better title>" --description "<polished description>" --json
+bd dep add <ISSUE_ID> <BLOCKER_ID> --json
+bd dep remove <ISSUE_ID> <BLOCKER_ID> --json
+```
+
+## Output
+
+Return:
+
+1. root epic ID;
+2. created epics and tasks;
+3. dependency summary;
+4. parallel work lanes;
+5. verification strategy;
+6. any ambiguities or human decisions needed;
+7. recommended first `bd ready --json --parent <ROOT_EPIC_ID>` command.
diff --git a/AGENTS.md b/AGENTS.md
index 5bc5c8aa..644c2e95 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -10,7 +10,9 @@ Before non-trivial work, load the relevant skill:
 
 - `agentv-core-development`: core design principles, TypeScript conventions, naming, snake_case wire formats, docs, examples, and repo structure.
 - `agentv-testing-verification`: CLI testing, Studio/browser verification, grader e2e checks, pre-push hooks, and PR readiness evidence.
-- `agentv-git-workflow`: Beads/GitHub workflow, worktrees, issue claiming, draft PRs, pushing, merging, and cleanup.
+- `agentv-git-workflow`: Beads-first decentralized orchestration, worktrees, PR handoff, pushing, merging, and cleanup.
+- `beads-execplan-issue-creator`: convert approved ExecPlans into dependency-aware bead epics/tasks.
+- `beads-epic-delivery-loop`: execute a bead epic by selecting, claiming, implementing, verifying, committing, and closing tasks in dependency order.
 - `agentv-grader-changes`: grader/evaluator type changes, score output, baselines, live eval verification, and score-range checks.
 - `agentv-release-publishing`: versioning, release automation, and package publishing.
 
@@ -25,7 +27,7 @@ Before non-trivial work, load the relevant skill:
 - For non-trivial repo changes, work in a fresh sibling worktree under `../agentv.worktrees/` based on latest `origin/main`. Keep the primary checkout clean; do not do feature work in the main folder.
 - Never push directly to `main`. Push feature branches and open/update draft PRs.
 - Use conventional commit and PR titles: `type(scope): summary`.
-- Do not create markdown TODO lists or memory files. Beads is the canonical task tracker and agent memory.
+- Do not create markdown TODO lists or memory files. Beads is the canonical decentralized task graph, coordination state, and agent memory.
 
 ## Key Paths
 
@@ -164,9 +166,9 @@ For more details, see README.md and docs/QUICKSTART.md.
 
 The Beads block above is managed by `bd setup codex`. For this repository, keep these local rules in addition to the generated Beads workflow:
 
-- Beads is the canonical task tracker and agent memory for this project: it is the working brain for task state, dependencies, discoveries, and durable project knowledge.
-- GitHub is the team collaboration surface: use it for draft PRs, reviews, CI, merge coordination, and communication with other parties.
-- Interpret the generated "do not use external issue trackers" rule as "do not create a second private task brain." It does not replace this repo's GitHub PR, review, CI, and team communication workflow.
+- Beads is AgentV's decentralized orchestration layer: the bead graph is the source of truth for task state, ownership, dependencies, discoveries, and durable project memory.
+- GitHub is the collaboration surface: use it for draft PRs, reviews, CI, merge coordination, and communication with other parties. It does not replace Beads as the local task graph.
+- Use `ep-spawn-agent` or the manual worktree flow in `agentv-git-workflow` to launch bead-scoped workers. Sessions are disposable; bead state is durable.
 - After the first meaningful commit for Beads-backed work, push the branch and open a draft PR. Continue pushing incremental commits to that draft PR so work is visible and recoverable before merge.
 - Before ending a work session, sync Beads with `bd dolt push`, push committed code with `git push`, and confirm the branch is up to date with its remote.
 - Do not create markdown TODO lists or separate memory files. Use `bd create` for follow-up work and `bd remember "insight"` for durable project memory.

From 81ce1af699942d5eeaef7bcf0c46a07b25da77b2 Mon Sep 17 00:00:00 2001
From: Christopher Tso <christso@gmail.com>
Date: Tue, 2 Jun 2026 05:35:14 +0200
Subject: [PATCH 3/4] docs: clarify bead launcher fallback

---
 .agents/skills/agentv-git-workflow/SKILL.md | 8 +++++---
 AGENTS.md                                   | 2 +-
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/.agents/skills/agentv-git-workflow/SKILL.md b/.agents/skills/agentv-git-workflow/SKILL.md
index e11615fe..3008aa94 100644
--- a/.agents/skills/agentv-git-workflow/SKILL.md
+++ b/.agents/skills/agentv-git-workflow/SKILL.md
@@ -10,7 +10,7 @@ description: Use when starting, claiming, committing, pushing, opening, updating
 - Beads is the decentralized orchestration layer: task state, ownership, dependencies, discoveries, and durable project knowledge live in the bead graph.
 - GitHub is the collaboration surface: draft PRs, reviews, CI, merge coordination, and communication with other parties.
 - Interpret "do not use external issue trackers" as "do not create a second private task brain." GitHub PRs still handle code review and merge state.
-- Runtime stays lightweight: Beads tracks durable coordination state, `ep-spawn-agent` or manual worktree setup launches disposable workers, and git worktrees provide isolation.
+- Runtime stays lightweight: Beads tracks durable coordination state, the repo-standard bead launcher creates disposable worktree sessions, and git worktrees provide isolation. Use manual worktree setup only as a fallback when the launcher is unavailable or broken.
 
 Use Beads instead of markdown TODO lists:
 
@@ -27,12 +27,14 @@ bd dolt push
 
 ## Starting New Bead Work
 
-Prefer a bead-aware launcher when available:
+Use the repo-standard bead launcher:
 
 ```bash
 ep-spawn-agent <bead-id>
 ```
 
+Until a dedicated `bead-start` wrapper exists, `ep-spawn-agent <bead-id>` is the default launch path. Do not choose between multiple launch modes during normal work.
+
 The launcher should:
 
 1. read the bead with `bd show <bead-id> --json`;
@@ -41,7 +43,7 @@ The launcher should:
 4. launch the agent with bead context;
 5. write the session/worktree/branch note back to the bead.
 
-Manual fallback:
+Manual fallback only when the launcher is unavailable or broken:
 
 ```bash
 bd show <id> --json
diff --git a/AGENTS.md b/AGENTS.md
index 644c2e95..987a7948 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -168,7 +168,7 @@ The Beads block above is managed by `bd setup codex`. For this repository, keep
 
 - Beads is AgentV's decentralized orchestration layer: the bead graph is the source of truth for task state, ownership, dependencies, discoveries, and durable project memory.
 - GitHub is the collaboration surface: use it for draft PRs, reviews, CI, merge coordination, and communication with other parties. It does not replace Beads as the local task graph.
-- Use `ep-spawn-agent` or the manual worktree flow in `agentv-git-workflow` to launch bead-scoped workers. Sessions are disposable; bead state is durable.
+- Use the repo-standard bead launcher (`ep-spawn-agent <bead-id>` until a `bead-start` wrapper exists) to launch bead-scoped workers. The manual worktree flow in `agentv-git-workflow` is only a fallback when the launcher is unavailable or broken. Sessions are disposable; bead state is durable.
 - After the first meaningful commit for Beads-backed work, push the branch and open a draft PR. Continue pushing incremental commits to that draft PR so work is visible and recoverable before merge.
 - Before ending a work session, sync Beads with `bd dolt push`, push committed code with `git push`, and confirm the branch is up to date with its remote.
 - Do not create markdown TODO lists or separate memory files. Use `bd create` for follow-up work and `bd remember "insight"` for durable project memory.

From 4c5ed1425ba3deccb4164bcd29259795ae87b8c2 Mon Sep 17 00:00:00 2001
From: Christopher Tso <christso@gmail.com>
Date: Tue, 2 Jun 2026 05:41:09 +0200
Subject: [PATCH 4/4] docs: add durable agent guardrails

---
 .../skills/agentv-core-development/SKILL.md   |  8 ++++++++
 .agents/skills/agentv-git-workflow/SKILL.md   | 19 ++++++++++++++++++-
 .../agentv-testing-verification/SKILL.md      |  6 ++++++
 AGENTS.md                                     |  8 ++++++++
 4 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/.agents/skills/agentv-core-development/SKILL.md b/.agents/skills/agentv-core-development/SKILL.md
index 076ca3b8..89f820da 100644
--- a/.agents/skills/agentv-core-development/SKILL.md
+++ b/.agents/skills/agentv-core-development/SKILL.md
@@ -41,6 +41,14 @@ If you notice existing overengineering while working, create a Beads issue title
 - `apps/web/`: documentation site.
 - `examples/`: documentation and integration coverage.
 
+## Code Editing Discipline
+
+- Revise existing files in place when the feature belongs there; avoid creating `*-v2`, `*-new`, `*-improved`, or similarly duplicative files.
+- New files are appropriate for genuinely new modules, skills, examples, or docs, but do not create throwaway variants as a substitute for understanding the existing code.
+- Avoid broad script-based rewrites of source code. For code changes, prefer targeted edits after reading enough context; scripts are acceptable for mechanical verification, generated outputs, or narrow non-code maintenance where risk is low.
+- Do not delete files or folders without explicit permission. If cleanup is needed, ask or use a reversible alternative.
+- If using a third-party library/API and you are not sure about current usage, consult current official docs before changing the integration.
+
 ## TypeScript
 
 - Prefer inference over explicit types when clear.
diff --git a/.agents/skills/agentv-git-workflow/SKILL.md b/.agents/skills/agentv-git-workflow/SKILL.md
index 3008aa94..e064c2a9 100644
--- a/.agents/skills/agentv-git-workflow/SKILL.md
+++ b/.agents/skills/agentv-git-workflow/SKILL.md
@@ -57,6 +57,23 @@ cp "$(git worktree list --porcelain | head -1 | sed 's/worktree //')/.env" .env
 codex-eng
 ```
 
+## Beads Viewer
+
+`bv` is optional graph/kanban visibility for the Beads graph. For agents, never run bare `bv` because it opens the interactive TUI and blocks the session. Use robot-mode commands only:
+
+```bash
+bv --robot-next
+bv --robot-triage
+bv --robot-plan
+bv --robot-graph
+```
+
+In worktrees where `.beads` is not present, point `bv` at the canonical project Beads directory:
+
+```bash
+bv --db /home/entity/projects/EntityProcess/agentv/.beads --robot-triage
+```
+
 ## Worktrees
 
 For feature, bug fix, or non-trivial repo changes, work from a dedicated sibling worktree based on latest `origin/main`. Keep the primary checkout clean; do not do feature work in the main folder.
@@ -102,7 +119,7 @@ gh pr create --draft --title "<type>(scope): summary" --body "Refs <bead-id>"
 bd note <bead-id> "Draft PR: <url>"
 ```
 
-Do not push directly to `main`.
+Do not push directly to `main`. The default branch is `main`; do not use or document `master` for AgentV workflows.
 
 ## PR Readiness
 
diff --git a/.agents/skills/agentv-testing-verification/SKILL.md b/.agents/skills/agentv-testing-verification/SKILL.md
index 50fde676..469401a4 100644
--- a/.agents/skills/agentv-testing-verification/SKILL.md
+++ b/.agents/skills/agentv-testing-verification/SKILL.md
@@ -47,6 +47,10 @@ Use `agent-browser` for docs site verification. Always pass `--session <name>` a
 
 If session launch hangs with EAGAIN on ARM64, pre-start Chrome with CDP and use `agent-browser --cdp 9222`.
 
+## Browser Safety In Tests
+
+Automated tests should not unexpectedly open a graphical browser. For browser-dependent behavior, prefer headless `agent-browser` verification or explicit opt-in test hooks. If adding code that can launch a browser, guard it behind environment checks or explicit user action.
+
 ## Agent Provider Evals
 
 Limit coding-agent provider eval concurrency to 3 targets at a time for `claude`, `claude-sdk`, `codex`, `copilot`, `copilot-sdk`, `pi`, and `pi-cli`. Lightweight LLM-only targets can use higher concurrency.
@@ -58,6 +62,8 @@ Limit coding-agent provider eval concurrency to 3 targets at a time for `claude`
 - Avoid tests for obvious one-line behavior unless it is a regression risk.
 - Regression tests matter more than broad happy-path duplication.
 - Tests are executable contracts; update them when behavior promises change.
+- Use table-driven tests when multiple cases exercise the same behavior.
+- Use temporary directories/helpers for filesystem tests; do not write persistent test artifacts into the repo.
 
 ## Completion Checklist
 
diff --git a/AGENTS.md b/AGENTS.md
index 987a7948..8773bc45 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -29,6 +29,14 @@ Before non-trivial work, load the relevant skill:
 - Use conventional commit and PR titles: `type(scope): summary`.
 - Do not create markdown TODO lists or memory files. Beads is the canonical decentralized task graph, coordination state, and agent memory.
 
+## Safety Guardrails
+
+- The user is in charge. If an explicit user instruction conflicts with repo habits, follow the user unless it would be unsafe or impossible.
+- Do not delete files or folders without explicit permission. This includes temporary files you created unless the user already approved that cleanup.
+- Never run destructive cleanup/reset commands such as `git reset --hard`, `git clean -fd`, or broad `rm -rf` unless the user gives the exact command and explicitly confirms the irreversible consequences.
+- Prefer non-destructive recovery: inspect with `git status` / `git diff`, move aside, stash, or ask before overwriting work.
+- Do not push directly to `main`; all code changes land through branches and PRs.
+
 ## Key Paths
 
 - `packages/core/`: evaluation engine, providers, grading, registry, programmatic API.