diff --git a/.agents/skills/README.md b/.agents/skills/README.md
new file mode 100644
index 00000000..62c5ee34
--- /dev/null
+++ b/.agents/skills/README.md
@@ -0,0 +1,15 @@
+# AgentV Coding Agent Skills
+
+This directory contains repo-local skills that teach coding agents how to work with AgentV. They are shared across compatible tools through `.agents/skills`, with `.claude/skills` symlinked here for Claude compatibility.
+
+## Skills
+
+| Skill | Description |
+| ----- | ----------- |
+| [agentv-core-development](agentv-core-development/) | Core design principles, TypeScript conventions, naming, wire-format rules, docs expectations, and project structure. |
+| [agentv-testing-verification](agentv-testing-verification/) | AgentV test strategy, CLI verification, grader e2e checks, browser verification, and pre-push behavior. |
+| [agentv-git-workflow](agentv-git-workflow/) | AO-first session/worktree/PR lifecycle, GitHub collaboration, manual fallback worktrees, existing PR takeover, and merge cleanup. |
+| [beads-execplan-issue-creator](beads-execplan-issue-creator/) | Optional when explicitly assigned: convert approved plans into dependency-aware bead epics/tasks with acceptance criteria, verification, and invariants. |
+| [beads-epic-delivery-loop](beads-epic-delivery-loop/) | Optional when explicitly assigned: execute a bead epic end-to-end without spawning unmanaged agents. |
+| [agentv-grader-changes](agentv-grader-changes/) | Grader type conventions, live eval verification, baseline updates, and score-range checks. |
+| [agentv-release-publishing](agentv-release-publishing/) | Versioning, release workflow, and package publishing. |
diff --git a/.agents/skills/agentv-core-development/SKILL.md b/.agents/skills/agentv-core-development/SKILL.md
new file mode 100644
index 00000000..cd3fce47
--- /dev/null
+++ b/.agents/skills/agentv-core-development/SKILL.md
@@ -0,0 +1,85 @@
+---
+name: agentv-core-development
+description: Use when changing AgentV core, SDK, CLI, Studio APIs, config schemas, docs, examples, or any cross-process wire format. Covers design principles, TypeScript conventions, naming, snake_case boundaries, and documentation updates.
+---
+
+# AgentV Core Development
+
+AgentV is a TypeScript monorepo for a declarative AI agent evaluation framework.
+
+## Goals
+
+- Declarative YAML eval definitions.
+- Structured, type-safe grading.
+- Multi-objective scoring for correctness, latency, cost, and safety.
+- Optimization-ready primitives without speculative built-ins.
+
+## Design Principles
+
+- Keep core lightweight and extensible through plugins.
+- Built-ins should be universal primitives: deterministic, stateless, single-purpose, and broadly useful.
+- Prefer composition over new features. If existing primitives cover a need, document the pattern instead of adding code.
+- Research peer frameworks before adding a new capability, and choose the lowest common denominator.
+- Apply YAGNI to implementation size, not just feature selection. Audit existing primitives before adding knobs, modes, precedence rules, or new invariants.
+- New fields must be optional and non-breaking.
+- Design for AI agents: intuitive primitives, self-documenting modules, concise extension recipes in file headers, and no dead speculative infrastructure.
+
+If you notice existing overengineering while working, flag it through the active AO/GitHub workflow (for example, open a GitHub issue or report it in the PR) with current behavior, simpler model, migration notes, and code links. Do not widen the current PR unless asked.
+
+## Stack
+
+- TypeScript 5.x targeting ES2022 and Node 20+.
+- Bun for all package and script operations.
+- Bun workspaces, tsup, Biome, Vitest, Vercel AI SDK, Zod.
+
+## Project Structure
+
+- `packages/core/`: evaluation engine, providers, grading, registry, programmatic API.
+- `packages/eval/`: lightweight assertion SDK.
+- `apps/cli/`: command-line interface published as `agentv`.
+- `apps/studio/`: Studio frontend.
+- `apps/web/`: documentation site.
+- `examples/`: documentation and integration coverage.
+
+## Code Editing Discipline
+
+- Revise existing files in place when the feature belongs there; avoid creating `*-v2`, `*-new`, `*-improved`, or similarly duplicative files.
+- New files are appropriate for genuinely new modules, skills, examples, or docs, but do not create throwaway variants as a substitute for understanding the existing code.
+- Avoid broad script-based rewrites of source code. For code changes, prefer targeted edits after reading enough context; scripts are acceptable for mechanical verification, generated outputs, or narrow non-code maintenance where risk is low.
+- Do not delete files or folders without explicit permission. If cleanup is needed, ask or use a reversible alternative.
+- If using a third-party library/API and you are not sure about current usage, consult current official docs before changing the integration.
+
+## TypeScript
+
+- Prefer inference over explicit types when clear.
+- Use `async`/`await`.
+- Prefer named exports.
+- Keep modules cohesive.
+- Update stale file headers when behavior changes.
+
+## Project vs Benchmark
+
+- `Project`: top-level Studio container around a registered workspace directory. Modelled by `ProjectEntry` / `ProjectRegistry` and stored in `~/.agentv/projects.yaml`.
+- `Benchmark`: curated eval suite designed to measure a capability. Example benchmark directories should keep that name.
+- Legacy `~/.agentv/benchmarks.yaml` migration and per-run `benchmark.json` artifacts are separate concepts.
+
+When in doubt: if it holds runs/traces/experiments, it is a project. If it is a curated eval suite, it is a benchmark.
+
+## Wire Format
+
+Everything crossing a process boundary uses `snake_case`. Internal TypeScript uses `camelCase`. Translate at the boundary only.
+
+Snake case surfaces include YAML, JSONL result files, artifact output, HTTP responses, CLI JSON, and anything consumed by non-TS tooling. Camel case surfaces are TypeScript variables, parameters, type members, and in-memory shapes.
+
+Use paired wire/internal interfaces and converters, following `packages/core/src/projects.ts`. Do not dump TS objects directly to YAML or JSON responses.
+
+Treat existing camelCase on disk or in responses as a bug when touching that path.
+
+## Documentation
+
+When functionality changes, update:
+
+- Docs site under `apps/web/src/content/docs/`.
+- Skills if YAML schema, grader types, or CLI commands changed.
+- Examples that exercise changed behavior.
+- README only when the high-level pointer changes.
diff --git a/.agents/skills/agentv-git-workflow/SKILL.md b/.agents/skills/agentv-git-workflow/SKILL.md
new file mode 100644
index 00000000..97c64898
--- /dev/null
+++ b/.agents/skills/agentv-git-workflow/SKILL.md
@@ -0,0 +1,95 @@
+---
+name: agentv-git-workflow
+description: Use when starting, claiming, committing, pushing, opening, updating, reviewing, merging, or cleaning up AgentV work. Covers AO-first session/worktree/PR lifecycle, GitHub collaboration, manual fallback worktrees, existing PR takeover, and merge cleanup.
+---
+
+# AgentV Git Workflow
+
+## Tracking Model
+
+- AO (Composio Agent Orchestrator) is the orchestration layer for live coding work: assignment, worker ownership, status, worktree lifecycle, PR claiming, and visualization.
+- GitHub is the external collaboration surface: PRs, reviews, CI, merge coordination, issues, and human-visible handoff.
+- Beads (`bd`) is optional durable planning/backlog context only when explicitly assigned by the user/AO. Do not use Beads as routine live execution tracking in AO-managed sessions.
+- Do not create competing task trackers, markdown TODO ledgers, unmanaged agent sessions, or duplicate PRs.
+
+## AO-Managed Sessions
+
+When `AO_SESSION_ID` is present or the task says it is an AO worker session:
+
+1. Acknowledge and report status with AO commands (`ao acknowledge`, `ao report working`, `ao report fixing-ci`, `ao report addressing-reviews`, `ao report needs-input`).
+2. Use the AO-provided worktree and branch unless AO/user instructs otherwise.
+3. For an existing PR, run `ao session claim-pr <number-or-url>` before editing. If claim or checkout indicates another AO session/worktree owns the branch, coordinate instead of forcing checkout.
+4. Push focused commits to the claimed PR branch and report PR milestones with `ao report pr-created --pr-url <url>`, `draft-pr-created`, or `ready-for-review` as appropriate.
+5. Do not invoke `ep-spawn-agent`, launch sub-agents, create extra worktrees, or create Beads tasks for live tracking unless AO/user explicitly asks.
+
+## ep-spawn-agent Verdict
+
+`ep-spawn-agent` is disabled for normal AgentV work under AO. It may only be used in a non-AO environment or with explicit AO/user instruction for a Beads experiment. In AO-managed sessions it conflicts with AO ownership, visualization, worktree, and PR lifecycle, so prefer AO workers/harnesses instead.
+
+## Manual Fallback Outside AO
+
+For feature, bug fix, or non-trivial repo changes outside AO, work from a dedicated sibling worktree based on latest `origin/main`:
+
+```bash
+git fetch origin
+git worktree add ../agentv.worktrees/<type>-<short-desc> -b <type>/<short-desc> origin/main
+cd ../agentv.worktrees/<type>-<short-desc>
+bun install
+cp "$(git worktree list --porcelain | head -1 | sed 's/worktree //')/.env" .env 2>/dev/null || true
+```
+
+Keep the primary checkout clean. Do not push directly to `main`.
+
+## Existing PR Takeover
+
+1. Inspect the PR first:
+
+   ```bash
+   gh pr view <number> --json number,title,state,isDraft,headRefName,headRefOid,baseRefName,mergeStateStatus,reviewDecision,statusCheckRollup,url
+   gh pr checks <number> --watch=false
+   ```
+
+2. In AO, claim with `ao session claim-pr <number-or-url>` and use the resulting worktree/branch. If the branch is already used by another worktree, do not force it; coordinate or `cd` into the existing worktree only when that is the safe continuation path.
+
+3. Outside AO, check out the PR branch manually:
+
+   ```bash
+   gh pr checkout <number>
+   # or: cd /path/to/existing/worktree
+   ```
+
+4. Push focused commits to the existing PR branch. Do not create a second PR for the same work.
+
+## PRs and Pushing
+
+After the first meaningful commit, push and open or update a PR. In AO, prefer the PR lifecycle requested by the orchestrator; otherwise open a draft PR for in-progress work.
+
+```bash
+git push -u origin HEAD
+gh pr create --draft --title "<type>(scope): summary" --body "<summary and verification plan>"
+```
+
+Use conventional commit and PR titles: `type(scope): summary`.
+
+## PR Readiness
+
+Keep draft until verification evidence is complete: unit tests, test plan evidence, manual red/green UAT for user-facing changes, CI green, no conflicts, and final review pass when warranted.
+
+Before marking ready:
+
+```bash
+gh pr checks <number> --watch=false
+gh pr view <number> --json isDraft,mergeStateStatus,reviewDecision,statusCheckRollup
+```
+
+## Merge and Cleanup
+
+Use squash merge only when explicitly responsible for merging:
+
+```bash
+gh pr merge <PR_NUMBER> --squash --delete-branch
+```
+
+After squash merge, do not continue pushing to the old branch. Start follow-up fixes from fresh `main`.
+
+Before ending a session, ensure committed work is pushed and report the current state through AO when running under AO.
diff --git a/.agents/skills/agentv-grader-changes/SKILL.md b/.agents/skills/agentv-grader-changes/SKILL.md
new file mode 100644
index 00000000..1e9900fa
--- /dev/null
+++ b/.agents/skills/agentv-grader-changes/SKILL.md
@@ -0,0 +1,51 @@
+---
+name: agentv-grader-changes
+description: Use when adding, modifying, renaming, parsing, or verifying AgentV graders/evaluators, assertion types, scoring behavior, thresholds, baseline files, or eval output shape.
+---
+
+# AgentV Grader Changes
+
+## Type System
+
+Grader types are kebab-case everywhere:
+
+- YAML config: `llm-grader`, `is-json`, `execution-metrics`.
+- Internal `EvaluatorKind`.
+- Output `scores[].type`.
+- Registry keys.
+
+Source of truth: `EVALUATOR_KIND_VALUES` in `packages/core/src/evaluation/types.ts`.
+
+Snake_case aliases can be accepted for backward compatibility through `normalizeGraderType()` in `grader-parser.ts`. SDK-facing `AssertionType` in `packages/eval/src/assertion.ts` must stay in sync.
+
+## Verification
+
+Unit tests are not enough for grader changes.
+
+1. Ensure `.env` exists in the worktree.
+2. Run an actual eval with a real example file:
+
+```bash
+bun apps/cli/src/cli.ts eval examples/features/rubric/evals/dataset.eval.yaml --test-id <test-id>
+```
+
+3. Inspect JSONL output:
+   - correct `scores[].type`
+   - expected score calculation
+   - assertions have `text`, `passed`, and optional `evidence`
+
+4. Update `*.baseline.jsonl` files when output format changes.
+
+`--dry-run` is useful for harness plumbing but returns mock scores and cannot validate grading quality.
+
+## Score Range Checks
+
+For manual e2e score guardrails:
+
+```bash
+bun apps/cli/src/cli.ts eval examples/path/to/suite.eval.yaml --target azure \
+  --out examples/path/to/suite.results.jsonl
+bun scripts/check-grader-scores.ts
+```
+
+Add `<eval-stem>.grader-scores.yaml` next to an eval when a new suite needs score-range assertions.
diff --git a/.agents/skills/agentv-release-publishing/SKILL.md b/.agents/skills/agentv-release-publishing/SKILL.md
new file mode 100644
index 00000000..3317905e
--- /dev/null
+++ b/.agents/skills/agentv-release-publishing/SKILL.md
@@ -0,0 +1,31 @@
+---
+name: agentv-release-publishing
+description: Use when changing AgentV versioning, release automation, package publishing, npm package configuration, or release docs.
+---
+
+# AgentV Release and Publishing
+
+## Versioning
+
+Git commit history is the changelog. Use GitHub Actions for releases; do not publish manually from a local machine.
+
+## Standard Release Flow
+
+1. Run the Release workflow with `channel=next` and desired bump. It creates `x.y.z-next.1`, commits, tags, and pushes.
+2. Publish workflow publishes npm `next`.
+3. Run Release workflow with `channel=finalize`. It strips the prerelease suffix.
+4. Publish workflow publishes npm `latest`.
+
+## Direct Stable Release
+
+Run the Release workflow with `channel=stable` and the desired bump. Publish workflow publishes npm `latest`.
+
+## Local Scripts
+
+`bun scripts/release.ts` can inspect version state locally, but do not run `bun run publish` or `bun run publish:next` locally. npm publish uses OIDC trusted publishing from GitHub Actions.
+
+## Packages
+
+- `packages/core/` publishes `@agentv/core`.
+- `apps/cli/` publishes `agentv`.
+- tsup bundles workspace dependencies with `noExternal: ["@agentv/core"]`.
diff --git a/.agents/skills/agentv-testing-verification/SKILL.md b/.agents/skills/agentv-testing-verification/SKILL.md
new file mode 100644
index 00000000..469401a4
--- /dev/null
+++ b/.agents/skills/agentv-testing-verification/SKILL.md
@@ -0,0 +1,78 @@
+---
+name: agentv-testing-verification
+description: Use when testing, verifying, debugging checks, changing CLI behavior, grader behavior, Studio UI/API behavior, docs site visuals, examples, or preparing an AgentV PR for review.
+---
+
+# AgentV Testing and Verification
+
+## Pre-Push
+
+The repo uses `prek` pre-push hooks. Do not manually run the full pre-push suite before pushing unless diagnosing a failure. Push to the feature branch and let the hook run:
+
+- `bun run build`
+- `bun run typecheck`
+- `bun run lint`
+- `bun run test`
+- `bun run validate:examples`
+
+Manual equivalent:
+
+```bash
+bunx prek run --all-files --hook-stage pre-push
+```
+
+## CLI Testing
+
+Never use global `agentv` for functional testing. Use current source:
+
+```bash
+bun apps/cli/src/cli.ts <args>
+```
+
+If changes touch `packages/core/`, run `bun run build` first because the CLI imports `@agentv/core` from compiled `dist`.
+
+For built output use `bun apps/cli/dist/cli.js <args>` or `bun agentv <args>`, but only after building.
+
+## Studio UI
+
+`agentv studio` serves `apps/studio/dist/`. Rebuild before UAT or screenshots:
+
+```bash
+cd apps/studio && bun run build
+```
+
+## Docs Browser E2E
+
+Use `agent-browser` for docs site verification. Always pass `--session <name>` and do not use `--headed`.
+
+If session launch hangs with EAGAIN on ARM64, pre-start Chrome with CDP and use `agent-browser --cdp 9222`.
+
+## Browser Safety In Tests
+
+Automated tests should not unexpectedly open a graphical browser. For browser-dependent behavior, prefer headless `agent-browser` verification or explicit opt-in test hooks. If adding code that can launch a browser, guard it behind environment checks or explicit user action.
+
+## Agent Provider Evals
+
+Limit coding-agent provider eval concurrency to 3 targets at a time for `claude`, `claude-sdk`, `codex`, `copilot`, `copilot-sdk`, `pi`, and `pi-cli`. Lightweight LLM-only targets can use higher concurrency.
+
+## Writing Tests
+
+- Test new or changed behavior only.
+- Prefer one test per distinct behavior.
+- Avoid tests for obvious one-line behavior unless it is a regression risk.
+- Regression tests matter more than broad happy-path duplication.
+- Tests are executable contracts; update them when behavior promises change.
+- Use table-driven tests when multiple cases exercise the same behavior.
+- Use temporary directories/helpers for filesystem tests; do not write persistent test artifacts into the repo.
+
+## Completion Checklist
+
+Before marking a branch ready:
+
+- Ensure `.env` exists in a worktree when evals or LLM-dependent tests may run.
+- Run targeted tests while developing and rely on pre-push for the full suite.
+- Complete manual red/green UAT for user-facing behavior before review readiness.
+- Verify adjacent behavior where the change touches shared parsing, scoring, config, or UI paths.
+- For scoring/grader changes, run at least one real eval with a live provider when feasible.
+- For Studio UX/API changes, verify with browser testing.
+- Document verification evidence in the PR.
diff --git a/.agents/skills/beads-epic-delivery-loop/SKILL.md b/.agents/skills/beads-epic-delivery-loop/SKILL.md
new file mode 100644
index 00000000..6a5d8082
--- /dev/null
+++ b/.agents/skills/beads-epic-delivery-loop/SKILL.md
@@ -0,0 +1,129 @@
+---
+name: beads-epic-delivery-loop
+description: Optional; use only when the user/AO explicitly assigns a Beads epic for AgentV. Iterates unblocked tasks with claim, implement, test, review, commit, close, and repeat without spawning unmanaged agents.
+---
+
+# Beads Epic Delivery Loop
+
+This is an optional Beads execution playbook. In AO-managed sessions, do not use it unless the user or AO explicitly assigns a Beads epic. AO remains the live orchestration layer for session ownership, worktrees, PR claiming, status, and visualization.
+
+Execute a top-level epic by repeatedly selecting unblocked work, implementing only the selected scope, verifying, reviewing, committing, and closing tasks.
+
+## Inputs
+
+- `EPIC_ID` required: top-level epic to execute.
+- `PLAN_FILE` optional but recommended: source plan with acceptance and architecture context.
+
+## Required Rules
+
+- Use `bd ... --json` for Beads operations only after explicit assignment.
+- Do not invoke `ep-spawn-agent`, launch unmanaged agents, or create extra worktrees inside an AO-managed session.
+- Do not let Beads claims/status override AO session or PR ownership.
+- Keep statuses accurate: `open` -> `in_progress` -> `closed`, or `blocked` with a clear reason.
+- Do not work outside `EPIC_ID` and its descendants.
+- Do not close a bead until acceptance criteria and verification are satisfied.
+- Do not skip review before close.
+- Stop on hard blockers instead of inventing scope.
+
+## High-Level Loop
+
+1. Read the epic and plan context.
+2. Select the next incomplete unblocked sub-epic or task in deterministic order.
+3. Claim and mark the selected task in progress.
+4. Gather only the relevant task, dependency, plan, and repo context.
+5. Implement the scoped slice.
+6. Run task-specific verification first, then broader required checks.
+7. Review against plan/spec and perform a code review pass.
+8. Commit focused changes.
+9. Push the branch and update or create the draft PR when appropriate.
+10. Close the bead with a completion reason.
+11. Repeat until the epic completes or a stop condition is reached.
+
+## Deterministic Selection
+
+Use creation order to break ties:
+
+1. Load the epic:
+
+   ```bash
+   bd show <EPIC_ID> --json
+   bd children <EPIC_ID> --json
+   ```
+
+2. Prefer incomplete unblocked sub-epics before direct top-level tasks.
+3. Within the active scope, list children, filter to non-epic open tasks with satisfied dependencies, and pick the oldest.
+4. If open tasks remain but none are executable, stop with `blocked_waiting_on_dependencies`.
+5. If no open tasks remain in the active scope, mark the scope complete and advance.
+
+## Task Execution Pattern
+
+For each selected task:
+
+```bash
+bd update <TASK_ID> --claim --json
+bd update <TASK_ID> --status in_progress --json
+bd show <TASK_ID> --json
+```
+
+Then:
+
+- implement only the scoped task;
+- avoid opportunistic unrelated refactors;
+- run verification named in the bead or plan;
+- inspect changed files with `git diff`;
+- fix deviations before committing;
+- create a focused conventional commit;
+- push the branch;
+- update PR notes if a PR exists;
+- close the bead after acceptance is met.
+
+Close shape:
+
+```bash
+bd close <TASK_ID> --reason "Completed: <short evidence summary>" --json
+```
+
+## Discovery Handling
+
+When discovering follow-up work:
+
+```bash
+bd create "<follow-up title>" \
+  --description "<what was discovered, why it matters, and suggested next step>" \
+  -t bug|feature|task|chore \
+  -p 0-4 \
+  --deps discovered-from:<TASK_ID> \
+  --json
+```
+
+Keep the current task open if its declared acceptance criteria are not complete. Do not widen the current PR unless the follow-up is required for the selected task.
+
+## Stop Conditions
+
+Stop immediately on:
+
+- blocked dependency or missing prerequisite;
+- failing verification that cannot be resolved within the current task scope;
+- unclear plan/spec that would make implementation unsafe;
+- inconsistent Beads state preventing deterministic selection;
+- merge conflicts or PR/CI failures that require separate focused work.
+
+When stopping, leave a bead note:
+
+```bash
+bd note <TASK_ID> "Stopped: <reason>; current branch/worktree: <path>; next action: <action>"
+bd dolt push
+```
+
+## Completion Output
+
+Return one summary with:
+
+1. epic ID and plan file;
+2. tasks completed in order;
+3. commits created;
+4. PR URL or branch;
+5. checks run and results;
+6. new beads created from discoveries;
+7. blocked tasks or stop reason;
+8. next recommended action.
diff --git a/.agents/skills/beads-execplan-issue-creator/SKILL.md b/.agents/skills/beads-execplan-issue-creator/SKILL.md
new file mode 100644
index 00000000..89efe464
--- /dev/null
+++ b/.agents/skills/beads-execplan-issue-creator/SKILL.md
@@ -0,0 +1,145 @@
+---
+name: beads-execplan-issue-creator
+description: Optional; use only when the user/AO explicitly asks to convert an approved implementation plan or ExecPlan into dependency-aware Beads epics/issues for AgentV work.
+---
+
+# Beads ExecPlan Issue Creator
+
+This is an optional Beads planning playbook. In AO-managed sessions, do not use it unless the user or AO explicitly assigns Beads planning. AO remains the live orchestration layer; GitHub remains the external collaboration record.
+
+Convert one approved plan into high-quality Beads tracking in two passes:
+
+1. Create epics and issues with explicit hierarchy and true blocker dependencies.
+2. Review and polish the created graph so implementers can execute from fresh context with minimal ambiguity.
+
+## Inputs
+
+- `PLAN_FILE`: approved implementation plan path.
+- `ROOT_EPIC_ID` optional: existing epic to attach work under.
+
+If `ROOT_EPIC_ID` is omitted, create a root epic from the plan title and purpose.
+
+## Rules
+
+- Use `bd ... --json` for Beads operations only after explicit assignment.
+- Do not create Beads as a parallel live tracker for AO-managed work.
+- Do not invoke `ep-spawn-agent`, launch unmanaged agents, or create extra worktrees.
+- Use `--dry-run` before large `bd create` bursts when the command supports it.
+- Keep plan markdown as planning input; Beads can become durable backlog/planning context for the explicitly assigned Beads scope, but AO remains the live execution source of truth.
+- Prefer fewer high-confidence beads over many vague beads.
+- Ask for clarification when dependency edges or scope boundaries are ambiguous enough to risk incorrect work.
+- Do not serialize independent work. Use dependencies only for true blockers.
+
+## Parse The Plan
+
+Extract:
+
+- plan title and purpose;
+- milestones or phases;
+- concrete implementation steps;
+- validation and acceptance criteria;
+- interfaces, dependencies, and invariants;
+- idempotence, recovery, and safety constraints;
+- explicit non-goals.
+
+## Build The Graph Before Creating
+
+Model:
+
+- root phase epic;
+- child milestone epics when the plan has major phases;
+- implementation issues under the relevant epic;
+- blocker dependencies only where work cannot start without another bead;
+- parallelization notes where independent tracks can run concurrently.
+
+## Create Beads
+
+Use clear descriptions. For epics, include:
+
+```markdown
+## Context
+<why this epic exists>
+
+## Success Criteria
+- <verifiable outcome>
+
+## Dependencies and Parallelization
+- Blocked by: <ids or none>
+- Can run in parallel with: <ids or none>
+```
+
+For tasks/features/chores/bugs, include:
+
+```markdown
+## Context
+<why this work exists>
+
+## Detailed Design
+<technical approach and boundaries>
+
+## Acceptance Criteria
+- <observable behavior>
+
+## Verification
+- <command or explicit test path>
+
+## Parallelization Notes
+- Blocked by: <ids or none>
+- Parallel with: <ids or none>
+
+## Invariants
+- <must remain true>
+```
+
+Command shape:
+
+```bash
+bd create "<title>" \
+  --description "<well-structured description>" \
+  -t epic|feature|task|chore|bug \
+  -p 0-4 \
+  --parent <optional-parent-id> \
+  --deps discovered-from:<root-epic-id>[,<true-blocker-id>...] \
+  --json
+```
+
+## Review And Polish Pass
+
+After creation:
+
+```bash
+bd show <ROOT_EPIC_ID> --json
+bd children <ROOT_EPIC_ID> --json
+bd list --json
+```
+
+Check every created bead for:
+
+- clear title with actor/outcome/scope;
+- complete description sections;
+- specific acceptance criteria;
+- concrete verification commands;
+- correct dependency direction;
+- no accidental dependency cycles;
+- no unnecessary serialization;
+- enough context for a fresh worker to execute.
+
+Polish with:
+
+```bash
+bd update <ID> --title "<better title>" --description "<polished description>" --json
+bd dep add <ISSUE_ID> <BLOCKER_ID> --json
+bd dep remove <ISSUE_ID> <BLOCKER_ID> --json
+```
+
+## Output
+
+Return:
+
+1. root epic ID;
+2. created epics and tasks;
+3. dependency summary;
+4. parallel work lanes;
+5. verification strategy;
+6. any ambiguities or human decisions needed;
+7. recommended first `bd ready --json --parent <ROOT_EPIC_ID>` command.
diff --git a/.claude/skills b/.claude/skills
new file mode 120000
index 00000000..2b7a412b
--- /dev/null
+++ b/.claude/skills
@@ -0,0 +1 @@
+../.agents/skills
\ No newline at end of file
diff --git a/.gitignore b/.gitignore
index fe8ad9cf..c3665720 100644
--- a/.gitignore
+++ b/.gitignore
@@ -25,8 +25,10 @@ examples/**/*.results.jsonl
 agent-orchestrator.yaml
 
 # Agent configuration and activity logs
-.agents/
-.claude/
+.agents/*
+!.agents/skills/
+.claude/*
+!.claude/skills
 .opencode/
 .ao/
 
@@ -35,3 +37,8 @@ agent-orchestrator.yaml
 .runtime/
 .logs/
 state.json
+
+# Beads / Dolt files (added by bd init)
+.dolt/
+*.db
+.beads-credential-key
diff --git a/AGENTS.md b/AGENTS.md
index 366d1381..378a40f6 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -1,614 +1,62 @@
 # AgentV Repository Guidelines
 
-This is a TypeScript monorepo for AgentV - an AI agent evaluation framework.
-
-## High-Level Goals
-
-AgentV aims to provide a robust, declarative framework for evaluating AI agents.
-- **Declarative Definitions**: Define tasks, expected outcomes, and rubrics in simple YAML files.
-- **Structured Evaluation**: Use "Rubric as Object" (Google ADK style) for deterministic, type-safe grading.
-- **Multi-Objective Scoring**: Measure correctness, latency, cost, and safety in a single run.
-- **Optimization Ready**: Designed to support future automated hyperparameter tuning and candidate generation.
-
-## Design Principles
-
-These principles guide all feature decisions. **Follow these when proposing or implementing changes.**
-
-### 1. Lightweight Core, Plugin Extensibility
-AgentV's core should remain minimal. Complex or domain-specific logic belongs in plugins, not built-in features.
-
-**Extension points (prefer these over adding built-ins):**
-- `code-grader` scripts for custom evaluation logic
-- `llm-grader` graders with custom prompt files for domain-specific LLM grading
-- CLI wrappers that consume AgentV's JSON/JSONL output for post-processing (aggregation, comparison, reporting)
-
-**Ask yourself:** "Can this be achieved with existing primitives + a plugin or wrapper?" If yes, it should not be a built-in. This includes adding config overrides to existing graders — if a niche provider needs custom tool-name matching, that's a code-grader, not a new config field.
-
-### 2. Built-ins for Primitives Only
-Built-in graders provide **universal primitives** that users compose. A primitive is:
-- Stateless and deterministic
-- Has a single, clear responsibility
-- Cannot be trivially composed from other primitives
-- Needed by the majority of users
-
-If a feature serves a niche use case or adds conditional logic, it belongs in a plugin.
-
-### 3. Maximize Feature Surface Through Composition
-The goal is to achieve the **maximum feature surface with the minimum primitives** due to high reusability. Before proposing a new feature, enumerate which existing primitives could achieve the same outcome when composed:
-
-- **Oracle validation** is not a feature — it's a `cli` provider target that runs a reference solution through the same evaluators.
-- **Snapshot MCP for benchmarks** is not a feature — it's frozen data in the workspace template + `before_all`/`after_all` hooks to start/stop the server.
-- **Harness variant comparison** is not a feature — it's target hooks with different `before_each` setup scripts.
-- **Skill evaluation** is not a feature — it's `tool-trajectory` + `execution-metrics` + `rubric` composed via `composite`.
-
-**If existing primitives cover it, document the pattern instead of building a feature.** New primitives are justified only when the composition is impossible, not merely when it's undocumented.
-
-### 4. Align with Industry Standards
-Before adding features, research how peer frameworks solve the problem. Prefer the **lowest common denominator** that covers most use cases. Novel features without industry precedent require strong justification and should default to plugin implementation.
-
-### 5. YAGNI — You Aren't Gonna Need It
-Don't build features until there's a concrete need. Before adding a new capability, ask: "Is there real demand for this today, or am I anticipating future needs?" Numeric thresholds, extra tracking fields, and configurable knobs should be omitted until users actually request them. Start with the simplest version (e.g., boolean over numeric range) and extend later if needed.
-
-**YAGNI applies to *how* you meet a real request, not just *whether* to meet it.** The common failure mode is not "I built X and nobody wanted it." It's "someone asked for X and I built a bigger X than they asked for." Guard against that with these habits:
-
-1. **Audit existing primitives before adding new ones.** When an issue asks for capability Y, the first question is not "how do I build Y?" — it's **"what does the codebase already do that addresses Y?"** Grep for existing functions, endpoints, and config shapes. Many requests are satisfied by a behavior that already exists and just needs to be surfaced, configured, or exercised differently.
-2. **Treat issue language as a hint, not a spec.** Issues describe problems *and* implementations. "We need a discovery root" is one implementation of "we need the registry to update live." When an issue lists multiple acceptable approaches (or its acceptance criteria don't actually require the implementation it names), pick the one with the least code surface. Summarize the acceptance criteria in your own words, strip out implementation nouns ("discovery root," "watcher," "registry reload"), then match them against existing primitives before designing anything new.
-3. **Prefer data/config changes over new mechanisms.** If the observable effect is "this list should be editable at runtime," prefer "re-read the file per request" over "add a watcher + a new field + a precedence rule + a new endpoint." Config-driven beats code-driven when both are sufficient.
-4. **Stop when scope doubles.** If an implementation's surface area grows more than ~2× the starting estimate (extra types, extra endpoints, extra invariants), that's a red flag to re-plan, not a sign to push through. Pause and ask: "What would the smallest possible version look like? Does the issue actually require more than that?"
-5. **If you are about to add a second mode, two-layer precedence, or an invariant between two optional fields, stop.** `source: manual | discovered`, "pinned wins over discovered," `excluded_paths` filtering the discovered set — every one of these is a sign that you're in complexity territory that a simpler data model would have avoided.
-
-**Call out existing overengineering.** If, while working on a task, you notice a *current* feature in the repo that looks overengineered relative to what it's used for (multiple modes, optional precedence rules, dead-looking extensibility scaffolding), flag it — don't silently fix it. Open a tracking issue titled "cleanup: simplify X" that lists: the observable behavior today, the simpler model that would cover it, and the migration notes. Link to the code. Do not widen your current PR to absorb the cleanup unless the user asks.
-
-### 6. Non-Breaking Extensions
-New fields should be optional. Existing configurations must continue working unchanged.
-
-### 7. AI-First Design
-AI agents are the primary users of AgentV—not humans reading docs. Design for AI comprehension and composability.
-
-**Skills over rigid commands:**
-- Use Claude Code skills (or agent skill standards) to teach AI *how* to create evals, not step-by-step CLI instructions
-- Skills should cover most use cases; rigid commands trade off AI intelligence
-- Only prescribe exact steps where there's an established best practice
-
-**Intuitive primitives:**
-- Expose simple, single-purpose primitives that AI can combine flexibly
-- Avoid monolithic commands that do multiple things
-- SDK internals should be intuitive enough for AI to modify when needed
-
-**Self-documenting code:**
-- File headers should explain what the file does, how it works, and how to extend it — no need to read other files to understand this one
-- Don't reference external projects, PRs, or issues in code comments; make everything standalone
-- Prefer data-driven patterns (static mappings, config tables) over conditional chains — AI can extend a mapping by adding an entry, but has to trace logic to extend an if/else tree
-- No dead code or speculative infrastructure; if it's unused, delete it
-- When a module has an extension point, include a short recipe in the header (e.g., "To add a new provider: 1. Create a matcher, 2. Add it to the mapping")
-- When changing a module's behavior, update its file header to match. Stale headers are worse than no headers.
-
-**Scope:** Applies to skills, repo structure, documentation, SDK design, and source code — anything AI might need to reason about or extend.
-
-## Tech Stack & Tools
-- **Language:** TypeScript 5.x targeting ES2022
-- **Runtime:** Bun (use `bun` for all package and script operations)
-- **Monorepo:** Bun workspaces
-- **Bundler:** tsup (TypeScript bundler)
-- **Linter/Formatter:** Biome
-- **Testing:** Vitest
-- **LLM Framework:** Vercel AI SDK
-- **Validation:** Zod
-
-## Project Structure
-- `packages/core/` - Evaluation engine, providers, grading
-  - `src/evaluation/registry/` - Extensible grader registry (EvaluatorRegistry, assertion discovery)
-  - `src/evaluation/providers/provider-registry.ts` - Provider plugin registry
-  - `src/evaluation/evaluate.ts` - `evaluate()` programmatic API
-  - `src/evaluation/config.ts` - `defineConfig()` for typed agentv.config.ts
-- `packages/eval/` - Lightweight assertion SDK (`defineAssertion`, `defineCodeGrader`)
-- `apps/cli/` - Command-line interface (published as `agentv`)
-  - `src/commands/create/` - Scaffold commands (`agentv create assertion/eval`)
-- `examples/features/sdk-*` - SDK usage examples (custom assertion, programmatic API, config file)
-
-## Working Style
-
-### Worktree Setup
-- For any feature, bug fix, or non-trivial repo change, work from a dedicated git worktree based on the latest `origin/main`.
-- Before starting implementation, run `git fetch origin` and verify your worktree `HEAD` is based on the current `origin/main` commit.
-- Do not implement from the primary checkout, from a stale local `main`, or from a branch created off an outdated base.
-- Default setup:
-```bash
-git fetch origin
-git worktree add ../agentv.worktrees/<type>-<short-desc> -b <type>/<issue-or-topic>-<short-desc> origin/main
-cd ../agentv.worktrees/<type>-<short-desc>
-```
-- If you discover you are not on a fresh worktree from the latest `origin/main`, stop and fix that first before changing code.
-
-### Planning
-- Use plan mode for any non-trivial task (5+ steps or architectural decisions).
-- If something goes sideways, STOP and re-plan immediately — don't keep pushing a broken approach.
-- For non-trivial changes, pause and ask: "Is there a more elegant solution?" before diving in.
-- Check in with the user before starting implementation on ambiguous tasks.
-- Prefer automation: execute the requested work without extra confirmation unless blocked by missing information, safety concerns, or an irreversible/destructive action the user has not approved.
-
-### Subagent Strategy
-- Use subagents aggressively to keep the main context window clean.
-- Subagents for: research, file exploration, running tests, code review.
-- For complex problems, throw more subagents at it — parallelize where possible.
-- Name subagents descriptively.
-- Before declaring a repo change complete or opening/finalizing a PR, complete manual e2e verification first (see E2E Checklist), **then** spawn a subagent for a final code review pass. E2E must pass before code review — if e2e fails, fix the issue before investing time in review. The user may explicitly skip the review step.
-
-### Autonomous Bug Fixes
-- When you spot a bug, just fix it. Don't ask for hand-holding.
-- Point at logs, errors, failing tests — then resolve them.
-- Only ask when there's genuine ambiguity about intent.
-- Fix failing CI tests without being told.
-
-### Simplicity
-- Every change should be as simple as possible. Import existing code; don't reinvent.
-- Find root causes and fix them directly. No shotgun debugging.
-
-### Progress Updates
-- Provide high-level status updates at natural milestones.
-- When scope changes mid-task, communicate the shift and adjust the plan.
-- Use parallel tool calls when applicable, especially for independent reads, checks, and validation steps.
-
-### PR & Commit Titles
-- Prefer conventional commit style for branch-facing titles: `type(scope): summary`.
-- Use the repository's normal types where they fit, such as `feat`, `fix`, `chore`, `refactor`, `docs`, and `test`.
-- Use the most relevant module or product area as `scope`, such as `studio`, `cli`, `results`, or `evals`.
-- Do not prefix PR titles with `[codex]` unless the user explicitly requests it.
-
-## TypeScript Guidelines
-- Target ES2022 with Node 20+
-- Prefer type inference over explicit types
-- Use `async/await` for async operations
-- Prefer named exports
-- Keep modules cohesive
-
-## Naming Convention: "Project" vs "Benchmark"
-
-These two words have distinct, non-interchangeable meanings in this codebase. Get them right when adding new symbols, docs, or example dirs:
-
-- **Project** — the top-level container Studio organises around: a registered workspace directory (`.agentv/` + run artifacts + traces + experiments). Lives in `~/.agentv/projects.yaml`. Modelled by `ProjectEntry` / `ProjectRegistry` in `packages/core/src/projects.ts`. Matches the terminology used by Phoenix, Langfuse, Braintrust, W&B Weave, and LangSmith.
-- **Benchmark** — a curated *eval suite* designed to measure something specific (academic ML sense: MMLU, HumanEval, SWE-bench). Example dirs use this sense: `examples/showcase/multi-model-benchmark/`, `examples/showcase/offline-grader-benchmark/`, `examples/features/benchmark-tooling/`. Do not rename these — they are correctly named.
-
-The legacy registry file `~/.agentv/benchmarks.yaml` is auto-migrated to `projects.yaml` on first load by `migrateLegacyBenchmarksFile()`. The unrelated per-run `benchmark.json` artifact (Agent Skills compatibility output) is a third, separate concept — also keep that name.
-
-When in doubt: if the thing holds runs / traces / experiments, it's a **project**. If it's a curated set of eval cases meant to measure capability, it's a **benchmark**.
-
-## Wire Format Convention
-
-**Everything that crosses a process boundary uses `snake_case` keys. Internal TypeScript uses `camelCase`. Translate at the boundary — never in the middle.**
-
-The rule is blanket: if the key is going to disk, to a user's editor, into a JSON response, or onto a CLI, it's snake_case. There is no "well this file is internal-ish" carve-out. If in doubt, snake_case.
-
-### snake_case surfaces
-- All YAML files on disk: `*.eval.yaml`, `agentv.config.yaml`, `projects.yaml`, `studio/config.yaml`, any future YAML we add.
-- JSONL result files (`test_id`, `token_usage`, `duration_ms`).
-- Artifact-writer output (`pass_rate`, `tests_run`, `total_tool_calls`).
-- HTTP response bodies from `agentv serve` / Studio (`added_at`, `pass_rate`, `project_id`).
-- CLI JSON output (`agentv results summary`, `results failures`, `results show`).
-- Anything consumed by non-TS tooling (Python, jq pipelines, external dashboards).
-
-### camelCase surfaces
-- TypeScript source: all variables, parameters, fields, type members.
-- Internal in-memory shapes passed between TS modules.
-
-### Translate only at the boundary
-Define a second interface for the wire shape and convert in one place — don't smear snake_case through TS internals.
-
-```typescript
-// Wire shape — snake_case, matches what hits disk / the network
-interface ProjectEntryYaml {
-  id: string;
-  name: string;
-  path: string;
-  added_at: string;
-  last_opened_at: string;
-}
-
-// Internal shape — camelCase, what every TS call site sees
-interface ProjectEntry {
-  id: string;
-  name: string;
-  path: string;
-  addedAt: string;
-  lastOpenedAt: string;
-}
-
-function fromYaml(e: ProjectEntryYaml): ProjectEntry {
-  return { id: e.id, name: e.name, path: e.path, addedAt: e.added_at, lastOpenedAt: e.last_opened_at };
-}
-
-function toYaml(e: ProjectEntry): ProjectEntryYaml {
-  return { id: e.id, name: e.name, path: e.path, added_at: e.addedAt, last_opened_at: e.lastOpenedAt };
-}
-```
-
-Yes, this is two interfaces and two functions per entity. That's the price of keeping TS idiomatic while staying faithful to the wire contract. Don't skip it — dumping TS objects directly to YAML leaks `addedAt`-style camelCase onto disk and breaks jq/Python consumers.
-
-### Anti-patterns
-- `writeFileSync(path, stringifyYaml(tsObject))` — dumps TS field names verbatim. Wrong.
-- `interface Foo { testId: string; ... }` for a JSON response body — `test_id`, always.
-- Accepting both `testId` and `test_id` on input "for back-compat" when nothing is shipped yet. Just snake_case.
-
-### Existing divergences
-If you spot a camelCase key already on disk or in a response (e.g. a legacy endpoint), treat it as a bug: migrate it to snake_case in the same PR where you touch that code path. Don't grandfather it in.
-
-**Reading back:** `parseJsonlResults()` in `artifact-writer.ts` converts snake_case → camelCase when reading JSONL into TypeScript. `fromYaml` / `toYaml` in `packages/core/src/projects.ts` is the model for YAML boundaries.
-
-**Why:** Aligns with skill-creator (claude-plugins-official) and broader Python/JSON ecosystem conventions where snake_case is the standard wire format.
-
-## Testing & Verification
-
-### Pre-Push Hooks (Automated)
-
-The repository uses [prek](https://github.com/nickel-lang/prek) (`@j178/prek`) for pre-push hooks that automatically run build, typecheck, lint, and tests before pushing. **Do not manually run these checks before pushing** — just push to the feature branch and let the pre-push hook validate.
-
-**Setup (automatic):**
-The hooks are installed automatically when you run `bun install` via the `prepare` script. To manually install:
-```bash
-bunx prek install -t pre-push
-```
-
-**What runs on push:**
-- `bun run build` - Build all packages
-- `bun run typecheck` - TypeScript type checking
-- `bun run lint` - Biome linting
-- `bun run test` - All tests
-- `bun run validate:examples` - Validate example eval YAML files against the agentv schema
-
-If any check fails, the push is blocked until the issues are fixed.
-
-**Manual run (without pushing):**
-```bash
-bunx prek run --all-files --hook-stage pre-push
-```
-
-### Functional Testing (CLI)
-
-When functionally testing changes to the AgentV CLI, **NEVER** use `agentv` directly as it may run the globally installed version (bun or npm). Instead:
-
-- **From TypeScript source (preferred):** `bun apps/cli/src/cli.ts <args>` — always runs current CLI code, no build step needed. **Exception:** changes inside `packages/core/` require `bun run build` first, because the CLI imports `@agentv/core` from its compiled `dist/`, not from TypeScript source.
-- **From built dist:** `bun apps/cli/dist/cli.js <args>` — requires `bun run build` first, can be stale
-- **From repository root:** `bun agentv <args>` — runs the locally built version (also requires build)
-
-**Prefer running from source** (`src/cli.ts`) during development. The dist build can silently serve stale code if you forget to rebuild after changes. After pulling changes that touch `packages/core/`, always run `bun run build` before CLI testing.
-
-**Studio frontend exception — rebuild `apps/dashboard/dist/` before UAT.** Running `agentv studio` from source (`bun apps/cli/src/cli.ts studio ...`) only reloads the CLI and backend routes from source. The Studio web UI (React/Tailwind bundle) is served as static assets from `apps/dashboard/dist/`, which is build output and does **not** recompile on change. If you are testing Studio UI changes — especially post-merge on `main` or after pulling — rebuild the frontend first:
-
-```bash
-cd apps/dashboard && bun run build
-```
-
-Skipping this step silently serves the previous bundle, so you'll see the old UI even though your source edits and the backend API are live. This has burned at least one post-merge UAT; always rebuild before screenshotting or driving Studio with `agent-browser`.
-
-### Browser E2E Testing (Docs Site)
-
-Use `agent-browser` for visual verification of docs site changes. Environment-specific rules:
-
-- **Always use `--session <name>`** — isolates browser instances; close with `agent-browser --session <name> close` when done
-- **Never use `--headed`** — no display server available; headless (default) works correctly
-
-**Troubleshooting: `--session` hangs with EAGAIN on ARM64**
-
-If `agent-browser --session <name> open <url>` consistently fails with "Resource temporarily unavailable" or times out, Chrome is taking longer to start than the client's retry window. Workaround: pre-start Chrome manually and use `--cdp`:
-
-```bash
-nohup chromium --headless=new --remote-debugging-port=9222 \
-  --no-first-run --disable-background-networking --disable-default-apps \
-  --disable-sync --ozone-platform=headless --window-size=1280,720 \
-  --user-data-dir=/tmp/ab-chrome > /tmp/chrome.log 2>&1 &
-curl -s http://localhost:9222/json/version  # verify ready
-
-agent-browser --cdp 9222 open <url>
-agent-browser --cdp 9222 screenshot output.png
-```
-
-### Agent Provider Eval Concurrency
-
-When running evals against agent provider targets (claude, claude-sdk, codex, copilot, copilot-sdk, pi, pi-cli), **limit concurrency to 3 targets at a time**. Each agent provider spawns heavyweight subprocesses (CLI binaries, SDK sessions) that consume significant memory and CPU. Running more than 3 in parallel can exhaust system resources.
-
-```bash
-# Good: batch targets in groups of 2-3
-bun apps/cli/src/cli.ts eval my.EVAL.yaml --target claude &
-bun apps/cli/src/cli.ts eval my.EVAL.yaml --target codex &
-wait
-bun apps/cli/src/cli.ts eval my.EVAL.yaml --target copilot &
-bun apps/cli/src/cli.ts eval my.EVAL.yaml --target pi &
-wait
-```
-
-This does not apply to lightweight LLM-only targets (azure, openai, gemini, openrouter) which can run with higher concurrency.
-
-### Writing Tests
-
-Tests should be lean and focused on what matters. Follow these principles:
-
-- **Only test new or changed behavior.** Don't write tests for existing behavior that's already covered by the 1600+ core tests. If you fix a bug, test the fix and its edge cases — not the surrounding module.
-- **One test per distinct behavior.** Don't write separate tests for trivially different inputs that exercise the same code path.
-- **No tests for obvious code.** If a function returns `undefined` for missing input and that's a one-line null check, you don't need a test for it unless it's a regression risk.
-- **Regression tests > comprehensive tests.** A test that would have caught the bug is worth more than five tests that exercise happy paths.
-- **Tests are executable contracts.** When a module's behavioral contract changes, the tests must reflect the new contract — not just the happy path. If you change what a function promises, update its tests to assert the new promise.
-
-### Verifying Grader Changes
-
-Unit tests alone are insufficient for grader changes. After implementing or modifying graders:
-
-1. **Copy `.env` to the worktree** if running in a git worktree (e2e tests need environment variables):
-   ```bash
-   cp /path/to/main/.env .env
-   ```
-   ```powershell
-   Copy-Item D:/path/to/main/.env .env
-   ```
-   Do not claim e2e or grader verification results unless this preflight has passed.
-
-2. **Run an actual eval** with a real example file:
-   ```bash
-   bun apps/cli/src/cli.ts eval examples/features/rubric/evals/dataset.eval.yaml --test-id <test-id>
-   ```
-
-3. **Inspect the results JSONL** to verify:
-   - The correct grader type is invoked (check `scores[].type`)
-   - Scores are calculated as expected
-   - Assertions array reflects the evaluation logic (each entry has `text`, `passed`, optional `evidence`)
-
-4. **Update baseline files** if output format changes (e.g., type name renames). Baseline files live alongside eval YAML files as `*.baseline.jsonl` and contain expected `scores[].type` values. There are 30+ baseline files across `examples/`.
-
-5. **Note:** `--dry-run` returns schema-valid mock responses for both agent output and grader evaluation (score=1, empty assertions/checks). Built-in LLM graders run without parse errors but scores are meaningless. Use it for end-to-end harness testing including grader plumbing.
-
-### Checking Grader Score Ranges (manual e2e)
-
-`scripts/check-grader-scores.ts` is a post-processor that asserts each grader's score on each test case falls within an expected range. Run it manually after an eval to catch grader regressions (false positives / false negatives) before merging.
-
-**Workflow:**
-```bash
-# 1. Run the eval, writing results to a sibling *.results.jsonl file
-bun apps/cli/src/cli.ts eval examples/path/to/suite.eval.yaml --target azure \
-  --out examples/path/to/suite.results.jsonl
-
-# 2. Assert all expected score ranges pass
-bun scripts/check-grader-scores.ts
-```
-
-The script auto-discovers `examples/**/*.grader-scores.yaml`, locates the sibling `*.results.jsonl` (same stem), and exits non-zero if any score is out of range.
-
-**To add score checks for a new eval:**
-1. Create `<eval-stem>.grader-scores.yaml` next to the eval YAML.
-2. Add entries for each `(test_id, grader, range)` you care about — `grader` must match a `scores[].name` value in the JSONL output, and `range.min`/`range.max` default to 0/1 if omitted.
-3. Run the eval with `--out <eval-stem>.results.jsonl`, then run the script.
-
-See `examples/red-team/archetypes/coding-agent/suites/screenshot-pii-upload.grader-scores.yaml` for a concrete example.
-
-### Completing Work — E2E Checklist
-
-Before marking any branch as ready for review, complete this checklist:
-
-1. **Preflight:** If in a git worktree, ensure `.env` exists in the worktree root.
-   ```bash
-   cp "$(git worktree list --porcelain | head -1 | sed 's/worktree //')/.env" .env
-   ```
-   Without this, any eval run or LLM-dependent test will fail with missing API key errors.
-
-2. **Run unit tests**: `bun run test` — all must pass.
-
-3. **⚠️ BLOCKING: Manual red/green UAT — must complete before steps 4-5:**
-   Unit tests passing is NOT sufficient. Every change must be manually verified from the end user's perspective. Do NOT skip this step or proceed to step 4 until red/green evidence is documented.
-
-   - **Red (before your changes):** Run the scenario on `main` (or the code state before your changes). Confirm the bug or missing feature is observable from the CLI / user-facing output. Capture the output.
-   - **Green (with your changes):** Run the identical scenario with your branch. Confirm the fix or feature works correctly from the end user's perspective. Capture the output.
-   - **Document both** red and green results in the PR description or comments so reviewers can see the before/after evidence.
-
-   For grader changes, this means running a real eval (not `--dry-run`) and inspecting the output JSONL. For CLI/UX changes, this means running the CLI command and verifying the console output.
-
-4. **Verify no regressions** in areas adjacent to your changes (e.g., if you changed grader parsing, run an eval that exercises different grader types).
-
-5. **Live eval verification**: For changes affecting scoring, thresholds, or grader behavior, run at least one real eval with a live provider (not `--dry-run`) and verify the output JSONL has correct scores, verdicts, and execution status.
-
-6. **Studio UX verification**: For changes affecting config, scoring display, or studio API, use `agent-browser` to verify the studio UI still renders and functions correctly (settings page loads, pass/fail indicators are correct, config saves work).
-
-7. **Mark PR as ready** only after steps 1-6 have been completed AND red/green UAT evidence is included in the PR.
-
-## Documentation Updates
-
-When making changes to functionality:
-
-1. **Docs site** (`apps/web/src/content/docs/`): Update human-readable documentation on agentv.dev. This is the comprehensive reference.
-
-2. **Skill files** (`plugins/agentv-dev/skills/agentv-eval-builder/`): Update the AI-focused reference card if the change affects YAML schema, grader types, or CLI commands. Keep concise — link to docs site for details.
-
-3. **Examples** (`examples/`): Update any example code, scripts, or eval YAML files that exercise the changed functionality. Examples are both documentation and integration tests.
-
-4. **README.md**: Keep minimal. Links point to agentv.dev.
-
-## Grader Type System
-
-Grader types use **kebab-case** everywhere (matching promptfoo convention):
-
-- **YAML config:** `type: llm-grader`, `type: is-json`, `type: execution-metrics`
-- **Internal TypeScript:** `EvaluatorKind = 'llm-grader' | 'is-json' | ...`
-- **Output `scores[].type`:** `"llm-grader"`, `"is-json"`
-- **Registry keys:** `registry.register('llm-grader', ...)`
-
-**Source of truth:** `EVALUATOR_KIND_VALUES` array in `packages/core/src/evaluation/types.ts`
-
-**Backward compatibility:** Snake_case is accepted in YAML (`llm_judge` → `llm-grader`) via `normalizeGraderType()` in `grader-parser.ts`. Single-word types (`contains`, `equals`, `regex`, `latency`, `cost`) have no separator and are unchanged.
-
-**Two type definitions exist:**
-- `EvaluatorKind` in `packages/core/src/evaluation/types.ts` — internal, canonical
-- `AssertionType` in `packages/eval/src/assertion.ts` — SDK-facing, must stay in sync
-
-## Git Workflow
-
-### Commit Convention
-
-Follow conventional commits: `type(scope): description`
-
-Types: `feat`, `fix`, `docs`, `style`, `refactor`, `test`, `chore`
-
-### Issue Workflow
-
-When working on a GitHub issue, **ALWAYS** follow this workflow:
-
-1. **Claim the issue** — prevents other agents from duplicating work by stamping Agent ID and setting status on the project board:
-   ```bash
-   # Load AGENT_ID from .env; if not set, ask the user or default to <harness>-<model>
-   # Harness = the coding tool (claude-code, opencode, codex-cli, cursor, etc.)
-   # Model = the LLM (opus, sonnet, o3, etc.)
-   # Examples: "claude-code-opus", "opencode-sonnet", "cursor-o3", "codex-cli-o3"
-   # In this local dev environment, default to "devbox2-codex" unless the user specifies another AGENT_ID.
-   # Do NOT use hostname or machine name.
-   source .env 2>/dev/null
-   if [ -z "$AGENT_ID" ]; then
-     echo "AGENT_ID is not set. Ask the user for an agent identifier, or default to devbox2-codex in this environment (otherwise use <harness>-<model>)."
-   fi
-
-   # Check if already claimed via project board status
-   ITEM_ID=$(gh project item-list 1 --owner EntityProcess --format json | jq -r '.items[] | select(.content.number == <number> and .content.repository == "EntityProcess/agentv") | .id')
-   CURRENT_STATUS=$(gh project item-list 1 --owner EntityProcess --format json | jq -r '.items[] | select(.content.number == <number> and .content.repository == "EntityProcess/agentv") | .status')
-   [ "$CURRENT_STATUS" = "In Progress" ] && echo "SKIP — already claimed" && exit 1
-
-   # Update project roadmap: ensure the issue is on the AgentV OSS board,
-   # then set status to "In Progress" and stamp Agent ID
-   if [ -z "$ITEM_ID" ] || [ "$ITEM_ID" = "null" ]; then
-     ITEM_ID=$(gh project item-add 1 --owner EntityProcess --url "https://github.com/EntityProcess/agentv/issues/<number>" --format json | jq -r '.id')
-   fi
-   if [ -n "$ITEM_ID" ]; then
-     gh project item-edit --project-id PVT_kwDOAIbbRc4BSmjF --id "$ITEM_ID" --field-id PVTSSF_lADOAIbbRc4BSmjFzhAFomw --single-select-option-id c3991b20
-     gh project item-edit --project-id PVT_kwDOAIbbRc4BSmjF --id "$ITEM_ID" --field-id PVTF_lADOAIbbRc4BSmjFzhAHSnk --text "$AGENT_ID"
-   fi
-   ```
-   If the issue has project board status "In Progress", **do not work on it** — pick a different issue.
-
-2. **Update local `main` to the latest `origin/main`** before branching:
-   ```bash
-   git checkout main
-   git pull --ff-only origin main
-   ```
-
-3. **Create a worktree** with a feature branch:
-   ```bash
-   git worktree add agentv.worktrees/<branch-name> -b <type>/<issue-number>-<short-description>
-   cd agentv.worktrees/<branch-name>
-   bun install
-   cp "$(git worktree list --porcelain | head -1 | sed 's/worktree //')/.env" .env
-   # Example: git worktree add agentv.worktrees/feat/42-add-new-embedder -b feat/42-add-new-embedder
-   ```
-
-   The feature branch must be based on the freshly updated `main`, not a stale local checkout.
-
-4. **After your first commit, push and open a draft PR immediately:**
-   ```bash
-   git push -u origin <branch-name>
-   gh pr create --draft --title "<type>(scope): description" --body "Closes #<issue-number>"
-   ```
-   Do NOT wait until implementation is complete. The draft PR is a handoff artifact — if the session is interrupted, the user or another agent can pick up where you left off.
-
-5. **Implement the changes.** Commit and push incrementally as you work. Every meaningful checkpoint (feature compiles, tests pass, new behavior added) should be pushed to the draft PR so progress is visible and recoverable.
-
-6. **Complete E2E verification** (see "Completing Work — E2E Checklist") — this is BLOCKING. Do NOT mark the PR ready for review until every step of the E2E checklist has passed and evidence is documented in the PR body. Specifically:
-   1. Run unit tests.
-   2. Execute every test plan item from the issue/PR checklist, mark each `[x]`, and paste CLI output as evidence.
-   3. Manual red/green UAT with before/after evidence.
-   4. **After e2e passes**, spawn a final subagent code review pass and address or call out any findings — **unless the change is focused** (single-responsibility, well-tested, no architectural impact), in which case this step may be skipped. Do NOT run the code review before e2e — if e2e fails you'll need to fix it first, which invalidates the review.
-   5. CI pipeline passes (all checks green).
-   6. No merge conflicts with `main`.
-
-7. **Only after verification is complete**:
-   - Mark the draft PR ready for review, or
-   - Merge directly if the change is low risk and the repo policy allows it
-
-8. **After merge, clean up local state**:
-   - Delete the local feature branch
-   - Remove the local worktree created for the issue
-   - Confirm the primary checkout is back on an up-to-date `main`
-
-**IMPORTANT:** Never push directly to `main`. Always use branches and PRs.
-
-### Tracker Conventions
-
-- The roadmap project is the source of truth for prioritization and claim status — use it, not labels.
-- Issues in the roadmap are prioritized; issues outside it are not.
-- `bug` marks defects.
-- Issues without `bug` are non-bug work by default.
-- `core`, `wui`, and `tui` are area labels.
-- Keep issue bodies focused on the handoff contract: objective, design latitude, acceptance signals, non-goals, and related links.
-- Do not put priority metadata in issue bodies.
-
-### Pull Requests
-
-**Always use squash merge** when merging PRs to main. This keeps the commit history clean with one commit per feature/fix.
-
-```bash
-# Using GitHub CLI to squash merge a PR
-gh pr merge <PR_NUMBER> --squash --delete-branch
-
-# Or with auto-merge enabled
-gh pr merge <PR_NUMBER> --squash --auto
-```
-
-Do NOT use regular merge or rebase merge, as these create noisy commit history with intermediate commits.
-
-### After Squash Merge
-
-Once a PR is squash-merged, its source branch diverges from main. **Do NOT** try to push additional commits from that branch—you will get merge conflicts.
-
-For follow-up fixes:
-```bash
-git checkout main
-git pull origin main
-git checkout -b fix/<short-description>
-# Apply fixes on the fresh branch
-```
-
-### Plans and Worktrees
-
-#### Plans
-
-Design documents and implementation plans are stored in `docs/plans/` inside the worktree (not the main repo). Save plans to the worktree so they are committed on the feature branch and visible in the draft PR.
-
-**Path warning:** When working in a worktree, use paths relative to the worktree root (e.g., `docs/plans/plan.md`). Do NOT prefix with the worktree directory from the main repo (e.g., `agentv.worktrees/feat/xxx/docs/plans/plan.md`) — this creates accidental nested directories inside the worktree.
-
-Plans are temporary working materials. **Before merging the PR**, delete the plan file and incorporate any user-relevant details into the official documentation.
-
-#### Git Worktrees
-
-Use the sibling `../agentv.worktrees/` directory for all AgentV worktrees. This overrides any generic skill or default preference for `.worktrees/` or `worktrees/` inside the repository. Do not create new AgentV worktrees inside the repository root.
-
-After creating a worktree, always run setup:
-```bash
-bun install                                    # worktrees do NOT share node_modules
-cp "$(git worktree list --porcelain | head -1 | sed 's/worktree //')/.env" .env    # required for e2e tests and LLM operations
-```
-Both steps are required before running builds, tests, or evals in the worktree.
-
-### After Checking Out an Existing Branch or PR
-
-Whenever you `git checkout`, `gh pr checkout`, `git pull`, or otherwise switch to a ref that may have changed `package.json` / `bun.lock`, run `bun install` before building, testing, or pushing. The pre-push hook builds all workspaces — if dependencies are stale, the push fails with errors like `Cannot find module 'recharts'` even though the source change is unrelated. `bun install` is cheap when already up-to-date, so run it by default after any ref switch.
-
-## Version Management
-
-This project uses a simple release script for version bumping. The git commit history serves as the changelog.
-
-### Releasing a new version
-
-Use the **GitHub Actions workflows** — do not publish manually from a local machine.
-
-**Standard flow (pre-release → stable):**
-1. Run the [Release workflow](https://github.com/EntityProcess/agentv/actions/workflows/release.yml) with `channel=next` (and desired bump: patch/minor/major). This bumps the version to `x.y.z-next.1`, commits, tags, and pushes.
-2. The [Publish workflow](https://github.com/EntityProcess/agentv/actions/workflows/publish.yml) triggers automatically and publishes to npm `next`.
-3. Run the [Release workflow](https://github.com/EntityProcess/agentv/actions/workflows/release.yml) with `channel=finalize`. This strips the `-next.N` suffix (e.g. `4.12.0-next.1` → `4.12.0`), commits, tags, and pushes.
-4. The Publish workflow triggers automatically and publishes to npm `latest`.
-
-**Direct stable release (skip pre-release):**
-1. Run the Release workflow with `channel=stable` (and bump).
-2. Publish workflow auto-publishes to npm `latest`.
-
-The release script (`bun scripts/release.ts`) is what the Release workflow calls; it can also be run locally for non-publishing tasks (e.g. inspecting version state), but **do not run `bun run publish` or `bun run publish:next` locally** — npm publish uses OIDC trusted publishing which only works in GitHub Actions.
-
-## Package Publishing
-- Core package (`packages/core/`) - Core evaluation engine and grading logic (published as `@agentv/core`)
-- CLI package (`apps/cli/`) is published as `agentv` on npm
-- Uses tsup with `noExternal: ["@agentv/core"]` to bundle workspace dependencies
-- Install command: `bun install -g agentv` (preferred) or `npm install -g agentv`
-
-## Python Scripts
-When running Python scripts, always use: `uv run <script.py>`
+This is a TypeScript monorepo for AgentV, an AI agent evaluation framework.
+
+## Load Skills First
+
+Keep this file as bootstrap context. Detailed AgentV playbooks live in committed skills under `.agents/skills/`, following the Phoenix-style repo skill layout. `.claude/skills` is a symlink to the same directory for Claude compatibility.
+
+Before non-trivial work, load the relevant skill:
+
+- `agentv-core-development`: core design principles, TypeScript conventions, naming, snake_case wire formats, docs, examples, and repo structure.
+- `agentv-testing-verification`: CLI testing, Studio/browser verification, grader e2e checks, pre-push hooks, and PR readiness evidence.
+- `agentv-git-workflow`: AO-first session/worktree/PR lifecycle, GitHub collaboration, pushing, merging, and cleanup.
+- `beads-execplan-issue-creator`: optional, only when the user/AO explicitly assigns Beads planning; convert approved ExecPlans into dependency-aware bead epics/tasks.
+- `beads-epic-delivery-loop`: optional, only when the user/AO explicitly assigns Beads execution; execute a bead epic without spawning unmanaged agents.
+- `agentv-grader-changes`: grader/evaluator type changes, score output, baselines, live eval verification, and score-range checks.
+- `agentv-release-publishing`: versioning, release automation, and package publishing.
+
+## Always-On Rules
+
+- Use Bun for all package and script operations.
+- Run Python scripts with `uv run <script.py>`.
+- Internal TypeScript uses `camelCase`; anything crossing a process boundary uses `snake_case`. Translate at the boundary.
+- Keep AgentV core lightweight. Prefer existing primitives, plugins, examples, and docs over new built-ins.
+- Do not use global `agentv` for CLI testing. Use `bun apps/cli/src/cli.ts <args>`; rebuild first when `packages/core/` changes.
+- For Studio UI verification, rebuild `apps/studio/dist/` before UAT or screenshots.
+- In AO-managed sessions, use the AO-provided worktree/session/PR lifecycle. Do not create a second worktree, session, tracker, or PR unless AO/user explicitly asks.
+- Outside AO, use a fresh sibling worktree under `../agentv.worktrees/` based on latest `origin/main` for non-trivial repo changes.
+- Never push directly to `main`. Push feature branches and open/update PRs.
+- Use conventional commit and PR titles: `type(scope): summary`.
+- Do not create competing task trackers or memory files. AO is the orchestration layer for live work; GitHub is the external collaboration record. Use Beads only when explicitly assigned.
+
+## Safety Guardrails
+
+- The user is in charge. If an explicit user instruction conflicts with repo habits, follow the user unless it would be unsafe or impossible.
+- Do not delete files or folders without explicit permission. This includes temporary files you created unless the user already approved that cleanup.
+- Never run destructive cleanup/reset commands such as `git reset --hard`, `git clean -fd`, or broad `rm -rf` unless the user gives the exact command and explicitly confirms the irreversible consequences.
+- Prefer non-destructive recovery: inspect with `git status` / `git diff`, move aside, stash, or ask before overwriting work.
+- Do not push directly to `main`; all code changes land through branches and PRs.
+
+## Key Paths
+
+- `packages/core/`: evaluation engine, providers, grading, registry, programmatic API.
+- `packages/eval/`: lightweight assertion SDK.
+- `apps/cli/`: CLI published as `agentv`.
+- `apps/studio/`: Studio frontend.
+- `apps/web/`: documentation site.
+- `examples/`: documentation and integration coverage.
+- `.agents/skills/`: committed coding-agent skills.
+
+## Orchestration and Tracking
+
+AO (Composio Agent Orchestrator) is AgentV's orchestration layer for live coding work. In an AO-managed session:
+
+- Treat the AO session as the source of truth for assignment, status, worker ownership, worktree lifecycle, PR claiming, and visualization.
+- Report progress with `ao acknowledge`, `ao report working`, `ao report fixing-ci`, `ao report addressing-reviews`, `ao report needs-input`, and PR milestone reports as appropriate.
+- When taking over an existing PR, run `ao session claim-pr <number-or-url>` first. If AO or git shows another session/worktree owns it, coordinate instead of forcing checkout or creating a duplicate PR.
+- Do not spawn unmanaged coding agents, invoke `ep-spawn-agent`, create ad-hoc worktrees, or maintain a parallel live task tracker unless AO/user explicitly delegates that work.
+- GitHub remains the external collaboration surface for PRs, reviews, CI, issues, and human-visible handoff.
+- Beads (`bd`) may exist for durable planning or backlog records, but it is not the routine live execution tracker in AO-managed sessions. Use it only when explicitly assigned, and never let Beads state override AO session/PR ownership.
+
+Outside AO, follow the repo git workflow skill for manual worktree and PR handling.
diff --git a/biome.json b/biome.json
index 5e9695a9..bd201cd5 100644
--- a/biome.json
+++ b/biome.json
@@ -38,6 +38,7 @@
       "**/test-output/**",
       "**/__tmp_*/**",
       "**/.agentv/**",
+      ".beads/**",
       ".claude/**",
       ".opencode/**",
       ".entire/**",