EntityProcess · christso · Jun 2, 2026 · Jun 2, 2026 · Jun 2, 2026 · Jun 2, 2026
diff --git a/.agents/skills/README.md b/.agents/skills/README.md
@@ -0,0 +1,15 @@
+# AgentV Coding Agent Skills
+
+This directory contains repo-local skills that teach coding agents how to work with AgentV. They are shared across compatible tools through `.agents/skills`, with `.claude/skills` symlinked here for Claude compatibility.
+
+## Skills
+
+| Skill | Description |
+| ----- | ----------- |
+| [agentv-core-development](agentv-core-development/) | Core design principles, TypeScript conventions, naming, wire-format rules, docs expectations, and project structure. |
+| [agentv-testing-verification](agentv-testing-verification/) | AgentV test strategy, CLI verification, grader e2e checks, browser verification, and pre-push behavior. |
+| [agentv-git-workflow](agentv-git-workflow/) | AO-first session/worktree/PR lifecycle, GitHub collaboration, manual fallback worktrees, existing PR takeover, and merge cleanup. |
+| [beads-execplan-issue-creator](beads-execplan-issue-creator/) | Optional when explicitly assigned: convert approved plans into dependency-aware bead epics/tasks with acceptance criteria, verification, and invariants. |
+| [beads-epic-delivery-loop](beads-epic-delivery-loop/) | Optional when explicitly assigned: execute a bead epic end-to-end without spawning unmanaged agents. |
+| [agentv-grader-changes](agentv-grader-changes/) | Grader type conventions, live eval verification, baseline updates, and score-range checks. |
+| [agentv-release-publishing](agentv-release-publishing/) | Versioning, release workflow, and package publishing. |
diff --git a/.agents/skills/agentv-core-development/SKILL.md b/.agents/skills/agentv-core-development/SKILL.md
@@ -0,0 +1,85 @@
+---
+name: agentv-core-development
+description: Use when changing AgentV core, SDK, CLI, Studio APIs, config schemas, docs, examples, or any cross-process wire format. Covers design principles, TypeScript conventions, naming, snake_case boundaries, and documentation updates.
+---
+
+# AgentV Core Development
+
+AgentV is a TypeScript monorepo for a declarative AI agent evaluation framework.
+
+## Goals
+
+- Declarative YAML eval definitions.
+- Structured, type-safe grading.
+- Multi-objective scoring for correctness, latency, cost, and safety.
+- Optimization-ready primitives without speculative built-ins.
+
+## Design Principles
+
+- Keep core lightweight and extensible through plugins.
+- Built-ins should be universal primitives: deterministic, stateless, single-purpose, and broadly useful.
+- Prefer composition over new features. If existing primitives cover a need, document the pattern instead of adding code.
+- Research peer frameworks before adding a new capability, and choose the lowest common denominator.
+- Apply YAGNI to implementation size, not just feature selection. Audit existing primitives before adding knobs, modes, precedence rules, or new invariants.
+- New fields must be optional and non-breaking.
+- Design for AI agents: intuitive primitives, self-documenting modules, concise extension recipes in file headers, and no dead speculative infrastructure.
+
+If you notice existing overengineering while working, flag it through the active AO/GitHub workflow (for example, open a GitHub issue or report it in the PR) with current behavior, simpler model, migration notes, and code links. Do not widen the current PR unless asked.
+
+## Stack
+
+- TypeScript 5.x targeting ES2022 and Node 20+.
+- Bun for all package and script operations.
+- Bun workspaces, tsup, Biome, Vitest, Vercel AI SDK, Zod.
+
+## Project Structure
+
+- `packages/core/`: evaluation engine, providers, grading, registry, programmatic API.
+- `packages/eval/`: lightweight assertion SDK.
+- `apps/cli/`: command-line interface published as `agentv`.
+- `apps/studio/`: Studio frontend.
+- `apps/web/`: documentation site.
+- `examples/`: documentation and integration coverage.
+
+## Code Editing Discipline
+
+- Revise existing files in place when the feature belongs there; avoid creating `*-v2`, `*-new`, `*-improved`, or similarly duplicative files.
+- New files are appropriate for genuinely new modules, skills, examples, or docs, but do not create throwaway variants as a substitute for understanding the existing code.
+- Avoid broad script-based rewrites of source code. For code changes, prefer targeted edits after reading enough context; scripts are acceptable for mechanical verification, generated outputs, or narrow non-code maintenance where risk is low.
+- Do not delete files or folders without explicit permission. If cleanup is needed, ask or use a reversible alternative.
+- If using a third-party library/API and you are not sure about current usage, consult current official docs before changing the integration.
+
+## TypeScript
+
+- Prefer inference over explicit types when clear.
+- Use `async`/`await`.
+- Prefer named exports.
+- Keep modules cohesive.
+- Update stale file headers when behavior changes.
+
+## Project vs Benchmark
+
+- `Project`: top-level Studio container around a registered workspace directory. Modelled by `ProjectEntry` / `ProjectRegistry` and stored in `~/.agentv/projects.yaml`.
+- `Benchmark`: curated eval suite designed to measure a capability. Example benchmark directories should keep that name.
+- Legacy `~/.agentv/benchmarks.yaml` migration and per-run `benchmark.json` artifacts are separate concepts.
+
+When in doubt: if it holds runs/traces/experiments, it is a project. If it is a curated eval suite, it is a benchmark.
+
+## Wire Format
+
+Everything crossing a process boundary uses `snake_case`. Internal TypeScript uses `camelCase`. Translate at the boundary only.
+
+Snake case surfaces include YAML, JSONL result files, artifact output, HTTP responses, CLI JSON, and anything consumed by non-TS tooling. Camel case surfaces are TypeScript variables, parameters, type members, and in-memory shapes.
+
+Use paired wire/internal interfaces and converters, following `packages/core/src/projects.ts`. Do not dump TS objects directly to YAML or JSON responses.
+
+Treat existing camelCase on disk or in responses as a bug when touching that path.
+
+## Documentation
+
+When functionality changes, update:
+
+- Docs site under `apps/web/src/content/docs/`.
+- Skills if YAML schema, grader types, or CLI commands changed.
+- Examples that exercise changed behavior.
+- README only when the high-level pointer changes.
diff --git a/.agents/skills/agentv-git-workflow/SKILL.md b/.agents/skills/agentv-git-workflow/SKILL.md
@@ -0,0 +1,95 @@
+---
+name: agentv-git-workflow
+description: Use when starting, claiming, committing, pushing, opening, updating, reviewing, merging, or cleaning up AgentV work. Covers AO-first session/worktree/PR lifecycle, GitHub collaboration, manual fallback worktrees, existing PR takeover, and merge cleanup.
+---
+
+# AgentV Git Workflow
+
+## Tracking Model
+
+- AO (Composio Agent Orchestrator) is the orchestration layer for live coding work: assignment, worker ownership, status, worktree lifecycle, PR claiming, and visualization.
+- GitHub is the external collaboration surface: PRs, reviews, CI, merge coordination, issues, and human-visible handoff.
+- Beads (`bd`) is optional durable planning/backlog context only when explicitly assigned by the user/AO. Do not use Beads as routine live execution tracking in AO-managed sessions.
+- Do not create competing task trackers, markdown TODO ledgers, unmanaged agent sessions, or duplicate PRs.
+
+## AO-Managed Sessions
+
+When `AO_SESSION_ID` is present or the task says it is an AO worker session:
+
+1. Acknowledge and report status with AO commands (`ao acknowledge`, `ao report working`, `ao report fixing-ci`, `ao report addressing-reviews`, `ao report needs-input`).
+2. Use the AO-provided worktree and branch unless AO/user instructs otherwise.
+3. For an existing PR, run `ao session claim-pr <number-or-url>` before editing. If claim or checkout indicates another AO session/worktree owns the branch, coordinate instead of forcing checkout.
+4. Push focused commits to the claimed PR branch and report PR milestones with `ao report pr-created --pr-url <url>`, `draft-pr-created`, or `ready-for-review` as appropriate.
+5. Do not invoke `ep-spawn-agent`, launch sub-agents, create extra worktrees, or create Beads tasks for live tracking unless AO/user explicitly asks.
+
+## ep-spawn-agent Verdict
+
+`ep-spawn-agent` is disabled for normal AgentV work under AO. It may only be used in a non-AO environment or with explicit AO/user instruction for a Beads experiment. In AO-managed sessions it conflicts with AO ownership, visualization, worktree, and PR lifecycle, so prefer AO workers/harnesses instead.
+
+## Manual Fallback Outside AO
+
+For feature, bug fix, or non-trivial repo changes outside AO, work from a dedicated sibling worktree based on latest `origin/main`:
+
+```bash
+git fetch origin
+git worktree add ../agentv.worktrees/<type>-<short-desc> -b <type>/<short-desc> origin/main
+cd ../agentv.worktrees/<type>-<short-desc>
+bun install
+cp "$(git worktree list --porcelain | head -1 | sed 's/worktree //')/.env" .env 2>/dev/null || true
+```
+
+Keep the primary checkout clean. Do not push directly to `main`.
+
+## Existing PR Takeover
+
+1. Inspect the PR first:
+
+   ```bash
+   gh pr view <number> --json number,title,state,isDraft,headRefName,headRefOid,baseRefName,mergeStateStatus,reviewDecision,statusCheckRollup,url
+   gh pr checks <number> --watch=false
+   ```
+
+2. In AO, claim with `ao session claim-pr <number-or-url>` and use the resulting worktree/branch. If the branch is already used by another worktree, do not force it; coordinate or `cd` into the existing worktree only when that is the safe continuation path.
+
+3. Outside AO, check out the PR branch manually:
+
+   ```bash
+   gh pr checkout <number>
+   # or: cd /path/to/existing/worktree
+   ```
+
+4. Push focused commits to the existing PR branch. Do not create a second PR for the same work.
+
+## PRs and Pushing
+
+After the first meaningful commit, push and open or update a PR. In AO, prefer the PR lifecycle requested by the orchestrator; otherwise open a draft PR for in-progress work.
+
+```bash
+git push -u origin HEAD
+gh pr create --draft --title "<type>(scope): summary" --body "<summary and verification plan>"
+```
+
+Use conventional commit and PR titles: `type(scope): summary`.
+
+## PR Readiness
+
+Keep draft until verification evidence is complete: unit tests, test plan evidence, manual red/green UAT for user-facing changes, CI green, no conflicts, and final review pass when warranted.
+
+Before marking ready:
+
+```bash
+gh pr checks <number> --watch=false
+gh pr view <number> --json isDraft,mergeStateStatus,reviewDecision,statusCheckRollup
+```
+
+## Merge and Cleanup
+
+Use squash merge only when explicitly responsible for merging:
+
+```bash
+gh pr merge <PR_NUMBER> --squash --delete-branch
+```
+
+After squash merge, do not continue pushing to the old branch. Start follow-up fixes from fresh `main`.
+
+Before ending a session, ensure committed work is pushed and report the current state through AO when running under AO.
diff --git a/.agents/skills/agentv-grader-changes/SKILL.md b/.agents/skills/agentv-grader-changes/SKILL.md
@@ -0,0 +1,51 @@
+---
+name: agentv-grader-changes
+description: Use when adding, modifying, renaming, parsing, or verifying AgentV graders/evaluators, assertion types, scoring behavior, thresholds, baseline files, or eval output shape.
+---
+
+# AgentV Grader Changes
+
+## Type System
+
+Grader types are kebab-case everywhere:
+
+- YAML config: `llm-grader`, `is-json`, `execution-metrics`.
+- Internal `EvaluatorKind`.
+- Output `scores[].type`.
+- Registry keys.
+
+Source of truth: `EVALUATOR_KIND_VALUES` in `packages/core/src/evaluation/types.ts`.
+
+Snake_case aliases can be accepted for backward compatibility through `normalizeGraderType()` in `grader-parser.ts`. SDK-facing `AssertionType` in `packages/eval/src/assertion.ts` must stay in sync.
+
+## Verification
+
+Unit tests are not enough for grader changes.
+
+1. Ensure `.env` exists in the worktree.
+2. Run an actual eval with a real example file:
+
+```bash
+bun apps/cli/src/cli.ts eval examples/features/rubric/evals/dataset.eval.yaml --test-id <test-id>
+```
+
+3. Inspect JSONL output:
+   - correct `scores[].type`
+   - expected score calculation
+   - assertions have `text`, `passed`, and optional `evidence`
+
+4. Update `*.baseline.jsonl` files when output format changes.
+
+`--dry-run` is useful for harness plumbing but returns mock scores and cannot validate grading quality.
+
+## Score Range Checks
+
+For manual e2e score guardrails:
+
+```bash
+bun apps/cli/src/cli.ts eval examples/path/to/suite.eval.yaml --target azure \
+  --out examples/path/to/suite.results.jsonl
+bun scripts/check-grader-scores.ts
+```
+
+Add `<eval-stem>.grader-scores.yaml` next to an eval when a new suite needs score-range assertions.
diff --git a/.agents/skills/agentv-release-publishing/SKILL.md b/.agents/skills/agentv-release-publishing/SKILL.md
@@ -0,0 +1,31 @@
+---
+name: agentv-release-publishing
+description: Use when changing AgentV versioning, release automation, package publishing, npm package configuration, or release docs.
+---
+
+# AgentV Release and Publishing
+
+## Versioning
+
+Git commit history is the changelog. Use GitHub Actions for releases; do not publish manually from a local machine.
+
+## Standard Release Flow
+
+1. Run the Release workflow with `channel=next` and desired bump. It creates `x.y.z-next.1`, commits, tags, and pushes.
+2. Publish workflow publishes npm `next`.
+3. Run Release workflow with `channel=finalize`. It strips the prerelease suffix.
+4. Publish workflow publishes npm `latest`.
+
+## Direct Stable Release
+
+Run the Release workflow with `channel=stable` and the desired bump. Publish workflow publishes npm `latest`.
+
+## Local Scripts
+
+`bun scripts/release.ts` can inspect version state locally, but do not run `bun run publish` or `bun run publish:next` locally. npm publish uses OIDC trusted publishing from GitHub Actions.
+
+## Packages
+
+- `packages/core/` publishes `@agentv/core`.
+- `apps/cli/` publishes `agentv`.
+- tsup bundles workspace dependencies with `noExternal: ["@agentv/core"]`.
diff --git a/.agents/skills/agentv-testing-verification/SKILL.md b/.agents/skills/agentv-testing-verification/SKILL.md
@@ -0,0 +1,78 @@
+---
+name: agentv-testing-verification
+description: Use when testing, verifying, debugging checks, changing CLI behavior, grader behavior, Studio UI/API behavior, docs site visuals, examples, or preparing an AgentV PR for review.
+---
+
+# AgentV Testing and Verification
+
+## Pre-Push
+
+The repo uses `prek` pre-push hooks. Do not manually run the full pre-push suite before pushing unless diagnosing a failure. Push to the feature branch and let the hook run:
+
+- `bun run build`
+- `bun run typecheck`
+- `bun run lint`
+- `bun run test`
+- `bun run validate:examples`
+
+Manual equivalent:
+
+```bash
+bunx prek run --all-files --hook-stage pre-push
+```
+
+## CLI Testing
+
+Never use global `agentv` for functional testing. Use current source:
+
+```bash
+bun apps/cli/src/cli.ts <args>
+```
+
+If changes touch `packages/core/`, run `bun run build` first because the CLI imports `@agentv/core` from compiled `dist`.
+
+For built output use `bun apps/cli/dist/cli.js <args>` or `bun agentv <args>`, but only after building.
+
+## Studio UI
+
+`agentv studio` serves `apps/studio/dist/`. Rebuild before UAT or screenshots:
+
+```bash
+cd apps/studio && bun run build
+```
+
+## Docs Browser E2E
+
+Use `agent-browser` for docs site verification. Always pass `--session <name>` and do not use `--headed`.
+
+If session launch hangs with EAGAIN on ARM64, pre-start Chrome with CDP and use `agent-browser --cdp 9222`.
+
+## Browser Safety In Tests
+
+Automated tests should not unexpectedly open a graphical browser. For browser-dependent behavior, prefer headless `agent-browser` verification or explicit opt-in test hooks. If adding code that can launch a browser, guard it behind environment checks or explicit user action.
+
+## Agent Provider Evals
+
+Limit coding-agent provider eval concurrency to 3 targets at a time for `claude`, `claude-sdk`, `codex`, `copilot`, `copilot-sdk`, `pi`, and `pi-cli`. Lightweight LLM-only targets can use higher concurrency.
+
+## Writing Tests
+
+- Test new or changed behavior only.
+- Prefer one test per distinct behavior.
+- Avoid tests for obvious one-line behavior unless it is a regression risk.
+- Regression tests matter more than broad happy-path duplication.
+- Tests are executable contracts; update them when behavior promises change.
+- Use table-driven tests when multiple cases exercise the same behavior.
+- Use temporary directories/helpers for filesystem tests; do not write persistent test artifacts into the repo.
+
+## Completion Checklist
+
+Before marking a branch ready:
+
+- Ensure `.env` exists in a worktree when evals or LLM-dependent tests may run.
+- Run targeted tests while developing and rely on pre-push for the full suite.
+- Complete manual red/green UAT for user-facing behavior before review readiness.
+- Verify adjacent behavior where the change touches shared parsing, scoring, config, or UI paths.
+- For scoring/grader changes, run at least one real eval with a live provider when feasible.
+- For Studio UX/API changes, verify with browser testing.
+- Document verification evidence in the PR.