Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions .agents/skills/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# AgentV Coding Agent Skills

This directory contains repo-local skills that teach coding agents how to work with AgentV. They are shared across compatible tools through `.agents/skills`, with `.claude/skills` symlinked here for Claude compatibility.

## Skills

| Skill | Description |
| ----- | ----------- |
| [agentv-core-development](agentv-core-development/) | Core design principles, TypeScript conventions, naming, wire-format rules, docs expectations, and project structure. |
| [agentv-testing-verification](agentv-testing-verification/) | AgentV test strategy, CLI verification, grader e2e checks, browser verification, and pre-push behavior. |
| [agentv-git-workflow](agentv-git-workflow/) | Beads-first decentralized orchestration, worktrees, existing PR takeover, draft PRs, and merge cleanup. |
| [beads-execplan-issue-creator](beads-execplan-issue-creator/) | Convert approved plans into dependency-aware bead epics/tasks with acceptance criteria, verification, and invariants. |
| [beads-epic-delivery-loop](beads-epic-delivery-loop/) | Execute a bead epic end-to-end with select, claim, implement, verify, review, commit, close, and repeat loops. |
| [agentv-grader-changes](agentv-grader-changes/) | Grader type conventions, live eval verification, baseline updates, and score-range checks. |
| [agentv-release-publishing](agentv-release-publishing/) | Versioning, release workflow, and package publishing. |
85 changes: 85 additions & 0 deletions .agents/skills/agentv-core-development/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
---
name: agentv-core-development
description: Use when changing AgentV core, SDK, CLI, Studio APIs, config schemas, docs, examples, or any cross-process wire format. Covers design principles, TypeScript conventions, naming, snake_case boundaries, and documentation updates.
---

# AgentV Core Development

AgentV is a TypeScript monorepo for a declarative AI agent evaluation framework.

## Goals

- Declarative YAML eval definitions.
- Structured, type-safe grading.
- Multi-objective scoring for correctness, latency, cost, and safety.
- Optimization-ready primitives without speculative built-ins.

## Design Principles

- Keep core lightweight and extensible through plugins.
- Built-ins should be universal primitives: deterministic, stateless, single-purpose, and broadly useful.
- Prefer composition over new features. If existing primitives cover a need, document the pattern instead of adding code.
- Research peer frameworks before adding a new capability, and choose the lowest common denominator.
- Apply YAGNI to implementation size, not just feature selection. Audit existing primitives before adding knobs, modes, precedence rules, or new invariants.
- New fields must be optional and non-breaking.
- Design for AI agents: intuitive primitives, self-documenting modules, concise extension recipes in file headers, and no dead speculative infrastructure.

If you notice existing overengineering while working, create a Beads issue titled `cleanup: simplify X` with current behavior, simpler model, migration notes, and code links. Do not widen the current PR unless asked.

## Stack

- TypeScript 5.x targeting ES2022 and Node 20+.
- Bun for all package and script operations.
- Bun workspaces, tsup, Biome, Vitest, Vercel AI SDK, Zod.

## Project Structure

- `packages/core/`: evaluation engine, providers, grading, registry, programmatic API.
- `packages/eval/`: lightweight assertion SDK.
- `apps/cli/`: command-line interface published as `agentv`.
- `apps/studio/`: Studio frontend.
- `apps/web/`: documentation site.
- `examples/`: documentation and integration coverage.

## Code Editing Discipline

- Revise existing files in place when the feature belongs there; avoid creating `*-v2`, `*-new`, `*-improved`, or similarly duplicative files.
- New files are appropriate for genuinely new modules, skills, examples, or docs, but do not create throwaway variants as a substitute for understanding the existing code.
- Avoid broad script-based rewrites of source code. For code changes, prefer targeted edits after reading enough context; scripts are acceptable for mechanical verification, generated outputs, or narrow non-code maintenance where risk is low.
- Do not delete files or folders without explicit permission. If cleanup is needed, ask or use a reversible alternative.
- If using a third-party library/API and you are not sure about current usage, consult current official docs before changing the integration.

## TypeScript

- Prefer inference over explicit types when clear.
- Use `async`/`await`.
- Prefer named exports.
- Keep modules cohesive.
- Update stale file headers when behavior changes.

## Project vs Benchmark

- `Project`: top-level Studio container around a registered workspace directory. Modelled by `ProjectEntry` / `ProjectRegistry` and stored in `~/.agentv/projects.yaml`.
- `Benchmark`: curated eval suite designed to measure a capability. Example benchmark directories should keep that name.
- Legacy `~/.agentv/benchmarks.yaml` migration and per-run `benchmark.json` artifacts are separate concepts.

When in doubt: if it holds runs/traces/experiments, it is a project. If it is a curated eval suite, it is a benchmark.

## Wire Format

Everything crossing a process boundary uses `snake_case`. Internal TypeScript uses `camelCase`. Translate at the boundary only.

Snake case surfaces include YAML, JSONL result files, artifact output, HTTP responses, CLI JSON, and anything consumed by non-TS tooling. Camel case surfaces are TypeScript variables, parameters, type members, and in-memory shapes.

Use paired wire/internal interfaces and converters, following `packages/core/src/projects.ts`. Do not dump TS objects directly to YAML or JSON responses.

Treat existing camelCase on disk or in responses as a bug when touching that path.

## Documentation

When functionality changes, update:

- Docs site under `apps/web/src/content/docs/`.
- Skills if YAML schema, grader types, or CLI commands changed.
- Examples that exercise changed behavior.
- README only when the high-level pointer changes.
155 changes: 155 additions & 0 deletions .agents/skills/agentv-git-workflow/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
---
name: agentv-git-workflow
description: Use when starting, claiming, committing, pushing, opening, updating, reviewing, merging, or cleaning up AgentV work. Covers Beads as decentralized orchestration, GitHub as collaboration surface, worktrees, draft PRs, existing PR takeover, and merge cleanup.
---

# AgentV Git Workflow

## Tracking Model

- Beads is the decentralized orchestration layer: task state, ownership, dependencies, discoveries, and durable project knowledge live in the bead graph.
- GitHub is the collaboration surface: draft PRs, reviews, CI, merge coordination, and communication with other parties.
- Interpret "do not use external issue trackers" as "do not create a second private task brain." GitHub PRs still handle code review and merge state.
- Runtime stays lightweight: Beads tracks durable coordination state, the repo-standard bead launcher creates disposable worktree sessions, and git worktrees provide isolation. Use manual worktree setup only as a fallback when the launcher is unavailable or broken.

Use Beads instead of markdown TODO lists:

```bash
bd ready --json
bd show <id> --json
bd create "Issue title" --description="Detailed context" -t bug|feature|task|chore|epic -p 0-4 --json
bd update <id> --claim --json
bd update <id> --status in_progress --json
bd close <id> --reason "Completed" --json
bd remember "durable project insight"
bd dolt push
```

## Starting New Bead Work

Use the repo-standard bead launcher:

```bash
ep-spawn-agent <bead-id>
```

Until a dedicated `bead-start` wrapper exists, `ep-spawn-agent <bead-id>` is the default launch path. Do not choose between multiple launch modes during normal work.

The launcher should:

1. read the bead with `bd show <bead-id> --json`;
2. claim or mark it in progress;
3. create a fresh sibling worktree from latest `origin/main`;
4. launch the agent with bead context;
5. write the session/worktree/branch note back to the bead.

Manual fallback only when the launcher is unavailable or broken:

```bash
bd show <id> --json
bd update <id> --claim --json
bd update <id> --status in_progress --json
git fetch origin
git worktree add ../agentv.worktrees/<id> -b work/<id> origin/main
cd ../agentv.worktrees/<id>
bun install
cp "$(git worktree list --porcelain | head -1 | sed 's/worktree //')/.env" .env
codex-eng
```

## Beads Viewer

`bv` is optional graph/kanban visibility for the Beads graph. For agents, never run bare `bv` because it opens the interactive TUI and blocks the session. Use robot-mode commands only:

```bash
bv --robot-next
bv --robot-triage
bv --robot-plan
bv --robot-graph
```

In worktrees where `.beads` is not present, point `bv` at the canonical project Beads directory:

```bash
bv --db /home/entity/projects/EntityProcess/agentv/.beads --robot-triage
```

## Worktrees

For feature, bug fix, or non-trivial repo changes, work from a dedicated sibling worktree based on latest `origin/main`. Keep the primary checkout clean; do not do feature work in the main folder.

AgentV worktrees live in sibling `../agentv.worktrees/`, not `.worktrees/` inside the repo and not the primary checkout.

After checking out a branch or PR, run `bun install` if `package.json` or `bun.lock` may have changed.

## Existing PR Takeover

When continuing an existing PR, keep the PR branch as the source of truth for code and use Beads for durable task state/handoff.

1. Inspect the PR first:

```bash
gh pr view <number> --json number,title,state,isDraft,headRefName,headRefOid,baseRefName,mergeStateStatus,reviewDecision,statusCheckRollup,url
gh pr checks <number> --watch=false
```

2. Check out the PR branch. If Git reports the branch is already used by another worktree, do not force it; `cd` into that existing worktree instead.

```bash
gh pr checkout <number>
# or: cd /path/to/existing/worktree
```

3. Make or update a bead for the continuation if one is not already provided. Reference the PR number in the bead description or notes.

```bash
bd create "Continue PR <number>: <summary>" --description="Current state, requested changes, and handoff context" -t task -p 1 --json
bd note <id> "Working tree: <path>; PR: https://github.com/EntityProcess/agentv/pull/<number>"
```

4. Push focused commits to the existing PR branch. Do not create a second PR for the same work.

## Draft PRs

After the first meaningful commit, push and open a draft PR. Continue pushing meaningful checkpoints.

```bash
git push -u origin HEAD
gh pr create --draft --title "<type>(scope): summary" --body "Refs <bead-id>"
bd note <bead-id> "Draft PR: <url>"
```

Do not push directly to `main`. The default branch is `main`; do not use or document `master` for AgentV workflows.

## PR Readiness

Keep draft until verification evidence is complete: unit tests, test plan evidence, manual red/green UAT for user-facing changes, CI green, no conflicts, and final review pass when warranted.

Before marking ready:

```bash
gh pr checks <number> --watch=false
gh pr view <number> --json isDraft,mergeStateStatus,reviewDecision,statusCheckRollup
bd note <bead-id> "Verification complete: <summary>"
```

## Merge and Cleanup

Use squash merge only:

```bash
gh pr merge <PR_NUMBER> --squash --delete-branch
```

After squash merge, do not continue pushing to the old branch. Start follow-up fixes from fresh `main`.

Before ending a session:

```bash
git status
bd dolt push
git push
git status
```

Work is not complete until both Beads state and git commits are pushed.
51 changes: 51 additions & 0 deletions .agents/skills/agentv-grader-changes/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
---
name: agentv-grader-changes
description: Use when adding, modifying, renaming, parsing, or verifying AgentV graders/evaluators, assertion types, scoring behavior, thresholds, baseline files, or eval output shape.
---

# AgentV Grader Changes

## Type System

Grader types are kebab-case everywhere:

- YAML config: `llm-grader`, `is-json`, `execution-metrics`.
- Internal `EvaluatorKind`.
- Output `scores[].type`.
- Registry keys.

Source of truth: `EVALUATOR_KIND_VALUES` in `packages/core/src/evaluation/types.ts`.

Snake_case aliases can be accepted for backward compatibility through `normalizeGraderType()` in `grader-parser.ts`. SDK-facing `AssertionType` in `packages/eval/src/assertion.ts` must stay in sync.

## Verification

Unit tests are not enough for grader changes.

1. Ensure `.env` exists in the worktree.
2. Run an actual eval with a real example file:

```bash
bun apps/cli/src/cli.ts eval examples/features/rubric/evals/dataset.eval.yaml --test-id <test-id>
```

3. Inspect JSONL output:
- correct `scores[].type`
- expected score calculation
- assertions have `text`, `passed`, and optional `evidence`

4. Update `*.baseline.jsonl` files when output format changes.

`--dry-run` is useful for harness plumbing but returns mock scores and cannot validate grading quality.

## Score Range Checks

For manual e2e score guardrails:

```bash
bun apps/cli/src/cli.ts eval examples/path/to/suite.eval.yaml --target azure \
--out examples/path/to/suite.results.jsonl
bun scripts/check-grader-scores.ts
```

Add `<eval-stem>.grader-scores.yaml` next to an eval when a new suite needs score-range assertions.
31 changes: 31 additions & 0 deletions .agents/skills/agentv-release-publishing/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
name: agentv-release-publishing
description: Use when changing AgentV versioning, release automation, package publishing, npm package configuration, or release docs.
---

# AgentV Release and Publishing

## Versioning

Git commit history is the changelog. Use GitHub Actions for releases; do not publish manually from a local machine.

## Standard Release Flow

1. Run the Release workflow with `channel=next` and desired bump. It creates `x.y.z-next.1`, commits, tags, and pushes.
2. Publish workflow publishes npm `next`.
3. Run Release workflow with `channel=finalize`. It strips the prerelease suffix.
4. Publish workflow publishes npm `latest`.

## Direct Stable Release

Run the Release workflow with `channel=stable` and the desired bump. Publish workflow publishes npm `latest`.

## Local Scripts

`bun scripts/release.ts` can inspect version state locally, but do not run `bun run publish` or `bun run publish:next` locally. npm publish uses OIDC trusted publishing from GitHub Actions.

## Packages

- `packages/core/` publishes `@agentv/core`.
- `apps/cli/` publishes `agentv`.
- tsup bundles workspace dependencies with `noExternal: ["@agentv/core"]`.
Loading
Loading