Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions .agents/skills/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# AgentV Coding Agent Skills

This directory contains repo-local skills that teach coding agents how to work with AgentV. They are shared across compatible tools through `.agents/skills`, with `.claude/skills` symlinked here for Claude compatibility.

## Skills

| Skill | Description |
| ----- | ----------- |
| [agentv-core-development](agentv-core-development/) | Core design principles, TypeScript conventions, naming, wire-format rules, docs expectations, and project structure. |
| [agentv-testing-verification](agentv-testing-verification/) | AgentV test strategy, CLI verification, grader e2e checks, browser verification, and pre-push behavior. |
| [agentv-git-workflow](agentv-git-workflow/) | AO-first session/worktree/PR lifecycle, GitHub collaboration, manual fallback worktrees, existing PR takeover, and merge cleanup. |
| [beads-execplan-issue-creator](beads-execplan-issue-creator/) | Optional when explicitly assigned: convert approved plans into dependency-aware bead epics/tasks with acceptance criteria, verification, and invariants. |
| [beads-epic-delivery-loop](beads-epic-delivery-loop/) | Optional when explicitly assigned: execute a bead epic end-to-end without spawning unmanaged agents. |
| [agentv-grader-changes](agentv-grader-changes/) | Grader type conventions, live eval verification, baseline updates, and score-range checks. |
| [agentv-release-publishing](agentv-release-publishing/) | Versioning, release workflow, and package publishing. |
85 changes: 85 additions & 0 deletions .agents/skills/agentv-core-development/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
---
name: agentv-core-development
description: Use when changing AgentV core, SDK, CLI, Studio APIs, config schemas, docs, examples, or any cross-process wire format. Covers design principles, TypeScript conventions, naming, snake_case boundaries, and documentation updates.
---

# AgentV Core Development

AgentV is a TypeScript monorepo for a declarative AI agent evaluation framework.

## Goals

- Declarative YAML eval definitions.
- Structured, type-safe grading.
- Multi-objective scoring for correctness, latency, cost, and safety.
- Optimization-ready primitives without speculative built-ins.

## Design Principles

- Keep core lightweight and extensible through plugins.
- Built-ins should be universal primitives: deterministic, stateless, single-purpose, and broadly useful.
- Prefer composition over new features. If existing primitives cover a need, document the pattern instead of adding code.
- Research peer frameworks before adding a new capability, and choose the lowest common denominator.
- Apply YAGNI to implementation size, not just feature selection. Audit existing primitives before adding knobs, modes, precedence rules, or new invariants.
- New fields must be optional and non-breaking.
- Design for AI agents: intuitive primitives, self-documenting modules, concise extension recipes in file headers, and no dead speculative infrastructure.

If you notice existing overengineering while working, flag it through the active AO/GitHub workflow (for example, open a GitHub issue or report it in the PR) with current behavior, simpler model, migration notes, and code links. Do not widen the current PR unless asked.

## Stack

- TypeScript 5.x targeting ES2022 and Node 20+.
- Bun for all package and script operations.
- Bun workspaces, tsup, Biome, Vitest, Vercel AI SDK, Zod.

## Project Structure

- `packages/core/`: evaluation engine, providers, grading, registry, programmatic API.
- `packages/eval/`: lightweight assertion SDK.
- `apps/cli/`: command-line interface published as `agentv`.
- `apps/studio/`: Studio frontend.
- `apps/web/`: documentation site.
- `examples/`: documentation and integration coverage.

## Code Editing Discipline

- Revise existing files in place when the feature belongs there; avoid creating `*-v2`, `*-new`, `*-improved`, or similarly duplicative files.
- New files are appropriate for genuinely new modules, skills, examples, or docs, but do not create throwaway variants as a substitute for understanding the existing code.
- Avoid broad script-based rewrites of source code. For code changes, prefer targeted edits after reading enough context; scripts are acceptable for mechanical verification, generated outputs, or narrow non-code maintenance where risk is low.
- Do not delete files or folders without explicit permission. If cleanup is needed, ask or use a reversible alternative.
- If using a third-party library/API and you are not sure about current usage, consult current official docs before changing the integration.

## TypeScript

- Prefer inference over explicit types when clear.
- Use `async`/`await`.
- Prefer named exports.
- Keep modules cohesive.
- Update stale file headers when behavior changes.

## Project vs Benchmark

- `Project`: top-level Studio container around a registered workspace directory. Modelled by `ProjectEntry` / `ProjectRegistry` and stored in `~/.agentv/projects.yaml`.
- `Benchmark`: curated eval suite designed to measure a capability. Example benchmark directories should keep that name.
- Legacy `~/.agentv/benchmarks.yaml` migration and per-run `benchmark.json` artifacts are separate concepts.

When in doubt: if it holds runs/traces/experiments, it is a project. If it is a curated eval suite, it is a benchmark.

## Wire Format

Everything crossing a process boundary uses `snake_case`. Internal TypeScript uses `camelCase`. Translate at the boundary only.

Snake case surfaces include YAML, JSONL result files, artifact output, HTTP responses, CLI JSON, and anything consumed by non-TS tooling. Camel case surfaces are TypeScript variables, parameters, type members, and in-memory shapes.

Use paired wire/internal interfaces and converters, following `packages/core/src/projects.ts`. Do not dump TS objects directly to YAML or JSON responses.

Treat existing camelCase on disk or in responses as a bug when touching that path.

## Documentation

When functionality changes, update:

- Docs site under `apps/web/src/content/docs/`.
- Skills if YAML schema, grader types, or CLI commands changed.
- Examples that exercise changed behavior.
- README only when the high-level pointer changes.
95 changes: 95 additions & 0 deletions .agents/skills/agentv-git-workflow/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
---
name: agentv-git-workflow
description: Use when starting, claiming, committing, pushing, opening, updating, reviewing, merging, or cleaning up AgentV work. Covers AO-first session/worktree/PR lifecycle, GitHub collaboration, manual fallback worktrees, existing PR takeover, and merge cleanup.
---

# AgentV Git Workflow

## Tracking Model

- AO (Composio Agent Orchestrator) is the orchestration layer for live coding work: assignment, worker ownership, status, worktree lifecycle, PR claiming, and visualization.
- GitHub is the external collaboration surface: PRs, reviews, CI, merge coordination, issues, and human-visible handoff.
- Beads (`bd`) is optional durable planning/backlog context only when explicitly assigned by the user/AO. Do not use Beads as routine live execution tracking in AO-managed sessions.
- Do not create competing task trackers, markdown TODO ledgers, unmanaged agent sessions, or duplicate PRs.

## AO-Managed Sessions

When `AO_SESSION_ID` is present or the task says it is an AO worker session:

1. Acknowledge and report status with AO commands (`ao acknowledge`, `ao report working`, `ao report fixing-ci`, `ao report addressing-reviews`, `ao report needs-input`).
2. Use the AO-provided worktree and branch unless AO/user instructs otherwise.
3. For an existing PR, run `ao session claim-pr <number-or-url>` before editing. If claim or checkout indicates another AO session/worktree owns the branch, coordinate instead of forcing checkout.
4. Push focused commits to the claimed PR branch and report PR milestones with `ao report pr-created --pr-url <url>`, `draft-pr-created`, or `ready-for-review` as appropriate.
5. Do not invoke `ep-spawn-agent`, launch sub-agents, create extra worktrees, or create Beads tasks for live tracking unless AO/user explicitly asks.

## ep-spawn-agent Verdict

`ep-spawn-agent` is disabled for normal AgentV work under AO. It may only be used in a non-AO environment or with explicit AO/user instruction for a Beads experiment. In AO-managed sessions it conflicts with AO ownership, visualization, worktree, and PR lifecycle, so prefer AO workers/harnesses instead.

## Manual Fallback Outside AO

For feature, bug fix, or non-trivial repo changes outside AO, work from a dedicated sibling worktree based on latest `origin/main`:

```bash
git fetch origin
git worktree add ../agentv.worktrees/<type>-<short-desc> -b <type>/<short-desc> origin/main
cd ../agentv.worktrees/<type>-<short-desc>
bun install
cp "$(git worktree list --porcelain | head -1 | sed 's/worktree //')/.env" .env 2>/dev/null || true
```

Keep the primary checkout clean. Do not push directly to `main`.

## Existing PR Takeover

1. Inspect the PR first:

```bash
gh pr view <number> --json number,title,state,isDraft,headRefName,headRefOid,baseRefName,mergeStateStatus,reviewDecision,statusCheckRollup,url
gh pr checks <number> --watch=false
```

2. In AO, claim with `ao session claim-pr <number-or-url>` and use the resulting worktree/branch. If the branch is already used by another worktree, do not force it; coordinate or `cd` into the existing worktree only when that is the safe continuation path.

3. Outside AO, check out the PR branch manually:

```bash
gh pr checkout <number>
# or: cd /path/to/existing/worktree
```

4. Push focused commits to the existing PR branch. Do not create a second PR for the same work.

## PRs and Pushing

After the first meaningful commit, push and open or update a PR. In AO, prefer the PR lifecycle requested by the orchestrator; otherwise open a draft PR for in-progress work.

```bash
git push -u origin HEAD
gh pr create --draft --title "<type>(scope): summary" --body "<summary and verification plan>"
```

Use conventional commit and PR titles: `type(scope): summary`.

## PR Readiness

Keep draft until verification evidence is complete: unit tests, test plan evidence, manual red/green UAT for user-facing changes, CI green, no conflicts, and final review pass when warranted.

Before marking ready:

```bash
gh pr checks <number> --watch=false
gh pr view <number> --json isDraft,mergeStateStatus,reviewDecision,statusCheckRollup
```

## Merge and Cleanup

Use squash merge only when explicitly responsible for merging:

```bash
gh pr merge <PR_NUMBER> --squash --delete-branch
```

After squash merge, do not continue pushing to the old branch. Start follow-up fixes from fresh `main`.

Before ending a session, ensure committed work is pushed and report the current state through AO when running under AO.
51 changes: 51 additions & 0 deletions .agents/skills/agentv-grader-changes/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
---
name: agentv-grader-changes
description: Use when adding, modifying, renaming, parsing, or verifying AgentV graders/evaluators, assertion types, scoring behavior, thresholds, baseline files, or eval output shape.
---

# AgentV Grader Changes

## Type System

Grader types are kebab-case everywhere:

- YAML config: `llm-grader`, `is-json`, `execution-metrics`.
- Internal `EvaluatorKind`.
- Output `scores[].type`.
- Registry keys.

Source of truth: `EVALUATOR_KIND_VALUES` in `packages/core/src/evaluation/types.ts`.

Snake_case aliases can be accepted for backward compatibility through `normalizeGraderType()` in `grader-parser.ts`. SDK-facing `AssertionType` in `packages/eval/src/assertion.ts` must stay in sync.

## Verification

Unit tests are not enough for grader changes.

1. Ensure `.env` exists in the worktree.
2. Run an actual eval with a real example file:

```bash
bun apps/cli/src/cli.ts eval examples/features/rubric/evals/dataset.eval.yaml --test-id <test-id>
```

3. Inspect JSONL output:
- correct `scores[].type`
- expected score calculation
- assertions have `text`, `passed`, and optional `evidence`

4. Update `*.baseline.jsonl` files when output format changes.

`--dry-run` is useful for harness plumbing but returns mock scores and cannot validate grading quality.

## Score Range Checks

For manual e2e score guardrails:

```bash
bun apps/cli/src/cli.ts eval examples/path/to/suite.eval.yaml --target azure \
--out examples/path/to/suite.results.jsonl
bun scripts/check-grader-scores.ts
```

Add `<eval-stem>.grader-scores.yaml` next to an eval when a new suite needs score-range assertions.
31 changes: 31 additions & 0 deletions .agents/skills/agentv-release-publishing/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
name: agentv-release-publishing
description: Use when changing AgentV versioning, release automation, package publishing, npm package configuration, or release docs.
---

# AgentV Release and Publishing

## Versioning

Git commit history is the changelog. Use GitHub Actions for releases; do not publish manually from a local machine.

## Standard Release Flow

1. Run the Release workflow with `channel=next` and desired bump. It creates `x.y.z-next.1`, commits, tags, and pushes.
2. Publish workflow publishes npm `next`.
3. Run Release workflow with `channel=finalize`. It strips the prerelease suffix.
4. Publish workflow publishes npm `latest`.

## Direct Stable Release

Run the Release workflow with `channel=stable` and the desired bump. Publish workflow publishes npm `latest`.

## Local Scripts

`bun scripts/release.ts` can inspect version state locally, but do not run `bun run publish` or `bun run publish:next` locally. npm publish uses OIDC trusted publishing from GitHub Actions.

## Packages

- `packages/core/` publishes `@agentv/core`.
- `apps/cli/` publishes `agentv`.
- tsup bundles workspace dependencies with `noExternal: ["@agentv/core"]`.
78 changes: 78 additions & 0 deletions .agents/skills/agentv-testing-verification/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
---
name: agentv-testing-verification
description: Use when testing, verifying, debugging checks, changing CLI behavior, grader behavior, Studio UI/API behavior, docs site visuals, examples, or preparing an AgentV PR for review.
---

# AgentV Testing and Verification

## Pre-Push

The repo uses `prek` pre-push hooks. Do not manually run the full pre-push suite before pushing unless diagnosing a failure. Push to the feature branch and let the hook run:

- `bun run build`
- `bun run typecheck`
- `bun run lint`
- `bun run test`
- `bun run validate:examples`

Manual equivalent:

```bash
bunx prek run --all-files --hook-stage pre-push
```

## CLI Testing

Never use global `agentv` for functional testing. Use current source:

```bash
bun apps/cli/src/cli.ts <args>
```

If changes touch `packages/core/`, run `bun run build` first because the CLI imports `@agentv/core` from compiled `dist`.

For built output use `bun apps/cli/dist/cli.js <args>` or `bun agentv <args>`, but only after building.

## Studio UI

`agentv studio` serves `apps/studio/dist/`. Rebuild before UAT or screenshots:

```bash
cd apps/studio && bun run build
```

## Docs Browser E2E

Use `agent-browser` for docs site verification. Always pass `--session <name>` and do not use `--headed`.

If session launch hangs with EAGAIN on ARM64, pre-start Chrome with CDP and use `agent-browser --cdp 9222`.

## Browser Safety In Tests

Automated tests should not unexpectedly open a graphical browser. For browser-dependent behavior, prefer headless `agent-browser` verification or explicit opt-in test hooks. If adding code that can launch a browser, guard it behind environment checks or explicit user action.

## Agent Provider Evals

Limit coding-agent provider eval concurrency to 3 targets at a time for `claude`, `claude-sdk`, `codex`, `copilot`, `copilot-sdk`, `pi`, and `pi-cli`. Lightweight LLM-only targets can use higher concurrency.

## Writing Tests

- Test new or changed behavior only.
- Prefer one test per distinct behavior.
- Avoid tests for obvious one-line behavior unless it is a regression risk.
- Regression tests matter more than broad happy-path duplication.
- Tests are executable contracts; update them when behavior promises change.
- Use table-driven tests when multiple cases exercise the same behavior.
- Use temporary directories/helpers for filesystem tests; do not write persistent test artifacts into the repo.

## Completion Checklist

Before marking a branch ready:

- Ensure `.env` exists in a worktree when evals or LLM-dependent tests may run.
- Run targeted tests while developing and rely on pre-push for the full suite.
- Complete manual red/green UAT for user-facing behavior before review readiness.
- Verify adjacent behavior where the change touches shared parsing, scoring, config, or UI paths.
- For scoring/grader changes, run at least one real eval with a live provider when feasible.
- For Studio UX/API changes, verify with browser testing.
- Document verification evidence in the PR.
Loading
Loading