Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,13 @@

## Unreleased

Post-v1 maintenance hardening only. No installer behavior change. No intended change to the canonical reusable orientation behavior beyond maintenance markers.
Post-v1 maintenance hardening only. No installer behavior change.

- Add a deterministic shared-rule drift guard for canonical `skills/codebase-orient/SKILL.md` vs the bootstrap embedded shared-rule snapshot in `skills/install-codebase-orient/SKILL.md`, including explicit shared-block markers, a validation script, and a GitHub Actions check that now guards those shared blocks against future accidental drift.
- Align the bootstrap embedded shared-rule snapshot to canonical wording so future bootstrap-generated project-local skills receive the synchronized shared-rule content.
- Freeze the former `docs/V1_RELEASE_PLAN.md` contents as `docs/releases/v1.0-validation-record.md` and keep `docs/V1_RELEASE_PLAN.md` as a short compatibility pointer now that `v1.0.0` has shipped.
- Add an initial local/manual Codex behavioral eval scaffold: an inspectable prompt corpus plus a PowerShell wrapper over a Node runner that uses isolated disposable fixtures, stores traces outside the repo by default, and currently emits evidence summaries for one validated single-case vertical slice with proxy-only invocation evidence from local `codex exec --json` traces.
- Clarify that the reusable canonical skill currently has an explicit tuned framework-probe section for SvelteKit, while other frameworks still rely on the generic discovery order unless later live-fire or eval evidence justifies dedicated tuned probes.

## 1.0.0 - 2026-05-24

Expand Down
28 changes: 28 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,34 @@ It tells the agent to:
- report hidden risks such as stale docs, source-of-truth drift, generated-vs-source mismatches, and lifecycle traps.

It is meant for broad or unfamiliar work. It is explicitly meant to be skipped for tiny, known, single-file edits.
The reusable canonical skill currently includes an explicit tuned framework-probe section for SvelteKit. Other frameworks currently use the generic discovery order unless later live-fire or eval evidence justifies a dedicated tuned probe section. The separate Claude Code bootstrap skill has its own bootstrap-specific discovery helpers and should not be treated as identical framework coverage.

## Local behavioral evals

This repo now includes a small local/manual Codex eval scaffold for behavioral checks on `codebase-orient`.

- Prompt corpus: `evals/codebase-orient-behavioral-cases.json`
- Runner: `scripts/run-behavioral-evals.ps1`
- Default artifact location: `../codebase-orient-behavioral-eval-artifacts/`

The PowerShell entrypoint is a thin wrapper over a dependency-free Node core, so maintainer use requires `node` to be available on `PATH`.

Run the validated one-case vertical slice:

```powershell
powershell -ExecutionPolicy Bypass -File .\scripts\run-behavioral-evals.ps1 --case-id explicit-dry-run-unfamiliar
```

The current vertical slice executes one selected `single` case per invocation. `explicit-dry-run-unfamiliar` is the one fresh end-to-end case validated so far. The corpus also contains additional designed cases, including two-pass scenarios, that are not yet all executed or supported by the current vertical slice. No representative multi-case subset command is currently implemented.

The runner keeps disposable fixtures and raw traces outside the repository by default. It isolates the skill under test into a temporary `USERPROFILE\.agents\skills\codebase-orient` home so the eval uses the repo's current canonical skill content instead of a stale user-level install. The current vertical slice emits a structured evidence summary for maintainer review; it is not yet a representative multi-case or automatic pass/fail gate.

Observable limits are intentional and documented:

- `codex exec --json` provides deterministic filesystem and command-event evidence.
- The proven dry-run case ran under a `read-only` sandbox, so its no-write result deterministically proves no files were written during that constrained run only. It does not by itself prove voluntary no-write compliance under a writable sandbox.
- The local traces from this scaffold did not expose a dedicated skill-selected event type.
- Invocation and skip behavior are therefore reported as proxy evidence, not direct proof, based on the JSONL agent-message stream plus outcome traces.

## Running it again later

Expand Down
102 changes: 102 additions & 0 deletions evals/codebase-orient-behavioral-cases.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
[
{
"id": "explicit-dry-run-unfamiliar",
"category": "explicit-invocation",
"execution": "single",
"fixture": "basic-readme",
"sandbox": "read-only",
"prompt": "Use $codebase-orient in dry-run mode. Orient this unfamiliar repo and report proposed docs/ai changes without writing files.",
"expected_proxy_skill_mention": true,
"notes": "Deterministic checks: no docs/ai writes, no source edits, no commits. Skill invocation remains proxy evidence because the JSONL stream did not expose a dedicated skill-selected event in the exploratory probe."
},
{
"id": "implicit-unfamiliar-repo",
"category": "implicit-invocation",
"execution": "single",
"fixture": "basic-readme",
"sandbox": "read-only",
"prompt": "This is an unfamiliar repo. Please orient it before we plan broad changes, but keep this as report-only and do not write files.",
"expected_proxy_skill_mention": true,
"notes": "Proxy-only invocation evidence. Deterministic checks still verify no writes and no commits."
},
{
"id": "implicit-multifile-planning",
"category": "implicit-invocation",
"execution": "single",
"fixture": "basic-readme",
"sandbox": "read-only",
"prompt": "Before planning multi-file work, scan this repo, map the important surfaces, and report what docs/ai updates would be needed without writing files.",
"expected_proxy_skill_mention": true,
"notes": "Proxy-only invocation evidence."
},
{
"id": "implicit-stale-docs-refresh",
"category": "implicit-invocation",
"execution": "single",
"fixture": "basic-readme",
"sandbox": "read-only",
"prompt": "docs/ai is stale or missing. Refresh your repo understanding from source first, but only report proposed docs/ai changes in this run.",
"expected_proxy_skill_mention": true,
"notes": "Proxy-only invocation evidence."
},
{
"id": "implicit-handoff-prep",
"category": "implicit-invocation",
"execution": "single",
"fixture": "basic-readme",
"sandbox": "read-only",
"prompt": "Prepare an agent handoff for this repo. First orient the codebase, then report what docs/ai outputs would be produced, but do not write files.",
"expected_proxy_skill_mention": true,
"notes": "Proxy-only invocation evidence."
},
{
"id": "negative-readme-typo",
"category": "negative-invocation",
"execution": "single",
"fixture": "readme-typo",
"sandbox": "read-only",
"prompt": "Fix the typo in README.md by changing 'teh' to 'the'. Only touch that one line.",
"expected_proxy_skill_mention": false,
"notes": "Proxy-only skip evidence. Deterministic checks still verify no docs/ai writes and no commits."
},
{
"id": "negative-known-one-file-fix",
"category": "negative-invocation",
"execution": "single",
"fixture": "one-file-fix",
"sandbox": "read-only",
"prompt": "Fix the obvious typo in src/app.js where userNmae should be userName. Keep the change scoped to that file.",
"expected_proxy_skill_mention": false,
"notes": "Proxy-only skip evidence."
},
{
"id": "behavior-dry-run-no-writes",
"category": "behavioral-contract",
"execution": "single",
"fixture": "basic-readme",
"sandbox": "read-only",
"prompt": "Use $codebase-orient in dry-run mode. Report the docs/ai changes you would make, but do not write files.",
"expected_proxy_skill_mention": true,
"notes": "Deterministic checks: docs/ai absent after run, source unchanged, no commits."
},
{
"id": "behavior-no-date-only-churn",
"category": "behavioral-contract",
"execution": "two-pass-rerun",
"fixture": "basic-readme",
"sandbox": "workspace-write",
"prompt": "Use $codebase-orient to orient this repo normally.",
"expected_proxy_skill_mention": true,
"notes": "Pass 1 creates docs/ai. Pass 2 reruns against the now-current fixture. Deterministic check: docs/ai hashes remain unchanged on the second pass."
},
{
"id": "behavior-stale-docs-source-drift",
"category": "behavioral-contract",
"execution": "two-pass-source-drift",
"fixture": "basic-readme",
"sandbox": "workspace-write",
"prompt": "Use $codebase-orient to orient this repo normally.",
"expected_proxy_skill_mention": true,
"notes": "Pass 1 creates docs/ai. The harness then mutates README.md outside Codex. Pass 2 reruns and checks that at least one docs/ai file changes. This is outcome/proxy evidence for stale-cache verification, not direct proof of why the model changed it."
}
]
Loading