Shaelz · Shaelz · May 24, 2026 · May 24, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,11 +2,13 @@
 
 ## Unreleased
 
-Post-v1 maintenance hardening only. No installer behavior change. No intended change to the canonical reusable orientation behavior beyond maintenance markers.
+Post-v1 maintenance hardening only. No installer behavior change.
 
 - Add a deterministic shared-rule drift guard for canonical `skills/codebase-orient/SKILL.md` vs the bootstrap embedded shared-rule snapshot in `skills/install-codebase-orient/SKILL.md`, including explicit shared-block markers, a validation script, and a GitHub Actions check that now guards those shared blocks against future accidental drift.
 - Align the bootstrap embedded shared-rule snapshot to canonical wording so future bootstrap-generated project-local skills receive the synchronized shared-rule content.
 - Freeze the former `docs/V1_RELEASE_PLAN.md` contents as `docs/releases/v1.0-validation-record.md` and keep `docs/V1_RELEASE_PLAN.md` as a short compatibility pointer now that `v1.0.0` has shipped.
+- Add an initial local/manual Codex behavioral eval scaffold: an inspectable prompt corpus plus a PowerShell wrapper over a Node runner that uses isolated disposable fixtures, stores traces outside the repo by default, and currently emits evidence summaries for one validated single-case vertical slice with proxy-only invocation evidence from local `codex exec --json` traces.
+- Clarify that the reusable canonical skill currently has an explicit tuned framework-probe section for SvelteKit, while other frameworks still rely on the generic discovery order unless later live-fire or eval evidence justifies dedicated tuned probes.
 
 ## 1.0.0 - 2026-05-24
 

diff --git a/README.md b/README.md
@@ -201,6 +201,34 @@ It tells the agent to:
 - report hidden risks such as stale docs, source-of-truth drift, generated-vs-source mismatches, and lifecycle traps.
 
 It is meant for broad or unfamiliar work. It is explicitly meant to be skipped for tiny, known, single-file edits.
+The reusable canonical skill currently includes an explicit tuned framework-probe section for SvelteKit. Other frameworks currently use the generic discovery order unless later live-fire or eval evidence justifies a dedicated tuned probe section. The separate Claude Code bootstrap skill has its own bootstrap-specific discovery helpers and should not be treated as identical framework coverage.
+
+## Local behavioral evals
+
+This repo now includes a small local/manual Codex eval scaffold for behavioral checks on `codebase-orient`.
+
+- Prompt corpus: `evals/codebase-orient-behavioral-cases.json`
+- Runner: `scripts/run-behavioral-evals.ps1`
+- Default artifact location: `../codebase-orient-behavioral-eval-artifacts/`
+
+The PowerShell entrypoint is a thin wrapper over a dependency-free Node core, so maintainer use requires `node` to be available on `PATH`.
+
+Run the validated one-case vertical slice:
+
+```powershell
+powershell -ExecutionPolicy Bypass -File .\scripts\run-behavioral-evals.ps1 --case-id explicit-dry-run-unfamiliar
+```
+
+The current vertical slice executes one selected `single` case per invocation. `explicit-dry-run-unfamiliar` is the one fresh end-to-end case validated so far. The corpus also contains additional designed cases, including two-pass scenarios, that are not yet all executed or supported by the current vertical slice. No representative multi-case subset command is currently implemented.
+
+The runner keeps disposable fixtures and raw traces outside the repository by default. It isolates the skill under test into a temporary `USERPROFILE\.agents\skills\codebase-orient` home so the eval uses the repo's current canonical skill content instead of a stale user-level install. The current vertical slice emits a structured evidence summary for maintainer review; it is not yet a representative multi-case or automatic pass/fail gate.
+
+Observable limits are intentional and documented:
+
+- `codex exec --json` provides deterministic filesystem and command-event evidence.
+- The proven dry-run case ran under a `read-only` sandbox, so its no-write result deterministically proves no files were written during that constrained run only. It does not by itself prove voluntary no-write compliance under a writable sandbox.
+- The local traces from this scaffold did not expose a dedicated skill-selected event type.
+- Invocation and skip behavior are therefore reported as proxy evidence, not direct proof, based on the JSONL agent-message stream plus outcome traces.
 
 ## Running it again later
 

diff --git a/evals/codebase-orient-behavioral-cases.json b/evals/codebase-orient-behavioral-cases.json
@@ -0,0 +1,102 @@
+[
+  {
+    "id": "explicit-dry-run-unfamiliar",
+    "category": "explicit-invocation",
+    "execution": "single",
+    "fixture": "basic-readme",
+    "sandbox": "read-only",
+    "prompt": "Use $codebase-orient in dry-run mode. Orient this unfamiliar repo and report proposed docs/ai changes without writing files.",
+    "expected_proxy_skill_mention": true,
+    "notes": "Deterministic checks: no docs/ai writes, no source edits, no commits. Skill invocation remains proxy evidence because the JSONL stream did not expose a dedicated skill-selected event in the exploratory probe."
+  },
+  {
+    "id": "implicit-unfamiliar-repo",
+    "category": "implicit-invocation",
+    "execution": "single",
+    "fixture": "basic-readme",
+    "sandbox": "read-only",
+    "prompt": "This is an unfamiliar repo. Please orient it before we plan broad changes, but keep this as report-only and do not write files.",
+    "expected_proxy_skill_mention": true,
+    "notes": "Proxy-only invocation evidence. Deterministic checks still verify no writes and no commits."
+  },
+  {
+    "id": "implicit-multifile-planning",
+    "category": "implicit-invocation",
+    "execution": "single",
+    "fixture": "basic-readme",
+    "sandbox": "read-only",
+    "prompt": "Before planning multi-file work, scan this repo, map the important surfaces, and report what docs/ai updates would be needed without writing files.",
+    "expected_proxy_skill_mention": true,
+    "notes": "Proxy-only invocation evidence."
+  },
+  {
+    "id": "implicit-stale-docs-refresh",
+    "category": "implicit-invocation",
+    "execution": "single",
+    "fixture": "basic-readme",
+    "sandbox": "read-only",
+    "prompt": "docs/ai is stale or missing. Refresh your repo understanding from source first, but only report proposed docs/ai changes in this run.",
+    "expected_proxy_skill_mention": true,
+    "notes": "Proxy-only invocation evidence."
+  },
+  {
+    "id": "implicit-handoff-prep",
+    "category": "implicit-invocation",
+    "execution": "single",
+    "fixture": "basic-readme",
+    "sandbox": "read-only",
+    "prompt": "Prepare an agent handoff for this repo. First orient the codebase, then report what docs/ai outputs would be produced, but do not write files.",
+    "expected_proxy_skill_mention": true,
+    "notes": "Proxy-only invocation evidence."
+  },
+  {
+    "id": "negative-readme-typo",
+    "category": "negative-invocation",
+    "execution": "single",
+    "fixture": "readme-typo",
+    "sandbox": "read-only",
+    "prompt": "Fix the typo in README.md by changing 'teh' to 'the'. Only touch that one line.",
+    "expected_proxy_skill_mention": false,
+    "notes": "Proxy-only skip evidence. Deterministic checks still verify no docs/ai writes and no commits."
+  },
+  {
+    "id": "negative-known-one-file-fix",
+    "category": "negative-invocation",
+    "execution": "single",
+    "fixture": "one-file-fix",
+    "sandbox": "read-only",
+    "prompt": "Fix the obvious typo in src/app.js where userNmae should be userName. Keep the change scoped to that file.",
+    "expected_proxy_skill_mention": false,
+    "notes": "Proxy-only skip evidence."
+  },
+  {
+    "id": "behavior-dry-run-no-writes",
+    "category": "behavioral-contract",
+    "execution": "single",
+    "fixture": "basic-readme",
+    "sandbox": "read-only",
+    "prompt": "Use $codebase-orient in dry-run mode. Report the docs/ai changes you would make, but do not write files.",
+    "expected_proxy_skill_mention": true,
+    "notes": "Deterministic checks: docs/ai absent after run, source unchanged, no commits."
+  },
+  {
+    "id": "behavior-no-date-only-churn",
+    "category": "behavioral-contract",
+    "execution": "two-pass-rerun",
+    "fixture": "basic-readme",
+    "sandbox": "workspace-write",
+    "prompt": "Use $codebase-orient to orient this repo normally.",
+    "expected_proxy_skill_mention": true,
+    "notes": "Pass 1 creates docs/ai. Pass 2 reruns against the now-current fixture. Deterministic check: docs/ai hashes remain unchanged on the second pass."
+  },
+  {
+    "id": "behavior-stale-docs-source-drift",
+    "category": "behavioral-contract",
+    "execution": "two-pass-source-drift",
+    "fixture": "basic-readme",
+    "sandbox": "workspace-write",
+    "prompt": "Use $codebase-orient to orient this repo normally.",
+    "expected_proxy_skill_mention": true,
+    "notes": "Pass 1 creates docs/ai. The harness then mutates README.md outside Codex. Pass 2 reruns and checks that at least one docs/ai file changes. This is outcome/proxy evidence for stale-cache verification, not direct proof of why the model changed it."
+  }
+]