docs(evidence): commit evidence/ artifacts referenced by README by Protocol-zero-0 · Pull Request #41 · Protocol-zero-0/evolution-kernel

Protocol-zero-0 · 2026-05-26T10:14:24Z

Why

The README has pointed to `evidence/` since v1.0 as "checked-in artifacts of runs anyone can reproduce", but the directory never landed. Dead link in v1.0 / v1.1.0 / v1.1.1 / v1.1.2 readers. Pre-publicize cleanup.

What lands

`evidence/README.md` — top-level orientation; new reproducibility note pointing out the demo run used `claude_cli` (currently on main, shipping in v1.2.0); everything else reproducible against v1.1.2 today.
`evidence/demo-target-run-2026-05-14/` — full ledger of a real, self-halted kernel run against `examples/demo_target`. 47 ledger files + per-run README + config + raw `run.log`. Demonstrates planner reading history, evaluator rejecting useless patches, hard-stop firing ($0.11 / 3 runs / 276 tokens).
`evidence/repo-snapshot-2026-05-13/` — architecture, test transcript, and 5-minute terminal walkthrough captured at v0.3.0.

Stats: 54 files, 1132 insertions.

What's NOT in this PR

No code changes
No version bump
Paths inside the historical `evolution-cli.yml` are absolute and pre-bundled (capture from 2026-05-14) — kept as-is because this is a historical artifact, not a runnable example. The reproducibility note explains the situation.

Test plan

CI green on this PR
After merge, `evidence/` link in README points at real files

The README has pointed to `evidence/` since v1.0 ("checked-in artifacts of runs anyone can reproduce"), but the directory itself never landed — the link was dead in v1.0 / v1.1.0 / v1.1.1 / v1.1.2 readers. This commit ships the two anchor pieces: - `demo-target-run-2026-05-14/` — full ledger of a real, self-halted kernel run against examples/demo_target. 47 ledger files + README + config + raw run.log. Demonstrates planner reading history, evaluator rejecting useless patches, hard-stop firing — $0.11 / 3 runs / 276 tokens. - `repo-snapshot-2026-05-13/` — architecture, test transcript, and 5-minute terminal walkthrough captured at v0.3.0. Adds a reproducibility note pointing out that the demo run used `claude_cli` (currently on main, shipping in v1.2.0); everything else in evidence/ is reproducible against v1.1.2. No code change. No version bump.

Copilot

Pull request overview

Adds the missing evidence/ directory that the main README has referenced since v1.0, checking in a curated set of reproducibility artifacts (repo snapshot + a real demo-target run ledger) intended to make the project’s claims auditable via concrete files.

Changes:

Add evidence/README.md as a top-level index/orientation for checked-in evidence.
Add evidence/repo-snapshot-2026-05-13/ snapshot docs (architecture summary, walkthrough, test transcript).
Add evidence/demo-target-run-2026-05-14/ case study + full per-run ledger artifacts and config used for the demo run.

Reviewed changes

Copilot reviewed 53 out of 54 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
evidence/README.md	Index for evidence artifacts and reproducibility notes
evidence/repo-snapshot-2026-05-13/architecture.md	Repo architecture snapshot write-up (as of 2026-05-13)
evidence/repo-snapshot-2026-05-13/test-output.txt	Captured test transcript artifact
evidence/repo-snapshot-2026-05-13/walkthrough.md	Copy/paste walkthrough for inspecting repo anatomy
evidence/demo-target-run-2026-05-14/README.md	Narrative case study + reproduction steps for demo-target run
evidence/demo-target-run-2026-05-14/evolution-cli.yml	Config used for the recorded demo-target run
evidence/demo-target-run-2026-05-14/ledger/.evolution_state.json	Persisted run state summary for the demo run
evidence/demo-target-run-2026-05-14/ledger/accepted/current_commit.txt	Accepted-branch pointer recorded in the ledger
evidence/demo-target-run-2026-05-14/ledger/failed/0001-summary.json	Failure summary for run 0001
evidence/demo-target-run-2026-05-14/ledger/failed/0002-summary.json	Failure summary for run 0002
evidence/demo-target-run-2026-05-14/ledger/failed/0003-summary.json	Failure summary for run 0003
evidence/demo-target-run-2026-05-14/ledger/halted/20260514T041157Z.json	Halt reason + totals recorded at stop time
evidence/demo-target-run-2026-05-14/ledger/runs/0001/candidate_commit.txt	Candidate commit SHA record for run 0001
evidence/demo-target-run-2026-05-14/ledger/runs/0001/config.json	Per-run config snapshot (run 0001)
evidence/demo-target-run-2026-05-14/ledger/runs/0001/decision.json	Governor decision artifact (run 0001)
evidence/demo-target-run-2026-05-14/ledger/runs/0001/evaluation.json	Evaluator output artifact (run 0001)
evidence/demo-target-run-2026-05-14/ledger/runs/0001/evaluator_input.json	Evaluator input artifact (run 0001)
evidence/demo-target-run-2026-05-14/ledger/runs/0001/executor_input.json	Executor input artifact (run 0001)
evidence/demo-target-run-2026-05-14/ledger/runs/0001/executor_output.json	Executor output summary (run 0001)
evidence/demo-target-run-2026-05-14/ledger/runs/0001/executor_output.stdout.txt	Raw executor tool stdout capture (run 0001)
evidence/demo-target-run-2026-05-14/ledger/runs/0001/goal.json	Goal snapshot (run 0001)
evidence/demo-target-run-2026-05-14/ledger/runs/0001/observation.json	Evidence observation artifact (run 0001)
evidence/demo-target-run-2026-05-14/ledger/runs/0001/patch.diff	Patch diff artifact (run 0001)
evidence/demo-target-run-2026-05-14/ledger/runs/0001/plan.json	Planner output plan artifact (run 0001)
evidence/demo-target-run-2026-05-14/ledger/runs/0001/planner_input.json	Planner input artifact (run 0001)
evidence/demo-target-run-2026-05-14/ledger/runs/0001/reflection.json	Reflection/history entry artifact (run 0001)
evidence/demo-target-run-2026-05-14/ledger/runs/0002/candidate_commit.txt	Candidate commit SHA record for run 0002
evidence/demo-target-run-2026-05-14/ledger/runs/0002/config.json	Per-run config snapshot (run 0002)
evidence/demo-target-run-2026-05-14/ledger/runs/0002/decision.json	Governor decision artifact (run 0002)
evidence/demo-target-run-2026-05-14/ledger/runs/0002/evaluation.json	Evaluator output artifact (run 0002)
evidence/demo-target-run-2026-05-14/ledger/runs/0002/evaluator_input.json	Evaluator input artifact (run 0002)
evidence/demo-target-run-2026-05-14/ledger/runs/0002/executor_input.json	Executor input artifact (run 0002)
evidence/demo-target-run-2026-05-14/ledger/runs/0002/executor_output.json	Executor output summary (run 0002)
evidence/demo-target-run-2026-05-14/ledger/runs/0002/executor_output.stdout.txt	Raw executor tool stdout capture (run 0002)
evidence/demo-target-run-2026-05-14/ledger/runs/0002/goal.json	Goal snapshot (run 0002)
evidence/demo-target-run-2026-05-14/ledger/runs/0002/observation.json	Evidence observation artifact (run 0002)
evidence/demo-target-run-2026-05-14/ledger/runs/0002/patch.diff	Patch diff artifact (run 0002)
evidence/demo-target-run-2026-05-14/ledger/runs/0002/plan.json	Planner output plan artifact (run 0002)
evidence/demo-target-run-2026-05-14/ledger/runs/0002/planner_input.json	Planner input artifact (run 0002)
evidence/demo-target-run-2026-05-14/ledger/runs/0002/reflection.json	Reflection/history entry artifact (run 0002)
evidence/demo-target-run-2026-05-14/ledger/runs/0003/candidate_commit.txt	Candidate commit SHA record for run 0003
evidence/demo-target-run-2026-05-14/ledger/runs/0003/config.json	Per-run config snapshot (run 0003)
evidence/demo-target-run-2026-05-14/ledger/runs/0003/decision.json	Governor decision artifact (run 0003)
evidence/demo-target-run-2026-05-14/ledger/runs/0003/evaluation.json	Evaluator output artifact (run 0003)
evidence/demo-target-run-2026-05-14/ledger/runs/0003/evaluator_input.json	Evaluator input artifact (run 0003)
evidence/demo-target-run-2026-05-14/ledger/runs/0003/executor_input.json	Executor input artifact (run 0003)
evidence/demo-target-run-2026-05-14/ledger/runs/0003/executor_output.json	Executor output summary (run 0003)
evidence/demo-target-run-2026-05-14/ledger/runs/0003/executor_output.stdout.txt	Raw executor tool stdout capture (run 0003)
evidence/demo-target-run-2026-05-14/ledger/runs/0003/goal.json	Goal snapshot (run 0003)
evidence/demo-target-run-2026-05-14/ledger/runs/0003/observation.json	Evidence observation artifact (run 0003)
evidence/demo-target-run-2026-05-14/ledger/runs/0003/patch.diff	Patch diff artifact (run 0003; empty)
evidence/demo-target-run-2026-05-14/ledger/runs/0003/plan.json	Planner output plan artifact (run 0003)
evidence/demo-target-run-2026-05-14/ledger/runs/0003/planner_input.json	Planner input artifact (run 0003)
evidence/demo-target-run-2026-05-14/ledger/runs/0003/reflection.json	Reflection/history entry artifact (run 0003)

Comments suppressed due to low confidence (4)

evidence/repo-snapshot-2026-05-13/walkthrough.md:35

The “Expected last line” shows “83 passed in ~25s”, but the checked-in transcript ends with “83 passed in 41.30s”. Consider updating the expected timing (or removing it) to avoid confusing readers who compare output.
evidence/repo-snapshot-2026-05-13/walkthrough.md:62
ls roles/ and cat roles/planner.py will fail in the current repo layout because the role scripts live under evolution_kernel/roles/ (or can be referenced via bundled: in config). Either update the paths or add a note to check out the v0.3.0 snapshot commit before running these commands.
evidence/repo-snapshot-2026-05-13/walkthrough.md:53
The expected wc -l evolution_kernel/governor.py output is hard-coded to 673 lines, but the current file is 674 lines. If this walkthrough is meant to be copy-pasteable on current main, consider making this an approximate check (or pinning to a specific tag/commit).
evidence/repo-snapshot-2026-05-13/architecture.md:38
This snapshot lists role scripts under a top-level roles/ directory and says role I/O lives at roles/*.{input,output}.json, but the current repo layout uses evolution_kernel/roles/* for the bundled scripts and the governor writes per-run artifacts like planner_input.json / plan.json directly under each run dir. Either pin the document to a specific commit/tag where roles/ existed, or update the paths so the file references remain navigable today.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+
+It exists alongside the README's "See it in action" section, which describes an *illustrative* GSM8K scenario (a design target, not a checked-in run). The artifacts here are different: they describe **what's actually inside this repo today**, anchored to specific files, tests, commits, and a real end-to-end run anyone can verify in under ten minutes.
+
+> **Reproducibility note.** The 2026-05-14 demo run used the `claude_cli` LLM provider, which is merged on `main` and will ship in **v1.2.0**. Until then, reproduce from a `git clone` of this repo. Everything else in this directory (architecture snapshot, walkthrough, ledger structure) is reproducible against v1.1.2 today.


+|---|---|
+| [`README.md`](demo-target-run-2026-05-14/README.md) | Narrated case study: setup, run-by-run timeline, why a self-halted run is stronger evidence than a successful one, exact reproduction commands. |
+| [`evolution-cli.yml`](demo-target-run-2026-05-14/evolution-cli.yml) | The exact config used (the new `claude_cli` provider; $10 budget; src/ scope). |
+| [`run.log`](demo-target-run-2026-05-14/run.log) | Raw kernel stdout/stderr from the run. |


+| File | What it shows |
+|---|---|
+| [`architecture.md`](repo-snapshot-2026-05-13/architecture.md) | The 8-module, ~1,800-LOC runtime; the 5 role scripts; the 4-stage PR history that built it. With exact file paths and line counts. |
+| [`test-output.txt`](repo-snapshot-2026-05-13/test-output.txt) | Raw transcript of `python3 -m pytest tests/ -v`. **83 tests pass in ~25s, no network calls.** (Counted before PR7a/PR7b landed; latest count is 99 — re-run the suite to verify.) |


+| Coding agent | `claude-code` (also `claude -p`, with `--permission-mode bypassPermissions` inside the kernel-managed worktree) |
+| Hard stops | `max_iterations: 10`, `max_consecutive_failures: 3`, `max_total_usd: 10.00`, `max_total_tokens: 1,000,000` |
+
+Full config: [`evolution-cli.yml`](evolution-cli.yml). Raw kernel stdout/stderr: [`run.log`](run.log).


+## How to reproduce
+
+```bash
+git clone https://github.com/Protocol-zero-0/evolution-kernel.git
+cd evolution-kernel
+pip install -e .                          # (or set PYTHONPATH=$PWD if PEP 668 blocks)
+
+# Prepare a fresh demo target outside the repo
+cp -r examples/demo_target /tmp/demo-target
+bash /tmp/demo-target/setup.sh
+
+# Re-run with the same config used here
+PYTHONPATH=$PWD evolution-kernel \
+    --config evidence/demo-target-run-2026-05-14/evolution-cli.yml \
+    --repo   /tmp/demo-target \
+    --ledger /tmp/ek-rerun \
+    --loop


+| [`roles/planner.py`](../../roles/planner.py) | new `_call_claude_cli(prompt, model)` + `provider == "claude_cli"` branch |
+| [`roles/evaluator.py`](../../roles/evaluator.py) | inline `claude_cli` branch in `_call_llm` |
+| [`roles/goal_evaluator.py`](../../roles/goal_evaluator.py) | new helper + branch |
+| [`roles/strategist.py`](../../roles/strategist.py) | new helper + branch |
+| [`roles/executor.sh`](../../roles/executor.sh) | `claude -p` now invoked with `--permission-mode bypassPermissions` (governor's worktree is the trust boundary) |


+
+All four `_call_claude_cli` helpers shell out via `claude -p --model <m> --output-format json`, parse the JSON envelope, and return real `total_cost_usd` and `input_tokens + output_tokens`. The kernel's existing `max_total_usd` and `max_total_tokens` cost guards continue to work without modification — they just see a different LLM client underneath.
+
+All 99 existing tests still pass with these additions (see [`repo-snapshot-2026-05-13/test-output.txt`](../repo-snapshot-2026-05-13/test-output.txt) — that snapshot was taken before this PR's edits, but the suite was re-run after each edit during this session; final result: `99 passed in 25.93s`).


+  planner:   ["python3", "/home/ubuntu/work/protocol-zero/evolution-kernel/roles/planner.py"]
+  executor:  ["bash",    "/home/ubuntu/work/protocol-zero/evolution-kernel/roles/executor.sh"]
+  evaluator: ["python3", "/home/ubuntu/work/protocol-zero/evolution-kernel/roles/evaluator.py"]


+### 1. Verify the kernel works — run the test suite (~25s)
+
+```bash
+python3 -m pytest tests/ -v
+```


+## Runtime — 8 Python modules, ~1,800 LOC total
+
+```
+evolution_kernel/__init__.py        4 LOC
+evolution_kernel/cli.py           327 LOC   — argument parsing, --loop dispatch
+evolution_kernel/config.py        368 LOC   — evolution.yml schema + validation
+evolution_kernel/governor.py      673 LOC   — closed-loop orchestrator (zero LLM calls)
+evolution_kernel/hard_stops.py    132 LOC   — max_iterations / max_total_usd / max_total_tokens
+evolution_kernel/observer.py      100 LOC   — runs evidence_sources, normalizes output
+evolution_kernel/sandbox.py       100 LOC   — process sandbox (PR7a, work-in-progress)
+evolution_kernel/scope.py          70 LOC   — allowed_paths enforcement
+                                ─────────
+                                1,774 LOC
+```


Copilot AI review requested due to automatic review settings May 26, 2026 10:14

Copilot started reviewing on behalf of Protocol-zero-0 May 26, 2026 10:15 View session

Protocol-zero-0 merged commit b3a5d66 into main May 26, 2026
4 checks passed

Protocol-zero-0 deleted the chore/post-v112-evidence-and-docs branch May 26, 2026 10:15

Copilot AI reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(evidence): commit evidence/ artifacts referenced by README#41

docs(evidence): commit evidence/ artifacts referenced by README#41
Protocol-zero-0 merged 1 commit into
mainfrom
chore/post-v112-evidence-and-docs

Protocol-zero-0 commented May 26, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		It exists alongside the README's "See it in action" section, which describes an illustrative GSM8K scenario (a design target, not a checked-in run). The artifacts here are different: they describe what's actually inside this repo today, anchored to specific files, tests, commits, and a real end-to-end run anyone can verify in under ten minutes.

		> Reproducibility note. The 2026-05-14 demo run used the `claude_cli` LLM provider, which is merged on `main` and will ship in v1.2.0. Until then, reproduce from a `git clone` of this repo. Everything else in this directory (architecture snapshot, walkthrough, ledger structure) is reproducible against v1.1.2 today.


		All four `_call_claude_cli` helpers shell out via `claude -p --model <m> --output-format json`, parse the JSON envelope, and return real `total_cost_usd` and `input_tokens + output_tokens`. The kernel's existing `max_total_usd` and `max_total_tokens` cost guards continue to work without modification — they just see a different LLM client underneath.

		All 99 existing tests still pass with these additions (see [`repo-snapshot-2026-05-13/test-output.txt`](../repo-snapshot-2026-05-13/test-output.txt) — that snapshot was taken before this PR's edits, but the suite was re-run after each edit during this session; final result: `99 passed in 25.93s`).

Conversation

Protocol-zero-0 commented May 26, 2026

Why

What lands

What's NOT in this PR

Test plan

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants