docs(evidence): commit evidence/ artifacts referenced by README#41
Merged
Conversation
The README has pointed to `evidence/` since v1.0 ("checked-in artifacts
of runs anyone can reproduce"), but the directory itself never landed —
the link was dead in v1.0 / v1.1.0 / v1.1.1 / v1.1.2 readers.
This commit ships the two anchor pieces:
- `demo-target-run-2026-05-14/` — full ledger of a real, self-halted
kernel run against examples/demo_target. 47 ledger files + README +
config + raw run.log. Demonstrates planner reading history, evaluator
rejecting useless patches, hard-stop firing — $0.11 / 3 runs / 276
tokens.
- `repo-snapshot-2026-05-13/` — architecture, test transcript, and
5-minute terminal walkthrough captured at v0.3.0.
Adds a reproducibility note pointing out that the demo run used
`claude_cli` (currently on main, shipping in v1.2.0); everything else
in evidence/ is reproducible against v1.1.2.
No code change. No version bump.
There was a problem hiding this comment.
Pull request overview
Adds the missing evidence/ directory that the main README has referenced since v1.0, checking in a curated set of reproducibility artifacts (repo snapshot + a real demo-target run ledger) intended to make the project’s claims auditable via concrete files.
Changes:
- Add
evidence/README.mdas a top-level index/orientation for checked-in evidence. - Add
evidence/repo-snapshot-2026-05-13/snapshot docs (architecture summary, walkthrough, test transcript). - Add
evidence/demo-target-run-2026-05-14/case study + full per-run ledger artifacts and config used for the demo run.
Reviewed changes
Copilot reviewed 53 out of 54 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| evidence/README.md | Index for evidence artifacts and reproducibility notes |
| evidence/repo-snapshot-2026-05-13/architecture.md | Repo architecture snapshot write-up (as of 2026-05-13) |
| evidence/repo-snapshot-2026-05-13/test-output.txt | Captured test transcript artifact |
| evidence/repo-snapshot-2026-05-13/walkthrough.md | Copy/paste walkthrough for inspecting repo anatomy |
| evidence/demo-target-run-2026-05-14/README.md | Narrative case study + reproduction steps for demo-target run |
| evidence/demo-target-run-2026-05-14/evolution-cli.yml | Config used for the recorded demo-target run |
| evidence/demo-target-run-2026-05-14/ledger/.evolution_state.json | Persisted run state summary for the demo run |
| evidence/demo-target-run-2026-05-14/ledger/accepted/current_commit.txt | Accepted-branch pointer recorded in the ledger |
| evidence/demo-target-run-2026-05-14/ledger/failed/0001-summary.json | Failure summary for run 0001 |
| evidence/demo-target-run-2026-05-14/ledger/failed/0002-summary.json | Failure summary for run 0002 |
| evidence/demo-target-run-2026-05-14/ledger/failed/0003-summary.json | Failure summary for run 0003 |
| evidence/demo-target-run-2026-05-14/ledger/halted/20260514T041157Z.json | Halt reason + totals recorded at stop time |
| evidence/demo-target-run-2026-05-14/ledger/runs/0001/candidate_commit.txt | Candidate commit SHA record for run 0001 |
| evidence/demo-target-run-2026-05-14/ledger/runs/0001/config.json | Per-run config snapshot (run 0001) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0001/decision.json | Governor decision artifact (run 0001) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0001/evaluation.json | Evaluator output artifact (run 0001) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0001/evaluator_input.json | Evaluator input artifact (run 0001) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0001/executor_input.json | Executor input artifact (run 0001) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0001/executor_output.json | Executor output summary (run 0001) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0001/executor_output.stdout.txt | Raw executor tool stdout capture (run 0001) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0001/goal.json | Goal snapshot (run 0001) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0001/observation.json | Evidence observation artifact (run 0001) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0001/patch.diff | Patch diff artifact (run 0001) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0001/plan.json | Planner output plan artifact (run 0001) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0001/planner_input.json | Planner input artifact (run 0001) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0001/reflection.json | Reflection/history entry artifact (run 0001) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0002/candidate_commit.txt | Candidate commit SHA record for run 0002 |
| evidence/demo-target-run-2026-05-14/ledger/runs/0002/config.json | Per-run config snapshot (run 0002) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0002/decision.json | Governor decision artifact (run 0002) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0002/evaluation.json | Evaluator output artifact (run 0002) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0002/evaluator_input.json | Evaluator input artifact (run 0002) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0002/executor_input.json | Executor input artifact (run 0002) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0002/executor_output.json | Executor output summary (run 0002) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0002/executor_output.stdout.txt | Raw executor tool stdout capture (run 0002) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0002/goal.json | Goal snapshot (run 0002) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0002/observation.json | Evidence observation artifact (run 0002) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0002/patch.diff | Patch diff artifact (run 0002) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0002/plan.json | Planner output plan artifact (run 0002) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0002/planner_input.json | Planner input artifact (run 0002) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0002/reflection.json | Reflection/history entry artifact (run 0002) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0003/candidate_commit.txt | Candidate commit SHA record for run 0003 |
| evidence/demo-target-run-2026-05-14/ledger/runs/0003/config.json | Per-run config snapshot (run 0003) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0003/decision.json | Governor decision artifact (run 0003) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0003/evaluation.json | Evaluator output artifact (run 0003) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0003/evaluator_input.json | Evaluator input artifact (run 0003) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0003/executor_input.json | Executor input artifact (run 0003) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0003/executor_output.json | Executor output summary (run 0003) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0003/executor_output.stdout.txt | Raw executor tool stdout capture (run 0003) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0003/goal.json | Goal snapshot (run 0003) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0003/observation.json | Evidence observation artifact (run 0003) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0003/patch.diff | Patch diff artifact (run 0003; empty) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0003/plan.json | Planner output plan artifact (run 0003) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0003/planner_input.json | Planner input artifact (run 0003) |
| evidence/demo-target-run-2026-05-14/ledger/runs/0003/reflection.json | Reflection/history entry artifact (run 0003) |
Comments suppressed due to low confidence (4)
evidence/repo-snapshot-2026-05-13/walkthrough.md:35
- The “Expected last line” shows “83 passed in ~25s”, but the checked-in transcript ends with “83 passed in 41.30s”. Consider updating the expected timing (or removing it) to avoid confusing readers who compare output.
evidence/repo-snapshot-2026-05-13/walkthrough.md:62 ls roles/andcat roles/planner.pywill fail in the current repo layout because the role scripts live underevolution_kernel/roles/(or can be referenced viabundled:in config). Either update the paths or add a note to check out the v0.3.0 snapshot commit before running these commands.
evidence/repo-snapshot-2026-05-13/walkthrough.md:53- The expected
wc -l evolution_kernel/governor.pyoutput is hard-coded to 673 lines, but the current file is 674 lines. If this walkthrough is meant to be copy-pasteable on current main, consider making this an approximate check (or pinning to a specific tag/commit).
evidence/repo-snapshot-2026-05-13/architecture.md:38 - This snapshot lists role scripts under a top-level
roles/directory and says role I/O lives atroles/*.{input,output}.json, but the current repo layout usesevolution_kernel/roles/*for the bundled scripts and the governor writes per-run artifacts likeplanner_input.json/plan.jsondirectly under each run dir. Either pin the document to a specific commit/tag whereroles/existed, or update the paths so the file references remain navigable today.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| It exists alongside the README's "See it in action" section, which describes an *illustrative* GSM8K scenario (a design target, not a checked-in run). The artifacts here are different: they describe **what's actually inside this repo today**, anchored to specific files, tests, commits, and a real end-to-end run anyone can verify in under ten minutes. | ||
|
|
||
| > **Reproducibility note.** The 2026-05-14 demo run used the `claude_cli` LLM provider, which is merged on `main` and will ship in **v1.2.0**. Until then, reproduce from a `git clone` of this repo. Everything else in this directory (architecture snapshot, walkthrough, ledger structure) is reproducible against v1.1.2 today. |
| |---|---| | ||
| | [`README.md`](demo-target-run-2026-05-14/README.md) | Narrated case study: setup, run-by-run timeline, why a self-halted run is stronger evidence than a successful one, exact reproduction commands. | | ||
| | [`evolution-cli.yml`](demo-target-run-2026-05-14/evolution-cli.yml) | The exact config used (the new `claude_cli` provider; $10 budget; src/ scope). | | ||
| | [`run.log`](demo-target-run-2026-05-14/run.log) | Raw kernel stdout/stderr from the run. | |
| | File | What it shows | | ||
| |---|---| | ||
| | [`architecture.md`](repo-snapshot-2026-05-13/architecture.md) | The 8-module, ~1,800-LOC runtime; the 5 role scripts; the 4-stage PR history that built it. With exact file paths and line counts. | | ||
| | [`test-output.txt`](repo-snapshot-2026-05-13/test-output.txt) | Raw transcript of `python3 -m pytest tests/ -v`. **83 tests pass in ~25s, no network calls.** (Counted before PR7a/PR7b landed; latest count is 99 — re-run the suite to verify.) | |
| | Coding agent | `claude-code` (also `claude -p`, with `--permission-mode bypassPermissions` inside the kernel-managed worktree) | | ||
| | Hard stops | `max_iterations: 10`, `max_consecutive_failures: 3`, `max_total_usd: 10.00`, `max_total_tokens: 1,000,000` | | ||
|
|
||
| Full config: [`evolution-cli.yml`](evolution-cli.yml). Raw kernel stdout/stderr: [`run.log`](run.log). |
Comment on lines
+100
to
+116
| ## How to reproduce | ||
|
|
||
| ```bash | ||
| git clone https://github.com/Protocol-zero-0/evolution-kernel.git | ||
| cd evolution-kernel | ||
| pip install -e . # (or set PYTHONPATH=$PWD if PEP 668 blocks) | ||
|
|
||
| # Prepare a fresh demo target outside the repo | ||
| cp -r examples/demo_target /tmp/demo-target | ||
| bash /tmp/demo-target/setup.sh | ||
|
|
||
| # Re-run with the same config used here | ||
| PYTHONPATH=$PWD evolution-kernel \ | ||
| --config evidence/demo-target-run-2026-05-14/evolution-cli.yml \ | ||
| --repo /tmp/demo-target \ | ||
| --ledger /tmp/ek-rerun \ | ||
| --loop |
Comment on lines
+133
to
+137
| | [`roles/planner.py`](../../roles/planner.py) | new `_call_claude_cli(prompt, model)` + `provider == "claude_cli"` branch | | ||
| | [`roles/evaluator.py`](../../roles/evaluator.py) | inline `claude_cli` branch in `_call_llm` | | ||
| | [`roles/goal_evaluator.py`](../../roles/goal_evaluator.py) | new helper + branch | | ||
| | [`roles/strategist.py`](../../roles/strategist.py) | new helper + branch | | ||
| | [`roles/executor.sh`](../../roles/executor.sh) | `claude -p` now invoked with `--permission-mode bypassPermissions` (governor's worktree is the trust boundary) | |
|
|
||
| All four `_call_claude_cli` helpers shell out via `claude -p --model <m> --output-format json`, parse the JSON envelope, and return real `total_cost_usd` and `input_tokens + output_tokens`. The kernel's existing `max_total_usd` and `max_total_tokens` cost guards continue to work without modification — they just see a different LLM client underneath. | ||
|
|
||
| All 99 existing tests still pass with these additions (see [`repo-snapshot-2026-05-13/test-output.txt`](../repo-snapshot-2026-05-13/test-output.txt) — that snapshot was taken before this PR's edits, but the suite was re-run after each edit during this session; final result: `99 passed in 25.93s`). |
Comment on lines
+36
to
+38
| planner: ["python3", "/home/ubuntu/work/protocol-zero/evolution-kernel/roles/planner.py"] | ||
| executor: ["bash", "/home/ubuntu/work/protocol-zero/evolution-kernel/roles/executor.sh"] | ||
| evaluator: ["python3", "/home/ubuntu/work/protocol-zero/evolution-kernel/roles/evaluator.py"] |
Comment on lines
+21
to
+25
| ### 1. Verify the kernel works — run the test suite (~25s) | ||
|
|
||
| ```bash | ||
| python3 -m pytest tests/ -v | ||
| ``` |
Comment on lines
+7
to
+20
| ## Runtime — 8 Python modules, ~1,800 LOC total | ||
|
|
||
| ``` | ||
| evolution_kernel/__init__.py 4 LOC | ||
| evolution_kernel/cli.py 327 LOC — argument parsing, --loop dispatch | ||
| evolution_kernel/config.py 368 LOC — evolution.yml schema + validation | ||
| evolution_kernel/governor.py 673 LOC — closed-loop orchestrator (zero LLM calls) | ||
| evolution_kernel/hard_stops.py 132 LOC — max_iterations / max_total_usd / max_total_tokens | ||
| evolution_kernel/observer.py 100 LOC — runs evidence_sources, normalizes output | ||
| evolution_kernel/sandbox.py 100 LOC — process sandbox (PR7a, work-in-progress) | ||
| evolution_kernel/scope.py 70 LOC — allowed_paths enforcement | ||
| ───────── | ||
| 1,774 LOC | ||
| ``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The README has pointed to `evidence/` since v1.0 as "checked-in artifacts of runs anyone can reproduce", but the directory never landed. Dead link in v1.0 / v1.1.0 / v1.1.1 / v1.1.2 readers. Pre-publicize cleanup.
What lands
Stats: 54 files, 1132 insertions.
What's NOT in this PR
Test plan