Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions evidence/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
## Evidence

This directory contains **reproducible, checked-in evidence** that Evolution Kernel is real, runnable code with an auditable design — not just a README narrative.

It exists alongside the README's "See it in action" section, which describes an *illustrative* GSM8K scenario (a design target, not a checked-in run). The artifacts here are different: they describe **what's actually inside this repo today**, anchored to specific files, tests, commits, and a real end-to-end run anyone can verify in under ten minutes.

> **Reproducibility note.** The 2026-05-14 demo run used the `claude_cli` LLM provider, which is merged on `main` and will ship in **v1.2.0**. Until then, reproduce from a `git clone` of this repo. Everything else in this directory (architecture snapshot, walkthrough, ledger structure) is reproducible against v1.1.2 today.

---

## What's here

### [`demo-target-run-2026-05-14/`](demo-target-run-2026-05-14/)

A real end-to-end run of the kernel against [`examples/demo_target`](../examples/demo_target/), driven entirely through the local Claude Code CLI (`claude -p`) — **no `ANTHROPIC_API_KEY` required**. The kernel halted itself after 3 consecutive rejections, spending $0.11 of a $10 budget. The self-stop is the point: this run demonstrates the planner reading history and changing direction, the evaluator rejecting "syntactically correct but useless" patches, and the hard-stop logic firing on the kernel's own terms.

| File | What it shows |
|---|---|
| [`README.md`](demo-target-run-2026-05-14/README.md) | Narrated case study: setup, run-by-run timeline, why a self-halted run is stronger evidence than a successful one, exact reproduction commands. |
| [`evolution-cli.yml`](demo-target-run-2026-05-14/evolution-cli.yml) | The exact config used (the new `claude_cli` provider; $10 budget; src/ scope). |
| [`run.log`](demo-target-run-2026-05-14/run.log) | Raw kernel stdout/stderr from the run. |
| [`ledger/`](demo-target-run-2026-05-14/ledger/) | Complete audit trail: per-run plan/patch/evaluation/decision/reflection JSON (3 runs × 12 files each), plus `.evolution_state.json`, `halted/`, `failed/`. Total: 47 files, 232 KB. |

### [`repo-snapshot-2026-05-13/`](repo-snapshot-2026-05-13/)

A point-in-time snapshot of the repository as of 2026-05-13, after stages 0–3 were merged and v0.3.0 was tagged:

| File | What it shows |
|---|---|
| [`architecture.md`](repo-snapshot-2026-05-13/architecture.md) | The 8-module, ~1,800-LOC runtime; the 5 role scripts; the 4-stage PR history that built it. With exact file paths and line counts. |
| [`test-output.txt`](repo-snapshot-2026-05-13/test-output.txt) | Raw transcript of `python3 -m pytest tests/ -v`. **83 tests pass in ~25s, no network calls.** (Counted before PR7a/PR7b landed; latest count is 99 — re-run the suite to verify.) |
| [`walkthrough.md`](repo-snapshot-2026-05-13/walkthrough.md) | A 5-minute terminal walkthrough — commands you can copy-paste to see the kernel's anatomy directly: config schema, role protocol, scope enforcement, ledger structure. No API key required. |

---

## What about the GSM8K story in the README?

The "$34 / overnight / 51.8% → 96.2%" GSM8K story in the main README is a **design narrative** — a description of what a complete, well-targeted run on a real benchmark looks like, given the kernel's current capabilities. It is *not* a checked-in run artifact, and the math_solver_harness it references is not part of this repository. We've added an "Illustrative scenario" banner to the README's hero section to make this boundary explicit.

A real, reproducible run on a non-toy target is on the roadmap — when one is produced, its full ledger will be deposited here under a new dated subfolder, the same way [`demo-target-run-2026-05-14/`](demo-target-run-2026-05-14/) is laid out.
141 changes: 141 additions & 0 deletions evidence/demo-target-run-2026-05-14/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
## Demo target run — 2026-05-14

A real end-to-end run of Evolution Kernel against [`examples/demo_target`](../../examples/demo_target/), driven entirely through the local Claude Code CLI (`claude -p`) — no `ANTHROPIC_API_KEY` required. The full ledger is in [`ledger/`](ledger/).

**The kernel halted itself after 3 consecutive rejections, spending $0.11 of a $10 budget. That self-stop is the point.** Read on for why this is a stronger demonstration than a successful run would have been.

---

## Configuration

| | |
|---|---|
| Target repo | [`examples/demo_target`](../../examples/demo_target/), initialized via `setup.sh` to commit `a05ac1f` |
| Mission | "Improve the demo target so its evaluator passes a simple metric check, under strict reproducibility constraints." |
| Mutation scope | `src/` only — every change outside this path would be rejected by `evolution_kernel/scope.py` |
| LLM | `claude-sonnet-4-6` via `claude -p` ([`evolution-cli.yml`](evolution-cli.yml)) — uses the operator's existing Claude Code subscription, no separate API key |
| Coding agent | `claude-code` (also `claude -p`, with `--permission-mode bypassPermissions` inside the kernel-managed worktree) |
| Hard stops | `max_iterations: 10`, `max_consecutive_failures: 3`, `max_total_usd: 10.00`, `max_total_tokens: 1,000,000` |

Full config: [`evolution-cli.yml`](evolution-cli.yml). Raw kernel stdout/stderr: [`run.log`](run.log).

---

## Outcome — the self-stop, in one screen

```
cumulative
run decision cost cost what happened
─────────────────────────────────────────────────────────────────────────────
0001 REJECT $0.041 $0.041 created src/feature.py with just a docstring;
evaluator: "no effect on score" (fitness 0.01)
0002 REJECT $0.043 $0.084 read run-0001 history → added a stub function;
evaluator: "stub still doesn't change metrics" (fitness 0.10)
0003 REJECT $0.026 $0.110 read run-0002 history → planned to modify the
evaluator output itself; executor stalled,
produced no diff this round (fitness 0.00)
Comment on lines +35 to +36
─────────────────────────────────────────────────────────────────────────────
HALT: max_consecutive_failures (3) reached. Final: 3 runs · 276 tokens · $0.110
```

(Costs are per-run cumulative LLM spend reported by `claude -p --output-format json`, then summed by `evolution_kernel/hard_stops.py`. Source: [`ledger/.evolution_state.json`](ledger/.evolution_state.json), [`ledger/halted/20260514T041157Z.json`](ledger/halted/20260514T041157Z.json), per-run [`ledger/runs/*/evaluation.json`](ledger/runs/).)

---

## Why a "halted after 3 rejects" run is the right kind of evidence

A toy demo where the LLM gets lucky and the metric jumps to 100% would show **one** thing: that something improved. It would not show whether the kernel can *resist* a bad change, or *stop itself* before burning a budget on a dead-end strategy. Those two behaviors are the kernel's actual job. This run demonstrates both:

### 1. Planner reads history and changes direction

`run-0002`'s `planner_input.json` contains a `history` field with one entry — the run-0001 rejection summary. The plan it produces is materially different:

> run-0001 plan: *"Create the file `src/feature.py` with a minimal valid Python module body (e.g., a single docstring or pass statement)"*
>
> run-0002 plan: *"Create `src/feature.py` with exact content: `def feature() -> str: return 'ok'`. Ensure `src/` directory exists. Confirm file is readable and non-empty before returning."*

The planner did not retry the same approach. It saw the rejection reason ("no effect on score") and reached for something more substantive — a callable function rather than just a docstring. By run-0003 it had pivoted further, proposing changes to evaluator-visible behavior. **The history-injection mechanism is doing real work.** ([`ledger/runs/0002/planner_input.json`](ledger/runs/0002/planner_input.json), [`ledger/runs/0002/plan.json`](ledger/runs/0002/plan.json))

### 2. Evaluator rejects "syntactically correct but useless" changes

In runs 0001 and 0002 the executor *did* produce real diffs that compiled and made the candidate commit. A naive system ("did the LLM apply the patch? yes? accept!") would have shipped both of them. The Evolution Kernel evaluator looked at the actual observation (the metric was still `score: 0.5` after the patch) and rejected them anyway, with a one-sentence reason that names the failure mode:

- run-0001 evaluator: *"Adding an empty feature module docstring has no effect on the metrics.json score or evaluator outcome."*
- run-0002 evaluator: *"Adding a stub feature.py with a trivial function does not modify metrics.json or any evaluator logic, so the score remains 0.5 and the metric check will not pass."*

This is the protocol's **`hard_gates_passed` vs `recommendation`** separation in action. The change technically passed the syntactic safety gate, but the evaluator's *outcome* judgment was still "reject." The `fitness` floats (0.01, 0.10, 0.00) record how strongly each candidate moved the goal — both for the human reader and for the k-branch parallel exploration that ranks sibling branches.

### 3. Hard stop fires on the kernel's own terms

The `max_consecutive_failures: 3` hard stop is enforced in `evolution_kernel/hard_stops.py`. It triggered exactly when it should — after three back-to-back REJECTs — and the kernel exited with a halted-state JSON dropped at [`ledger/halted/20260514T041157Z.json`](ledger/halted/20260514T041157Z.json) so a future operator (or auditor) can reconstruct what happened from disk alone.

Total spend at halt: **$0.110 of a $10 budget** (~1%). The kernel did not "run away" — it noticed it was stuck and stopped, leaving the next operator with a clean baseline (`evolution/accepted` still pointing at the initial commit `a05ac1f`, no spurious commits merged) and a written audit trail of why.

### 4. Every byte of this run is reconstructable from `ledger/`

Each `runs/<id>/` directory has the same 12-file shape described in the main README's "Ledger" section. Concretely, for run 0002:

```
ledger/runs/0002/
config.json ← evolution.yml snapshot at this iteration
goal.json ← the mission, restated
observation.json ← evidence_sources output (metrics.json + status.sh)
planner_input.json ← goal + observation + history fed to planner
plan.json ← LLM's plan: summary, steps, allowed_paths, cost, tokens
executor_input.json ← plan + worktree path fed to executor
executor_output.json ← changed_files: 1, tool: claude-code
executor_output.stdout.txt ← raw claude -p text response
evaluator_input.json ← goal + patch + observation fed to evaluator
patch.diff ← the actual diff that landed in the sandbox
candidate_commit.txt ← the sandbox commit SHA (b42cfc2…)
evaluation.json ← hard_gates_passed, recommendation, fitness, reason, cost, tokens
decision.json ← governor's final accept/reject + rollback_target
reflection.json ← one-line summary that becomes run-0003's history
```

You can scroll through any of these directly in this checked-in `ledger/` directory.

---

## How to reproduce

```bash
git clone https://github.com/Protocol-zero-0/evolution-kernel.git
cd evolution-kernel
pip install -e . # (or set PYTHONPATH=$PWD if PEP 668 blocks)

# Prepare a fresh demo target outside the repo
cp -r examples/demo_target /tmp/demo-target
bash /tmp/demo-target/setup.sh

# Re-run with the same config used here
PYTHONPATH=$PWD evolution-kernel \
--config evidence/demo-target-run-2026-05-14/evolution-cli.yml \
--repo /tmp/demo-target \
--ledger /tmp/ek-rerun \
--loop
Comment on lines +100 to +116
```

Required tooling:
- Python ≥ 3.10
- An authenticated [Claude Code CLI](https://docs.claude.com/en/docs/claude-code/overview) (`claude --version` must return ≥ 2.x). No `ANTHROPIC_API_KEY` is needed — the kernel's new `claude_cli` provider shells out to `claude -p`.

The exact LLM outputs will differ run-to-run (the Sonnet model isn't deterministic), but the *shape* of the ledger — config, observation, plan, patch, evaluation, decision, reflection for each run, plus a halted/ entry when budget rules fire — is reproducible by construction.

---

## What this run also produced — the `claude_cli` provider

To make this run possible, four role scripts gained a third provider option:

| File | Change |
|---|---|
| [`roles/planner.py`](../../roles/planner.py) | new `_call_claude_cli(prompt, model)` + `provider == "claude_cli"` branch |
| [`roles/evaluator.py`](../../roles/evaluator.py) | inline `claude_cli` branch in `_call_llm` |
| [`roles/goal_evaluator.py`](../../roles/goal_evaluator.py) | new helper + branch |
| [`roles/strategist.py`](../../roles/strategist.py) | new helper + branch |
| [`roles/executor.sh`](../../roles/executor.sh) | `claude -p` now invoked with `--permission-mode bypassPermissions` (governor's worktree is the trust boundary) |
Comment on lines +133 to +137

All four `_call_claude_cli` helpers shell out via `claude -p --model <m> --output-format json`, parse the JSON envelope, and return real `total_cost_usd` and `input_tokens + output_tokens`. The kernel's existing `max_total_usd` and `max_total_tokens` cost guards continue to work without modification — they just see a different LLM client underneath.

All 99 existing tests still pass with these additions (see [`repo-snapshot-2026-05-13/test-output.txt`](../repo-snapshot-2026-05-13/test-output.txt) — that snapshot was taken before this PR's edits, but the suite was re-run after each edit during this session; final result: `99 passed in 25.93s`).
38 changes: 38 additions & 0 deletions evidence/demo-target-run-2026-05-14/evolution-cli.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
mission: "Improve the demo target so its evaluator passes a simple metric check, under strict reproducibility constraints."

# Use the local Claude Code CLI (no API key required, runs against the
# logged-in user's subscription via `claude -p`).
llm:
provider: claude_cli
model: claude-sonnet-4-6

# Same — executor uses `claude -p` instead of aider.
coding_agent:
tool: claude-code

history:
max_entries: 10

evidence_sources:
- type: file
path: "metrics.json"
- type: shell
command: "bash scripts/status.sh"

mutation_scope:
allowed_paths:
- "src/"

# Budget guards. max_total_usd is enforced via the real cost reported by
# `claude -p --output-format json`. $10 is generous; the run should finish
# well under $1 on the toy demo_target.
hard_stops:
max_iterations: 10
max_consecutive_failures: 3
max_total_usd: 10.00
max_total_tokens: 1000000

roles:
planner: ["python3", "/home/ubuntu/work/protocol-zero/evolution-kernel/roles/planner.py"]
executor: ["bash", "/home/ubuntu/work/protocol-zero/evolution-kernel/roles/executor.sh"]
evaluator: ["python3", "/home/ubuntu/work/protocol-zero/evolution-kernel/roles/evaluator.py"]
Comment on lines +36 to +38
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"consecutive_failures": 3,
"halt_reason": "max_consecutive_failures reached (3)",
"halted": true,
"iterations": 3,
"total_tokens": 276,
"total_usd": 0.11044649999999998
}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
a05ac1fa38bdb1ded0721326b9dfadfa3e4214b5
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"accepted": false,
"baseline_commit": "a05ac1fa38bdb1ded0721326b9dfadfa3e4214b5",
"candidate_commit": "42fe32b50d2e83e1f768820dc782f500185d3e78",
"reason": "hard gates failed or evaluator rejected candidate",
"rollback_target": "a05ac1fa38bdb1ded0721326b9dfadfa3e4214b5"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"accepted": false,
"baseline_commit": "a05ac1fa38bdb1ded0721326b9dfadfa3e4214b5",
"candidate_commit": "b42cfc2beb5fa673bed40e8773ee538f156e4a3f",
"reason": "hard gates failed or evaluator rejected candidate",
"rollback_target": "a05ac1fa38bdb1ded0721326b9dfadfa3e4214b5"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"accepted": false,
"baseline_commit": "a05ac1fa38bdb1ded0721326b9dfadfa3e4214b5",
"candidate_commit": null,
"reason": "executor produced no repo changes",
"rollback_target": "a05ac1fa38bdb1ded0721326b9dfadfa3e4214b5"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"consecutive_failures": 3,
"halted_at": "20260514T041157Z",
"iterations": 3,
"reason": "max_consecutive_failures reached (3)",
"total_tokens": 276,
"total_usd": 0.11044649999999998
}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
42fe32b50d2e83e1f768820dc782f500185d3e78
48 changes: 48 additions & 0 deletions evidence/demo-target-run-2026-05-14/ledger/runs/0001/config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
{
"coding_agent": {
"tool": "claude-code"
},
"evidence_sources": [
{
"path": "metrics.json",
"type": "file"
},
{
"command": "bash scripts/status.sh",
"type": "shell"
}
],
"hard_stops": {
"max_consecutive_failures": 3,
"max_iterations": 10,
"max_total_tokens": 1000000,
"max_total_usd": 10.0
},
"history": {
"max_entries": 10
},
"llm": {
"model": "claude-sonnet-4-6",
"provider": "claude_cli"
},
"mission": "Improve the demo target so its evaluator passes a simple metric check, under strict reproducibility constraints.",
"mutation_scope": {
"allowed_paths": [
"src/"
]
},
"roles": {
"evaluator": [
"python3",
"/home/ubuntu/work/protocol-zero/evolution-kernel/roles/evaluator.py"
],
"executor": [
"bash",
"/home/ubuntu/work/protocol-zero/evolution-kernel/roles/executor.sh"
],
"planner": [
"python3",
"/home/ubuntu/work/protocol-zero/evolution-kernel/roles/planner.py"
]
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"accepted": false,
"baseline_commit": "a05ac1fa38bdb1ded0721326b9dfadfa3e4214b5",
"candidate_commit": "42fe32b50d2e83e1f768820dc782f500185d3e78",
"reason": "hard gates failed or evaluator rejected candidate",
"rollback_target": "a05ac1fa38bdb1ded0721326b9dfadfa3e4214b5"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{
"cost_usd": 0.036670499999999995,
"fitness": 0.01,
"hard_gates_passed": true,
"metrics": {
"iterations_observed": 0,
"score": 0.5
},
"reason": "Adding an empty feature module docstring has no effect on the metrics.json score or evaluator outcome.",
"recommendation": "reject",
"tokens_used": 77
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{
"baseline_commit": "a05ac1fa38bdb1ded0721326b9dfadfa3e4214b5",
"candidate_commit": "42fe32b50d2e83e1f768820dc782f500185d3e78",
"goal": {
"name": "Improve the demo target so its evaluator passes a simple metric check, under strict reproducibility constraints.",
"objective": "Improve the demo target so its evaluator passes a simple metric check, under strict reproducibility constraints."
},
"observation_path": "/tmp/ek-ledger/runs/0001/observation.json",
"patch_path": "/tmp/ek-ledger/runs/0001/patch.diff",
"run_id": "0001",
"worktree": "/tmp/ek-ledger/worktrees/0001"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"baseline_commit": "a05ac1fa38bdb1ded0721326b9dfadfa3e4214b5",
"goal": {
"name": "Improve the demo target so its evaluator passes a simple metric check, under strict reproducibility constraints.",
"objective": "Improve the demo target so its evaluator passes a simple metric check, under strict reproducibility constraints."
},
"plan_path": "/tmp/ek-ledger/runs/0001/plan.json",
"run_id": "0001",
"worktree": "/tmp/ek-ledger/worktrees/0001"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"changed_files": 1,
"tool": "claude-code",
"summary": "Create src/feature.py to satisfy the evaluator's hard gate check"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Created `src/feature.py` with a minimal module docstring.
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"name": "Improve the demo target so its evaluator passes a simple metric check, under strict reproducibility constraints.",
"objective": "Improve the demo target so its evaluator passes a simple metric check, under strict reproducibility constraints."
}
Loading
Loading