feat: examples/oss_fix_demo/ — real OSS fix via claude CLI#32
Merged
Conversation
Lands the LLM-driven companion to examples/quickstart/. The target is a
real published OSS package (python-slugify v8.0.4 — 1,106 LoC, MIT,
cloned by setup.sh at a pinned tag). The executor is `claude -p
--permission-mode acceptEdits` inside a custom Python wrapper that uses
the operator's Claude Pro/Max subscription — no API key, no per-token
charge. Planner and evaluator stay deterministic so the loop can be
driven from a clean install without anthropic / openai SDKs.
E2E on a developer laptop (2026-05-17):
setup.sh: ~1s + clone bandwidth
evolution-kernel --loop: ~48s total wall-clock (3 rounds)
claude executor (round 1): 34s (10 ruff violations → 0)
rounds 2-3: halt — ruff is already clean
cost: $0 marginal (Pro subscription flat fee)
Real evolution/accepted commit landed on the demo target's branch.
Captured in examples/oss_fix_demo/README.md.
Realistic division of labor in bots/executor.py: claude does the
semantic edits (F401 "use explicit `as` re-export" pattern); a follow-up
`ruff check --fix && ruff format` postprocess mops up structural
autofixes (I001 import sort). This both raises the accept rate and
mirrors how production teams chain LLMs with deterministic tooling.
setup.sh writes a target-side pyproject.toml with `ignore = ["F403"]`
because python-slugify's star-imports are load-bearing public-API
re-exports — flagging them as a real "bug" would be incorrect.
Files added under examples/oss_fix_demo/:
- setup.sh — clones python-slugify, drops bots/, writes lint config
- evolution.yml — references in-target bots, allowed_paths slugify/
- bots/planner.py — deterministic, snapshots ruff diagnostics into the plan
- bots/executor.py — wraps `claude -p --permission-mode acceptEdits`,
then runs ruff --fix && ruff format postprocess
- bots/evaluator.py — re-runs ruff, accept iff exit 0
- README.md — full forensic walk-through with ledger snippets
No kernel changes; 102 baseline tests still pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new examples/oss_fix_demo/ companion demo that runs Evolution Kernel against python-slugify using a Claude CLI executor, with deterministic planner/evaluator scripts and setup documentation.
Changes:
- Adds setup and README for cloning/preparing the python-slugify demo target.
- Adds demo config wiring planner, Claude executor wrapper, evaluator, ruff evidence, and mutation scope.
- Adds bot scripts for planning ruff cleanup, invoking
claude -p, postprocessing with ruff, and evaluating ruff cleanliness.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
examples/oss_fix_demo/setup.sh |
Bootstraps the external OSS target repo and commits demo bots/config. |
examples/oss_fix_demo/README.md |
Documents prerequisites, run steps, measured output, and ledger artifacts. |
examples/oss_fix_demo/evolution.yml |
Defines mission, ruff evidence, allowed paths, hard stops, and role commands. |
examples/oss_fix_demo/bots/planner.py |
Emits a canned ruff-cleanup plan with current diagnostics. |
examples/oss_fix_demo/bots/executor.py |
Wraps Claude CLI execution and applies ruff postprocessing. |
examples/oss_fix_demo/bots/evaluator.py |
Accepts candidates when ruff check slugify/ is clean. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Snapshot ruff state at observe-time so the planner sees concrete diagnostics. | ||
| evidence_sources: | ||
| - type: shell | ||
| command: "python3 -m ruff check slugify/ || true" |
| prompt = f"{plan.get('summary', 'improve the codebase')}\n\nSteps:\n{steps}\n\nOnly modify files within: {', '.join(plan.get('allowed_paths', ['slugify/']))}" | ||
|
|
||
| claude_bin = os.environ.get("EK_CLAUDE_BIN", "claude") | ||
| extra_args = os.environ.get("EK_CLAUDE_ARGS", "--permission-mode acceptEdits").split() |
|
|
||
| ## Want to point this at *your* OSS repo? | ||
|
|
||
| Replace the `git clone` line in `setup.sh` with your own URL/tag, point `mutation_scope.allowed_paths` in `evolution.yml` at the right subtree, and rewrite the planner's `summary`/`steps` to describe your goal. The bots are <50 LoC each — easy to fork. |
Comment on lines
+6
to
+7
| # 2. Copy local-deterministic bots/ into it and commit on the target's | ||
| # first commit so the role scripts are reachable inside every |
| |---|---|---| | ||
| | `setup.sh` (clones slugify, commits bots, init repo) | ~1 s + clone bandwidth | \$0 | | ||
| | Run 1 — claude executor + ruff postprocess + evaluator | **34 s** (claude alone) / **~48 s** total wall-clock for the 3-round loop | Claude Pro subscription, flat fee — no per-token charge | | ||
| | Run 2 — halt: ruff already clean, no changes to make | <1 s | \$0 | |
Comment on lines
+84
to
+85
| | `observation.json` | Ruff snapshot the planner saw | | ||
| | `plan.json` | Canned plan + verbatim ruff diagnostics | |
Comment on lines
+24
to
+31
| # Count remaining violations from the "Found N errors." footer if present. | ||
| remaining = 0 | ||
| for line in proc.stdout.splitlines(): | ||
| if line.startswith("Found ") and "error" in line: | ||
| try: | ||
| remaining = int(line.split()[1]) | ||
| except (ValueError, IndexError): | ||
| pass |
Comment on lines
+64
to
+70
| "tool": "claude-code", | ||
| "exit": proc.returncode, | ||
| "elapsed_seconds": round(elapsed, 2), | ||
| "postprocess": postprocess, | ||
| "changed_files": changed, | ||
| "stdout_tail": proc.stdout[-800:], | ||
| "stderr_tail": proc.stderr[-400:], |
Comment on lines
+45
to
+49
| # Mop up any remaining autofix-only diagnostics (import sort, whitespace). | ||
| # This is realistic: in real workflows you run the formatter after the LLM. | ||
| postprocess = [] | ||
| for cmd in ( | ||
| ["python3", "-m", "ruff", "check", "--fix", "--unsafe-fixes", "slugify/"], |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #31.
Summary
Measured (2026-05-17, dev laptop)
`runs/0001/decision.json`:
```json
{"accepted": true, "candidate_commit": "bae97a8...", "reason": "hard gates passed and evaluator recommended promotion"}
```
`runs/0001/executor_output.json` (claude's verbatim summary):
Design notes (worth recording)
Test plan
🤖 Generated with Claude Code