Skip to content

evalbuff: carve-based eval pipeline (delete & rebuild)#487

Merged
jahooma merged 4 commits intomainfrom
jahooma/evalbuff-delete-rebuild
Mar 31, 2026
Merged

evalbuff: carve-based eval pipeline (delete & rebuild)#487
jahooma merged 4 commits intomainfrom
jahooma/evalbuff-delete-rebuild

Conversation

@jahooma
Copy link
Copy Markdown
Contributor

@jahooma jahooma commented Mar 30, 2026

Summary

  • Adds a new eval approach that carves features out of the current codebase (using gpt-5.4 via OpenAI SDK) and has agents rebuild them from natural prompts, instead of replaying git commits
  • carve-features.ts: two-phase pipeline — plans carveable features across the codebase, then surgically removes each one producing diffs and ground truth
  • run-carve-eval.ts: runs N agents in parallel on carved repos, judges against original code, and iterates on docs using the existing doc-optimizer loop
  • Tested end-to-end on this repo: carved cli-init-command, agents scored 5.0 baseline → 5.5 after doc improvement, generated patterns/discover-before-implement.md
  • Also includes the doc and test artifacts from the trial run (AGENTS.md update, generated doc, carve/eval result JSONs)

Test plan

  • Typecheck passes (npx tsc --noEmit)
  • End-to-end test: bun run evalbuff/src/carve-features.ts --repo . --count 3 produced 3 carved features
  • End-to-end test: bun run evalbuff/src/run-carve-eval.ts --repo . --carve-file carve-2026-03-30.json --feature cli-init-command --parallelism 2 ran full loop successfully

🤖 Generated with Claude Code

jahooma and others added 2 commits March 30, 2026 16:36
New approach to evals that carves features out of the current codebase
and has agents rebuild them, instead of replaying git commits. Uses
OpenAI SDK (gpt-5.4) to identify and surgically remove features, then
runs agents in parallel to rebuild from a natural prompt, judges against
original code, and iterates on docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jahooma and others added 2 commits March 30, 2026 17:43
- Switch carve eval inner agents to Claude SDK (sonnet) with 3 parallel runs
- Update carve-features to use gpt-5.4 model
- Remove auto-generated discover-before-implement.md (test artifact)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jahooma jahooma merged commit 869f5c4 into main Mar 31, 2026
34 checks passed
@jahooma jahooma deleted the jahooma/evalbuff-delete-rebuild branch March 31, 2026 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant