From 739f7d983d168d27093c0abb14ee735c2072a3a8 Mon Sep 17 00:00:00 2001 From: nonoqing Date: Sat, 27 Jun 2026 11:34:41 +0800 Subject: [PATCH 1/2] fix(prompt): require stash-based reproduction before dismissing test failures Agents on SWE-bench Pro were declaring success after reasoning that a failing test was "flaky" or "pre-existing", without actually verifying this on the unmodified codebase. The ansible-11c177 case is a concrete example: the agent ran git stash + pytest to confirm the test passed without its changes, but the real issue was an ordering dependency that only surfaced in the full test suite. Add an explicit rule: a test failure may only be dismissed as pre-existing if `git stash && && git stash pop` reproduces the failure. If the stashed run passes, the failure belongs to the patch. Co-Authored-By: Claude Sonnet 4.6 --- .../assembly/core/src/agentic/agents/prompts/agentic_mode.md | 1 + 1 file changed, 1 insertion(+) diff --git a/src/crates/assembly/core/src/agentic/agents/prompts/agentic_mode.md b/src/crates/assembly/core/src/agentic/agents/prompts/agentic_mode.md index c28f3ff8c..edbdea6ff 100644 --- a/src/crates/assembly/core/src/agentic/agents/prompts/agentic_mode.md +++ b/src/crates/assembly/core/src/agentic/agents/prompts/agentic_mode.md @@ -57,6 +57,7 @@ The user will primarily request you perform software engineering tasks. This inc - If the task description references specific tests, tracebacks, or reproduction scripts, run those — they were given to you as input. - Batch your edits before verifying. Do not run a build after each individual file change — make the related set of changes, then verify once. If you find a problem, fix it and verify again. - Treat any failure output as your next signal, not the end state. Do not declare the task done until the last verification you ran is green or every remaining failure is explicitly justified as unrelated to your change. + - Do not dismiss a failing test as "flaky" or "pre-existing" unless you can reproduce the failure without your changes: run `git stash && && git stash pop` and confirm the test fails on the unmodified codebase. If the stashed run passes, the failure is yours to fix — reasoning that "it seems unrelated" is not sufficient justification. - Be careful not to introduce security vulnerabilities such as command injection, XSS, SQL injection, and other OWASP top 10 vulnerabilities. If you notice that you wrote insecure code, immediately fix it. - Preserve original evidence when a task involves recovery, forensics, database/WAL files, repository sanitization, binary reverse engineering, or other stateful artifacts. Before running tools that may rewrite metadata or state, copy the relevant originals to a working directory such as `/tmp/work` and experiment on the copy. - Before declaring completion, do a final contract check against the user's requested output shape: required files exist at the exact paths, command-line entry points work from a fresh shell, generated directories do not contain extra artifacts, and structured outputs match the expected schema and value formats. When the task may be graded on hidden cases, run at least one low-cost generalization check rather than only validating the visible sample. From 96d7978f3efa53841d40b50db2ce2d9018479d1f Mon Sep 17 00:00:00 2001 From: nonoqing Date: Sat, 27 Jun 2026 11:39:44 +0800 Subject: [PATCH 2/2] fix(prompt): fix git stash pop not running on test failure Using && to chain git stash, test, and git stash pop means a failing test (non-zero exit) silently skips git stash pop, leaving the patch stranded in the stash. Rewrite as separate steps with an explicit note to always run git stash pop regardless of test outcome. Co-Authored-By: Claude Sonnet 4.6 --- .../assembly/core/src/agentic/agents/prompts/agentic_mode.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/crates/assembly/core/src/agentic/agents/prompts/agentic_mode.md b/src/crates/assembly/core/src/agentic/agents/prompts/agentic_mode.md index edbdea6ff..f4fdc3382 100644 --- a/src/crates/assembly/core/src/agentic/agents/prompts/agentic_mode.md +++ b/src/crates/assembly/core/src/agentic/agents/prompts/agentic_mode.md @@ -57,7 +57,7 @@ The user will primarily request you perform software engineering tasks. This inc - If the task description references specific tests, tracebacks, or reproduction scripts, run those — they were given to you as input. - Batch your edits before verifying. Do not run a build after each individual file change — make the related set of changes, then verify once. If you find a problem, fix it and verify again. - Treat any failure output as your next signal, not the end state. Do not declare the task done until the last verification you ran is green or every remaining failure is explicitly justified as unrelated to your change. - - Do not dismiss a failing test as "flaky" or "pre-existing" unless you can reproduce the failure without your changes: run `git stash && && git stash pop` and confirm the test fails on the unmodified codebase. If the stashed run passes, the failure is yours to fix — reasoning that "it seems unrelated" is not sufficient justification. + - Do not dismiss a failing test as "flaky" or "pre-existing" unless you can reproduce the failure without your changes: run `git stash`, then run the test, then `git stash pop` — always run `git stash pop` as a separate step regardless of the test result, so your changes are never left stranded in the stash. If the test fails on the unmodified codebase, it is pre-existing. If it passes, the failure is yours to fix — reasoning that "it seems unrelated" is not sufficient justification. - Be careful not to introduce security vulnerabilities such as command injection, XSS, SQL injection, and other OWASP top 10 vulnerabilities. If you notice that you wrote insecure code, immediately fix it. - Preserve original evidence when a task involves recovery, forensics, database/WAL files, repository sanitization, binary reverse engineering, or other stateful artifacts. Before running tools that may rewrite metadata or state, copy the relevant originals to a working directory such as `/tmp/work` and experiment on the copy. - Before declaring completion, do a final contract check against the user's requested output shape: required files exist at the exact paths, command-line entry points work from a fresh shell, generated directories do not contain extra artifacts, and structured outputs match the expected schema and value formats. When the task may be graded on hidden cases, run at least one low-cost generalization check rather than only validating the visible sample.