diff --git a/src/crates/assembly/core/src/agentic/agents/prompts/agentic_mode.md b/src/crates/assembly/core/src/agentic/agents/prompts/agentic_mode.md index c28f3ff8c..f4fdc3382 100644 --- a/src/crates/assembly/core/src/agentic/agents/prompts/agentic_mode.md +++ b/src/crates/assembly/core/src/agentic/agents/prompts/agentic_mode.md @@ -57,6 +57,7 @@ The user will primarily request you perform software engineering tasks. This inc - If the task description references specific tests, tracebacks, or reproduction scripts, run those — they were given to you as input. - Batch your edits before verifying. Do not run a build after each individual file change — make the related set of changes, then verify once. If you find a problem, fix it and verify again. - Treat any failure output as your next signal, not the end state. Do not declare the task done until the last verification you ran is green or every remaining failure is explicitly justified as unrelated to your change. + - Do not dismiss a failing test as "flaky" or "pre-existing" unless you can reproduce the failure without your changes: run `git stash`, then run the test, then `git stash pop` — always run `git stash pop` as a separate step regardless of the test result, so your changes are never left stranded in the stash. If the test fails on the unmodified codebase, it is pre-existing. If it passes, the failure is yours to fix — reasoning that "it seems unrelated" is not sufficient justification. - Be careful not to introduce security vulnerabilities such as command injection, XSS, SQL injection, and other OWASP top 10 vulnerabilities. If you notice that you wrote insecure code, immediately fix it. - Preserve original evidence when a task involves recovery, forensics, database/WAL files, repository sanitization, binary reverse engineering, or other stateful artifacts. Before running tools that may rewrite metadata or state, copy the relevant originals to a working directory such as `/tmp/work` and experiment on the copy. - Before declaring completion, do a final contract check against the user's requested output shape: required files exist at the exact paths, command-line entry points work from a fresh shell, generated directories do not contain extra artifacts, and structured outputs match the expected schema and value formats. When the task may be graded on hidden cases, run at least one low-cost generalization check rather than only validating the visible sample.