Skip to content

fix(prompt): require stash-based reproduction before dismissing test failures#1327

Merged
nonoqing merged 3 commits into
GCWing:evals-on-releasefrom
nonoqing:evals-on-release
Jun 27, 2026
Merged

fix(prompt): require stash-based reproduction before dismissing test failures#1327
nonoqing merged 3 commits into
GCWing:evals-on-releasefrom
nonoqing:evals-on-release

Conversation

@nonoqing

Copy link
Copy Markdown
Collaborator

Summary

  • Agents on SWE-bench Pro were declaring success after labeling failing tests as "flaky" or "pre-existing" without proof
  • The ansible-11c177 case: agent ran git stash && pytest test_timeout.py && git stash pop, saw it pass, and concluded the failure was unrelated — but the real issue was incomplete implementation causing ordering dependencies in the full test suite
  • Add an explicit rule: a failure may only be dismissed as pre-existing if git stash && <test> && git stash pop reproduces the failure. If the stashed run passes, the failure belongs to the patch and must be fixed

Test plan

  • Re-run ansible-11c177 and similar cases — agent should fix rather than dismiss the failing test
  • Confirm agents still correctly identify and skip genuinely pre-existing failures (stash run also fails)

🤖 Generated with Claude Code

nonoqing and others added 3 commits June 27, 2026 11:34
…failures

Agents on SWE-bench Pro were declaring success after reasoning that a
failing test was "flaky" or "pre-existing", without actually verifying
this on the unmodified codebase. The ansible-11c177 case is a concrete
example: the agent ran git stash + pytest to confirm the test passed
without its changes, but the real issue was an ordering dependency that
only surfaced in the full test suite.

Add an explicit rule: a test failure may only be dismissed as
pre-existing if `git stash && <test> && git stash pop` reproduces the
failure. If the stashed run passes, the failure belongs to the patch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Using && to chain git stash, test, and git stash pop means a failing
test (non-zero exit) silently skips git stash pop, leaving the patch
stranded in the stash. Rewrite as separate steps with an explicit note
to always run git stash pop regardless of test outcome.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@nonoqing nonoqing merged commit c868868 into GCWing:evals-on-release Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant