Skip to content

Make env server retries idempotent#1565

Open
xeophon wants to merge 3 commits into
mainfrom
df89/zmq-idempotent-retries
Open

Make env server retries idempotent#1565
xeophon wants to merge 3 commits into
mainfrom
df89/zmq-idempotent-retries

Conversation

@xeophon

@xeophon xeophon commented Jun 8, 2026

Copy link
Copy Markdown
Member

Independent PR targeting main.

This PR does not depend on #1564, #1565, or #1566. The three changes were originally opened as a stack, but they touch separate subsystems and can be reviewed and merged independently.

Summary

This makes ZMQ env-server retries reuse one logical request ID and prevents duplicate retry dispatch from executing the same rollout twice.

Changes

  • reuse the same short request ID across retry attempts
  • acknowledge completed responses so the server can drop its response cache
  • replay completed duplicate requests and reroute active duplicate requests
  • fence stale worker responses and stats with worker incarnation IDs

Verification

  • uv run --frozen pytest tests/test_path_utils.py tests/test_save_utils.py tests/test_env_server.py -q
  • uv run --frozen pytest tests/test_v1_runtime_lifecycle.py -q -k 'not test_real_sandbox_base_program_calls_host_callable_tool'
  • uv run --frozen ruff format --check && uv run --frozen ruff check .

Note

Medium Risk
Changes rollout dispatch and retry semantics on the critical client–router–worker path; incorrect idempotency could skip work or double-execute, though incarnation fencing and cache replay are designed to prevent that.

Overview
Makes ZMQ env-server retries idempotent so the same logical rollout is not executed twice after transport failures or duplicate sends.

Client (ZMQEnvClient) keeps one full UUID hex request ID for the whole send_request retry loop (no new ID per attempt). After a response is matched to a pending future, it sends an ack frame so the server can drop cached data; late responses with no pending entry are ignored without acking (so the server can still replay). cancel_all_pending gains notify_server=False for unhealthy transitions so local futures fail without spamming cancel frames.

Router (EnvRouter) adds a TTL- and size-bounded completed-response cache. Duplicate dispatches with the same ID replay from cache or re-bind client_id on an in-flight request instead of re-queuing work. Workers tag responses/stats with a per-process incarnation; stale messages from restarted workers are dropped. Worker IPC paths are only registered once on restart.

Wire format: worker→router responses are now 5 frames (worker id, incarnation, client id, request id, payload). ZMQEnvServer handles client ack payloads via router.ack_request().

New TestIdempotentRetries covers ack behavior, stable request IDs, cache pruning, and IPC path deduplication.

Reviewed by Cursor Bugbot for commit a1e39c5. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Make env server retries idempotent by caching completed responses and acknowledging delivery

  • EnvRouter now maintains a bounded, TTL-based cache (max 10,000 entries, 300s TTL) of completed responses keyed by request ID, enabling cached replay on retry instead of re-executing work.
  • EnvRouter.dispatch_request coalesces duplicate in-flight requests by updating the destination client ID rather than dispatching again, and replays cached responses immediately for already-completed requests.
  • ZMQEnvClient.send_request now generates a single UUID outside the retry loop so all retries share the same request ID, enabling the server to recognize and replay cached results.
  • ZMQEnvClient.receive_loop sends an ack frame to the server after receiving a valid response; the server forwards this to EnvRouter.ack_request to evict the entry from the cache.
  • Workers now include a unique incarnation token in all response and stats frames; the router ignores messages from stale worker incarnations to prevent misrouted responses after restarts.
  • Behavioral Change: on_became_unhealthy no longer sends cancel notifications to the server for each pending request (notify_server=False).

Macroscope summarized a1e39c5.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 08c4233. Configure here.

Comment thread verifiers/serve/client/zmq_env_client.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 08c4233a00

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/serve/client/zmq_env_client.py Outdated
Comment thread verifiers/serve/client/zmq_env_client.py Outdated
Comment thread verifiers/serve/client/zmq_env_client.py Outdated
@macroscopeapp

macroscopeapp Bot commented Jun 8, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

This PR introduces significant changes to the client-server communication protocol including a new ACK mechanism, response caching, worker incarnation tracking, and modified retry semantics. These core runtime behavior changes to the distributed system warrant human review.

You can customize Macroscope's approvability policy. Learn more.

@xeophon xeophon force-pushed the df89/zmq-idempotent-retries branch from 08c4233 to e4bb81b Compare June 8, 2026 11:45
@xeophon xeophon changed the base branch from df89/eval-append-only-results to main June 8, 2026 11:45

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e4bb81b15c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/serve/server/env_router.py Outdated
Comment thread verifiers/serve/server/env_router.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant