Make env server retries idempotent#1565
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 08c4233. Configure here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 08c4233a00
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
ApprovabilityVerdict: Needs human review This PR introduces significant changes to the client-server communication protocol including a new ACK mechanism, response caching, worker incarnation tracking, and modified retry semantics. These core runtime behavior changes to the distributed system warrant human review. You can customize Macroscope's approvability policy. Learn more. |
08c4233 to
e4bb81b
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e4bb81b15c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Independent PR targeting
main.This PR does not depend on #1564, #1565, or #1566. The three changes were originally opened as a stack, but they touch separate subsystems and can be reviewed and merged independently.
Summary
This makes ZMQ env-server retries reuse one logical request ID and prevents duplicate retry dispatch from executing the same rollout twice.
Changes
Verification
Note
Medium Risk
Changes rollout dispatch and retry semantics on the critical client–router–worker path; incorrect idempotency could skip work or double-execute, though incarnation fencing and cache replay are designed to prevent that.
Overview
Makes ZMQ env-server retries idempotent so the same logical rollout is not executed twice after transport failures or duplicate sends.
Client (
ZMQEnvClient) keeps one full UUID hex request ID for the wholesend_requestretry loop (no new ID per attempt). After a response is matched to a pending future, it sends anackframe so the server can drop cached data; late responses with no pending entry are ignored without acking (so the server can still replay).cancel_all_pendinggainsnotify_server=Falsefor unhealthy transitions so local futures fail without spamming cancel frames.Router (
EnvRouter) adds a TTL- and size-bounded completed-response cache. Duplicate dispatches with the same ID replay from cache or re-bindclient_idon an in-flight request instead of re-queuing work. Workers tag responses/stats with a per-processincarnation; stale messages from restarted workers are dropped. Worker IPC paths are only registered once on restart.Wire format: worker→router responses are now 5 frames (worker id, incarnation, client id, request id, payload).
ZMQEnvServerhandles clientackpayloads viarouter.ack_request().New
TestIdempotentRetriescovers ack behavior, stable request IDs, cache pruning, and IPC path deduplication.Reviewed by Cursor Bugbot for commit a1e39c5. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Make env server retries idempotent by caching completed responses and acknowledging delivery
EnvRouternow maintains a bounded, TTL-based cache (max 10,000 entries, 300s TTL) of completed responses keyed by request ID, enabling cached replay on retry instead of re-executing work.EnvRouter.dispatch_requestcoalesces duplicate in-flight requests by updating the destination client ID rather than dispatching again, and replays cached responses immediately for already-completed requests.ZMQEnvClient.send_requestnow generates a single UUID outside the retry loop so all retries share the same request ID, enabling the server to recognize and replay cached results.ZMQEnvClient.receive_loopsends anackframe to the server after receiving a valid response; the server forwards this toEnvRouter.ack_requestto evict the entry from the cache.incarnationtoken in all response and stats frames; the router ignores messages from stale worker incarnations to prevent misrouted responses after restarts.on_became_unhealthyno longer sends cancel notifications to the server for each pending request (notify_server=False).Macroscope summarized a1e39c5.