fix(remote-capability-client): clean up request map after Execute returns#22533
fix(remote-capability-client): clean up request map after Execute returns#22533nadahalli wants to merge 1 commit into
Conversation
…urns The client's Execute method stored each pending request in requestIDToCallerRequest but never removed it. Cleanup only happened via the expireRequests background ticker, which runs at intervals of cfg.requestTimeout. Any caller that re-invoked Execute within that window with the same workflowExecutionID + reference ID hit "failed to store request: request for ID ... already exists" from storeRequest's duplicate check. Add a defer immediately after storeRequest succeeds so the entry is removed regardless of how Execute exits (success, response error, ctx cancellation). The defer uses a new helper deleteRequest that acquires the mutex symmetrically with storeRequest, and is a no-op when the entry has already been removed (e.g., by expireRequests racing the defer).
|
👋 nadahalli, thanks for creating this pull request! To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team. Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks! |
|
I see you updated files related to
|
|
✅ No conflicts with other open PRs targeting |
There was a problem hiding this comment.
Pull request overview
Risk Rating: HIGH
This PR addresses a correctness issue in the remote executable capability client where Execute() stores an in-flight request in requestIDToCallerRequest but (previously) relied on the periodic expireRequests() reaper to remove it. That could cause retries reusing the same workflowExecutionID + referenceID to fail with a duplicate request ID before the expiry window elapsed.
Changes:
- Add a deferred cleanup in
client.Execute()to remove the request entry fromrequestIDToCallerRequestafterExecute()returns. - Introduce a small
deleteRequest()helper that acquires the same mutex asstoreRequest(). - Add a focused unit test covering
deleteRequest()behavior (removal + idempotency).
Scrupulous human review recommended:
client.Execute()deferred deletion vs. response lifecycle: ensure late responses from capability DON peers (beyond quorum) are handled intentionally and do not cause operational side-effects (notably log volume) or mask real “unknown message ID” conditions.- The interaction between request cleanup and
Receive()behavior (especially logging and message dropping), since it affects production observability and could materially change runtime behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| core/capabilities/remote/executable/client.go | Defers removal of request map entry after Execute() returns; adds deleteRequest() helper. |
| core/capabilities/remote/executable/client_internal_test.go | Adds unit test verifying deleteRequest() removes entries and is idempotent. |
| // Ensure the entry is removed from requestIDToCallerRequest regardless of how Execute exits | ||
| // (success, response error, ctx cancellation). Without this, the entry only goes away when | ||
| // expireRequests() reaps it on its ticker, which means any caller (workflow engine step retry, | ||
| // concurrent call with the same execution ID + reference ID) that re-enters Execute within | ||
| // that window hits "request for ID ... already exists" from storeRequest above. | ||
| defer c.deleteRequest(req.ID()) |
|




Summary
client.Executestored each pending request inrequestIDToCallerRequestbut never removed it. Cleanup only happened via theexpireRequestsbackground ticker.Executewithin the expiry window with the sameworkflowExecutionID+referenceIDhit"failed to store request: request for ID ... already exists"fromstoreRequest's duplicate check.[2]Unknowncapability errors. Observed ~80 occurrences over a 2-week window against the VaultDON capability from confidential-compute workflows.c.deleteRequest(req.ID())immediately afterstoreRequestsucceeds. Runs regardless of howExecuteexits (success, response error, ctx cancellation). Symmetricmutexacquisition withstoreRequest. Idempotent, so it composes safely withexpireRequestsracing the cleanup.Notes for reviewers
client_internal_test.gocoversdeleteRequest(removes entry, idempotent).Test_RemoteExecutionCapability_CapabilityErrorandTest_RemoteExecutableCapability_RandomCapabilityErrorlocally.Executetwice on the same client with identical metadata and asserts both succeed. It hangs because the server-sideExecutealso dedupes by requestID and won't issue a response for an immediate-succession identical call. Production doesn't hit this because there's real time between retries. So the unit test is the right scope; broader behavior is verified by reading the 3-line defer + helper.