Skip to content

fix(remote-capability-client): clean up request map after Execute returns#22533

Open
nadahalli wants to merge 1 commit into
developfrom
tejaswi/fix-remote-capability-client-request-leak
Open

fix(remote-capability-client): clean up request map after Execute returns#22533
nadahalli wants to merge 1 commit into
developfrom
tejaswi/fix-remote-capability-client-request-leak

Conversation

@nadahalli
Copy link
Copy Markdown
Contributor

Summary

  • client.Execute stored each pending request in requestIDToCallerRequest but never removed it. Cleanup only happened via the expireRequests background ticker.
  • Any caller that re-invoked Execute within the expiry window with the same workflowExecutionID + referenceID hit "failed to store request: request for ID ... already exists" from storeRequest's duplicate check.
  • Symptom in prod: workflow-engine step retries reuse the original referenceID, collide with the still-pending map entry from the prior attempt, and surface as [2]Unknown capability errors. Observed ~80 occurrences over a 2-week window against the VaultDON capability from confidential-compute workflows.
  • Fix: defer c.deleteRequest(req.ID()) immediately after storeRequest succeeds. Runs regardless of how Execute exits (success, response error, ctx cancellation). Symmetric mutex acquisition with storeRequest. Idempotent, so it composes safely with expireRequests racing the cleanup.

Notes for reviewers

  • New focused unit test in client_internal_test.go covers deleteRequest (removes entry, idempotent).
  • Existing E2E tests still pass. The defer adds no behavioral change to the happy path; verified Test_RemoteExecutionCapability_CapabilityError and Test_RemoteExecutableCapability_RandomCapabilityError locally.
  • I considered adding an E2E regression test that calls Execute twice on the same client with identical metadata and asserts both succeed. It hangs because the server-side Execute also dedupes by requestID and won't issue a response for an immediate-succession identical call. Production doesn't hit this because there's real time between retries. So the unit test is the right scope; broader behavior is verified by reading the 3-line defer + helper.

…urns

The client's Execute method stored each pending request in
requestIDToCallerRequest but never removed it. Cleanup only happened
via the expireRequests background ticker, which runs at intervals of
cfg.requestTimeout. Any caller that re-invoked Execute within that
window with the same workflowExecutionID + reference ID hit
"failed to store request: request for ID ... already exists" from
storeRequest's duplicate check.

Add a defer immediately after storeRequest succeeds so the entry is
removed regardless of how Execute exits (success, response error,
ctx cancellation). The defer uses a new helper deleteRequest that
acquires the mutex symmetrically with storeRequest, and is a no-op
when the entry has already been removed (e.g., by expireRequests
racing the defer).
@nadahalli nadahalli requested review from a team as code owners May 19, 2026 14:23
Copilot AI review requested due to automatic review settings May 19, 2026 14:23
@github-actions
Copy link
Copy Markdown
Contributor

👋 nadahalli, thanks for creating this pull request!

To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team.

Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks!

@github-actions
Copy link
Copy Markdown
Contributor

I see you updated files related to core. Please run make gocs in the root directory to add a changeset as well as in the text include at least one of the following tags:

  • #added For any new functionality added.
  • #breaking_change For any functionality that requires manual action for the node to boot.
  • #bugfix For bug fixes.
  • #changed For any change to the existing functionality.
  • #db_update For any feature that introduces updates to database schema.
  • #deprecation_notice For any upcoming deprecation functionality.
  • #internal For changesets that need to be excluded from the final changelog.
  • #nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
  • #removed For any functionality/config that is removed.
  • #updated For any functionality that is updated.
  • #wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

@github-actions
Copy link
Copy Markdown
Contributor

✅ No conflicts with other open PRs targeting develop

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Risk Rating: HIGH

This PR addresses a correctness issue in the remote executable capability client where Execute() stores an in-flight request in requestIDToCallerRequest but (previously) relied on the periodic expireRequests() reaper to remove it. That could cause retries reusing the same workflowExecutionID + referenceID to fail with a duplicate request ID before the expiry window elapsed.

Changes:

  • Add a deferred cleanup in client.Execute() to remove the request entry from requestIDToCallerRequest after Execute() returns.
  • Introduce a small deleteRequest() helper that acquires the same mutex as storeRequest().
  • Add a focused unit test covering deleteRequest() behavior (removal + idempotency).

Scrupulous human review recommended:

  • client.Execute() deferred deletion vs. response lifecycle: ensure late responses from capability DON peers (beyond quorum) are handled intentionally and do not cause operational side-effects (notably log volume) or mask real “unknown message ID” conditions.
  • The interaction between request cleanup and Receive() behavior (especially logging and message dropping), since it affects production observability and could materially change runtime behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
core/capabilities/remote/executable/client.go Defers removal of request map entry after Execute() returns; adds deleteRequest() helper.
core/capabilities/remote/executable/client_internal_test.go Adds unit test verifying deleteRequest() removes entries and is idempotent.

Comment on lines +249 to +254
// Ensure the entry is removed from requestIDToCallerRequest regardless of how Execute exits
// (success, response error, ctx cancellation). Without this, the entry only goes away when
// expireRequests() reaps it on its ticker, which means any caller (workflow engine step retry,
// concurrent call with the same execution ID + reference ID) that re-enters Execute within
// that window hits "request for ID ... already exists" from storeRequest above.
defer c.deleteRequest(req.ID())
@cl-sonarqube-production
Copy link
Copy Markdown

@trunk-io
Copy link
Copy Markdown

trunk-io Bot commented May 19, 2026

Static BadgeStatic BadgeStatic BadgeStatic Badge

View Full Report ↗︎Docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants