fix(agents-server): release pull-wake claim row even when in-memory token is missing#4346
fix(agents-server): release pull-wake claim row even when in-memory token is missing#4346kevin-dp wants to merge 2 commits into
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4346 +/- ##
===========================================
+ Coverage 55.84% 73.92% +18.08%
===========================================
Files 245 48 -197
Lines 24847 5592 -19255
Branches 6878 1718 -5160
===========================================
- Hits 13876 4134 -9742
+ Misses 10957 1445 -9512
+ Partials 14 13 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…oken is missing
The release path in callback-forward was gated by `stillOwnsClaim`, an
in-memory check that fails after server restart or when a newer wake on
the same stream evicts the token. When that happened, the consumer_claims
row stayed at status=active indefinitely and the entity remained stuck at
status=running long after `done` arrived.
Decouple the concerns:
- materializeReleasedClaim runs whenever epoch is defined (DB identity is
sufficient to release the row). Now returns `{ claim, entityCleared }`
where entityCleared is true iff our (consumerId, epoch) was the active
dispatch and we just cleared it.
- updateStatus(idle) and onEntityChanged fire when `entityCleared ||
stillOwnsClaim` — covers happy path, server restart, retry-after-failed-
updateStatus, while still leaving status=running when a newer wake holds
the entity's active dispatch.
- clearStream remains gated by stillOwnsClaim so we never clear another
consumer's token from under it.
Regression tests in test/webhook-forward-routing.test.ts cover the three
failure modes (lost token, evicted token, retry).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rename the describe block and the three tests so they read as positive
behavioral contracts ("releases ... and marks the entity idle when
...") rather than as bug-reproduction narratives. Drop comments that
described the production failure mode now that the names carry the
intent.
Also assert in the newer-wake-takes-over case that wake-2's in-memory
write token is preserved by wake-1's done — protects against future
regressions that might clear the wrong consumer's token from the
shared map.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8c43f69 to
8b288ee
Compare
KyleAMathews
left a comment
There was a problem hiding this comment.
Looks good — the core fix is sound. Decoupling durable claim release from in-memory write-token ownership is the right design correction here, and preserving the stillOwnsClaim gate for in-memory cleanup protects newer wakes from being disturbed. The added router-level regression coverage for held-token, missing-token, and newer-wake-took-over scenarios is also useful.\n\nMinor non-blocking follow-ups worth considering:\n\n1. Add lower-level registry tests around materializeReleasedClaim/entityCleared semantics: entityCleared=true when the active dispatch matches (consumerId, epoch), false when a newer dispatch is active, and false when no matching claim row exists.\n2. Consider the edge case where the first done attempt successfully clears the DB dispatch state but fails before updateStatus('idle'), then the server restarts before a retry. On retry, both entityCleared and stillOwnsClaim may be false, so the entity status may remain running. This seems outside the main bug fixed here, but it may be worth tracking separately if that recovery path matters.\n\nApproving — no blocking concerns from my review.
Summary
Fixes #4340 — pull-wake claims leaking in
consumer_claimswhen the in-memoryClaimWriteTokenStoreno longer holds the consumer's token at the timesendDonearrives.The release path in
callback-forwardwas gated bystillOwnsClaim, an in-memory check. When that check fails — server restart between mint and done, parallel wakes evicting each other's tokens, or a retry after a transientupdateStatusfailure — the entire release block is skipped:materializeReleasedClaimnever runs, the entity stays stuck atstatus='running', and theconsumer_claimsrow staysactiveindefinitely.The steady-state "send one message, wait" path is not affected by this bug: it releases correctly via the runtime's
sendDoneafter theidleTimeoutwindow (default 5 min viapackages/agents/src/bootstrap.ts:146). The bug only fires in the failure modes documented in Test scenarios below.Root cause
packages/agents-server/src/routing/internal-router.tshad all three release actions behind the same in-memory gate:stillOwnsClaimis the right gate for write authorization during the agent run, but it's the wrong gate for releasing the DB row, which is keyed by(consumerId, epoch). The DB primary key is authoritative identity — the in-memory token state is orthogonal.The fix
Three concerns, three gates:
materializeReleasedClaim) — runs wheneverepochis defined.(consumerId, epoch)is the DB primary key; that's enough.onEntityChanged— runs whenentityCleared || stillOwnsClaim.entityClearedis a new return field frommaterializeReleasedClaim, set totrueonly when our(consumerId, epoch)was the active dispatch row and we just nulled it out. The||handles two non-trivial cases: server restart (DB has us active, token is gone) and retry after a failedupdateStatus(state cleared on first attempt, token still held).clearStream) — remains gated bystillOwnsClaimso a newer consumer's token is never cleared out from under it.materializeReleasedClaimAPI changeOnly one production caller (
internal-router.ts); both that caller and the test mock are updated. The.returning()on theentityDispatchStateUPDATE now reports whether our row was actually cleared (vs. a no-op because a newer claim is active).Test scenarios
The fix decouples the three concerns. Below: ✓ = action happens, × = action does NOT happen (and is correct that it doesn't).
entityClearedstillOwnsClaimupdateStatusthrew; same done retried)New tests in
packages/agents-server/test/webhook-forward-routing.test.ts > claim release on done callback (regression for #4340)cover scenarios A–C. Existing tests inserver-claim-write-token.test.tscover D and E and continue to pass after the fix.Verified
materializeReleasedClaim); D fails (updateStatusskipped on retry). Post-fix, all five scenarios produce the documented behavior.active_count: 0 → 1), agent completes, runtime callssendDoneafteridleTimeout, server fires the new release path,active_count: 1 → 0, entity status transitions back toidle.Not addressed in this PR
donecallback is coming. Would need a reaper job or admin command.lease_expires_at: nullissue (Pull-wake: heartbeat path nulls out lease_expires_at on consumer_claims #4341): independent. Without a lease, even a reaper job can't time-out claims safely.Base branch note
This PR targets
fix-pull-wake(#4339), notmain, becausematerializeReleasedClaimwas introduced in #4308 which is part of thefix-pull-wakelineage but not yet inmain. Merge order: this → fix-pull-wake → main.🤖 Generated with Claude Code