Skip to content

fix(agents-server): release pull-wake claim row even when in-memory token is missing#4346

Open
kevin-dp wants to merge 2 commits into
mainfrom
fix-claim-release-after-dispatch
Open

fix(agents-server): release pull-wake claim row even when in-memory token is missing#4346
kevin-dp wants to merge 2 commits into
mainfrom
fix-claim-release-after-dispatch

Conversation

@kevin-dp
Copy link
Copy Markdown
Contributor

@kevin-dp kevin-dp commented May 18, 2026

Summary

Fixes #4340 — pull-wake claims leaking in consumer_claims when the in-memory ClaimWriteTokenStore no longer holds the consumer's token at the time sendDone arrives.

The release path in callback-forward was gated by stillOwnsClaim, an in-memory check. When that check fails — server restart between mint and done, parallel wakes evicting each other's tokens, or a retry after a transient updateStatus failure — the entire release block is skipped: materializeReleasedClaim never runs, the entity stays stuck at status='running', and the consumer_claims row stays active indefinitely.

The steady-state "send one message, wait" path is not affected by this bug: it releases correctly via the runtime's sendDone after the idleTimeout window (default 5 min via packages/agents/src/bootstrap.ts:146). The bug only fires in the failure modes documented in Test scenarios below.

Root cause

packages/agents-server/src/routing/internal-router.ts had all three release actions behind the same in-memory gate:

if (entity && stillOwnsClaim) {
  await materializeReleasedClaim(...)   // DB row release
  await updateStatus(entity.url, 'idle') // entity status
  clearStream(...)                       // in-memory token cleanup
  await onEntityChanged(entity.url)
} else if (stillOwnsClaim) {
  clearStream(...)
} else if (entity) {
  log.info('done ignored for stale claim ...')
}

stillOwnsClaim is the right gate for write authorization during the agent run, but it's the wrong gate for releasing the DB row, which is keyed by (consumerId, epoch). The DB primary key is authoritative identity — the in-memory token state is orthogonal.

The fix

Three concerns, three gates:

  1. DB row release (materializeReleasedClaim) — runs whenever epoch is defined. (consumerId, epoch) is the DB primary key; that's enough.
  2. Entity status → idle + onEntityChanged — runs when entityCleared || stillOwnsClaim. entityCleared is a new return field from materializeReleasedClaim, set to true only when our (consumerId, epoch) was the active dispatch row and we just nulled it out. The || handles two non-trivial cases: server restart (DB has us active, token is gone) and retry after a failed updateStatus (state cleared on first attempt, token still held).
  3. In-memory token cleanup (clearStream) — remains gated by stillOwnsClaim so a newer consumer's token is never cleared out from under it.

materializeReleasedClaim API change

- ): Promise<ConsumerClaim | null> {
+ ): Promise<{ claim: ConsumerClaim | null; entityCleared: boolean }> {

Only one production caller (internal-router.ts); both that caller and the test mock are updated. The .returning() on the entityDispatchState UPDATE now reports whether our row was actually cleared (vs. a no-op because a newer claim is active).

Test scenarios

The fix decouples the three concerns. Below: ✓ = action happens, × = action does NOT happen (and is correct that it doesn't).

Scenario entityCleared stillOwnsClaim DB row released Entity → idle
A. Happy path (mint + done) true true ✓ released ✓ goes idle
B. Server restart (no in-memory token, DB row still active) true false ✓ released ✓ goes idle
C. Newer wake (wake-1 done after wake-2 takes over the stream) false false ✓ wake-1's row released × stays running — wake-2 is in flight
D. Retry (first done's updateStatus threw; same done retried) false true ✓ no-op (already released) ✓ goes idle
E. Legacy stale-done test (test setup never materialized active claim; token evicted) false false no row to release × stays running — newer claim conceptually in flight

New tests in packages/agents-server/test/webhook-forward-routing.test.ts > claim release on done callback (regression for #4340) cover scenarios A–C. Existing tests in server-claim-write-token.test.ts cover D and E and continue to pass after the fix.

Verified

  • Unit tests: deterministic. Pre-fix, B and C fail (zero invocations of materializeReleasedClaim); D fails (updateStatus skipped on retry). Post-fix, all five scenarios produce the documented behavior.
  • Manual run-through against a local desktop + agents-server: send one message, dispatch claims (active_count: 0 → 1), agent completes, runtime calls sendDone after idleTimeout, server fires the new release path, active_count: 1 → 0, entity status transitions back to idle.

Not addressed in this PR

Base branch note

This PR targets fix-pull-wake (#4339), not main, because materializeReleasedClaim was introduced in #4308 which is part of the fix-pull-wake lineage but not yet in main. Merge order: this → fix-pull-wake → main.

🤖 Generated with Claude Code

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

❌ Patch coverage is 75.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.92%. Comparing base (1ab43f5) to head (8b288ee).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
packages/agents-server/src/entity-registry.ts 60.00% 2 Missing ⚠️
...kages/agents-server/src/routing/internal-router.ts 85.71% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #4346       +/-   ##
===========================================
+ Coverage   55.84%   73.92%   +18.08%     
===========================================
  Files         245       48      -197     
  Lines       24847     5592    -19255     
  Branches     6878     1718     -5160     
===========================================
- Hits        13876     4134     -9742     
+ Misses      10957     1445     -9512     
+ Partials       14       13        -1     
Flag Coverage Δ
packages/agents ?
packages/agents-runtime ?
packages/agents-server 73.92% <75.00%> (-0.01%) ⬇️
packages/agents-server-ui ?
packages/electric-ax ?
typescript 73.92% <75.00%> (+18.08%) ⬆️
unit-tests 73.92% <75.00%> (+18.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

kevin-dp and others added 2 commits May 20, 2026 10:20
…oken is missing

The release path in callback-forward was gated by `stillOwnsClaim`, an
in-memory check that fails after server restart or when a newer wake on
the same stream evicts the token. When that happened, the consumer_claims
row stayed at status=active indefinitely and the entity remained stuck at
status=running long after `done` arrived.

Decouple the concerns:
- materializeReleasedClaim runs whenever epoch is defined (DB identity is
  sufficient to release the row). Now returns `{ claim, entityCleared }`
  where entityCleared is true iff our (consumerId, epoch) was the active
  dispatch and we just cleared it.
- updateStatus(idle) and onEntityChanged fire when `entityCleared ||
  stillOwnsClaim` — covers happy path, server restart, retry-after-failed-
  updateStatus, while still leaving status=running when a newer wake holds
  the entity's active dispatch.
- clearStream remains gated by stillOwnsClaim so we never clear another
  consumer's token from under it.

Regression tests in test/webhook-forward-routing.test.ts cover the three
failure modes (lost token, evicted token, retry).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rename the describe block and the three tests so they read as positive
behavioral contracts ("releases ... and marks the entity idle when
...") rather than as bug-reproduction narratives. Drop comments that
described the production failure mode now that the names carry the
intent.

Also assert in the newer-wake-takes-over case that wake-2's in-memory
write token is preserved by wake-1's done — protects against future
regressions that might clear the wrong consumer's token from the
shared map.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kevin-dp kevin-dp self-assigned this May 20, 2026
@kevin-dp kevin-dp force-pushed the fix-claim-release-after-dispatch branch from 8c43f69 to 8b288ee Compare May 20, 2026 08:24
Copy link
Copy Markdown
Contributor

@KyleAMathews KyleAMathews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good — the core fix is sound. Decoupling durable claim release from in-memory write-token ownership is the right design correction here, and preserving the stillOwnsClaim gate for in-memory cleanup protects newer wakes from being disturbed. The added router-level regression coverage for held-token, missing-token, and newer-wake-took-over scenarios is also useful.\n\nMinor non-blocking follow-ups worth considering:\n\n1. Add lower-level registry tests around materializeReleasedClaim/entityCleared semantics: entityCleared=true when the active dispatch matches (consumerId, epoch), false when a newer dispatch is active, and false when no matching claim row exists.\n2. Consider the edge case where the first done attempt successfully clears the DB dispatch state but fails before updateStatus('idle'), then the server restarts before a retry. On retry, both entityCleared and stillOwnsClaim may be false, so the entity status may remain running. This seems outside the main bug fixed here, but it may be worth tracking separately if that recovery path matters.\n\nApproving — no blocking concerns from my review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pull-wake: active claim leaks when in-memory write-token is missing (server restart, newer-wake eviction, retry)

4 participants