Backport #2295: fix(world-local,world-postgres): make duplicate hook_created idempotent#2374
Open
github-actions[bot] wants to merge 1 commit into
Open
Backport #2295: fix(world-local,world-postgres): make duplicate hook_created idempotent#2374github-actions[bot] wants to merge 1 commit into
github-actions[bot] wants to merge 1 commit into
Conversation
…nt (#2295) * fix(world-local): make duplicate hook_created idempotent Duplicate processing of the same hook_created — same runId, hookId, and token, e.g. cross-process replay or queue redelivery — was being recorded as a hook_conflict in the event log, which then replayed as a self- conflict HookConflictError. The fix mirrors the existing step_created duplicate-correlation path: when the exclusive token claim fails and the existing claim has the same (runId, hookId), throw EntityConflictError so the runtime's existing concurrent-replay catch path swallows it. Different runId or hookId reusing the same token still produces a real hook_conflict. The persisted token claim already carried hookId; only the read schema was dropping it. The schema now preserves hookId (marked optional for backward compatibility with older claim files). Fixes #2283 * fix(world-postgres): make duplicate hook_created idempotent world-postgres has the same gap as world-local was just fixed for: the duplicate-token check in events.create unconditionally writes a hook_conflict event when an existing hook with the same token is found, even when the existing hook has the same (runId, hookId) as the incoming event. The unique partial index on workflow_events does not catch this because the duplicate path inserts hook_conflict, not hook_created. Mirror the world-local fix: when the existing hook's (runId, hookId) matches the incoming event, throw EntityConflictError so the runtime's existing concurrent-replay catch path swallows it. Different runId or hookId reusing the same token still produces a real hook_conflict. Refs #2283 * test(e2e): add regression test for hook_conflict from same-tick replay race Regression test for #1665 / #2283. A parent workflow awaits 6 child workflows with Promise.all; each child does a tiny step and creates one webhook. Awaited children flatten into the parent run, so all webhook creations land on the same workflow body. When their step resolutions align in the same tick the workflow body is re-walked and each pass submits hook_created with the same deterministic (correlationId, token). Before the world-side idempotency fix, the world wrote hook_conflict events for the duplicates and the workflow failed with HookConflictError. With the fix, duplicates throw EntityConflictError (swallowed by the suspension handler), no hook_conflict events appear in the log, and the webhooks resolve normally. Verified locally against world-local: the test fails reliably (3/3) on the unfixed code and passes reliably (5/5) on the fixed code. * test(e2e): rewrite parallelStepsThenWebhookWorkflow to match the actual #1665 repro The earlier version invoked another 'use workflow' function directly from inside the parent workflow, which is not a valid child-workflow invocation (child workflows must be spawned via start()) and didn't mirror the bug shape on #1665 anyway. Rewrite the workflow as a single 'use workflow' function that exactly mirrors Paolo's minimal repro: await Promise.all([stepA(), stepB()]); using webhook = createWebhook(); await webhook; The for-loop runs N independent iterations of that sequence in series, each disposing its webhook via 'using' before the next, to give the timing-sensitive race multiple chances to fire. The race is hard to force deterministically on fast local dev — but the same (runId, hookId) idempotency invariant is covered deterministically by the new unit tests in world-local and world-postgres. This e2e test serves as a higher-level regression net: its assertions (no hook_conflict event in the log, no HookConflictError-failed run) are correct whether the race fires or not, and will catch any future regression on a run that does hit it. * fix(world-local,world-postgres): recover crash-orphaned hook claims/rows instead of suppressing the retry Addresses review feedback on PR #2295. The original idempotency fix made duplicate same-(runId, hookId) hook_created submissions throw EntityConflictError so the suspension handler's concurrent-replay catch path swallows them. But the claim file (world-local) and hook row (world-postgres) are written before the durable hook_created event, and the writes are not atomic. A process / DB interruption between the claim/hook write and the event write leaves an orphaned claim/hook row; the retry then matched the same (runId, hookId), threw EntityConflictError, got swallowed, and the run was permanently left with no hook_created event in the log. world-local: - Add a per-(runId, hookId) in-process mutex (withHookLock) mirroring the existing withStepLock, so two same-tick concurrent calls serialize on the entity write and the dedup branch never observes an in-flight winner mid-write. - In the dedup branch, when the existing claim is for the same (runId, hookId) we are trying to create, check whether the durable hook entity actually exists on disk: - exists → real duplicate: throw EntityConflictError as before. - missing → orphaned claim from a prior crash: fall through and complete the partial write (write the hook entity with overwrite, then emit hook_created via the outer code path). world-postgres: - In the dedup branch, when the existing hook row matches the incoming (runId, hookId), check whether a hook_created event for this (runId, correlationId) already exists in the event log: - exists → real duplicate: throw EntityConflictError as before. - missing → orphaned hook row from a prior crash between hook INSERT and events INSERT: skip the hook insert (the row is already there) and let the outer code path emit hook_created, completing the partial write. Tests: - world-local: pre-seed an orphaned token claim with no matching hook entity, retry hook_created, assert hook entity and hook_created event both land (no hook_conflict, no EntityConflictError). - world-postgres: pre-seed an orphaned hook row with no matching hook_created event, retry, assert hook_created event lands (no hook_conflict, no EntityConflictError). Both tests fail on the prior implementation (EntityConflictError thrown on retry, exact symptom from the review). * fix(world-local): probe the event log (not the hook entity) to detect duplicate hook_created Addresses follow-up review on PR #2295. The previous dedup branch checked whether the durable hook entity existed on disk. But the hook entity is written before the `hook_created` event, and the two writes are not atomic, so a crash between them leaves both the claim file and the hook entity on disk with no event in the log. The dedup branch then matched on `(runId, hookId)`, found the hook entity, threw EntityConflictError, and the suspension handler swallowed the retry — permanently losing `hook_created` from the event log. The fix mirrors what the world-postgres branch already does: probe the run's event log for an existing `hook_created` event for the same `(runId, correlationId)`. The event is the durable record of a successful hook creation; the claim file and hook entity are partial- write artifacts that may exist without the event. - exists → real duplicate: throw EntityConflictError so the runtime's concurrent-replay catch path swallows it. - missing → orphaned partial write (crash at any point before the event landed): re-write the hook entity (with overwrite: true, in case a stale partial copy exists) and let the outer code path emit the hook_created event. Added a new helper findHookCreatedEvent that runs a filtered paginatedFileSystemQuery with limit:1 over the run's events. Regression test "should recover an orphaned hook entity with no matching hook_created event" added — pre-creates a hook, deletes just the hook_created event from disk to simulate a crash between the entity write and the event write, asserts the retry emits a fresh hook_created event (no hook_conflict, no swallowed EntityConflictError). I verified this test fails on the prior fix (throws `EntityConflictError: Hook "hook_orphan_entity_1" already created`, exactly as pranaygp reported) and passes on this commit. The previous test ("should recover an orphaned hook token claim with no matching hook entity") continues to pass — the event-log probe is a strict superset of the entity probe, since a missing entity always also implies a missing event. * fix(world-local): converge same-hook creation across workers via canonical eventId Addresses follow-up review on PR #2295. The previous fix made the dedup branch probe the event log to decide real-duplicate vs orphan-recovery, but the probe and the recovery write are not a single atomic operation. Two workers sharing a data directory (or two retries that lose `writeExclusive(constraintPath)` back to back) could both pass the probe (each observing no hook_created event yet), both fall through to the recovery write, and both append a hook_created event with a different eventId — producing two events in the log for the same (runId, hookId). The in-process `withHookLock` mutex does not help here because it is process-local and tag-specific. The fix persists `eventId` in the durable token claim file (written by the original `writeExclusive(constraintPath)`). On a same-(runId, hookId) dedup match, retries adopt that canonical eventId and rebuild the event with a deterministic createdAt derived from the eventId (a ULID). The outer event write switches from `writeJSON` (check-then-write, TOCTOU) to `writeExclusive` (O_CREAT|O_EXCL via temp-file + hard-link, atomic across processes). Either worker may win the publish; the other throws EntityConflictError which the runtime's existing concurrent-replay catch path swallows. Net result: exactly one hook_created event per logical creation. Backward compatibility: a claim file written before this commit lacks `eventId`. Retries that read such a claim fall back to the event-log probe + fresh-eventId recovery — the legacy behavior that does not converge across workers but cannot regress for freshly- written claims after upgrade. world-postgres already converges across workers via the partial unique index on workflow_events_entity_creation_unique (runId+correlationId+eventType for hook/step/wait_created): the loser's INSERT raises 23505 which is already translated to EntityConflictError. Regression tests: - world-local: `converges same-hook creation across workers to one event` uses two tagged storage instances sharing one data directory and fires 25 paired Promise.allSettled hook_created calls. Expected 25 hook_created events total; before this fix yielded 50. - world-postgres: `converges same-hook creation across concurrent calls to one event` exercises the same shape against the real Postgres unique index. Already converges; the test is a guard against future regressions to the catch path. Verified the world-local test fails on c7b23e1 with exactly the shape pranaygp reported (50 events for 25 logical creations) and passes on this commit. The earlier orphaned-claim and orphaned- entity recovery tests also continue to pass. * fix(world-local): converge legacy hook claims via recovery-marker sidecar; replace tag-proxy test with real subprocess workers Addresses follow-up review on PR #2295. Two distinct issues, both flagged by pranaygp as P1: 1. The fallback path for token claims written by versions before eventId was persisted inline (legacy claims after upgrade) still permitted the same cross-process corruption the inline fast path was fixed to prevent. Two processes both reading a legacy claim each generated their own eventId, landed their writeExclusive(eventPath) calls at different paths, and appended two hook_created events for the same (runId, hookId). Existing persisted claims after a real upgrade are exactly the state the crash-recovery branch needs to repair, so leaving the legacy path non-convergent is silent corruption, not backward compatibility. 2. The committed cross-worker convergence test used two tagged storage instances sharing one directory as a proxy for separate processes. But tags change the destination filename (events/wrun_X-evnt_Y.worker-a.json vs ...worker-b.json), so two tagged workers can each writeExclusive their own event at different paths and both fulfill. The Map-by-eventId deduplication in the assertion then masked the duplicate publication, so the test passed for the wrong reason. Implementation: - New HookRecoveryMarkerSchema (`{ eventId, hookId, runId }`) and HookRecoveryMarkerPath helper. The marker is a sidecar at hooks/tokens/<hash>.recovery.json, written via writeExclusive so the first cross-process retry pins its candidate eventId as canonical; subsequent retries read the marker and adopt that eventId. Together with the existing writeExclusive(eventPath) in the outer publish, this gives the legacy-fallback path the same single-event convergence guarantee as the inline-eventId fast path. - pinCanonicalEventIdForLegacyClaim() encapsulates the marker write-or-read. A stale marker for a different (runId, hookId) (token-reuse with leaked state) is overwritten best-effort — the common cross-worker race for the same hook still converges; only the narrow stale-token-reuse case loses convergence. - hook_disposed now also deletes the recovery marker when it deletes the token constraint file, preventing a future legacy recovery for a recycled token from latching onto a stale eventId. - The dedup branch unified: existingClaim.eventId for new claims, pinCanonicalEventIdForLegacyClaim() for legacy ones. Removed the now-redundant findHookCreatedEvent helper — the writeExclusive(eventPath) in the outer publish is the authoritative duplicate-vs-orphan detector. Tests: - New test fixture test-fixtures/hook-race-worker.ts (TypeScript, run via child_process.fork with tsx as execPath — tsx is a transitive dev dep via vitest). Each subprocess gets its own createStorage(testDir) so the in-process hookLocks Map cannot serialize across workers. - Replaced the tag-proxy test with "converges same-hook creation across separate OS processes to one event". Spawns workerCount subprocesses, releases them from a barrier into the same hook_created, asserts exactly one fulfilled + (N-1) rejected with EntityConflictError, and asserts directly on the raw events.list() result (no Map dedup) that the number of hook_created entries equals the number of logical creations. - Added "converges same-hook creation across processes when only a legacy token claim exists". Same shape, but pre-seeds the legacy claim format (`{ token, hookId, runId }` with no eventId) before each race. Verified to FAIL on 7ce6655 (both subprocesses fulfill, no convergence) and pass on this commit. - Also verified the new-eventId subprocess test FAILS when the event write is reverted to writeJSON (TOCTOU), confirming it exercises the writeExclusive-based cross-process arbitration. Both prior orphaned-claim / orphaned-entity recovery tests also continue to pass. * fix(world-local): per-lifetime recovery markers, restore event-log probe, fix CI tsx resolution Addresses three P1 review comments on PR #2295. 1. Stale recovery marker leaking across token-reuse lifetimes (pranaygp): The previous marker path used `hashToken(token)` so a stale marker for run A could leak into run B's recovery when the same token was reused after run A terminated through normal lifecycle. `deleteAllHooksForRun()` and tagged `world.clear()` deleted the token constraint and hook entity but NOT the marker sidecar, so the next legacy claim on the same token entered the stale-marker overwrite branch and the workers overwrote it non-atomically, yielding divergent publication. Fix: - Marker path now hashes `(token, runId, hookId)` together (`hookRecoveryMarkerPath` in storage/helpers.ts). Different lifetimes can never share a marker, so the stale-marker overwrite branch is removed entirely. - `hookRecoveryMarkerPath` is moved to helpers.ts and shared across events-storage.ts, hooks-storage.ts, and index.ts. - `deleteAllHooksForRun()` and tagged `world.clear()` now also delete the recovery marker for each hook (disk hygiene; per- lifetime identity makes leaks no longer corrupting). - `hook_disposed` now uses the new per-lifetime marker path too. 2. Duplicate `hook_created` event when a legacy claim's event was already published (VADE bot, also implied by pranaygp's analysis): Removing the event-log probe from the legacy fallback let a post- upgrade retry pin a new canonical eventId via the marker and publish a duplicate event at that path, even when the original pre-upgrade writer had already successfully published the event with its own (different) eventId. Fix: - Restore `findExistingHookCreatedEventId()` (renamed and made to return the eventId for clearer semantics). - Legacy fallback now probes the event log BEFORE pinning the marker; if a matching `hook_created` event already exists, throw `EntityConflictError` so the runtime's concurrent-replay catch path swallows the retry. - Inline-`eventId` fast path does NOT need the probe — the claim itself is the durable convergence key. 3. CI failure: tsx not resolvable under pnpm isolated linking (pranaygp; confirmed by ubuntu/windows unit test 60s timeouts): The previous test hard-coded `node_modules/.bin/tsx` assuming tsx would be hoisted there. But tsx was only a transitive peer dep via vitest, and pnpm's isolated linking does NOT link transitive peer deps into the workspace bin after a fresh install — so neither root nor package-local `.bin/tsx` existed in CI, the subprocess fork never started, and the barrier hung until vitest killed the test. Fix: - Add `tsx` as a direct `devDependency` of `@workflow/world- local` (pinned to 4.20.6 to match the existing transitive resolution). - Resolve via `import.meta.resolve('tsx/package.json')` and read the `bin` field dynamically, so we adapt to wherever pnpm links tsx for this package — not a hard-coded layout. - Lazy-init the resolver (no module-load IIFE) so an absent tsx fails only the convergence tests, not all 376 tests in the file. - Surface a clear error message if resolution fails, calling out the cause (transitive vs direct deps) for future readers. Also: harden the barrier helper so `error` events and pre-ready exits resolve BOTH `readyPromises` and `donePromises`, then `SIGKILL` siblings. Previously a broken child only resolved `donePromises`, leaving `Promise.all(readyPromises)` pending until the per-test timeout (60s in CI). Regression tests added: - `legacy claim whose hook_created event was already published does not append a duplicate event` — pre-seeds a legacy claim AND a pre-existing `hook_created` event with a different eventId, asserts the retry throws EntityConflictError and the log still has exactly the original event. - `converges legacy claim recovery across run lifetimes after token reuse` — runs pranaygp's full lifecycle path: race subprocess workers on run A's legacy claim, terminate run A via `run_completed` (triggers `deleteAllHooksForRun`), reuse the token in a legacy claim for run B, race subprocess workers again, asserts exactly one fulfillment + one `EntityConflictError` per race and exactly one `hook_created` event per run. Both new tests verified to fail on 2c673e4 (after rebuilding): the published-event test throws via duplicate publish instead of EntityConflictError, the token-reuse test sees both run B workers fulfill (2 events instead of 1). The existing orphaned-claim and orphaned-entity recovery tests also continue to pass. CI loop confirmed to be repaired locally by spawning subprocesses via the new resolver and intentionally breaking the worker fixture to verify the helper fails fast (~500ms) instead of hanging at the barrier. * fix(world-local): defer hook entity write until event publish commits Addresses karthikscale3's P1 review comment on PR #2295. The dedup-recovery path used to write the hook entity BEFORE the outer event publish proved whether the attempt was repairing a missing event or just colliding with an already-published `hook_created`. For already-committed duplicates, the event write then throws `EntityConflictError`, but the hook entity had already been overwritten with the retry's payload — leaving the durable hook entity and the event log inconsistent (e.g. the entity reflects the retry's metadata while the event still carries the original). karthikscale3 reproduced this on the prior head by creating `hook_created` with metadata `{ v: "a" }`, then retrying the same `(runId, hookId, token)` with metadata `{ v: "b" }` and `isWebhook: false`: the retry threw `EntityConflictError` but `hooks.get()` returned the retry's payload. Fix: defer the hook entity write until AFTER the outer `writeExclusive(eventPath)` commits. The branch now only captures the entity-to-write and its overwrite options; the actual write happens immediately after the event publish in the shared trailing block. A retry that ends in `EntityConflictError` (the event was already published) now leaves the entity untouched. The first-writer happy path and all recovery paths (orphaned- claim, orphaned-entity, cross-worker convergence, legacy claim, token-reuse across lifetimes) are unaffected — they all reach the event publish successfully, then the entity write runs as before. Regression test `does not mutate an already-committed hook entity when a duplicate hook_created retry collides` added to world-local: runs karthikscale3's exact scenario and asserts the persisted entity still carries the original metadata and isWebhook. Verified to fail on the prior commit (persisted metadata = 0xbb instead of 0xaa) and pass on this commit after rebuilding. Parallel guard test `does not mutate an already-committed hook entity when a duplicate hook_created retry collides` added to world-postgres. Postgres already protected this via `onConflictDoNothing()` on the hook INSERT, but the test guards against a future regression that adds an UPDATE/UPSERT to the dedup path. * refactor(world-local): per-instance in-process locks; drop tsx subprocess test plumbing You were right that the tsx subprocess machinery was overkill for a storage-level convergence test. Replaced with a simple two-instance in-process test that exercises the same cross-process semantics without spawning anything. The trick: `stepLocks` and `hookLocks` were module-level Maps shared by all `createEventsStorage` calls in the same process. Move them inside the function so each `createStorage(dir)` call gets its own lock map. Two storage instances sharing one data directory then behave exactly like two separate OS processes: - independent in-process `hookLocks` Maps (no in-process serialization between them), and - a shared filesystem (so the on-disk `writeExclusive` claim / marker / event publish primitives are the only thing arbitrating convergence). This is also a real architectural improvement — the global lock map was always a leaky abstraction that made unit-test simulation of the cross-process path awkward. Changes: - `stepLocks` and `hookLocks` moved from module scope into `createEventsStorage`. `withStepLock` and `withHookLock` wrappers collapsed into direct `withInProcessLock(map, key, fn)` calls at the two call sites that need them. - The three convergence regression tests in `storage.test.ts` now use `const workerA = createStorage(testDir); const workerB = createStorage(testDir);` and race `Promise.allSettled` of `events.create` from both — no subprocess, no IPC, no barrier helper, no `raceHookCreatedAcrossProcesses`. Same assertions (exactly one fulfillment + N-1 `EntityConflictError` per race, raw `events.list()` shows exactly one `hook_created` per logical creation — no Map dedup) so the regression catches are identical. - Removed: `tsx` devDep, `test-fixtures/hook-race-worker.ts`, `HOOK_RACE_WORKER` / `resolveTsxLoaderUrl` / `TSX_BIN` / `raceHookCreatedAcrossProcesses` and the `fork`/`fileURLToPath` imports they pulled in. Verified (after rebuilding world-local): - All 379 tests pass on macOS in ~1s (was ~6.7s with subprocesses). - Convergence tests confirmed to still catch the bugs: temporarily reverted the `eventId = canonicalEventId` adoption → both workers fulfilled (2 events instead of 1). Temporarily reverted the legacy-claim marker pin → same: both workers fulfilled. - No subprocess machinery means no Windows-specific quirks (cli.mjs shebang, .cmd wrappers, .bin hoisting under pnpm isolated linking, etc.) that produced the Windows CI 60s timeouts. - World-postgres still has its own parallel guard test for the karthikscale3 "no-mutate-on-duplicate" regression; that one exercises real DB concurrency and is unaffected by this change. Full repo `pnpm test` (43 packages) and the `parallelStepsThenWebhookWorkflow` e2e test against world-local both green. * fix(world-local): repair event-first hook orphans from the persisted event; skip #1665 e2e on world-postgres - A crash between the hook_created event publish and the deferred hook entity write left the event committed with the entity missing and unrepairable (retries threw EntityConflictError without materializing the entity). Retries now rebuild the entity from the PERSISTED event's payload — never the retry's eventData — via a race-safe writeExclusive, on both the canonical-eventId collision path and the legacy-claim probe path. - Skip parallelStepsThenWebhookWorkflow e2e on world-postgres: the same-tick replay pattern surfaces a separate pre-existing step_started ordering bug there (#2331). --------- Co-authored-by: Peter Wielander <peter.wielander@vercel.com> Signed-off-by: Nathan Rajlich <n@n8.io>
🦋 Changeset detectedLatest commit: 99df3fc The changes in this PR will be included in the next version bump. This PR includes changesets to release 20 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Contributor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Automated backport of #2295 to
stable(backport job run).Triggered manually via
workflow_dispatch.Merge conflicts were resolved by AI (opencode with
anthropic/claude-fable-5). Please review the conflict resolution carefully before merging.