fix(world-local,world-postgres): make duplicate hook_created idempotent#2295
Conversation
🦋 Changeset detectedLatest commit: 8b80f55 The changes in this PR will be included in the next version bump. This PR includes changesets to release 18 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) | Nitro workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro | Express workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro | Express workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro | Express Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) | Nitro stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro | Express SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
|
🧪 E2E Test Results✅ All tests passed Summary
Details by Category✅ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
✅ 📋 Other
|
There was a problem hiding this comment.
Pull request overview
This PR fixes an idempotency bug in @workflow/world-local where duplicate processing of the same hook_created could incorrectly produce a hook_conflict event and later replay as a self-conflict HookConflictError. It aligns hook_created behavior with the existing step_created duplicate-correlation path by treating same-entity duplicates as benign via EntityConflictError.
Changes:
- Preserve
hookIdwhen reading existing hook token-claim files and use(runId, hookId)to detect duplicate same-hook creation. - On duplicate same-hook claim, throw
EntityConflictErrorinstead of writing ahook_conflictevent; keephook_conflictfor genuine token conflicts. - Add regression tests covering same-hook duplicates vs. real conflicts (same token with different hook/run) and include a patch changeset.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| packages/world-local/src/storage/events-storage.ts | Updates hook token-claim parsing and makes duplicate same-hook hook_created idempotent by throwing EntityConflictError. |
| packages/world-local/src/storage.test.ts | Adds regression tests ensuring same-hook duplicates don’t append hook_conflict, while real conflicts still do. |
| .changeset/fix-world-local-hook-self-conflict.md | Adds a patch changeset documenting the behavior fix for @workflow/world-local. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… duplicate hook_created Addresses follow-up review on PR #2295. The previous dedup branch checked whether the durable hook entity existed on disk. But the hook entity is written before the `hook_created` event, and the two writes are not atomic, so a crash between them leaves both the claim file and the hook entity on disk with no event in the log. The dedup branch then matched on `(runId, hookId)`, found the hook entity, threw EntityConflictError, and the suspension handler swallowed the retry — permanently losing `hook_created` from the event log. The fix mirrors what the world-postgres branch already does: probe the run's event log for an existing `hook_created` event for the same `(runId, correlationId)`. The event is the durable record of a successful hook creation; the claim file and hook entity are partial- write artifacts that may exist without the event. - exists → real duplicate: throw EntityConflictError so the runtime's concurrent-replay catch path swallows it. - missing → orphaned partial write (crash at any point before the event landed): re-write the hook entity (with overwrite: true, in case a stale partial copy exists) and let the outer code path emit the hook_created event. Added a new helper findHookCreatedEvent that runs a filtered paginatedFileSystemQuery with limit:1 over the run's events. Regression test "should recover an orphaned hook entity with no matching hook_created event" added — pre-creates a hook, deletes just the hook_created event from disk to simulate a crash between the entity write and the event write, asserts the retry emits a fresh hook_created event (no hook_conflict, no swallowed EntityConflictError). I verified this test fails on the prior fix (throws `EntityConflictError: Hook "hook_orphan_entity_1" already created`, exactly as pranaygp reported) and passes on this commit. The previous test ("should recover an orphaned hook token claim with no matching hook entity") continues to pass — the event-log probe is a strict superset of the entity probe, since a missing entity always also implies a missing event.
…nical eventId Addresses follow-up review on PR #2295. The previous fix made the dedup branch probe the event log to decide real-duplicate vs orphan-recovery, but the probe and the recovery write are not a single atomic operation. Two workers sharing a data directory (or two retries that lose `writeExclusive(constraintPath)` back to back) could both pass the probe (each observing no hook_created event yet), both fall through to the recovery write, and both append a hook_created event with a different eventId — producing two events in the log for the same (runId, hookId). The in-process `withHookLock` mutex does not help here because it is process-local and tag-specific. The fix persists `eventId` in the durable token claim file (written by the original `writeExclusive(constraintPath)`). On a same-(runId, hookId) dedup match, retries adopt that canonical eventId and rebuild the event with a deterministic createdAt derived from the eventId (a ULID). The outer event write switches from `writeJSON` (check-then-write, TOCTOU) to `writeExclusive` (O_CREAT|O_EXCL via temp-file + hard-link, atomic across processes). Either worker may win the publish; the other throws EntityConflictError which the runtime's existing concurrent-replay catch path swallows. Net result: exactly one hook_created event per logical creation. Backward compatibility: a claim file written before this commit lacks `eventId`. Retries that read such a claim fall back to the event-log probe + fresh-eventId recovery — the legacy behavior that does not converge across workers but cannot regress for freshly- written claims after upgrade. world-postgres already converges across workers via the partial unique index on workflow_events_entity_creation_unique (runId+correlationId+eventType for hook/step/wait_created): the loser's INSERT raises 23505 which is already translated to EntityConflictError. Regression tests: - world-local: `converges same-hook creation across workers to one event` uses two tagged storage instances sharing one data directory and fires 25 paired Promise.allSettled hook_created calls. Expected 25 hook_created events total; before this fix yielded 50. - world-postgres: `converges same-hook creation across concurrent calls to one event` exercises the same shape against the real Postgres unique index. Already converges; the test is a guard against future regressions to the catch path. Verified the world-local test fails on c7b23e1 with exactly the shape pranaygp reported (50 events for 25 logical creations) and passes on this commit. The earlier orphaned-claim and orphaned- entity recovery tests also continue to pass.
…ecar; replace tag-proxy test with real subprocess workers Addresses follow-up review on PR #2295. Two distinct issues, both flagged by pranaygp as P1: 1. The fallback path for token claims written by versions before eventId was persisted inline (legacy claims after upgrade) still permitted the same cross-process corruption the inline fast path was fixed to prevent. Two processes both reading a legacy claim each generated their own eventId, landed their writeExclusive(eventPath) calls at different paths, and appended two hook_created events for the same (runId, hookId). Existing persisted claims after a real upgrade are exactly the state the crash-recovery branch needs to repair, so leaving the legacy path non-convergent is silent corruption, not backward compatibility. 2. The committed cross-worker convergence test used two tagged storage instances sharing one directory as a proxy for separate processes. But tags change the destination filename (events/wrun_X-evnt_Y.worker-a.json vs ...worker-b.json), so two tagged workers can each writeExclusive their own event at different paths and both fulfill. The Map-by-eventId deduplication in the assertion then masked the duplicate publication, so the test passed for the wrong reason. Implementation: - New HookRecoveryMarkerSchema (`{ eventId, hookId, runId }`) and HookRecoveryMarkerPath helper. The marker is a sidecar at hooks/tokens/<hash>.recovery.json, written via writeExclusive so the first cross-process retry pins its candidate eventId as canonical; subsequent retries read the marker and adopt that eventId. Together with the existing writeExclusive(eventPath) in the outer publish, this gives the legacy-fallback path the same single-event convergence guarantee as the inline-eventId fast path. - pinCanonicalEventIdForLegacyClaim() encapsulates the marker write-or-read. A stale marker for a different (runId, hookId) (token-reuse with leaked state) is overwritten best-effort — the common cross-worker race for the same hook still converges; only the narrow stale-token-reuse case loses convergence. - hook_disposed now also deletes the recovery marker when it deletes the token constraint file, preventing a future legacy recovery for a recycled token from latching onto a stale eventId. - The dedup branch unified: existingClaim.eventId for new claims, pinCanonicalEventIdForLegacyClaim() for legacy ones. Removed the now-redundant findHookCreatedEvent helper — the writeExclusive(eventPath) in the outer publish is the authoritative duplicate-vs-orphan detector. Tests: - New test fixture test-fixtures/hook-race-worker.ts (TypeScript, run via child_process.fork with tsx as execPath — tsx is a transitive dev dep via vitest). Each subprocess gets its own createStorage(testDir) so the in-process hookLocks Map cannot serialize across workers. - Replaced the tag-proxy test with "converges same-hook creation across separate OS processes to one event". Spawns workerCount subprocesses, releases them from a barrier into the same hook_created, asserts exactly one fulfilled + (N-1) rejected with EntityConflictError, and asserts directly on the raw events.list() result (no Map dedup) that the number of hook_created entries equals the number of logical creations. - Added "converges same-hook creation across processes when only a legacy token claim exists". Same shape, but pre-seeds the legacy claim format (`{ token, hookId, runId }` with no eventId) before each race. Verified to FAIL on 7ce6655 (both subprocesses fulfill, no convergence) and pass on this commit. - Also verified the new-eventId subprocess test FAILS when the event write is reverted to writeJSON (TOCTOU), confirming it exercises the writeExclusive-based cross-process arbitration. Both prior orphaned-claim / orphaned-entity recovery tests also continue to pass.
…obe, fix CI tsx resolution Addresses three P1 review comments on PR #2295. 1. Stale recovery marker leaking across token-reuse lifetimes (pranaygp): The previous marker path used `hashToken(token)` so a stale marker for run A could leak into run B's recovery when the same token was reused after run A terminated through normal lifecycle. `deleteAllHooksForRun()` and tagged `world.clear()` deleted the token constraint and hook entity but NOT the marker sidecar, so the next legacy claim on the same token entered the stale-marker overwrite branch and the workers overwrote it non-atomically, yielding divergent publication. Fix: - Marker path now hashes `(token, runId, hookId)` together (`hookRecoveryMarkerPath` in storage/helpers.ts). Different lifetimes can never share a marker, so the stale-marker overwrite branch is removed entirely. - `hookRecoveryMarkerPath` is moved to helpers.ts and shared across events-storage.ts, hooks-storage.ts, and index.ts. - `deleteAllHooksForRun()` and tagged `world.clear()` now also delete the recovery marker for each hook (disk hygiene; per- lifetime identity makes leaks no longer corrupting). - `hook_disposed` now uses the new per-lifetime marker path too. 2. Duplicate `hook_created` event when a legacy claim's event was already published (VADE bot, also implied by pranaygp's analysis): Removing the event-log probe from the legacy fallback let a post- upgrade retry pin a new canonical eventId via the marker and publish a duplicate event at that path, even when the original pre-upgrade writer had already successfully published the event with its own (different) eventId. Fix: - Restore `findExistingHookCreatedEventId()` (renamed and made to return the eventId for clearer semantics). - Legacy fallback now probes the event log BEFORE pinning the marker; if a matching `hook_created` event already exists, throw `EntityConflictError` so the runtime's concurrent-replay catch path swallows the retry. - Inline-`eventId` fast path does NOT need the probe — the claim itself is the durable convergence key. 3. CI failure: tsx not resolvable under pnpm isolated linking (pranaygp; confirmed by ubuntu/windows unit test 60s timeouts): The previous test hard-coded `node_modules/.bin/tsx` assuming tsx would be hoisted there. But tsx was only a transitive peer dep via vitest, and pnpm's isolated linking does NOT link transitive peer deps into the workspace bin after a fresh install — so neither root nor package-local `.bin/tsx` existed in CI, the subprocess fork never started, and the barrier hung until vitest killed the test. Fix: - Add `tsx` as a direct `devDependency` of `@workflow/world- local` (pinned to 4.20.6 to match the existing transitive resolution). - Resolve via `import.meta.resolve('tsx/package.json')` and read the `bin` field dynamically, so we adapt to wherever pnpm links tsx for this package — not a hard-coded layout. - Lazy-init the resolver (no module-load IIFE) so an absent tsx fails only the convergence tests, not all 376 tests in the file. - Surface a clear error message if resolution fails, calling out the cause (transitive vs direct deps) for future readers. Also: harden the barrier helper so `error` events and pre-ready exits resolve BOTH `readyPromises` and `donePromises`, then `SIGKILL` siblings. Previously a broken child only resolved `donePromises`, leaving `Promise.all(readyPromises)` pending until the per-test timeout (60s in CI). Regression tests added: - `legacy claim whose hook_created event was already published does not append a duplicate event` — pre-seeds a legacy claim AND a pre-existing `hook_created` event with a different eventId, asserts the retry throws EntityConflictError and the log still has exactly the original event. - `converges legacy claim recovery across run lifetimes after token reuse` — runs pranaygp's full lifecycle path: race subprocess workers on run A's legacy claim, terminate run A via `run_completed` (triggers `deleteAllHooksForRun`), reuse the token in a legacy claim for run B, race subprocess workers again, asserts exactly one fulfillment + one `EntityConflictError` per race and exactly one `hook_created` event per run. Both new tests verified to fail on 2c673e4 (after rebuilding): the published-event test throws via duplicate publish instead of EntityConflictError, the token-reuse test sees both run B workers fulfill (2 events instead of 1). The existing orphaned-claim and orphaned-entity recovery tests also continue to pass. CI loop confirmed to be repaired locally by spawning subprocesses via the new resolver and intentionally breaking the worker fixture to verify the helper fails fast (~500ms) instead of hanging at the barrier.
Addresses karthikscale3's P1 review comment on PR #2295. The dedup-recovery path used to write the hook entity BEFORE the outer event publish proved whether the attempt was repairing a missing event or just colliding with an already-published `hook_created`. For already-committed duplicates, the event write then throws `EntityConflictError`, but the hook entity had already been overwritten with the retry's payload — leaving the durable hook entity and the event log inconsistent (e.g. the entity reflects the retry's metadata while the event still carries the original). karthikscale3 reproduced this on the prior head by creating `hook_created` with metadata `{ v: "a" }`, then retrying the same `(runId, hookId, token)` with metadata `{ v: "b" }` and `isWebhook: false`: the retry threw `EntityConflictError` but `hooks.get()` returned the retry's payload. Fix: defer the hook entity write until AFTER the outer `writeExclusive(eventPath)` commits. The branch now only captures the entity-to-write and its overwrite options; the actual write happens immediately after the event publish in the shared trailing block. A retry that ends in `EntityConflictError` (the event was already published) now leaves the entity untouched. The first-writer happy path and all recovery paths (orphaned- claim, orphaned-entity, cross-worker convergence, legacy claim, token-reuse across lifetimes) are unaffected — they all reach the event publish successfully, then the entity write runs as before. Regression test `does not mutate an already-committed hook entity when a duplicate hook_created retry collides` added to world-local: runs karthikscale3's exact scenario and asserts the persisted entity still carries the original metadata and isWebhook. Verified to fail on the prior commit (persisted metadata = 0xbb instead of 0xaa) and pass on this commit after rebuilding. Parallel guard test `does not mutate an already-committed hook entity when a duplicate hook_created retry collides` added to world-postgres. Postgres already protected this via `onConflictDoNothing()` on the hook INSERT, but the test guards against a future regression that adds an UPDATE/UPSERT to the dedup path.
|
Warning Review the following alerts detected in dependencies. According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.
|
…cess test plumbing
You were right that the tsx subprocess machinery was overkill for a
storage-level convergence test. Replaced with a simple two-instance
in-process test that exercises the same cross-process semantics
without spawning anything.
The trick: `stepLocks` and `hookLocks` were module-level Maps shared
by all `createEventsStorage` calls in the same process. Move them
inside the function so each `createStorage(dir)` call gets its own
lock map. Two storage instances sharing one data directory then
behave exactly like two separate OS processes:
- independent in-process `hookLocks` Maps (no in-process
serialization between them), and
- a shared filesystem (so the on-disk `writeExclusive` claim /
marker / event publish primitives are the only thing arbitrating
convergence).
This is also a real architectural improvement — the global lock map
was always a leaky abstraction that made unit-test simulation of
the cross-process path awkward.
Changes:
- `stepLocks` and `hookLocks` moved from module scope into
`createEventsStorage`. `withStepLock` and `withHookLock` wrappers
collapsed into direct `withInProcessLock(map, key, fn)` calls at
the two call sites that need them.
- The three convergence regression tests in `storage.test.ts` now
use `const workerA = createStorage(testDir); const workerB =
createStorage(testDir);` and race `Promise.allSettled` of
`events.create` from both — no subprocess, no IPC, no barrier
helper, no `raceHookCreatedAcrossProcesses`. Same assertions
(exactly one fulfillment + N-1 `EntityConflictError` per race,
raw `events.list()` shows exactly one `hook_created` per
logical creation — no Map dedup) so the regression catches are
identical.
- Removed: `tsx` devDep, `test-fixtures/hook-race-worker.ts`,
`HOOK_RACE_WORKER` / `resolveTsxLoaderUrl` / `TSX_BIN` /
`raceHookCreatedAcrossProcesses` and the
`fork`/`fileURLToPath` imports they pulled in.
Verified (after rebuilding world-local):
- All 379 tests pass on macOS in ~1s (was ~6.7s with subprocesses).
- Convergence tests confirmed to still catch the bugs: temporarily
reverted the `eventId = canonicalEventId` adoption → both workers
fulfilled (2 events instead of 1). Temporarily reverted the
legacy-claim marker pin → same: both workers fulfilled.
- No subprocess machinery means no Windows-specific quirks
(cli.mjs shebang, .cmd wrappers, .bin hoisting under pnpm
isolated linking, etc.) that produced the Windows CI 60s
timeouts.
- World-postgres still has its own parallel guard test for the
karthikscale3 "no-mutate-on-duplicate" regression; that one
exercises real DB concurrency and is unaffected by this change.
Full repo `pnpm test` (43 packages) and the
`parallelStepsThenWebhookWorkflow` e2e test against world-local
both green.
|
Follow-up on the Windows unit-test failures and the tsx complexity from the previous rounds. @nrajlich called out (correctly) that introducing tsx as a devDep + a subprocess fixture + a barrier helper to run a TypeScript worker file felt like a lot of machinery for what is fundamentally a storage-level convergence test. The Windows CI failure was the symptom — Rather than keep wrestling with platform-specific subprocess plumbing, fdf3be9 takes a different approach: move The three convergence tests now just do: const workerA = createStorage(testDir);
const workerB = createStorage(testDir);
// ... seed legacy claim if needed ...
const results = await Promise.allSettled([
workerA.events.create(runId, { eventType: 'hook_created', correlationId, eventData: { token } }),
workerB.events.create(runId, { eventType: 'hook_created', correlationId, eventData: { token } }),
]);
// Assert exactly one fulfillment + one EntityConflictError, exactly one
// hook_created event in the raw event log.Same assertions (no Map dedup; raw Removed:
The module-level Net: −184 lines, storage test file runs in ~1s (was ~6.7s with subprocesses), no Windows-specific quirks possible. Full |
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
pranaygp
left a comment
There was a problem hiding this comment.
Re-reviewed the latest head. The lock scoping, canonical event ID convergence, and duplicate-entity mutation fixes are good improvements, but two blocking correctness gaps remain; details are inline.
…event; skip #1665 e2e on world-postgres - A crash between the hook_created event publish and the deferred hook entity write left the event committed with the entity missing and unrepairable (retries threw EntityConflictError without materializing the entity). Retries now rebuild the entity from the PERSISTED event's payload — never the retry's eventData — via a race-safe writeExclusive, on both the canonical-eventId collision path and the legacy-claim probe path. - Skip parallelStepsThenWebhookWorkflow e2e on world-postgres: the same-tick replay pattern surfaces a separate pre-existing step_started ordering bug there (#2331).
pranaygp
left a comment
There was a problem hiding this comment.
Requesting changes. The core idempotency fix is sound and each prior round of findings was genuinely addressed, but three things block approval on the current head:
- Event-first orphan is confirmed and unaddressed — see the open thread at
events-storage.ts:1533(#2295 (comment)). I re-verified on99fd2e0: whenwriteExclusive(eventPath)returns false, the code throwsEntityConflictErrorbefore the deferred hook-entity write ever runs, so a crash after event publish but before the entity write leaves the hook permanently unresolvable (HookNotFoundErroron everyhooks.get). The repair must reconstruct the hook from the persisted event payload, not the retry's. - The PR's own #1665 regression fails on world-postgres — see the open thread on
packages/core/e2e/e2e.test.ts(#2295 (comment)). Either fix the duplicatestep_startedpath or scope this PR to #2283 and drop the "Closes #1665" claim. E2E Required Checkis red on the head — including the sveltekit postgres failure above, plus nextjs-webpack (lazyDiscovery disabled) Local Prod/Postgres and sveltekit Vercel Prod. Even if some are flakes, the required check needs to be green.
Two inline notes below; one is a blocker (same bug as #1, different branch), one is non-blocking.
…self-conflict # Conflicts: # packages/world-local/src/storage/events-storage.ts # workbench/example/workflows/99_e2e.ts
pranaygp
left a comment
There was a problem hiding this comment.
Approving — all three blockers from my previous review are resolved at 8b80f55d8:
- Event-first orphan fixed (fec2229):
repairHookEntityFromPersistedEvent()rebuilds the missing hook entity from the persisted event's payload (never the retry's) via a race-safe no-overwritewriteExclusive, on both the canonical-eventId collision path and the legacy-claim probe path. Both regression tests use deliberately different retry metadata, so they also guard against reintroducing the earlier mutation bug. - #1665 scope corrected: the e2e test is skipped on world-postgres (the skip's
WORKFLOW_TARGET_WORLDdetection matches whattests.ymlsets), the separatestep_startedordering bug is tracked in #2331, and the PR body no longer claims to close #1665. - CI is green:
E2E Required Checkand unit tests on both platforms pass on the head.
The architectural follow-up (event log as single source of truth) is tracked in #2339. Nice work converging this.
|
Backport to To resolve manually, push a backport branch and open a PR against git fetch origin stable
git checkout -b backport/pr-2295-to-stable origin/stable
git cherry-pick -S f2a7bdeb0abcf8a5d48c33a35b4b15aeca78cddf # -S signs the commit
# Fix conflicts, then:
git add -A
git cherry-pick --continue
git push -u origin backport/pr-2295-to-stable
gh pr create --base stable --head backport/pr-2295-to-stable \
--title "Backport #2295: <original PR title>" \
--body "Manual backport of #2295 (cherry-pick f2a7bdeb0abc) to \`stable\`." |
|
Backport PR opened against |
Summary
Fixes #2283. Both
@workflow/world-localand@workflow/world-postgreswere turning duplicate processing of the samehook_created(samerunId,hookId, and token — e.g. cross-process replay, queue redelivery, or the snapshot-runtime same-tick replay reported in #1665) into ahook_conflictevent, which then replayed as a self-conflictHookConflictError.The fix in both worlds mirrors the existing
step_createdduplicate-correlation path: when the duplicate-detection branch fires and the existing claim/hook is for the same(runId, hookId), throwEntityConflictErrorinstead of writinghook_conflict. The runtime's existing concurrent-replay catch path (suspension-handler.ts:142) then swallows it as a benign duplicate.The issue reproduces in both worlds for the same reason — neither path distinguished "same hook being re-processed" from "different hook claiming the same token". world-postgres' unique partial index on
workflow_events(runId, correlationId, eventType)for entity-creation events does not catch this case either, because the duplicate branch inserts ahook_conflictevent (differenteventType), not a secondhook_created.Subsequent review rounds hardened the world-local path into a full crash-recovery/convergence protocol: orphaned token claims, orphaned hook entities, event-first orphans, cross-process convergence on a canonical eventId, legacy-claim upgrades via a recovery-marker sidecar, and per-lifetime marker identity. See the review threads for the full history.
Changes
packages/world-local/src/storage/events-storage.tsHookTokenClaimSchemapreserveshookId(the persisted claim always carried it — only the read schema was dropping it, which is the root cause of world-local can turn duplicate same-hook creation into self-conflict #2283) and now also persists the canonicaleventIdso cross-process retries converge on a single event path.hook_createdbranch, whenwriteExclusiveon the token claim fails and the existing claim matches the incoming(runId, hookId), the retry adopts the canonical eventId (from the claim, or — for legacy claims without one — via an exclusive recovery-marker sidecar keyed by(token, runId, hookId)), and the outerwriteExclusive(eventPath)atomically arbitrates publication: the loser throwsEntityConflictError(swallowed by the runtime), the winner repairs any partial write. Cross-hook / cross-run token conflicts still emithook_conflict.writeExclusive.hook_disposed, terminal-run cleanup, taggedclear()).packages/world-postgres/src/storage.tshook_createdbranch, before writing ahook_conflictevent, compare the existing hook's(runId, hookId)(already returned by thegetHookByTokenprepared statement) to the incoming event. On a match, verify thehook_createdevent actually exists for(runId, correlationId): if yes, throwEntityConflictError; if no (crash-orphaned hook row), skip the hook insert and complete the partial write by emittinghook_created. Otherwise fall through to the existinghook_conflictwrite.Unit tests
Regression tests in the
concurrent entity-creation racesdescribe blocks ofpackages/world-local/src/storage.test.tsandpackages/world-postgres/test/storage.test.ts, covering: same-hook duplicate idempotency, cross-hook/cross-run conflicts still emittinghook_conflict, orphaned-claim recovery, orphaned-entity recovery, event-first-orphan repair (entity rebuilt from the persisted event payload, not the retry's), no entity mutation on already-committed duplicates, cross-process convergence to one event (realchild_process.forksubprocess workers), legacy-claim upgrades, and token reuse across run lifetimes.E2E test
parallelStepsThenWebhookWorkflowinworkbench/example/workflows/99_e2e.tsreproduces the user-facing scenario from #1665: N sequential iterations ofawait Promise.all([stepA, stepB])followed bycreateWebhook()+ await. When the two step resolutions align in the same tick, the workflow body is re-walked and each pass submitshook_createdwith the same deterministic(correlationId, token).Note: this test is skipped on
world-postgres(WORKFLOW_TARGET_WORLDincludespostgres). The same-tick replay pattern it stresses also surfaces a separate, pre-existing world-postgres bug — a late concurrentstep_startedcan land afterstep_completedin the event log because the step entity UPDATE and event INSERT are not atomic — which corrupts the run independently of the hook fix (the failing CI job's logs show the duplicatehook_createdinserts were correctly rejected). Tracked in #2331; re-enable on postgres when that lands.Verification
pnpm --filter @workflow/world-local test— 381 tests passpnpm --filter @workflow/world-postgres test— passes (testcontainers Postgres)pnpm vitest run packages/core/e2e/e2e.test.ts -t "parallelStepsThenWebhookWorkflow"— passes against fixed world-local, fails against unfixedpnpm build,pnpm typecheck, fullpnpm test— greenNotes
dist/.HookConflictErrorwithcreateWebhook#1665 is not auto-closed by this PR: the hook self-conflict it reported is fixed on both worlds, but fully verifying the repro end-to-end on world-postgres is blocked by the separate step-lifecycle ordering bug @workflow/world-postgres: late concurrent step_started can land after step_completed in the event log (non-atomic entity UPDATE + event INSERT) #2331.Fixes #2283