Skip to content

feat(service-automation): persist suspended flow runs for durable resume across restarts#1520

Merged
os-zhuang merged 1 commit into
mainfrom
claude/lucid-cannon-Hc2sc
Jun 2, 2026
Merged

feat(service-automation): persist suspended flow runs for durable resume across restarts#1520
os-zhuang merged 1 commit into
mainfrom
claude/lucid-cannon-Hc2sc

Conversation

@os-zhuang

Copy link
Copy Markdown
Contributor

Closes #1518.

Problem

service-automation kept suspended flow runs in memory only (engine.ts's private suspendedRuns = new Map()). A flow paused at a long-lived node (approval, wait, screen, …) could not be resumed after the process restarted — the engine registered no sys_* objects, so run state was never persisted.

This blocks durable-pause flows on serverless / hibernating hosts: on the Cloudflare Workers control plane a marketplace-review flow that suspends at an approval node (minutes → days) has its in-memory run evicted; sys_approval_request persists, but resume(runId) then has nothing to continue, so post-approval side-effects never run.

Solution (ADR-0019)

Make suspended-run state durable and rehydratable, behind a pluggable store:

  • SuspendedRunStore interface + two implementations:
    • InMemorySuspendedRunStore — the default; JSON round-trips on save/load so it faithfully mirrors a DB serialization boundary.
    • ObjectStoreSuspendedRunStore — persists to a new sys_automation_run system object via the ObjectQL engine (so it migrates like other sys_* tables and correlates to sys_approval_request.flow_run_id).
  • AutomationServicePlugin registers sys_automation_run via the manifest and auto-enables the DB-backed store when an ObjectQL engine is present (opt out with suspendedRunStore: 'memory'). No engine ⇒ in-memory default stands.
  • Durable suspend/resume — persist on suspend, delete on terminal completion. resume(runId) rehydrates from the store when the run isn't in the in-memory cache (cold boot) and continues from the paused node down the correct branch.
  • Idempotent resume — the suspension is consumed before downstream work runs, plus an in-process guard rejects a concurrent duplicate resume, so a repeated resume after a partial restart can't double-run side effects.
  • Process-unique run ids (random component) so a fresh process doesn't reissue an id that collides with a still-suspended durable run.

Acceptance criteria

  • ✅ A flow suspended at an approval-style node survives a full process restart: a fresh engine over the same store resumes from the paused node down the correct approve/reject branch and runs downstream nodes.
  • ✅ Works with the DB-backed store (validated against a fake ObjectQL engine); in-memory store remains the default for unit tests.
  • ✅ Existing service-automation (154 incl. new) + plugin-approvals (30) tests pass unchanged.
  • variables round-trip correctly, including nested objects/arrays.
  • ✅ Resume is idempotent.

Tests

  • engine.test.ts — suspend on one engine, resume on a brand-new engine sharing one store (simulated restart); nested-variable round-trip; idempotent duplicate resume; listSuspendedRunsDurable fallback.
  • suspended-run-store.test.tsObjectStoreSuspendedRunStore serialize/deserialize round-trip, upsert, delete/list, and an end-to-end suspend → restart → resume through the DB store.

New exports

SuspendedRun, SuspendedRunStore, StepLogEntry, InMemorySuspendedRunStore, ObjectStoreSuspendedRunStore, SuspendedRunStoreEngine, SysAutomationRun, plus AutomationEngine.setSuspendedRunStore() and listSuspendedRunsDurable().

https://claude.ai/code/session_01SGp45AQZzBp1ftWgDzTRqy


Generated by Claude Code

…ume across restarts (#1518)

Suspended runs lived only in an in-memory Map, so a flow paused at an
approval / wait / screen node could never resume after a process restart
— blocking durable-pause flows on hibernating/serverless hosts.

Back the in-memory map with a pluggable SuspendedRunStore (ADR-0019):

- SuspendedRunStore interface + InMemorySuspendedRunStore (default,
  JSON round-trips) and ObjectStoreSuspendedRunStore, which persists to a
  new sys_automation_run system object via the ObjectQL engine.
- AutomationServicePlugin registers the object and auto-enables the
  DB-backed store when an ObjectQL engine is present (opt out with
  suspendedRunStore: 'memory').
- Persist on suspend, delete on terminal completion; resume() rehydrates
  from the store on a cold boot and continues down the correct branch.
- Idempotent resume: the suspension is consumed before downstream work,
  with an in-process guard against concurrent duplicate resumes.
- Process-unique run ids so they don't collide with runs persisted by a
  previous process lifetime.

Existing service-automation and plugin-approvals tests pass unchanged.

https://claude.ai/code/session_01SGp45AQZzBp1ftWgDzTRqy
@vercel

vercel Bot commented Jun 2, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
spec Building Building Preview, Comment Jun 2, 2026 5:02am

Request Review

@github-actions github-actions Bot added documentation Improvements or additions to documentation tests tooling size/xl labels Jun 2, 2026
@os-zhuang os-zhuang marked this pull request as ready for review June 2, 2026 05:05
@os-zhuang os-zhuang merged commit cf03ef2 into main Jun 2, 2026
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation size/xl tests tooling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

service-automation: persist suspended flow runs (durable resume across process restarts)

2 participants