Skip to content

feat(0.13.0): durable-run substrate#20

Merged
tangletools merged 3 commits into
mainfrom
feat/durable-runs
May 20, 2026
Merged

feat(0.13.0): durable-run substrate#20
tangletools merged 3 commits into
mainfrom
feat/durable-runs

Conversation

@tangletools
Copy link
Copy Markdown
Contributor

Summary

Production agents need step-level durability — survive worker crashes, deploy rolls, OOM, rate-limit cascades. The 0.12.1 transport-level retry was the floor; this PR is the substrate.

Model directly inspired by Absurd (Postgres) and Cloudflare Workflows: split a run into ordered idempotent steps, persist each result before the next runs, replay from store on resume.

Public surface

```ts
const { result } = await runDurable({
runId: 'chat-session-42', // idempotency key (e.g. session id)
manifest: { projectId, task, input },
store: new FileSystemDurableRunStore(path), // or InMemoryDurableRunStore
taskFn: async (ctx) => {
const payment = await ctx.step('process-payment', async () => { ... })
const ship = await ctx.awaitEvent('shipment.packed', { timeoutMs: 60_000 })
return { payment, tracking: ship.trackingNumber }
},
})
```

Boundary disciplines — fail loud, no silent shortcuts

  • Step results MUST be JSON-serializable (class instances rejected at hash time)
  • Step intents MUST be stable across replays → `DurableRunDivergenceError`
  • Same runId + different inputs → `DurableRunInputMismatchError`
  • Lease conflict → `DurableRunLeaseHeldError`
  • `ctx.now()` / `ctx.uuid()` checkpointed once — replays return cached values
  • Lease renewal heartbeat (every leaseMs/3); lease loss aborts the current step

Ships

  • `DurableRunStore` contract + typed error taxonomy
  • `InMemoryDurableRunStore` (dev)
  • `FileSystemDurableRunStore` (eval harness — one dir per run, append-only steps.jsonl + events.jsonl, atomic-rename run.json + lease.json)
  • `runDurable` wrapper with lease heartbeat + abort propagation
  • `canonicalHash` / `manifestHash` / `stepId` / `deriveWorkerId` helpers

Tests

  • `pnpm typecheck` clean
  • `pnpm test` — 165 / 165 pass (21 new durable tests)
  • Identical test matrix runs against both in-memory + filesystem stores
  • Crash recovery: fail mid-stream, restart, verify completed step NOT re-executed
  • Lease takeover: workerA holds, expires, workerB acquires
  • Concurrent emit race: first-emit-wins enforced
  • awaitEvent + emit concurrent: receives payload mid-flight; replay returns cached
  • Deterministic ctx.now / ctx.uuid stable across replay
  • Manifest hash stable across object insertion order

Follow-ups (separate PRs, queued)

  1. `D1DurableRunStore` + versioned `.sql` migration → Cloudflare prod path
  2. Cloudflare Workflows adapter (each chat turn = a Workflow step, native CF durability)
  3. Wire legal-agent onto the substrate as end-to-end demo

drewstone added 3 commits May 20, 2026 17:02
Production agents need step-level durability — survive worker crashes,
deploy rolls, OOM, transient transport errors, rate-limit cascades.
0.12.1's transport-level retry was the floor; this is the substrate.

The model — inspired by Absurd and Cloudflare Workflows — splits a run
into ordered, idempotent **steps**. Each step's result is persisted before
the next runs. On resume, the runner replays prior steps (returning cached
values without re-execution) until it reaches the first unfinished step.

Surface:

  ctx.step('process-payment', async () => { ... })  // checkpointed
  ctx.awaitEvent('shipment.packed:42')              // race-free
  ctx.emitEvent('shipment.packed:42', payload)      // first-emit-wins
  ctx.now() / ctx.uuid()                            // deterministic

Concurrency: lease-based exclusivity. One worker per run at a time; lease
renewal on heartbeat; takeover after lease expiry. Committed steps survive.

Boundary disciplines (fail-loud, not silent):
  - Step results MUST be JSON-serializable — class instances rejected
  - Step intents MUST be stable across replays — divergence throws
  - Same runId + different manifest hash → DurableRunInputMismatchError
  - Step input fingerprints hashed canonically (sorted-key JSON)

Ships:
  - DurableRunStore contract + typed error taxonomy
  - InMemoryDurableRunStore (dev)
  - FileSystemDurableRunStore (eval harness — one dir per run, append-only
    steps.jsonl + events.jsonl, atomic-rename run.json + lease.json)
  - runDurable(ctx => ...) wrapper with lease heartbeat + abort-signal
    propagation + crash-safe failure recording
  - canonicalHash + manifestHash + stepId + deriveWorkerId helpers

Follow-ups (separate PRs, queued):
  - D1DurableRunStore + versioned .sql migration (Cloudflare prod path)
  - Cloudflare Workflows adapter (each turn = a Workflow step)
  - Wire one agent (legal) onto the substrate as end-to-end demo

21 new tests covering: fresh runs, replay-skips-fn, manifest mismatch,
step divergence, lease takeover after expiry, awaitEvent race + timeout,
first-emit-wins, deterministic ctx.now/uuid stability across replay. All
21 pass identically against in-memory AND filesystem stores via a shared
test matrix. Total suite: 165 tests, all passing.
Three pieces added on top of the InMemory + FileSystem stores:

1. D1DurableRunStore — the production path for Cloudflare Workers.
   - All operations mapped to D1 prepared statements
   - Lease takeover via conditional UPDATE (atomic under SQLite locking)
   - First-emit-wins enforced by PK (run_id, key) on durable_events
   - Structural D1DatabaseLike interface — zero @cloudflare/workers-types dep
   - Versioned schema.sql ships in the package (durable_schema_info table)

2. better-sqlite3 in devDependencies — drives the D1DurableRunStore
   test matrix against a real SQLite engine. The full 10-test contract
   suite now runs identically against InMemory, FileSystem, and D1
   (real SQL, real UNIQUE constraints, real CASE expressions). 31 total
   tests in the durable matrix.

3. runOnWorkflowStep — Cloudflare Workflows entrypoint adapter that
   converts a WorkflowStep into a DurableContext. Each ctx.step
   delegates to step.do; ctx.awaitEvent delegates to step.waitForEvent;
   ctx.now / ctx.uuid checkpoint through step.do for replay stability.
   Pure structural typing — no cloudflare:workers runtime dep.

Deployment patterns (pick one per agent, do not mix):
  A. Plain Worker + D1DurableRunStore — default. Survives isolate
     restart via D1 lease takeover. Use for chat sessions.
  B. Cloudflare Workflows + runOnWorkflowStep — for long tasks
     (minutes to hours), platform-managed retries, dashboard
     observability. Workflows handles the outer durability.

175 tests pass (was 144).
@tangletools tangletools merged commit a48cb65 into main May 20, 2026
1 check passed
@tangletools tangletools deleted the feat/durable-runs branch May 20, 2026 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants