feat(0.13.0): durable-run substrate#20
Merged
Merged
Conversation
Production agents need step-level durability — survive worker crashes,
deploy rolls, OOM, transient transport errors, rate-limit cascades.
0.12.1's transport-level retry was the floor; this is the substrate.
The model — inspired by Absurd and Cloudflare Workflows — splits a run
into ordered, idempotent **steps**. Each step's result is persisted before
the next runs. On resume, the runner replays prior steps (returning cached
values without re-execution) until it reaches the first unfinished step.
Surface:
ctx.step('process-payment', async () => { ... }) // checkpointed
ctx.awaitEvent('shipment.packed:42') // race-free
ctx.emitEvent('shipment.packed:42', payload) // first-emit-wins
ctx.now() / ctx.uuid() // deterministic
Concurrency: lease-based exclusivity. One worker per run at a time; lease
renewal on heartbeat; takeover after lease expiry. Committed steps survive.
Boundary disciplines (fail-loud, not silent):
- Step results MUST be JSON-serializable — class instances rejected
- Step intents MUST be stable across replays — divergence throws
- Same runId + different manifest hash → DurableRunInputMismatchError
- Step input fingerprints hashed canonically (sorted-key JSON)
Ships:
- DurableRunStore contract + typed error taxonomy
- InMemoryDurableRunStore (dev)
- FileSystemDurableRunStore (eval harness — one dir per run, append-only
steps.jsonl + events.jsonl, atomic-rename run.json + lease.json)
- runDurable(ctx => ...) wrapper with lease heartbeat + abort-signal
propagation + crash-safe failure recording
- canonicalHash + manifestHash + stepId + deriveWorkerId helpers
Follow-ups (separate PRs, queued):
- D1DurableRunStore + versioned .sql migration (Cloudflare prod path)
- Cloudflare Workflows adapter (each turn = a Workflow step)
- Wire one agent (legal) onto the substrate as end-to-end demo
21 new tests covering: fresh runs, replay-skips-fn, manifest mismatch,
step divergence, lease takeover after expiry, awaitEvent race + timeout,
first-emit-wins, deterministic ctx.now/uuid stability across replay. All
21 pass identically against in-memory AND filesystem stores via a shared
test matrix. Total suite: 165 tests, all passing.
Three pieces added on top of the InMemory + FileSystem stores:
1. D1DurableRunStore — the production path for Cloudflare Workers.
- All operations mapped to D1 prepared statements
- Lease takeover via conditional UPDATE (atomic under SQLite locking)
- First-emit-wins enforced by PK (run_id, key) on durable_events
- Structural D1DatabaseLike interface — zero @cloudflare/workers-types dep
- Versioned schema.sql ships in the package (durable_schema_info table)
2. better-sqlite3 in devDependencies — drives the D1DurableRunStore
test matrix against a real SQLite engine. The full 10-test contract
suite now runs identically against InMemory, FileSystem, and D1
(real SQL, real UNIQUE constraints, real CASE expressions). 31 total
tests in the durable matrix.
3. runOnWorkflowStep — Cloudflare Workflows entrypoint adapter that
converts a WorkflowStep into a DurableContext. Each ctx.step
delegates to step.do; ctx.awaitEvent delegates to step.waitForEvent;
ctx.now / ctx.uuid checkpoint through step.do for replay stability.
Pure structural typing — no cloudflare:workers runtime dep.
Deployment patterns (pick one per agent, do not mix):
A. Plain Worker + D1DurableRunStore — default. Survives isolate
restart via D1 lease takeover. Use for chat sessions.
B. Cloudflare Workflows + runOnWorkflowStep — for long tasks
(minutes to hours), platform-managed retries, dashboard
observability. Workflows handles the outer durability.
175 tests pass (was 144).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Production agents need step-level durability — survive worker crashes, deploy rolls, OOM, rate-limit cascades. The 0.12.1 transport-level retry was the floor; this PR is the substrate.
Model directly inspired by Absurd (Postgres) and Cloudflare Workflows: split a run into ordered idempotent steps, persist each result before the next runs, replay from store on resume.
Public surface
```ts
const { result } = await runDurable({
runId: 'chat-session-42', // idempotency key (e.g. session id)
manifest: { projectId, task, input },
store: new FileSystemDurableRunStore(path), // or InMemoryDurableRunStore
taskFn: async (ctx) => {
const payment = await ctx.step('process-payment', async () => { ... })
const ship = await ctx.awaitEvent('shipment.packed', { timeoutMs: 60_000 })
return { payment, tracking: ship.trackingNumber }
},
})
```
Boundary disciplines — fail loud, no silent shortcuts
Ships
Tests
Follow-ups (separate PRs, queued)