feat(0.15.0): cross-worker sandbox-reconnect durability#23
Open
drewstone wants to merge 1 commit into
Open
Conversation
A 15-minute agentic sandbox turn must survive the Cloudflare worker
isolate dying mid-turn. `runDurableTurn` already replays a *completed*
turn, but an *interrupted* one re-runs from the top — the producer's
`streamPrompt` generator died with the isolate.
The sandbox container is orchestrator-managed and outlives the worker.
`runReconnectableTurn` checkpoints a `RunHandle` — `{ kind, sandboxId,
sessionId, runId, status, cursor }` — at turn start. On a retry that
finds a `running` handle, a fresh worker calls a product-supplied
`reconnect(handle)` callback (which wires the sandbox SDK's event-replay
endpoint) instead of re-prompting. tcloud products omit `reconnect` and
fall through to a clean re-run.
The handle is checkpointed as a completed step at index 0; the turn runs
at index 1. This reuses the existing `completeStep` JSON-result path
with zero schema change — a completed step is the only shape
`startOrResume` returns to a retry, and the handle must be readable
while the turn step is still `running`.
Tests cover fresh / reconnected / replayed / rerun / reconnect-failure
across the InMemory / FileSystem / D1 store matrix.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A 15-minute agentic sandbox turn must survive the Cloudflare worker isolate dying mid-turn (deploy roll, CPU limit, OOM).
runDurableTurnalready replays a completed turn, but an interrupted turn re-runs from the top — the producer'sstreamPromptgenerator died with the isolate.The Tangle sandbox container is orchestrator-managed and outlives the worker. This PR adds
runReconnectableTurn: it checkpoints aRunHandleat turn start so a fresh worker re-attaches to the in-flight sandbox run instead of re-prompting.RunHandle—{ kind: 'sandbox' | 'tcloud', sandboxId?, sessionId?, runId?, status, cursor? }. A pointer to a substrate run that outlives the isolate.runReconnectableTurn— three resolution paths on a retry:replayed(turn already finished — cached text replays),reconnected(arunninghandle survived — calls the product'sreconnect(handle)callback),rerun/fresh(no reconnectable handle — produces live).reconnect(handle)is product-supplied substrate glue. Sandbox products wire the SDK's event-replay endpoint (GET {runtimeUrl}/agents/run/{runId}/events?lastEventId={cursor}); tcloud products omit it and fall through to a clean re-run.completeStepJSON-result path with zero schema change — a completed step is the only shapestartOrResumereturns to a retry, and the handle must be readable while the turn step is stillrunning. A newdurable_stepscolumn would force a migration across all three stores plus a new store method.This is a thin handle registry, not a second durable-execution framework — the sandbox runtime is the durable engine; agent-runtime just remembers the pointer.
Spike findings (
@tangle-network/sandbox@0.1.2)Cross-worker attach is feasible.
streamPrompt's reconnect usesexecutionId(run id, carried on theexecution.startedSSE frame'sdata) +lastEventId(the SSEid:cursor). The runtime exposesGET {runtimeUrl}/agents/run/{executionId}/events?lastEventId={cursor}&format=sse, reachable from any process via the publicSandboxConnection.runtimeUrl+authToken. The SDK does not expose a one-callresumeRun(executionId)— its reconnect loop is closure-local — so the raw replay fetch is product-owned, which is exactly whyreconnectis a product-supplied callback.Test plan
pnpm typecheckpassespnpm test— 231/231 pass (18 new inrun-handle.test.ts)runninghandle callsreconnectnotproduce;completedhandle replays;runninghandle with noreconnectfalls through to re-run; reconnect-stream failure fails the run (error not swallowed);registeradvances the persisted cursor