feat(kiloclaw): add KiloClawRegistry DO and complete instance-keyed identity migration#1706
Conversation
f1a4338 to
06eca4b
Compare
| // subsequent reads converge on the oldest row. The second row is an inert | ||
| // orphan (no DO created for it). The window is milliseconds on a user- | ||
| // initiated action already deduplicated by the frontend's useMutation. | ||
| const existing = await getActiveInstance(userId); |
There was a problem hiding this comment.
WARNING: Check-then-insert can create multiple active personal instances
Now that personal rows get a fresh sandbox_id from instanceId, two concurrent calls can both miss getActiveInstance(userId) and each insert a different active row. That is no longer protected by the old (user_id, sandbox_id) uniqueness path, so a double-submit/retry can leave multiple active personal instances (and duplicate subscription rows, since kiloclaw_subscriptions is only unique on instance_id).
There was a problem hiding this comment.
yea, hence the code comment - but we're going to be supporting multiple instances per user very shortly like this week.
Code Review SummaryStatus: 3 Issues Found | Recommendation: Address before merge Overview
Issue Details (click to expand)No new issues found in the incremental diff. Other Observations (not in diff)Issues found in unchanged code that cannot receive inline comments:
Fix these issues in Kilo Cloud Files Reviewed (7 files)
Reviewed by gpt-5.4-20260305 · 901,410 tokens |
| // leave expired accounts active. | ||
| try { | ||
| await client.stop(row.user_id); | ||
| await client.stop(row.user_id, row.instance_id ?? undefined); |
There was a problem hiding this comment.
WARNING: Lifecycle sweeps are only half-migrated to per-instance state
This stop call is now scoped to row.instance_id, but the Postgres mutations in sweeps 1-4 still key off user_id only (kiloclaw_subscriptions in the suspension paths and kiloclaw_instances in the destruction path). Once a user has multiple subscribed instances, expiring one row here will suspend or destroy unrelated instances as well. Those updates need the same instance_id scoping, with the old user-wide fallback reserved for legacy rows.
There was a problem hiding this comment.
This is only a problem when a user has multiple subscribed instances — which doesn't happen yet.
src/routers/admin-router.ts
Outdated
| try { | ||
| const client = new KiloClawInternalClient(); | ||
| await client.start(input.userId); | ||
| await client.start(input.userId, activeInstance.id); |
There was a problem hiding this comment.
WARNING: Trial reset still starts an arbitrary instance
activeInstance is still whichever undestroyed row the bare LIMIT 1 returns for this user, so threading activeInstance.id here does not guarantee you're restarting the subscription row that was just reset. In the per-instance subscription model this can bring up the wrong DO while leaving the intended instance suspended.
There was a problem hiding this comment.
same, Currently one personal instance per user, so LIMIT 1 returns the right one.
…y through registry Add the KiloClawRegistry Durable Object (SQLite-backed via Drizzle ORM) that indexes instances per owner (user or org). Wire provision, destroy, and catch-all proxy flows through the registry. Enable lazy migration of legacy instances from Postgres on first access. Key changes: - KiloClawRegistry DO with listInstances, createInstance, destroyInstance, resolveDoKey, findInstancesForUser methods - Lazy migration: reads legacy instance from Postgres via Hyperdrive on first listInstances() call, with 60s retry cooldown - Catch-all proxy reads sandboxId from DO status (not middleware) for gateway token derivation — critical for instance-keyed DOs using ki_ sandboxIds - Registry create/destroy are best-effort (non-fatal errors) - resolveRegistryEntry falls back to legacy idFromName(userId) on registry failure - ensureActiveInstance supports org instances with instance-keyed sandboxId derivation - restoreFromPostgres accepts opts.sandboxId for precise multi-instance lookup - tRPC router threads instanceId to worker for all provisions/destroys
Complete the instance-keyed DO migration by threading instanceId through every caller that resolves a KiloClawInstance DO stub: Worker: - All ~30 platform.ts routes now parse ?instanceId= and pass to instanceStubFactory (3-arg calls) - controller.ts handles ki_ sandboxIds via isInstanceKeyedSandboxId to resolve the correct DO key - Snapshot-restore queue message includes optional instanceId; consumer uses it as DO key when present Internal client: - All ~30 instance-scoped methods accept optional instanceId as last parameter, forwarded as ?instanceId= query param Next.js callers: - All tRPC router methods call getActiveInstance(userId) and pass instance?.id to internal client - Admin router methods pass instance.id from DB lookups - Billing cron + autoResumeIfSuspended already had instanceId (verified pre-existing) New exports from @kilocode/worker-utils/instance-id: - isInstanceKeyedSandboxId(sandboxId): boolean - instanceIdFromSandboxId(sandboxId): string
The controller checkin route used instanceId as a placeholder for userId when handling ki_ sandboxIds. This caused PostHog attribution and instance-ready emails to silently fail for instance-keyed DOs. Fix: call stub.getStatus() after auth to read the real userId from the DO, which always stores it during provision.
…row targeting Three identity consistency fixes: 1. ensureActiveInstance personal flow now writes instance-keyed sandboxId (ki_ prefix) matching what the DO stores. Previously wrote sandboxIdFromUserId which diverged from the DO's sandboxIdFromInstanceId, breaking restoreFromPostgres lookups. getActiveInstance/markActiveInstanceDestroyed/renameInstance updated to find rows without filtering by sandboxId format. 2. Lazy migration derives doKey from Postgres row's sandboxId format (ki_ prefix → instanceId, base64url → userId) instead of blindly writing doKey=userId. Prevents wrong DO routing after transient registry create failures. 3. Admin destroy and devNukeAll pass instance.id to markActiveInstanceDestroyed for explicit row targeting instead of legacy sandboxId matching.
…ated_at Add ORDER BY created_at ASC to getActiveInstance so LIMIT 1 always returns the oldest active row. Documents the known race window in ensureActiveInstance where concurrent callers can both insert — the consequence is a benign orphan row since all subsequent reads converge on the same (oldest) row.
Always attempt user registry cleanup on destroy. When getStatus() succeeds and reveals an orgId, also clean up the org registry. When getStatus() fails, log a warning — the org registry entry becomes stale but harmless (points to a destroyed DO with no machineId).
getOrCreateInstanceForBilling searched for the destroyed personal instance by sandboxIdFromUserId, which doesn't match instance-keyed rows (ki_ prefix). Now finds the most recently destroyed personal instance by user_id + organization_id IS NULL without filtering by sandboxId format, matching the pattern used by getActiveInstance.
The access gateway's buildRedirectUrl derived the gateway token from sandboxIdFromUserId(userId), but instance-keyed DOs use sandboxIdFromInstanceId(instanceId) — a different sandboxId. This caused token_mismatch on every new instance's WebSocket auth. Fix: resolve the DO's actual sandboxId via the registry + getStatus() before deriving the token. Falls back to legacy derivation if the DO is unreachable.
Three files with instanceId threading were in the git stash during rebase and not included in the lifecycle threading commit: - billing-lifecycle-cron.ts: pass instance_id to client.stop/destroy - instance-lifecycle.ts: formatting only (already had instanceId) - admin-router.ts: pass activeInstance.id to client.start
Legacy instances have DOs at idFromName(userId), not idFromName(instanceId). Passing instance.id as instanceId to the worker causes it to resolve an empty DO — breaking status, start, stop, destroy, and all lifecycle operations for existing instances after upgrading to this branch. Add workerInstanceId() helper that checks the row's sandboxId: only returns instance.id for ki_-prefixed (instance-keyed) rows. Returns undefined for legacy base64url rows so the worker falls back to idFromName(userId). Applied across all callers: tRPC router, admin router, and admin instances router.
…sandbox_id Two fixes: 1. access-gateway resolveSandboxId: when getStatus() throws but the registry entry exists, derive sandboxId from the entry's doKey (sandboxIdFromInstanceId for UUID doKeys) instead of falling back to sandboxIdFromUserId. Prevents wrong gateway token for instance-keyed DOs when the DO is temporarily unreachable. 2. Admin destroy and devNukeAll queries now select sandbox_id so workerInstanceId() can check the ki_ prefix. Without it, the function always returned undefined, routing to the legacy userId-keyed DO instead of the instance-keyed one.
…ceId) The destroy route only cleaned up registry entries when instanceId was provided. Legacy destroys (no instanceId, DO keyed by userId) skipped the cleanup entirely, leaving stale registry entries that point to destroyed DOs. On re-provision, the proxy picks the stale entry (oldest) instead of the new instance's entry. Fix: when no instanceId is provided, list the user registry's entries and find the one with doKey=userId (the legacy entry), then destroy it by its instanceId from the registry.
0bf7eb9 to
72e49b4
Compare
Registry lifecycle: - Instance DO destroy() now cleans up registry entries on finalization, covering alarm-initiated destroys that bypass the platform route - Production logging on registry create/destroy in platform routes Admin observability: - New GET /api/platform/registry-entries endpoint returns all entries (including destroyed) for a user's registry - Admin instance detail page shows Registry Status card with all entries, doKey, timestamps, and current/active badges - orgId added to PlatformDebugStatusResponse and shown in admin UI Known gap documented: GDPR soft-delete does not clean up registry or Fly resources (deferred to follow-up PR).
…ration status, detailed logging
Three review fixes:
1. Registry admin endpoint now accepts optional orgId param and queries
both personal (user:userId) and org (org:orgId) registries. Admin
instance detail page passes organization_id. AdminKiloclawInstance
type and list/get queries updated to include organization_id.
2. migrated field now sourced from the Registry DO's actual migration
state (listAllInstances returns { entries, migrated }) instead of
hardcoded true.
3. DO finalization registry cleanup now logs success, legacy-path
success, and no-op (already cleaned / never existed) with context
for each case.
Summary
KiloClawRegistryDurable Object — a per-owner SQLite-backed index (via Drizzle ORM) that maps instance IDs to DO keys. Keyed byuser:{userId}ororg:{orgId}. Supports lazy migration from Postgres for legacy instances on first access.idFromName(instanceId)withki_-prefixed sandboxIds. Legacy instances remain atidFromName(userId)and are backfilled into the registry via lazy migration.sandboxIdfrom the DO'sgetStatus()for gateway token derivation, instead of the middleware-derived value. This is critical: instance-keyed DOs derive sandboxId from instanceId, which differs fromsandboxIdFromUserId(). Without this, all new provisions would have gateway token mismatches.instanceIdend-to-end through every lifecycle caller: all ~30 platform routes, all ~30 internal client methods, all tRPC router methods, admin routers, controller heartbeat, and snapshot-restore queue. This makes the PR self-contained — no follow-up PR required before deploy.isInstanceKeyedSandboxId()andinstanceIdFromSandboxId()to@kilocode/worker-utils/instance-idfor reverse-mappingki_sandboxIds to instance UUIDs (used by controller heartbeat).restoreFromPostgresacceptsopts.sandboxIdfor precise multi-instance lookup instead of ambiguousgetActiveInstance(db, userId).ensureActiveInstancesupports org instances with instance-keyed sandboxId derivation (sandboxIdFromInstanceId).Verification
pnpm typecheck(kiloclaw worker) — passpnpm typecheck(root / Next.js) — passpnpm test(kiloclaw) — 48 files, 1125 tests, all passingpnpm lint(kiloclaw) — 0 warnings, 0 errorspnpm format:check— pass (pre-push hook)manual tests - pre-existing instance (i.e. created before this branch)
newly created instance (created with this branch)
Visual Changes
N/A
Reviewer Notes
~/fd-plans/kiloclaw/multi-instance-deviations.md— 24 deviations logged with rationale, including: Drizzle instead of gastown raw SQL, all new provisions instance-keyed (scope expansion), catch-all proxy sandboxId from DO status, ownerKey-as-param pattern, and full lifecycle threading pulled from PR 3 into PR 2.resolveRegistryEntryfallback toidFromName(userId)only helps legacy instances; for instance-keyed DOs it returns "not provisioned" until the registry recovers.getStatus()call — the controller checkin now makes one extra DO RPC to resolve the real userId for PostHog attribution and instance-ready emails. This is on the checkin hot path (~every 60s per instance) butgetStatus()is a lightweight in-memory read.ki_prefix on sandboxIds is the discriminator between legacy (base64url) and instance-keyed identity. All gateway token derivation, controller routing, and metadata recovery depend on it.