feat(kiloclaw): add KiloClawRegistry DO and complete instance-keyed identity migration by pandemicsyn · Pull Request #1706 · Kilo-Org/cloud

pandemicsyn · 2026-03-29T21:35:16Z

Summary

Adds the KiloClawRegistry Durable Object — a per-owner SQLite-backed index (via Drizzle ORM) that maps instance IDs to DO keys. Keyed by user:{userId} or org:{orgId}. Supports lazy migration from Postgres for legacy instances on first access.
All new provisions (personal and org) now create Instance DOs keyed by idFromName(instanceId) with ki_-prefixed sandboxIds. Legacy instances remain at idFromName(userId) and are backfilled into the registry via lazy migration.
The catch-all proxy reads sandboxId from the DO's getStatus() for gateway token derivation, instead of the middleware-derived value. This is critical: instance-keyed DOs derive sandboxId from instanceId, which differs from sandboxIdFromUserId(). Without this, all new provisions would have gateway token mismatches.
Threads instanceId end-to-end through every lifecycle caller: all ~30 platform routes, all ~30 internal client methods, all tRPC router methods, admin routers, controller heartbeat, and snapshot-restore queue. This makes the PR self-contained — no follow-up PR required before deploy.
Adds isInstanceKeyedSandboxId() and instanceIdFromSandboxId() to @kilocode/worker-utils/instance-id for reverse-mapping ki_ sandboxIds to instance UUIDs (used by controller heartbeat).
restoreFromPostgres accepts opts.sandboxId for precise multi-instance lookup instead of ambiguous getActiveInstance(db, userId).
ensureActiveInstance supports org instances with instance-keyed sandboxId derivation (sandboxIdFromInstanceId).

Verification

pnpm typecheck (kiloclaw worker) — pass
pnpm typecheck (root / Next.js) — pass
pnpm test (kiloclaw) — 48 files, 1125 tests, all passing
pnpm lint (kiloclaw) — 0 warnings, 0 errors
pnpm format:check — pass (pre-push hook)

manual tests - pre-existing instance (i.e. created before this branch)

admin ui loads fine, and ops work
kilo ui loads fine, and ops work (tested redeploy, model change)
kilo proxy correctly sets gateway token and proxies traffic to openclaw gateway
openclaw control ui works
openclaw doctor work
destroy works

newly created instance (created with this branch)

admin fine
kilo ui loads and ops work
openclaw control ui works
pairing channel setup works
destroy works

Visual Changes

N/A

Reviewer Notes

Deviations from plan documented at ~/fd-plans/kiloclaw/multi-instance-deviations.md — 24 deviations logged with rationale, including: Drizzle instead of gastown raw SQL, all new provisions instance-keyed (scope expansion), catch-all proxy sandboxId from DO status, ownerKey-as-param pattern, and full lifecycle threading pulled from PR 3 into PR 2.
Registry operations are best-effort — provision and destroy succeed even if the registry is unavailable. The resolveRegistryEntry fallback to idFromName(userId) only helps legacy instances; for instance-keyed DOs it returns "not provisioned" until the registry recovers.
Lazy migration retry has a 60-second cooldown to avoid hammering Hyperdrive during outages.
Controller getStatus() call — the controller checkin now makes one extra DO RPC to resolve the real userId for PostHog attribution and instance-ready emails. This is on the checkin hot path (~every 60s per instance) but getStatus() is a lightweight in-memory read.
The ki_ prefix on sandboxIds is the discriminator between legacy (base64url) and instance-keyed identity. All gateway token derivation, controller routing, and metadata recovery depend on it.

kilo-code-bot · 2026-03-30T17:29:25Z

src/lib/kiloclaw/instance-registry.ts

+  // subsequent reads converge on the oldest row. The second row is an inert
+  // orphan (no DO created for it). The window is milliseconds on a user-
+  // initiated action already deduplicated by the frontend's useMutation.
+  const existing = await getActiveInstance(userId);


WARNING: Check-then-insert can create multiple active personal instances

Now that personal rows get a fresh sandbox_id from instanceId, two concurrent calls can both miss getActiveInstance(userId) and each insert a different active row. That is no longer protected by the old (user_id, sandbox_id) uniqueness path, so a double-submit/retry can leave multiple active personal instances (and duplicate subscription rows, since kiloclaw_subscriptions is only unique on instance_id).

yea, hence the code comment - but we're going to be supporting multiple instances per user very shortly like this week.

kiloclaw/src/routes/platform.ts

kilo-code-bot · 2026-03-30T17:30:02Z

Code Review Summary

Status: 3 Issues Found | Recommendation: Address before merge

Overview

Severity	Count
CRITICAL	0
WARNING	3
SUGGESTION	0

Issue Details (click to expand)

No new issues found in the incremental diff.

Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

File	Line	Issue
`src/lib/kiloclaw/billing-lifecycle-cron.ts`	695	Lifecycle sweeps now stop or destroy one instance, but the Postgres suspension and destruction updates still target every row for the user
`src/routers/admin-router.ts`	687	Admin trial reset still starts whichever active instance `LIMIT 1` returns instead of the reset subscription's instance
`src/lib/kiloclaw/instance-registry.ts`	105	Check-then-insert can create multiple active personal instances

Fix these issues in Kilo Cloud

Files Reviewed (7 files)

kiloclaw/src/durable-objects/kiloclaw-instance/index.ts
kiloclaw/src/durable-objects/kiloclaw-registry.ts
kiloclaw/src/routes/platform.ts - previous registry inspection issue fixed
src/app/admin/components/KiloclawInstances/KiloclawInstanceDetail.tsx
src/lib/kiloclaw/kiloclaw-internal-client.ts
src/lib/kiloclaw/types.ts
src/routers/admin-kiloclaw-instances-router.ts - previous destroy/dev-nuke DO targeting issues fixed

_{Reviewed by gpt-5.4-20260305 · 901,410 tokens}

kiloclaw/src/routes/access-gateway.ts

kilo-code-bot · 2026-03-30T19:31:36Z

src/lib/kiloclaw/billing-lifecycle-cron.ts

      // leave expired accounts active.
      try {
-        await client.stop(row.user_id);
+        await client.stop(row.user_id, row.instance_id ?? undefined);


WARNING: Lifecycle sweeps are only half-migrated to per-instance state

This stop call is now scoped to row.instance_id, but the Postgres mutations in sweeps 1-4 still key off user_id only (kiloclaw_subscriptions in the suspension paths and kiloclaw_instances in the destruction path). Once a user has multiple subscribed instances, expiring one row here will suspend or destroy unrelated instances as well. Those updates need the same instance_id scoping, with the old user-wide fallback reserved for legacy rows.

This is only a problem when a user has multiple subscribed instances — which doesn't happen yet.

kilo-code-bot · 2026-03-30T19:31:36Z

src/routers/admin-router.ts

            try {
              const client = new KiloClawInternalClient();
-              await client.start(input.userId);
+              await client.start(input.userId, activeInstance.id);


WARNING: Trial reset still starts an arbitrary instance

activeInstance is still whichever undestroyed row the bare LIMIT 1 returns for this user, so threading activeInstance.id here does not guarantee you're restarting the subscription row that was just reset. In the per-instance subscription model this can bring up the wrong DO while leaving the intended instance suspended.

same, Currently one personal instance per user, so LIMIT 1 returns the right one.

src/routers/admin-kiloclaw-instances-router.ts

…y through registry Add the KiloClawRegistry Durable Object (SQLite-backed via Drizzle ORM) that indexes instances per owner (user or org). Wire provision, destroy, and catch-all proxy flows through the registry. Enable lazy migration of legacy instances from Postgres on first access. Key changes: - KiloClawRegistry DO with listInstances, createInstance, destroyInstance, resolveDoKey, findInstancesForUser methods - Lazy migration: reads legacy instance from Postgres via Hyperdrive on first listInstances() call, with 60s retry cooldown - Catch-all proxy reads sandboxId from DO status (not middleware) for gateway token derivation — critical for instance-keyed DOs using ki_ sandboxIds - Registry create/destroy are best-effort (non-fatal errors) - resolveRegistryEntry falls back to legacy idFromName(userId) on registry failure - ensureActiveInstance supports org instances with instance-keyed sandboxId derivation - restoreFromPostgres accepts opts.sandboxId for precise multi-instance lookup - tRPC router threads instanceId to worker for all provisions/destroys

Complete the instance-keyed DO migration by threading instanceId through every caller that resolves a KiloClawInstance DO stub: Worker: - All ~30 platform.ts routes now parse ?instanceId= and pass to instanceStubFactory (3-arg calls) - controller.ts handles ki_ sandboxIds via isInstanceKeyedSandboxId to resolve the correct DO key - Snapshot-restore queue message includes optional instanceId; consumer uses it as DO key when present Internal client: - All ~30 instance-scoped methods accept optional instanceId as last parameter, forwarded as ?instanceId= query param Next.js callers: - All tRPC router methods call getActiveInstance(userId) and pass instance?.id to internal client - Admin router methods pass instance.id from DB lookups - Billing cron + autoResumeIfSuspended already had instanceId (verified pre-existing) New exports from @kilocode/worker-utils/instance-id: - isInstanceKeyedSandboxId(sandboxId): boolean - instanceIdFromSandboxId(sandboxId): string

The controller checkin route used instanceId as a placeholder for userId when handling ki_ sandboxIds. This caused PostHog attribution and instance-ready emails to silently fail for instance-keyed DOs. Fix: call stub.getStatus() after auth to read the real userId from the DO, which always stores it during provision.

…aram

…row targeting Three identity consistency fixes: 1. ensureActiveInstance personal flow now writes instance-keyed sandboxId (ki_ prefix) matching what the DO stores. Previously wrote sandboxIdFromUserId which diverged from the DO's sandboxIdFromInstanceId, breaking restoreFromPostgres lookups. getActiveInstance/markActiveInstanceDestroyed/renameInstance updated to find rows without filtering by sandboxId format. 2. Lazy migration derives doKey from Postgres row's sandboxId format (ki_ prefix → instanceId, base64url → userId) instead of blindly writing doKey=userId. Prevents wrong DO routing after transient registry create failures. 3. Admin destroy and devNukeAll pass instance.id to markActiveInstanceDestroyed for explicit row targeting instead of legacy sandboxId matching.

…ated_at Add ORDER BY created_at ASC to getActiveInstance so LIMIT 1 always returns the oldest active row. Documents the known race window in ensureActiveInstance where concurrent callers can both insert — the consequence is a benign orphan row since all subsequent reads converge on the same (oldest) row.

Always attempt user registry cleanup on destroy. When getStatus() succeeds and reveals an orgId, also clean up the org registry. When getStatus() fails, log a warning — the org registry entry becomes stale but harmless (points to a destroyed DO with no machineId).

getOrCreateInstanceForBilling searched for the destroyed personal instance by sandboxIdFromUserId, which doesn't match instance-keyed rows (ki_ prefix). Now finds the most recently destroyed personal instance by user_id + organization_id IS NULL without filtering by sandboxId format, matching the pattern used by getActiveInstance.

The access gateway's buildRedirectUrl derived the gateway token from sandboxIdFromUserId(userId), but instance-keyed DOs use sandboxIdFromInstanceId(instanceId) — a different sandboxId. This caused token_mismatch on every new instance's WebSocket auth. Fix: resolve the DO's actual sandboxId via the registry + getStatus() before deriving the token. Falls back to legacy derivation if the DO is unreachable.

Three files with instanceId threading were in the git stash during rebase and not included in the lifecycle threading commit: - billing-lifecycle-cron.ts: pass instance_id to client.stop/destroy - instance-lifecycle.ts: formatting only (already had instanceId) - admin-router.ts: pass activeInstance.id to client.start

Legacy instances have DOs at idFromName(userId), not idFromName(instanceId). Passing instance.id as instanceId to the worker causes it to resolve an empty DO — breaking status, start, stop, destroy, and all lifecycle operations for existing instances after upgrading to this branch. Add workerInstanceId() helper that checks the row's sandboxId: only returns instance.id for ki_-prefixed (instance-keyed) rows. Returns undefined for legacy base64url rows so the worker falls back to idFromName(userId). Applied across all callers: tRPC router, admin router, and admin instances router.

…sandbox_id Two fixes: 1. access-gateway resolveSandboxId: when getStatus() throws but the registry entry exists, derive sandboxId from the entry's doKey (sandboxIdFromInstanceId for UUID doKeys) instead of falling back to sandboxIdFromUserId. Prevents wrong gateway token for instance-keyed DOs when the DO is temporarily unreachable. 2. Admin destroy and devNukeAll queries now select sandbox_id so workerInstanceId() can check the ki_ prefix. Without it, the function always returned undefined, routing to the legacy userId-keyed DO instead of the instance-keyed one.

…ceId) The destroy route only cleaned up registry entries when instanceId was provided. Legacy destroys (no instanceId, DO keyed by userId) skipped the cleanup entirely, leaving stale registry entries that point to destroyed DOs. On re-provision, the proxy picks the stale entry (oldest) instead of the new instance's entry. Fix: when no instanceId is provided, list the user registry's entries and find the one with doKey=userId (the legacy entry), then destroy it by its instanceId from the registry.

Registry lifecycle: - Instance DO destroy() now cleans up registry entries on finalization, covering alarm-initiated destroys that bypass the platform route - Production logging on registry create/destroy in platform routes Admin observability: - New GET /api/platform/registry-entries endpoint returns all entries (including destroyed) for a user's registry - Admin instance detail page shows Registry Status card with all entries, doKey, timestamps, and current/active badges - orgId added to PlatformDebugStatusResponse and shown in admin UI Known gap documented: GDPR soft-delete does not clean up registry or Fly resources (deferred to follow-up PR).

kiloclaw/src/routes/platform.ts

…ration status, detailed logging Three review fixes: 1. Registry admin endpoint now accepts optional orgId param and queries both personal (user:userId) and org (org:orgId) registries. Admin instance detail page passes organization_id. AdminKiloclawInstance type and list/get queries updated to include organization_id. 2. migrated field now sourced from the Registry DO's actual migration state (listAllInstances returns { entries, migrated }) instead of hardcoded true. 3. DO finalization registry cleanup now logs success, legacy-path success, and no-op (already cleaned / never existed) with context for each case.

pandemicsyn force-pushed the florian/feat/registry-do-proxy-cutover branch from f1a4338 to 06eca4b Compare March 30, 2026 17:22

pandemicsyn marked this pull request as ready for review March 30, 2026 17:22

kilo-code-bot bot reviewed Mar 30, 2026

View reviewed changes

kiloclaw/src/routes/access-gateway.ts Show resolved Hide resolved

kilo-code-bot bot reviewed Mar 30, 2026

View reviewed changes

src/routers/admin-kiloclaw-instances-router.ts Show resolved Hide resolved

src/routers/admin-kiloclaw-instances-router.ts Show resolved Hide resolved

jeanduplessis approved these changes Mar 30, 2026

View reviewed changes

St0rmz1 approved these changes Mar 30, 2026

View reviewed changes

pandemicsyn added 13 commits March 31, 2026 07:50

fix(test): update admin destroyFlyMachine assertions for instanceId p…

5334409

…aram

pandemicsyn force-pushed the florian/feat/registry-do-proxy-cutover branch from 0bf7eb9 to 72e49b4 Compare March 31, 2026 12:59

kilo-code-bot bot reviewed Mar 31, 2026

View reviewed changes

kiloclaw/src/routes/platform.ts Outdated Show resolved Hide resolved

pandemicsyn added 2 commits March 31, 2026 08:47

Merge branch 'main' into florian/feat/registry-do-proxy-cutover

fcc0bc5

pandemicsyn enabled auto-merge (squash) March 31, 2026 14:55

pandemicsyn merged commit d392608 into main Mar 31, 2026
32 checks passed

pandemicsyn deleted the florian/feat/registry-do-proxy-cutover branch March 31, 2026 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(kiloclaw): add KiloClawRegistry DO and complete instance-keyed identity migration#1706

feat(kiloclaw): add KiloClawRegistry DO and complete instance-keyed identity migration#1706
pandemicsyn merged 16 commits intomainfrom
florian/feat/registry-do-proxy-cutover

pandemicsyn commented Mar 29, 2026 •

edited

Loading

Uh oh!

kilo-code-bot bot Mar 30, 2026

Uh oh!

pandemicsyn Mar 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

kilo-code-bot bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

kilo-code-bot bot Mar 30, 2026

Uh oh!

pandemicsyn Mar 30, 2026

Uh oh!

kilo-code-bot bot Mar 30, 2026

Uh oh!

pandemicsyn Mar 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pandemicsyn commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Visual Changes

Reviewer Notes

Uh oh!

kilo-code-bot bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

pandemicsyn Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kilo-code-bot bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Overview

Uh oh!

Uh oh!

kilo-code-bot bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

pandemicsyn Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

kilo-code-bot bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

pandemicsyn Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pandemicsyn commented Mar 29, 2026 •

edited

Loading

pandemicsyn Mar 30, 2026 •

edited

Loading

kilo-code-bot bot commented Mar 30, 2026 •

edited

Loading