Skip to content

feat(kiloclaw): add KiloClawRegistry DO and complete instance-keyed identity migration#1706

Merged
pandemicsyn merged 16 commits intomainfrom
florian/feat/registry-do-proxy-cutover
Mar 31, 2026
Merged

feat(kiloclaw): add KiloClawRegistry DO and complete instance-keyed identity migration#1706
pandemicsyn merged 16 commits intomainfrom
florian/feat/registry-do-proxy-cutover

Conversation

@pandemicsyn
Copy link
Copy Markdown
Contributor

@pandemicsyn pandemicsyn commented Mar 29, 2026

Summary

  • Adds the KiloClawRegistry Durable Object — a per-owner SQLite-backed index (via Drizzle ORM) that maps instance IDs to DO keys. Keyed by user:{userId} or org:{orgId}. Supports lazy migration from Postgres for legacy instances on first access.
  • All new provisions (personal and org) now create Instance DOs keyed by idFromName(instanceId) with ki_-prefixed sandboxIds. Legacy instances remain at idFromName(userId) and are backfilled into the registry via lazy migration.
  • The catch-all proxy reads sandboxId from the DO's getStatus() for gateway token derivation, instead of the middleware-derived value. This is critical: instance-keyed DOs derive sandboxId from instanceId, which differs from sandboxIdFromUserId(). Without this, all new provisions would have gateway token mismatches.
  • Threads instanceId end-to-end through every lifecycle caller: all ~30 platform routes, all ~30 internal client methods, all tRPC router methods, admin routers, controller heartbeat, and snapshot-restore queue. This makes the PR self-contained — no follow-up PR required before deploy.
  • Adds isInstanceKeyedSandboxId() and instanceIdFromSandboxId() to @kilocode/worker-utils/instance-id for reverse-mapping ki_ sandboxIds to instance UUIDs (used by controller heartbeat).
  • restoreFromPostgres accepts opts.sandboxId for precise multi-instance lookup instead of ambiguous getActiveInstance(db, userId).
  • ensureActiveInstance supports org instances with instance-keyed sandboxId derivation (sandboxIdFromInstanceId).

Verification

  • pnpm typecheck (kiloclaw worker) — pass
  • pnpm typecheck (root / Next.js) — pass
  • pnpm test (kiloclaw) — 48 files, 1125 tests, all passing
  • pnpm lint (kiloclaw) — 0 warnings, 0 errors
  • pnpm format:check — pass (pre-push hook)

manual tests - pre-existing instance (i.e. created before this branch)

  • admin ui loads fine, and ops work
  • kilo ui loads fine, and ops work (tested redeploy, model change)
  • kilo proxy correctly sets gateway token and proxies traffic to openclaw gateway
  • openclaw control ui works
  • openclaw doctor work
  • destroy works

newly created instance (created with this branch)

  • admin fine
  • kilo ui loads and ops work
  • openclaw control ui works
  • pairing channel setup works
  • destroy works

Visual Changes

N/A

Reviewer Notes

  • Deviations from plan documented at ~/fd-plans/kiloclaw/multi-instance-deviations.md — 24 deviations logged with rationale, including: Drizzle instead of gastown raw SQL, all new provisions instance-keyed (scope expansion), catch-all proxy sandboxId from DO status, ownerKey-as-param pattern, and full lifecycle threading pulled from PR 3 into PR 2.
  • Registry operations are best-effort — provision and destroy succeed even if the registry is unavailable. The resolveRegistryEntry fallback to idFromName(userId) only helps legacy instances; for instance-keyed DOs it returns "not provisioned" until the registry recovers.
  • Lazy migration retry has a 60-second cooldown to avoid hammering Hyperdrive during outages.
  • Controller getStatus() call — the controller checkin now makes one extra DO RPC to resolve the real userId for PostHog attribution and instance-ready emails. This is on the checkin hot path (~every 60s per instance) but getStatus() is a lightweight in-memory read.
  • The ki_ prefix on sandboxIds is the discriminator between legacy (base64url) and instance-keyed identity. All gateway token derivation, controller routing, and metadata recovery depend on it.

@pandemicsyn pandemicsyn force-pushed the florian/feat/registry-do-proxy-cutover branch from f1a4338 to 06eca4b Compare March 30, 2026 17:22
@pandemicsyn pandemicsyn marked this pull request as ready for review March 30, 2026 17:22
// subsequent reads converge on the oldest row. The second row is an inert
// orphan (no DO created for it). The window is milliseconds on a user-
// initiated action already deduplicated by the frontend's useMutation.
const existing = await getActiveInstance(userId);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Check-then-insert can create multiple active personal instances

Now that personal rows get a fresh sandbox_id from instanceId, two concurrent calls can both miss getActiveInstance(userId) and each insert a different active row. That is no longer protected by the old (user_id, sandbox_id) uniqueness path, so a double-submit/retry can leave multiple active personal instances (and duplicate subscription rows, since kiloclaw_subscriptions is only unique on instance_id).

Copy link
Copy Markdown
Contributor Author

@pandemicsyn pandemicsyn Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, hence the code comment - but we're going to be supporting multiple instances per user very shortly like this week.

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Mar 30, 2026

Code Review Summary

Status: 3 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 3
SUGGESTION 0
Issue Details (click to expand)

No new issues found in the incremental diff.

Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

File Line Issue
src/lib/kiloclaw/billing-lifecycle-cron.ts 695 Lifecycle sweeps now stop or destroy one instance, but the Postgres suspension and destruction updates still target every row for the user
src/routers/admin-router.ts 687 Admin trial reset still starts whichever active instance LIMIT 1 returns instead of the reset subscription's instance
src/lib/kiloclaw/instance-registry.ts 105 Check-then-insert can create multiple active personal instances

Fix these issues in Kilo Cloud

Files Reviewed (7 files)
  • kiloclaw/src/durable-objects/kiloclaw-instance/index.ts
  • kiloclaw/src/durable-objects/kiloclaw-registry.ts
  • kiloclaw/src/routes/platform.ts - previous registry inspection issue fixed
  • src/app/admin/components/KiloclawInstances/KiloclawInstanceDetail.tsx
  • src/lib/kiloclaw/kiloclaw-internal-client.ts
  • src/lib/kiloclaw/types.ts
  • src/routers/admin-kiloclaw-instances-router.ts - previous destroy/dev-nuke DO targeting issues fixed

Reviewed by gpt-5.4-20260305 · 901,410 tokens

// leave expired accounts active.
try {
await client.stop(row.user_id);
await client.stop(row.user_id, row.instance_id ?? undefined);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Lifecycle sweeps are only half-migrated to per-instance state

This stop call is now scoped to row.instance_id, but the Postgres mutations in sweeps 1-4 still key off user_id only (kiloclaw_subscriptions in the suspension paths and kiloclaw_instances in the destruction path). Once a user has multiple subscribed instances, expiring one row here will suspend or destroy unrelated instances as well. Those updates need the same instance_id scoping, with the old user-wide fallback reserved for legacy rows.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only a problem when a user has multiple subscribed instances — which doesn't happen yet.

try {
const client = new KiloClawInternalClient();
await client.start(input.userId);
await client.start(input.userId, activeInstance.id);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Trial reset still starts an arbitrary instance

activeInstance is still whichever undestroyed row the bare LIMIT 1 returns for this user, so threading activeInstance.id here does not guarantee you're restarting the subscription row that was just reset. In the per-instance subscription model this can bring up the wrong DO while leaving the intended instance suspended.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, Currently one personal instance per user, so LIMIT 1 returns the right one.

…y through registry

Add the KiloClawRegistry Durable Object (SQLite-backed via Drizzle ORM)
that indexes instances per owner (user or org). Wire provision, destroy,
and catch-all proxy flows through the registry. Enable lazy migration of
legacy instances from Postgres on first access.

Key changes:
- KiloClawRegistry DO with listInstances, createInstance, destroyInstance,
  resolveDoKey, findInstancesForUser methods
- Lazy migration: reads legacy instance from Postgres via Hyperdrive on
  first listInstances() call, with 60s retry cooldown
- Catch-all proxy reads sandboxId from DO status (not middleware) for
  gateway token derivation — critical for instance-keyed DOs using ki_
  sandboxIds
- Registry create/destroy are best-effort (non-fatal errors)
- resolveRegistryEntry falls back to legacy idFromName(userId) on
  registry failure
- ensureActiveInstance supports org instances with instance-keyed
  sandboxId derivation
- restoreFromPostgres accepts opts.sandboxId for precise multi-instance
  lookup
- tRPC router threads instanceId to worker for all provisions/destroys
Complete the instance-keyed DO migration by threading instanceId
through every caller that resolves a KiloClawInstance DO stub:

Worker:
- All ~30 platform.ts routes now parse ?instanceId= and pass to
  instanceStubFactory (3-arg calls)
- controller.ts handles ki_ sandboxIds via isInstanceKeyedSandboxId
  to resolve the correct DO key
- Snapshot-restore queue message includes optional instanceId;
  consumer uses it as DO key when present

Internal client:
- All ~30 instance-scoped methods accept optional instanceId as
  last parameter, forwarded as ?instanceId= query param

Next.js callers:
- All tRPC router methods call getActiveInstance(userId) and pass
  instance?.id to internal client
- Admin router methods pass instance.id from DB lookups
- Billing cron + autoResumeIfSuspended already had instanceId
  (verified pre-existing)

New exports from @kilocode/worker-utils/instance-id:
- isInstanceKeyedSandboxId(sandboxId): boolean
- instanceIdFromSandboxId(sandboxId): string
The controller checkin route used instanceId as a placeholder for userId
when handling ki_ sandboxIds. This caused PostHog attribution and
instance-ready emails to silently fail for instance-keyed DOs.

Fix: call stub.getStatus() after auth to read the real userId from the
DO, which always stores it during provision.
…row targeting

Three identity consistency fixes:

1. ensureActiveInstance personal flow now writes instance-keyed sandboxId
   (ki_ prefix) matching what the DO stores. Previously wrote
   sandboxIdFromUserId which diverged from the DO's
   sandboxIdFromInstanceId, breaking restoreFromPostgres lookups.
   getActiveInstance/markActiveInstanceDestroyed/renameInstance updated
   to find rows without filtering by sandboxId format.

2. Lazy migration derives doKey from Postgres row's sandboxId format
   (ki_ prefix → instanceId, base64url → userId) instead of blindly
   writing doKey=userId. Prevents wrong DO routing after transient
   registry create failures.

3. Admin destroy and devNukeAll pass instance.id to
   markActiveInstanceDestroyed for explicit row targeting instead of
   legacy sandboxId matching.
…ated_at

Add ORDER BY created_at ASC to getActiveInstance so LIMIT 1 always
returns the oldest active row. Documents the known race window in
ensureActiveInstance where concurrent callers can both insert — the
consequence is a benign orphan row since all subsequent reads converge
on the same (oldest) row.
Always attempt user registry cleanup on destroy. When getStatus()
succeeds and reveals an orgId, also clean up the org registry. When
getStatus() fails, log a warning — the org registry entry becomes
stale but harmless (points to a destroyed DO with no machineId).
getOrCreateInstanceForBilling searched for the destroyed personal
instance by sandboxIdFromUserId, which doesn't match instance-keyed
rows (ki_ prefix). Now finds the most recently destroyed personal
instance by user_id + organization_id IS NULL without filtering by
sandboxId format, matching the pattern used by getActiveInstance.
The access gateway's buildRedirectUrl derived the gateway token from
sandboxIdFromUserId(userId), but instance-keyed DOs use
sandboxIdFromInstanceId(instanceId) — a different sandboxId. This
caused token_mismatch on every new instance's WebSocket auth.

Fix: resolve the DO's actual sandboxId via the registry + getStatus()
before deriving the token. Falls back to legacy derivation if the DO
is unreachable.
Three files with instanceId threading were in the git stash during
rebase and not included in the lifecycle threading commit:
- billing-lifecycle-cron.ts: pass instance_id to client.stop/destroy
- instance-lifecycle.ts: formatting only (already had instanceId)
- admin-router.ts: pass activeInstance.id to client.start
Legacy instances have DOs at idFromName(userId), not
idFromName(instanceId). Passing instance.id as instanceId to the
worker causes it to resolve an empty DO — breaking status, start,
stop, destroy, and all lifecycle operations for existing instances
after upgrading to this branch.

Add workerInstanceId() helper that checks the row's sandboxId: only
returns instance.id for ki_-prefixed (instance-keyed) rows. Returns
undefined for legacy base64url rows so the worker falls back to
idFromName(userId).

Applied across all callers: tRPC router, admin router, and admin
instances router.
…sandbox_id

Two fixes:

1. access-gateway resolveSandboxId: when getStatus() throws but the
   registry entry exists, derive sandboxId from the entry's doKey
   (sandboxIdFromInstanceId for UUID doKeys) instead of falling back
   to sandboxIdFromUserId. Prevents wrong gateway token for
   instance-keyed DOs when the DO is temporarily unreachable.

2. Admin destroy and devNukeAll queries now select sandbox_id so
   workerInstanceId() can check the ki_ prefix. Without it, the
   function always returned undefined, routing to the legacy
   userId-keyed DO instead of the instance-keyed one.
…ceId)

The destroy route only cleaned up registry entries when instanceId was
provided. Legacy destroys (no instanceId, DO keyed by userId) skipped
the cleanup entirely, leaving stale registry entries that point to
destroyed DOs. On re-provision, the proxy picks the stale entry
(oldest) instead of the new instance's entry.

Fix: when no instanceId is provided, list the user registry's entries
and find the one with doKey=userId (the legacy entry), then destroy it
by its instanceId from the registry.
@pandemicsyn pandemicsyn force-pushed the florian/feat/registry-do-proxy-cutover branch from 0bf7eb9 to 72e49b4 Compare March 31, 2026 12:59
Registry lifecycle:
- Instance DO destroy() now cleans up registry entries on finalization,
  covering alarm-initiated destroys that bypass the platform route
- Production logging on registry create/destroy in platform routes

Admin observability:
- New GET /api/platform/registry-entries endpoint returns all entries
  (including destroyed) for a user's registry
- Admin instance detail page shows Registry Status card with all entries,
  doKey, timestamps, and current/active badges
- orgId added to PlatformDebugStatusResponse and shown in admin UI

Known gap documented: GDPR soft-delete does not clean up registry or
Fly resources (deferred to follow-up PR).
…ration status, detailed logging

Three review fixes:

1. Registry admin endpoint now accepts optional orgId param and queries
   both personal (user:userId) and org (org:orgId) registries. Admin
   instance detail page passes organization_id. AdminKiloclawInstance
   type and list/get queries updated to include organization_id.

2. migrated field now sourced from the Registry DO's actual migration
   state (listAllInstances returns { entries, migrated }) instead of
   hardcoded true.

3. DO finalization registry cleanup now logs success, legacy-path
   success, and no-op (already cleaned / never existed) with context
   for each case.
@pandemicsyn pandemicsyn enabled auto-merge (squash) March 31, 2026 14:55
@pandemicsyn pandemicsyn merged commit d392608 into main Mar 31, 2026
32 checks passed
@pandemicsyn pandemicsyn deleted the florian/feat/registry-do-proxy-cutover branch March 31, 2026 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants