Skip to content

feat(agents): add pull-wake runner health check and rename owner_user_id to owner_principal#4339

Open
KyleAMathews wants to merge 12 commits into
mainfrom
fix-pull-wake
Open

feat(agents): add pull-wake runner health check and rename owner_user_id to owner_principal#4339
KyleAMathews wants to merge 12 commits into
mainfrom
fix-pull-wake

Conversation

@KyleAMathews
Copy link
Copy Markdown
Contributor

Summary

Adds a health check endpoint for pull-wake runners (GET /_electric/runners/:id/health) and renames owner_user_idowner_principal across the entire runners system. The health endpoint provides comprehensive diagnostics for debugging unreliable pull-wake dispatch, and the rename aligns the runners table with the principal identity model (URLs, not keys).

Root Cause

The pull-wake dispatch system had no observability — when runners failed to receive or process wakes, there was no way to diagnose what was wrong. Additionally, owner_user_id was a misnomer since principals include agents, services, and system actors, not just users.

Approach

Three-layer health diagnostics:

  1. Client-side (PullWakeRunner): Tracks 16 diagnostic fields — stream connection state, heartbeat results, claim outcomes (claimed/no_work/error), dispatch timestamps, reconnect counts. Reports these to the server via the existing heartbeat POST.

  2. Server-side storage: New diagnostics JSONB column on the runners table. The heartbeat handler persists client-reported diagnostics, making them available via Electric Shape sync for multi-device runner status.

  3. Health endpoint: GET /_electric/runners/:id/health aggregates runner state, client diagnostics, active claims, and dispatch stats. Derives a health status (healthy/degraded/unhealthy) from rules like lease expiry, stream connectivity, and heartbeat success.

Principal rename:

  • DB migration expires active claims, clears dispatch state, deletes runners, renames column, adds diagnostics column
  • All callers now send principal URLs (/principal/user%3Aalice), not keys (user:alice)
  • Strict validation via principalKeyFromUrl() — rejects invalid URLs at the API boundary
  • No backward compatibility — clean break, system is under active development

Key Invariants

  • owner_principal always contains a canonical principal URL that passes principalKeyFromUrl() validation
  • Auth checks compare owner_principal against ctx.principal.url (both URL form)
  • Health status only escalates (healthy → degraded → unhealthy), never downgrades
  • Heartbeat failures don't crash the runner — they're tracked in diagnostics and reported

Non-goals

  • No backward compatibility for owner_user_id — callers must update
  • No typed validation of the diagnostics JSONB on write — the schema is enforced by the client type system
  • No retry logic for heartbeat failures — they're best-effort by design

Verification

# Type check
pnpm --filter @electric-ax/agents-server exec tsc --noEmit
pnpm --filter @electric-ax/agents-runtime exec tsc --noEmit

# Tests (12 router tests, 5 pull-wake-runner tests)
pnpm --filter @electric-ax/agents-server exec vitest run test/runners-router.test.ts
pnpm --filter @electric-ax/agents-runtime exec vitest run test/pull-wake-runner.test.ts
pnpm --filter electric-ax exec vitest run test/start.test.ts

Files changed

File Change
drizzle/0007_runner_diagnostics_and_principal.sql Migration: expire claims, clear dispatch state, delete runners, rename column, add diagnostics
agents-runtime/src/pull-wake-runner.ts Add 16 diagnostic fields, getHealth(), report diagnostics in heartbeat, extract claim helpers
agents-server/src/db/schema.ts Rename ownerUserIdownerPrincipal, add diagnostics column
agents-server/src/electric-agents-types.ts Rename types, add RunnerHealthResponse/RunnerHealthStatus, type client diagnostics
agents-server/src/entity-registry.ts Rename params, store diagnostics, add getActiveClaimsForRunner/getDispatchStatsForRunner
agents-server/src/routing/runners-router.ts Rename fields, add principal validation, add health endpoint with status derivation
agents-server/src/routing/dispatch-policy.ts Auth check uses owner_principal/ctx.principal.url
agents-server/src/utils/server-utils.ts Shape column allowlist: add owner_principal, diagnostics
agents-desktop/src/main.ts Store principal URLs directly, derive from header
electric-ax/src/start.ts Store principal URL constant, convert env var to URL
agents/src/server.ts Use ownerPrincipal in registration
Test files (6) Updated for principal URLs, added health endpoint edge cases

🤖 Generated with Claude Code

KyleAMathews and others added 11 commits May 16, 2026 11:31
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dispatch-policy, server-utils, and electric-ax

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…URL form, callers convert keys to URLs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…migration, drop authorization fallback

- Use principalKeyFromUrl for proper principal URL validation (rejects /principal/local-desktop)
- Migration expires active claims and clears dispatch state before deleting runners
- Desktop: don't use authorization header as principal source — return undefined and let server derive from ctx.principal.url
- listRunners validates owner_principal query param

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pal keys, complete desktop constant replacement

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 16, 2026

❌ 17 Tests Failed:

Tests completed Failed Passed Skipped
687 17 670 39
View the top 3 failed test(s) by shortest run time
test/runtime-dsl.test.ts > N: wake primitives verification > N5: runFinished wake records the finished child on the parent stream
Stack Traces | 15.1s run time
Error: Timeout (15000ms) waiting for entity history on /wake-summary-parent-n4/wake-summary-1
[
  {
    "args": {},
    "entityType": "wake-summary-parent-n4",
    "operation": "insert",
    "type": "entity_created"
  },
  {
    "from": "/principal/system%3Aruntime-dsl-test",
    "operation": "insert",
    "payload": "spawn trio",
    "type": "inbox"
  },
  {
    "key": "run-0",
    "operation": "insert",
    "status": "started",
    "type": "run"
  },
  {
    "key": "step-0",
    "operation": "insert",
    "status": "started",
    "stepNumber": 1,
    "type": "step"
  },
  {
    "key": "msg-0",
    "operation": "insert",
    "status": "streaming",
    "type": "text"
  },
  {
    "delta": "spawned:3",
    "key": "msg-0:0",
    "operation": "insert",
    "runId": "run-0",
    "textId": "msg-0",
    "type": "text_delta"
  },
  {
    "key": "msg-0",
    "operation": "update",
    "status": "completed",
    "type": "text"
  },
  {
    "finishReason": "stop",
    "key": "step-0",
    "operation": "update",
    "status": "completed",
    "stepNumber": 1,
    "type": "step"
  },
  {
    "finishReason": "stop",
    "key": "run-0",
    "operation": "update",
    "status": "completed",
    "type": "run"
  },
  {
    "key": "child:wake-summary-child-n4:wake-summary-1-alpha",
    "manifest": {
      "entityType": "wake-summary-child-n4",
      "entityUrl": "/wake-summary-child-n4/wake-summary-1-alpha",
      "id": "wake-summary-1-alpha",
      "key": "child:wake-summary-child-n4:wake-summary-1-alpha",
      "kind": "child",
      "wake": "runFinished"
    },
    "operation": "insert",
    "type": "manifest"
  },
  {
    "key": "child:wake-summary-child-n4:wake-summary-1-bravo",
    "manifest": {
      "entityType": "wake-summary-child-n4",
      "entityUrl": "/wake-summary-child-n4/wake-summary-1-bravo",
      "id": "wake-summary-1-bravo",
      "key": "child:wake-summary-child-n4:wake-summary-1-bravo",
      "kind": "child",
      "wake": "runFinished"
    },
    "operation": "insert",
    "type": "manifest"
  },
  {
    "key": "child:wake-summary-child-n4:wake-summary-1-charlie",
    "manifest": {
      "entityType": "wake-summary-child-n4",
      "entityUrl": "/wake-summary-child-n4/wake-summary-1-charlie",
      "id": "wake-summary-1-charlie",
      "key": "child:wake-summary-child-n4:wake-summary-1-charlie",
      "kind": "child",
      "wake": "runFinished"
    },
    "operation": "insert",
    "type": "manifest"
  }
]
 ❯ waitForHistory test/runtime-dsl.ts:664:11
 ❯ test/runtime-dsl.test.ts:6261:27
test/runtime-dsl.test.ts > I: peer review coordination > I1: peer review aggregates three reviewer writes through shared state
Stack Traces | 30s run time
Error: Test timed out in 30000ms.
If this is a long-running test, pass a timeout value as the last argument or configure it globally with "testTimeout".
 ❯ test/runtime-dsl.test.ts:5049:3
test/runtime-dsl.test.ts > M: deep researcher coordination > M1: researcher workers start from spawn initialMessage without an extra send
Stack Traces | 30s run time
Error: Test timed out in 30000ms.
If this is a long-running test, pass a timeout value as the last argument or configure it globally with "testTimeout".
 ❯ test/runtime-dsl.test.ts:4915:3
test/runtime-dsl.test.ts > K: wiki coordination > K9: idempotent wiki recreation does not duplicate shared article rows
Stack Traces | 30s run time
Error: Test timed out in 30000ms.
If this is a long-running test, pass a timeout value as the last argument or configure it globally with "testTimeout".
 ❯ test/runtime-dsl.test.ts:5760:3
test/runtime-dsl.test.ts > K: wiki coordination > K3: get_wiki_status reports complete coverage after specialist articles land
Stack Traces | 30s run time
Error: Test timed out in 30000ms.
If this is a long-running test, pass a timeout value as the last argument or configure it globally with "testTimeout".
 ❯ test/runtime-dsl.test.ts:5521:3
test/runtime-dsl.test.ts > I: peer review coordination > I4: peer review with two configured reviewers summarizes only those durable rows
Stack Traces | 30s run time
Error: Test timed out in 30000ms.
If this is a long-running test, pass a timeout value as the last argument or configure it globally with "testTimeout".
 ❯ test/runtime-dsl.test.ts:5181:3
test/runtime-dsl.test.ts > K: wiki coordination > K10: same-topic wiki expansion adds only the missing article and updates later query coverage
Stack Traces | 30s run time
Error: Test timed out in 30000ms.
If this is a long-running test, pass a timeout value as the last argument or configure it globally with "testTimeout".
 ❯ test/runtime-dsl.test.ts:5791:3
test/runtime-dsl.test.ts > I: peer review coordination > I3: peer review with one configured reviewer summarizes only that durable row
Stack Traces | 30s run time
Error: Test timed out in 30000ms.
If this is a long-running test, pass a timeout value as the last argument or configure it globally with "testTimeout".
 ❯ test/runtime-dsl.test.ts:5143:3
test/runtime-dsl.test.ts > J: debate coordination > J1: debate parent reads both sides from shared state before issuing a ruling
Stack Traces | 30s run time
Error: Test timed out in 30000ms.
If this is a long-running test, pass a timeout value as the last argument or configure it globally with "testTimeout".
 ❯ test/runtime-dsl.test.ts:5221:3
test/runtime-dsl.test.ts > K: wiki coordination > K6: repeating create_wiki with the same topic and subtopics is idempotent
Stack Traces | 30s run time
Error: Test timed out in 30000ms.
If this is a long-running test, pass a timeout value as the last argument or configure it globally with "testTimeout".
 ❯ test/runtime-dsl.test.ts:5630:3
test/runtime-dsl.test.ts > J: debate coordination > J3: debate with only one durable side stays partial until the missing side arrives
Stack Traces | 30s run time
Error: Test timed out in 30000ms.
If this is a long-running test, pass a timeout value as the last argument or configure it globally with "testTimeout".
 ❯ test/runtime-dsl.test.ts:5328:3
test/runtime-dsl.test.ts > K: wiki coordination > K4: create_wiki rejects switching the topic on an existing wiki
Stack Traces | 30s run time
Error: Test timed out in 30000ms.
If this is a long-running test, pass a timeout value as the last argument or configure it globally with "testTimeout".
 ❯ test/runtime-dsl.test.ts:5546:3
test/runtime-dsl.test.ts > M: deep researcher coordination > M3: separate researcher entities keep child results isolated across later wakes
Stack Traces | 30s run time
Error: Test timed out in 30000ms.
If this is a long-running test, pass a timeout value as the last argument or configure it globally with "testTimeout".
 ❯ test/runtime-dsl.test.ts:4968:3
test/runtime-dsl.test.ts > K: wiki coordination > K2: repeating create_wiki reuses existing specialists and only spawns missing subtopics
Stack Traces | 30s run time
Error: Test timed out in 30000ms.
If this is a long-running test, pass a timeout value as the last argument or configure it globally with "testTimeout".
 ❯ test/runtime-dsl.test.ts:5434:3
test/runtime-dsl.test.ts > K: wiki coordination > K8: wiki keeps durable child metadata and shared articles carry topic and author details
Stack Traces | 30s run time
Error: Test timed out in 30000ms.
If this is a long-running test, pass a timeout value as the last argument or configure it globally with "testTimeout".
 ❯ test/runtime-dsl.test.ts:5688:3
test/runtime-dsl.test.ts > D: shared state > D9: a setup-registered shared-state effect fires on the first wake write and survives a later wake
Stack Traces | 30s run time
Error: Test timed out in 30000ms.
If this is a long-running test, pass a timeout value as the last argument or configure it globally with "testTimeout".
 ❯ test/runtime-dsl.test.ts:3396:3
test/runtime-dsl.test.ts > K: wiki coordination > K1: wiki specialists accumulate shared articles that a later query can read
Stack Traces | 30.1s run time
Error: Timeout (30000ms) waiting for shared:wiki_article x2 on shared state wiki-wiki-1
[]
 ❯ waitForHistory test/runtime-dsl.ts:664:11
 ❯ test/runtime-dsl.test.ts:5391:5

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant