Skip to content

[slopfix] fix(pegboard): decrement kv_storage_used metric on actor delete#5356

Open
MasterPtato wants to merge 1 commit into
stack/slopfix-refactor-universaldb-ups-single-multi-node-udb-over-nats-remove-ups-postgres-driver-vuorlwlufrom
stack/slopfix-fix-pegboard-decrement-kv_storage_used-metric-on-actor-delete-uqzqlkws
Open

[slopfix] fix(pegboard): decrement kv_storage_used metric on actor delete#5356
MasterPtato wants to merge 1 commit into
stack/slopfix-refactor-universaldb-ups-single-multi-node-udb-over-nats-remove-ups-postgres-driver-vuorlwlufrom
stack/slopfix-fix-pegboard-decrement-kv_storage_used-metric-on-actor-delete-uqzqlkws

Conversation

@MasterPtato

@MasterPtato MasterPtato commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

No description provided.

@MasterPtato

MasterPtato commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

@claude

claude Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Review

Small, well-targeted fix: the pegboard_actor_metrics / pegboard_actor2_metrics workflows periodically record the actor's KV size into a namespace-level KvStorageUsed gauge but never decremented it when the actor (and its KV) is destroyed, so the namespace total drifted upward permanently. This adds a RemoveKvMetrics activity after the lifecycle loop breaks that subtracts the last recorded size.

Correctness

  • The decrement logic mirrors the existing record_kv_metrics delta pattern (namespace::keys::metric::inc with a negative delta), consistent with the existing atomic-add accounting.
  • The if last_kv_storage_size != 0 guard correctly avoids an extra transaction and avoids double-decrementing when the last periodic record_kv_metrics call (which also runs once more on the Destroy iteration, since it isn't gated on !destroy) already observed a shrunk/deleted KV.
  • Verified the activity-state semantics: ActivityCtx::state()'s StateGuard only writes its mutated copy back to the workflow's durable state when the activity returns Ok (run_activity in ctx/workflow.rs); a mid-loop failure in the KV scan (record_kv_metrics) discards the dirty in-memory state entirely rather than partially persisting it, and completed activities are never re-executed on replay. So there's no partial-corruption or double-decrement risk here.
  • ctx.v(2) reused for the new RemoveKvMetricsInput call (same version as the pre-existing GetTsInput call in the loop) is safe, not a bug: compare_version/history-location matching is per call-site position, and reusing a version number at an unrelated, later-appended call site is consistent with existing convention elsewhere in this file tree (e.g. runner_pool.rs, actor/mod.rs).
  • actor/metrics.rs and actor2/metrics.rs are kept in parity, per this repo's convention of maintaining both actor versions side by side. The actor2 version additionally tags all its txns with tx.priority(Priority::Low), including the new remove_kv_metrics txn — consistent with the rest of that file (v1 actor/metrics.rs doesn't use Priority::Low anywhere, so this asymmetry is intentional/pre-existing rather than an oversight in this PR).
  • Verified there are 4 signal-send call sites for metrics::Destroy (3 in actor/mod.rs, 1 in actor2/mod.rs), so the destroy signal path looks complete. Still worth a mental double check: is there any actor-KV-deletion path (crash cleanup, orphan reaping, forced eviction) that deletes KV without routing through one of these Destroy signal sends? If such a path exists, it would reintroduce the same leak this PR fixes, since the decrement only fires when this workflow's loop observes the Destroy signal.

Minor / nit

  • Already-completed workflow instances (actors destroyed before this fix ships) will keep their stale contribution baked into the namespace KvStorageUsed gauge forever, since this code path never runs for finished workflows. Might be worth a one-off backfill/reset note if the drift matters for existing namespaces, but not a blocker for this change.
  • examples/kitchen-sink/scripts/seed-large-dbs.ts: flags/env vars are parsed with Number(flag(...)) || envNum(...), so an explicit --count=0 (or similar) would fall through to the env/default value instead of being honored, since 0 is falsy in JS. Very low severity for a load-generation script with no legitimate zero value, but worth a ??-based check if precise flag semantics ever matter here.
  • The PR also carries a large regenerated examples/kitchen-sink/dist-server/server.mjs (+11.9k/-11.1k lines) and .mjs.map diff. That file is listed in examples/kitchen-sink/.gitignore yet is tracked in git; since the diff reorders/rewrites much more than just the new growDb actor wiring, it'd help reviewers if this rebuild is confirmed intentional/expected as part of this change (vs. stale drift from the base branch), given it dwarfs the actual code change in line count.

Test coverage

  • No test added, and there's no existing coverage for KvStorageUsed lifecycle under engine/packages/pegboard/tests/. Not a regression in coverage, but since this fixes a real metric-drift bug, a test asserting the gauge returns to baseline after actor destroy (via the existing UDB-backed pegboard test harness) would help guard against this regressing again.
  • The new examples/kitchen-sink/src/actors/testing/grow-db.ts and scripts/seed-large-dbs.ts are load-test scaffolding rather than automated tests; that's fine for their stated purpose (seeding large uncompacted DBs for a compaction load test) but they don't exercise the actual metric fix in this PR.

Overall this looks correct and low-risk. Nice fix.

@MasterPtato MasterPtato force-pushed the stack/slopfix-fix-pegboard-decrement-kv_storage_used-metric-on-actor-delete-uqzqlkws branch from db3dfe5 to d0ae940 Compare July 3, 2026 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant