Skip to content

cmmd: flock lock + lsof timeout + housekeeper dry-run/metrics#2

Open
NagyVikt wants to merge 1 commit into
cmmd-hardening-part-1from
cmmd-hardening-part-2
Open

cmmd: flock lock + lsof timeout + housekeeper dry-run/metrics#2
NagyVikt wants to merge 1 commit into
cmmd-hardening-part-1from
cmmd-hardening-part-2

Conversation

@NagyVikt
Copy link
Copy Markdown
Contributor

Summary

Stacked on top of #1. Three small but load-bearing safety items the previous PR deferred.

1. ipc::acquire_lock — true flock(2), not path probing

The old code did read-check-then-write: probe the PID with kill(0), remove the file if dead, then write our own PID. Two daemons starting in the same microsecond both saw "lock file absent" and both passed. Replaced with an RAII LockGuard holding an fd and flock(LOCK_EX|LOCK_NB) on it. The kernel guarantees exclusivity; a crash automatically releases the lock when the fd closes. Lock + pid files removed on Drop; fd dropped last so another daemon can't observe the file gone before the kernel releases.

2. Bounded lsof — no more wedging on slow filesystems

New process::memory_holders_with_timeout wraps the blocking lsof +D <memory_root> in tokio::time::timeout(LSOF_TIMEOUT_SEC=5). Timeout returns Err(\"lsof timeout\") so the existing fallback to the name-only process guard kicks in. The sync variant is preserved for non-async callers; the parser is shared and now has its own unit tests.

3. Housekeepers gain dry-run + Prometheus counters

tmux_janitor::cleanup_unattached, orphan_node::reap_orphans, and pressure::check_and_respond now take a dry_run: bool. When true, they report what would be killed/written without taking the action — important because these run every tick unconditionally, with no audit gate above them. Daemon honors new HOUSEKEEPER_DRY_RUN.

New labeled metric: cmmd_housekeeper_actions_total{kind, mode} with kinds {tmux, orphan_node, pressure} and modes {real, dry_run}.

Stacked dependency

Targets cmmd-hardening-part-1 (#1), not main. Merge #1 first, then re-target this to main.

Test plan

  • cargo test --lib — 51 tests pass, 7 new (flock race, double-acquire, lsof parser dedup, housekeeper mode routing)
  • cargo clippy --lib --bins --no-deps — clean
  • cargo build — succeeds
  • Smoke: start two daemons; second one fails fast with "another daemon holds … (pid=N)"
  • Smoke: HOUSEKEEPER_DRY_RUN=true start; verify cmmd_housekeeper_actions_total{mode=\"dry_run\"} increments and no actual signals sent
  • Smoke: simulate slow FS (e.g. mount --bind a slow path); verify lsof timeout fires and tick falls back gracefully

🤖 Generated with Claude Code

Follow-up to the growth-control / budget-cap commit. Three small but
load-bearing safety items the earlier change deferred.

1. ipc::acquire_lock — true flock(2), not path probing.

   The old code did read-check-then-write: if the lock file existed,
   probe the PID with `kill(0)`, remove the file if dead, then write
   our own PID. Two daemons starting in the same microsecond both saw
   "lock file absent" and both passed. Replaced with an RAII
   LockGuard that holds an fd and an `flock(LOCK_EX|LOCK_NB)` on it.
   The kernel guarantees exclusivity, and a crash automatically
   releases the lock when the fd is closed — no stale-file false
   positives. Lock + pid files removed on Drop; fd dropped last so
   another daemon can't observe the file gone before the flock
   releases.

2. process::memory_holders_with_timeout — bounded lsof.

   `lsof +D <memory_root>` is recursive; on a slow filesystem (sshfs,
   network mount) it could hang the whole tick. Added a new async
   variant wrapped in `tokio::time::timeout(LSOF_TIMEOUT_SEC=5)`.
   Timeout returns Err("lsof timeout") so the existing fallback to
   the name-only process guard kicks in. The blocking sync variant
   is preserved for non-async callers; the parser is shared.

3. Housekeepers gain dry-run + Prometheus counters.

   tmux_janitor::cleanup_unattached, orphan_node::reap_orphans, and
   pressure::check_and_respond now take a `dry_run: bool`. When true,
   they report what *would* be killed/written without taking the
   action — important because these run every tick unconditionally,
   with no audit gate above them. Daemon honors HOUSEKEEPER_DRY_RUN.

   New labeled metric: `cmmd_housekeeper_actions_total{kind, mode}`
   with kinds {tmux, orphan_node, pressure} and modes {real, dry_run}.
   Six atomic counters back this; the render layer emits the labels.

51 tests pass (7 new), clippy clean on lib + bins.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant