cmmd: flock lock + lsof timeout + housekeeper dry-run/metrics#2
Open
NagyVikt wants to merge 1 commit into
Open
cmmd: flock lock + lsof timeout + housekeeper dry-run/metrics#2NagyVikt wants to merge 1 commit into
NagyVikt wants to merge 1 commit into
Conversation
Follow-up to the growth-control / budget-cap commit. Three small but
load-bearing safety items the earlier change deferred.
1. ipc::acquire_lock — true flock(2), not path probing.
The old code did read-check-then-write: if the lock file existed,
probe the PID with `kill(0)`, remove the file if dead, then write
our own PID. Two daemons starting in the same microsecond both saw
"lock file absent" and both passed. Replaced with an RAII
LockGuard that holds an fd and an `flock(LOCK_EX|LOCK_NB)` on it.
The kernel guarantees exclusivity, and a crash automatically
releases the lock when the fd is closed — no stale-file false
positives. Lock + pid files removed on Drop; fd dropped last so
another daemon can't observe the file gone before the flock
releases.
2. process::memory_holders_with_timeout — bounded lsof.
`lsof +D <memory_root>` is recursive; on a slow filesystem (sshfs,
network mount) it could hang the whole tick. Added a new async
variant wrapped in `tokio::time::timeout(LSOF_TIMEOUT_SEC=5)`.
Timeout returns Err("lsof timeout") so the existing fallback to
the name-only process guard kicks in. The blocking sync variant
is preserved for non-async callers; the parser is shared.
3. Housekeepers gain dry-run + Prometheus counters.
tmux_janitor::cleanup_unattached, orphan_node::reap_orphans, and
pressure::check_and_respond now take a `dry_run: bool`. When true,
they report what *would* be killed/written without taking the
action — important because these run every tick unconditionally,
with no audit gate above them. Daemon honors HOUSEKEEPER_DRY_RUN.
New labeled metric: `cmmd_housekeeper_actions_total{kind, mode}`
with kinds {tmux, orphan_node, pressure} and modes {real, dry_run}.
Six atomic counters back this; the render layer emits the labels.
51 tests pass (7 new), clippy clean on lib + bins.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on top of #1. Three small but load-bearing safety items the previous PR deferred.
1.
ipc::acquire_lock— trueflock(2), not path probingThe old code did read-check-then-write: probe the PID with
kill(0), remove the file if dead, then write our own PID. Two daemons starting in the same microsecond both saw "lock file absent" and both passed. Replaced with an RAIILockGuardholding an fd andflock(LOCK_EX|LOCK_NB)on it. The kernel guarantees exclusivity; a crash automatically releases the lock when the fd closes. Lock + pid files removed onDrop; fd dropped last so another daemon can't observe the file gone before the kernel releases.2. Bounded
lsof— no more wedging on slow filesystemsNew
process::memory_holders_with_timeoutwraps the blockinglsof +D <memory_root>intokio::time::timeout(LSOF_TIMEOUT_SEC=5). Timeout returnsErr(\"lsof timeout\")so the existing fallback to the name-only process guard kicks in. The sync variant is preserved for non-async callers; the parser is shared and now has its own unit tests.3. Housekeepers gain dry-run + Prometheus counters
tmux_janitor::cleanup_unattached,orphan_node::reap_orphans, andpressure::check_and_respondnow take adry_run: bool. When true, they report what would be killed/written without taking the action — important because these run every tick unconditionally, with no audit gate above them. Daemon honors newHOUSEKEEPER_DRY_RUN.New labeled metric:
cmmd_housekeeper_actions_total{kind, mode}with kinds{tmux, orphan_node, pressure}and modes{real, dry_run}.Stacked dependency
Targets
cmmd-hardening-part-1(#1), notmain. Merge #1 first, then re-target this tomain.Test plan
cargo test --lib— 51 tests pass, 7 new (flock race, double-acquire, lsof parser dedup, housekeeper mode routing)cargo clippy --lib --bins --no-deps— cleancargo build— succeedsHOUSEKEEPER_DRY_RUN=truestart; verifycmmd_housekeeper_actions_total{mode=\"dry_run\"}increments and no actual signals sentmount --binda slow path); verify lsof timeout fires and tick falls back gracefully🤖 Generated with Claude Code