Skip to content

feat(gc): per-id INFO logs + orchestrator summary#54

Merged
CMGS merged 4 commits into
masterfrom
feat/gc-detail-logs
May 18, 2026
Merged

feat(gc): per-id INFO logs + orchestrator summary#54
CMGS merged 4 commits into
masterfrom
feat/gc-detail-logs

Conversation

@CMGS
Copy link
Copy Markdown
Contributor

@CMGS CMGS commented May 18, 2026

Summary

Implements the "output detail" half of issue #33 (the LRU half landed in #53).

`cocoon gc` was effectively silent — only `INFO GC completed`. Operators couldn't tell what was actually reclaimed without inspecting filesystems. This PR adds:

  1. Per-id INFO log in every GC module's `Collect`, structured as `key=value` for awk/grep parsing
  2. Orchestrator summary line at end of each cycle: per-module counts + duration + failure count
  3. Reason field explaining why each id was selected

Sample output

```
INFO gc.snapshot collected id=XEOU... name=ubuntu-hot-testing:v1 bytes=3221225472 last_accessed=2026-04-12T10:30:00Z reason=lru-age
INFO gc.snapshot collected id=2GQVEA... name= bytes=0 last_accessed=never reason=orphan
INFO gc.cloudhypervisor collected id=ABC123 runDir=/var/lib/cocoon/run/cloudhypervisor/ABC123 logDir=/var/log/cocoon/cloudhypervisor/ABC123 reason=orphan-runDir
INFO gc.oci collected blob=b40150c1c2717d... reason=unreferenced
INFO gc.cni collected id=JKL netns=cocoon-JKL nics=2 reason=orphan
INFO gc.bridge collected id=MNO iface=btMNO-0 reason=orphan-tap
INFO gc.Run completed: cloudhypervisor=1 cni=1 oci=4 snapshot=3 (failures: 0, duration: 230ms)
```

Reason vocabulary

Module Reasons
snapshot `orphan`, `stale-pending`, `lru-all`, `lru-age`, `lru-keep`, `lru-size` (multi-criterion joined by `+`)
cloudhypervisor / firecracker `orphan-runDir`, `orphan-logDir`, `stale-creating`
images (oci, cloudimg) `unreferenced`
cni `orphan`
bridge `orphan-tap`

Design notes

  • `gc.Module.Collect` signature unchanged — modules that need per-id reason (snapshot + hypervisor) use a closure-captured `reasons map[string]string` populated in Resolve, read in Collect. Framework stays generic.
  • `pickLRU` now returns `map[string]string` (id → comma/plus-joined reasons). Tests updated.
  • `Backend.GCCollect` (hypervisor) becomes private `gcCollect` and takes `reasons` — no external callers existed.
  • `GCCollectBlobs` (images) takes a `module` parameter for log routing.
  • Orchestrator records attempted id-counts in the summary; success/failure distinction comes from per-module errs aggregated via `errors.Join` (already in place).
  • Removed redundant `log.WithFunc("cmd.gc").Info(ctx, "GC completed")` in cmd/others — orchestrator's summary supersedes it.

Files changed

```
README.md | 28 +++
cmd/others/handler.go | 7 +---
gc/orchestrator.go | 26 ++++
gc/orchestrator_test.go | 24 ++++ (NEW)
hypervisor/gc.go | 35 ++++-
images/gc.go | 2 +-
images/index.go | 14 +++
network/bridge/gc_linux.go | 4 +-
network/cni/gc.go | 7 ++++
snapshot/localfile/gc.go | 74 +++++++++---
snapshot/localfile/gc_test.go | 32 +++--
11 files changed, ~200 insertions
```

Verification

```
✓ go build ./... (zero warnings)
✓ make lint (0 issues × Linux + Darwin)
✓ go test ./... (all green, including new TestFormatSummary cases + updated pickLRU tests)
```

Test plan

  • On testbed: `cocoon gc` (no flags) — verify orphan + stale-pending entries logged with reasons
  • `cocoon gc --snapshot --snapshot-age=24h` — verify lru-* reasons in snapshot module
  • `cocoon gc --snapshot --snapshot-dry-run` — verify `would-evict` lines (preview), no `collected` lines for LRU
  • Verify final `gc.Run completed: ... (failures: 0, duration: ...)` summary line appears
  • grep/awk pipeline (`grep "gc.snapshot.*reason=lru-"`) parses cleanly

CMGS added 4 commits May 18, 2026 14:11
Issue #33's "output detail" requirement: `cocoon gc` was previously
near-silent, only printing a single "GC completed". Now every
collected item is logged at INFO under `gc.<module>` with a
structured key=value payload, and the orchestrator ends each cycle
with a per-module count + duration.

Sample run:

  INFO gc.snapshot         collected id=XEOU... name=ubuntu-hot-testing:v1 bytes=3221225472 last_accessed=2026-04-12T10:30:00Z reason=lru-age
  INFO gc.cloudhypervisor  collected id=ABC123 runDir=/var/lib/cocoon/run/cloudhypervisor/ABC123 logDir=/var/log/cocoon/cloudhypervisor/ABC123 reason=orphan-runDir
  INFO gc.oci              collected blob=b40150c1c2717d... reason=unreferenced
  INFO gc.cni              collected id=JKL netns=cocoon-JKL nics=2 reason=orphan
  INFO gc.bridge           collected id=MNO iface=btMNO-0 reason=orphan-tap
  INFO gc.Run              completed: cloudhypervisor=1 cni=1 oci=4 snapshot=3 (failures: 0, duration: 230ms)

Reasons surface why each id was selected:

- snapshot:        orphan / stale-pending / lru-all / lru-age / lru-keep / lru-size
                   (multi-criterion joins with `+`, e.g. `lru-age+lru-size`)
- cloudhypervisor/firecracker: orphan-runDir / orphan-logDir / stale-creating
- images (oci, cloudimg):      unreferenced
- cni:                          orphan
- bridge:                       orphan-tap

Implementation:

- pickLRU now returns `map[string]string` (id → reason); tests updated.
- snapshot and hypervisor modules carry a `reasons` closure variable
  populated in Resolve, read in Collect — Module.Collect signature is
  unchanged so the framework stays untouched for the other modules.
- gcCollect (hypervisor) is now private and takes the reasons map;
  no external callers existed.
- GCCollectBlobs takes a `module` parameter so the per-blob log goes
  to the correct `gc.<oci|cloudimg>` namespace.
- Orchestrator tracks {module → attempted-id-count} and emits a
  summary log via formatSummary; counts are attempts (not successes)
  to keep Collect signatures untouched. Failures count comes from the
  per-module err.
- README documents the format + grep/awk examples.

Compatible: no flag changes, no DB shape changes, no Module signature
changes. The only visible difference is more log output (INFO level —
appears by default).
- Hoist EvictionPolicy + hasCriteria above backfillSizeBytes (public
  declarations must precede private per file-organization rule).
- Align backfillSizeBytes logger prefix to gc.snapshot (matches the
  other two log sites in this file; localfile.X is reserved for
  non-GC callers).
- Extract logEvictRow shared by Collect (collected) and logWouldEvict
  (would-evict) — same field shape was duplicated; inlines formatTime
  into the single caller.
- Trim three godocs of restating chatter.
…ontract

Both snapshot/localfile and hypervisor maintained a closure-captured
reasons map[string]string that Resolve wrote and Collect read — the
same coupling smell duplicated in two places.

Add snap to Collect's signature (gc.Module[S].Collect) so any map
field populated by Resolve (reasons in this case, but generally any
per-id metadata) is visible to Collect without the closure dance.

- snapshot/localfile/gc.go: drop var (records, reasons) outer capture;
  reasons becomes a field of snapshotGCSnapshot, initialized in ReadDB,
  populated in Resolve, read in Collect via snap.
- hypervisor/gc.go: drop var reasons closure; reasons becomes a field
  of VMGCSnapshot.
- images/network bridge/cni: Collect signature gains a snap parameter;
  modules that don't need it use _ XSnapshot.
- gc/module.go + runner.go + orchestrator.go: thread snap through the
  collect adapter via the snapshots map already kept for cross-module
  analysis.

Net: same line count, removes duplicated closure-state smell, makes
"per-id metadata Resolve→Collect" a first-class GC contract for any
future module that needs it.
- gc/module.go: trim Resolve+Collect godoc narration that restated the
  signature contract.
- gc/module.go: collect adapter type-asserts symmetrically with
  resolveTargets — return nil on mismatch instead of silently passing
  zero S.
- hypervisor/gc.go: gcCollect now takes snap VMGCSnapshot (consistent
  with snapshot/localfile/gc.go which also reads snap.reasons inside
  Collect); reasons-map plumbing folded into the snap contract.
- snapshot/localfile/gc_test.go: ReadDB called BEFORE chmod 0500 on
  parent so the test exercises the intended path (read-then-fail-on-
  remove) instead of accidentally depending on chmod 0500 still
  permitting parent dir reads.
@CMGS CMGS merged commit dd7a3b9 into master May 18, 2026
4 checks passed
@CMGS CMGS deleted the feat/gc-detail-logs branch May 18, 2026 06:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant