Skip to content

fix(k8s): cap buildLogCache size to bound memory on failure bursts (bug-bash #2)#230

Merged
mastermanas805 merged 8 commits into
masterfrom
fix/buildlogcache-size-cap-2026-06-04
Jun 3, 2026
Merged

fix(k8s): cap buildLogCache size to bound memory on failure bursts (bug-bash #2)#230
mastermanas805 merged 8 commits into
masterfrom
fix/buildlogcache-size-cap-2026-06-04

Conversation

@mastermanas805

Copy link
Copy Markdown
Member

What

buildLogCache (k8s provider) snapshots kaniko build logs of failed builds so the failure autopsy can read them after the kaniko Job's 300s TTL reaps the pod. It was TTL-bounded (30m) and swept stale entries on each new failure — but had no size cap.

A burst of failing builds inside one TTL window (broken base image, wedged registry, repeated bad deploys) accumulates one ≤200-line snapshot per failure with no ceiling → unbounded memory growth on a long-lived api pod.

Fix

  • buildLogCacheMaxEntries = 256
  • capBuildLogCacheSize() — called evict-after-store from snapshotBuildLogs. When a store pushes the live count past the cap, evict oldest-first (a recent failure's autopsy is far likelier to be read). sync.Map has no length, so snapshot keys+times in one Range, sort by capturedAt, delete the excess oldest.

Coverage

Symptom:        unbounded buildLogCache growth under a burst of failing builds
Enumeration:    grep buildLogCache. internal/providers/compute/k8s/client.go (5 sites)
Sites found:    1 store path (snapshotBuildLogs) — the only growth source
Sites touched:  1 (+ new cap fn)
Coverage test:  TestCapBuildLogCacheSize_EvictsOldestOverCap / _NoOpUnderCap
Live verified:  go tool cover → capBuildLogCacheSize 100.0%, snapshotBuildLogs 100.0%

🤖 Generated with Claude Code

@mastermanas805 mastermanas805 enabled auto-merge (squash) June 3, 2026 18:45
@mastermanas805 mastermanas805 force-pushed the fix/buildlogcache-size-cap-2026-06-04 branch from e2d04dc to dba8173 Compare June 3, 2026 18:54
…ug-bash #2)

buildLogCache snapshots kaniko build logs of FAILED builds for the
failure autopsy. It was TTL-bounded (30m) and swept stale entries on each
new failure — but had NO size cap. A burst of failing builds inside one
TTL window (a broken base image, a wedged registry, a fork-bomb of bad
deploys) accumulates one ≤200-line snapshot per failure with no ceiling:
unbounded memory growth on a long-lived api pod.

Add buildLogCacheMaxEntries=256 and capBuildLogCacheSize(), called
evict-after-store from snapshotBuildLogs. When a store pushes the live
count past the cap, evict oldest-first (a recent failure's autopsy is far
likelier to be read than one from hundreds of failures ago). sync.Map has
no length, so we snapshot keys+times in one Range pass, sort by
capturedAt, and delete the excess oldest.

Tests: TestCapBuildLogCacheSize_EvictsOldestOverCap (over-cap → newest
survive, oldest evicted), TestCapBuildLogCacheSize_NoOpUnderCap. Both new
funcs 100% covered; verified locally.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 force-pushed the fix/buildlogcache-size-cap-2026-06-04 branch from dba8173 to 258f60d Compare June 3, 2026 19:12
@mastermanas805 mastermanas805 merged commit b87dd55 into master Jun 3, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant