Fix #389: Prevent 100% CPU usage when Docker restarts #41
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Fixed critical issue where PatchMon-agent consumed 100% CPU when Docker service was restarted, becoming completely unresponsive.
Problem
When Docker service restarts, the agent's Docker event monitoring loop enters an infinite CPU spin and must be manually restarted to recover. Only log message: "Docker event error" with error=EOF
Root Cause
The event monitoring loop had a critical flaw: when Docker restarts and channels close, Go's select statement returns immediately on closed channels (non-blocking). The loop would: select → sleep 5s → continue → select (immediately fires again) creating a busy-spin loop instead of properly waiting.
Technical Issue in Go:
When a channel is closed, receiving from it returns immediately with the zero value. The original code's select statement would fire thousands of times per second, creating a busy loop consuming 100% CPU.
Solution
Implemented two-tier monitoring architecture with exponential backoff:
TIER 1: monitoringLoop() - Manages Reconnection
TIER 2: monitorEvents() - Handles Single Event Stream
Key Insight:
By separating event processing (monitorEvents) from reconnection logic (monitoringLoop), we ensure the wait/sleep happens OUTSIDE the select loop. This prevents busy-spinning on closed channels.
Impact
Testing
Files Changed
Quality
Fixes #389