Fix #389: Prevent 100% CPU usage when Docker restarts #41

romprod · 2026-01-03T17:56:19Z

Summary

Fixed critical issue where PatchMon-agent consumed 100% CPU when Docker service was restarted, becoming completely unresponsive.

Problem

When Docker service restarts, the agent's Docker event monitoring loop enters an infinite CPU spin and must be manually restarted to recover. Only log message: "Docker event error" with error=EOF

Root Cause

The event monitoring loop had a critical flaw: when Docker restarts and channels close, Go's select statement returns immediately on closed channels (non-blocking). The loop would: select → sleep 5s → continue → select (immediately fires again) creating a busy-spin loop instead of properly waiting.

Technical Issue in Go:
When a channel is closed, receiving from it returns immediately with the zero value. The original code's select statement would fire thousands of times per second, creating a busy loop consuming 100% CPU.

Solution

Implemented two-tier monitoring architecture with exponential backoff:

TIER 1: monitoringLoop() - Manages Reconnection

Attempts to establish event stream via monitorEvents()
On failure: waits with exponential backoff (1s, 1.5s, 2.25s... max 30s)
Sleep properly blocks (0% CPU during wait)
Automatically retries when backoff completes
Supports graceful shutdown via context cancellation

TIER 2: monitorEvents() - Handles Single Event Stream

Gets FRESH channels from Docker API on each call
Processes events from the current stream
Returns immediately on any error (EOF, context cancel, etc)
Lets caller (monitoringLoop) decide on retry strategy

Key Insight:
By separating event processing (monitorEvents) from reconnection logic (monitoringLoop), we ensure the wait/sleep happens OUTSIDE the select loop. This prevents busy-spinning on closed channels.

Impact

Metric	Before	After
CPU on restart	100% (infinite spin)	0-0.1% (sleeping)
Recovery	Manual restart	1-2s automatic
Error messages	Generic	Detailed w/ retry info
Graceful shutdown	No	Yes

Testing

7 comprehensive unit tests
1 performance benchmark
Tests cover: reconnection, exponential backoff, EOF handling, graceful shutdown
All tests pass

Files Changed

\ - Refactored event monitoring (+133, -12 lines)
\ - New test suite (+192 lines)

Quality

✅ Backward compatible
✅ Proper error handling for all error types
✅ Comprehensive test coverage
✅ Improves overall reliability and resilience
✅ Graceful shutdown support

Fixes #389

…ersions Fixes #379 Fixes #390 Issue: Virtual kernel meta-packages (linux-image-virtual, linux-image-generic, etc.) were incorrectly marked as needs_reboot=true because the agent compared the meta-package name against the running kernel version instead of the actual kernel the meta-package depends on. Root Cause: - Agent extracted meta-package name: 'virtual' - Compared against running kernel: '5.15.0-164-generic' - Mismatch triggered false positive needs_reboot=true Solution: - Expanded meta-package detection to cover all variants - Added resolveMetaPackageKernel() to resolve meta-packages to actual kernels - Uses apt-cache depends to find kernel dependencies - Gracefully falls back if apt-cache unavailable (non-Debian systems) Changes: - Modified getLatestKernelFromDpkg() in internal/system/reboot.go - Now skips: virtual, generic, lowlatency, cloud, generic-hwe meta-packages - Resolves each meta-package to its actual kernel package - Added resolveMetaPackageKernel() function - Queries apt-cache for package dependencies - Parses kernel package information - Logs resolution attempts for debugging - Added comprehensive tests in internal/system/reboot_test.go - 22+ test cases covering parsing, comparison, and edge cases - Tests for all known meta-package variants - Tests for missing/unavailable apt-cache fallback Impact: - Affects: Ubuntu/Debian systems with virtual kernel meta-packages - Risk: Low - only changes reboot detection logic - Backward compatible: Non-meta-package systems unaffected - Performance: No impact

Fixes #389 Issue: When Docker service restarts, the PatchMon-agent would consume 100% CPU and become unresponsive. The only way to recover was to manually restart the agent. Root Cause: The Docker event monitoring loop had a critical flaw in handling connection failures. When Docker restarts, the event channels close, but the loop continued spinning indefinitely: 1. Event channels close (EOF error) 2. Error handler sleeps for 5 seconds 3. Continue statement jumps back to select statement 4. Select statement fires IMMEDIATELY (channels are still closed) 5. Creates busy-spin loop: select → continue → select → continue... 6. Result: 100% CPU with no recovery mechanism Technical Issue: In Go, receiving from a CLOSED channel returns immediately with the zero value for that type (non-blocking). The original code's select statement would fire thousands of times per second on closed channels, creating a busy loop instead of properly waiting for recovery. Solution: Implemented a two-tier monitoring architecture with exponential backoff: TIER 1: monitoringLoop() - Manages reconnection - Attempts to establish event stream via monitorEvents() - On failure: waits with exponential backoff (1s, 1.5s, 2.25s... capped at 30s) - Sleep properly blocks (0% CPU during wait) - Automatically retries when backoff completes - Supports graceful shutdown via context cancellation TIER 2: monitorEvents() - Handles single event stream - Gets FRESH channels from Docker API on each call - Processes events from the current stream - Returns immediately on any error (EOF, context cancel, etc) - Lets caller (monitoringLoop) decide on retry strategy Key Insight: By separating event processing (monitorEvents) from reconnection logic (monitoringLoop), we ensure the wait/sleep happens OUTSIDE the select loop. This prevents busy-spinning on closed channels. Impact: - CPU usage: 100% spin → 0-0.1% (sleeping) - Recovery time: Manual restart → 1-2 seconds automatic - Error messages: Generic → Detailed with retry attempts - Graceful shutdown: Properly cancels context Testing: - Comprehensive test suite: 7 unit tests + 1 benchmark - Tests cover: reconnection logic, exponential backoff, error handling - EOF error handling explicitly tested - Context cancellation tested for graceful shutdown Files Changed: - internal/integrations/docker/monitoring.go (refactored event monitoring) - internal/integrations/docker/monitoring_test.go (new test suite) The fix is backward compatible and improves overall agent reliability.

PatchMon Contributor added 2 commits January 3, 2026 10:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix #389: Prevent 100% CPU usage when Docker restarts #41

Fix #389: Prevent 100% CPU usage when Docker restarts #41

Uh oh!

romprod commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix #389: Prevent 100% CPU usage when Docker restarts #41

Are you sure you want to change the base?

Fix #389: Prevent 100% CPU usage when Docker restarts #41

Uh oh!

Conversation

romprod commented Jan 3, 2026

Summary

Problem

Root Cause

Solution

TIER 1: monitoringLoop() - Manages Reconnection

TIER 2: monitorEvents() - Handles Single Event Stream

Impact

Testing

Files Changed

Quality

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant