Harden Traefik and Alloy remote_write for control room resilience#270
Conversation
|
Claude finished @timtalbot's task —— View job Review of PR #270
Review submitted. Two non-blocking observations:
|
There was a problem hiding this comment.
Review Summary
Overall this is a well-targeted fix — both changes address the root cause (unbounded Alloy shards overwhelming Traefik) from complementary angles. The code is clean and consistent. A few observations:
lib/steps/helm_helpers.go — Alloy queue_config
max_shards = 3 and max_backoff = "5m" are applied consistently in both the control_room and workload remote_write blocks.
Going from the default 200 shards to 3 is a 98.5% reduction. With a 15s scrape interval and typical cardinality, 3 shards should be sufficient for steady-state. However, if a workload has unusually high cardinality (many sites/pods), you could see queue_config lag. Worth confirming post-rollout via dashboards that prometheus_remote_storage_samples_pending stays near zero across workloads.
traefik.py — Resource requests/limits
-
CPU limit omission: Without a CPU limit, Traefik can still burst to consume all available node CPU during spikes — which is the starvation scenario described in the PR context. A CPU limit (e.g.,
"cpu": "500m"or"cpu": "1") would cap that burst. If the intent is to allow bursting during legitimate traffic spikes, this is fine, but worth being explicit about the trade-off. -
Probe chart paths: Verify that
livenessProbeandreadinessProbeare the correct top-level Traefik Helm chart value paths for the deployed chart version. Some chart versions place probes underdeployment.podSpecrather than top-level. If this has been tested against the deployed chart version, disregard.
Changes are correct and consistent with existing patterns. No blocking issues.
Summary
node CPU starvation and crash loops under load
to prevent thundering herd when multiple clusters push to a shared endpoint
Context
Control room Traefik was crash-looping due to unbounded concurrent push connections
from workload Alloy instances. The default max_shards (200) per Alloy pod meant
recovery scenarios created massive connection floods that overwhelmed Traefik.
Changes
traefik.py: Add CPU/memory requests+limits (Burstable QoS), increase livenessprobe timeout to 5s and failure threshold to 5, readiness failure threshold to 3
helm_helpers.go: Add max_shards=3 and max_backoff=5m to both control_room andworkload prometheus.remote_write queue_config blocks
Test plan