Skip to content

Harden Traefik and Alloy remote_write for control room resilience#270

Merged
timtalbot merged 1 commit into
mainfrom
traefik-resources
Apr 24, 2026
Merged

Harden Traefik and Alloy remote_write for control room resilience#270
timtalbot merged 1 commit into
mainfrom
traefik-resources

Conversation

@timtalbot
Copy link
Copy Markdown
Contributor

Summary

  • Add resource requests/limits and probe tuning to Traefik Helm values to prevent
    node CPU starvation and crash loops under load
  • Add max_shards=3 and max_backoff=5m to Alloy prometheus.remote_write queue_config
    to prevent thundering herd when multiple clusters push to a shared endpoint

Context

Control room Traefik was crash-looping due to unbounded concurrent push connections
from workload Alloy instances. The default max_shards (200) per Alloy pod meant
recovery scenarios created massive connection floods that overwhelmed Traefik.

Changes

  • traefik.py: Add CPU/memory requests+limits (Burstable QoS), increase liveness
    probe timeout to 5s and failure threshold to 5, readiness failure threshold to 3
  • helm_helpers.go: Add max_shards=3 and max_backoff=5m to both control_room and
    workload prometheus.remote_write queue_config blocks

Test plan

  • Verify Traefik pods get Burstable QoS after deploy
  • Verify Alloy config includes max_shards after ptd ensure on a workload
  • Confirm push latencies stay sub-second under normal load

@claude
Copy link
Copy Markdown

claude Bot commented Apr 23, 2026

Claude finished @timtalbot's task —— View job


Review of PR #270

  • Read review guidelines
  • Analyze PR diff and changed files
  • Create and submit review

Review submitted. Two non-blocking observations:

  1. Consider whether a CPU limit on Traefik would better prevent the node starvation described in the PR context
  2. Verify Traefik Helm chart probe value paths match the deployed chart version

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Overall this is a well-targeted fix — both changes address the root cause (unbounded Alloy shards overwhelming Traefik) from complementary angles. The code is clean and consistent. A few observations:


lib/steps/helm_helpers.go — Alloy queue_config

max_shards = 3 and max_backoff = "5m" are applied consistently in both the control_room and workload remote_write blocks.

Going from the default 200 shards to 3 is a 98.5% reduction. With a 15s scrape interval and typical cardinality, 3 shards should be sufficient for steady-state. However, if a workload has unusually high cardinality (many sites/pods), you could see queue_config lag. Worth confirming post-rollout via dashboards that prometheus_remote_storage_samples_pending stays near zero across workloads.


traefik.py — Resource requests/limits

  • CPU limit omission: Without a CPU limit, Traefik can still burst to consume all available node CPU during spikes — which is the starvation scenario described in the PR context. A CPU limit (e.g., "cpu": "500m" or "cpu": "1") would cap that burst. If the intent is to allow bursting during legitimate traffic spikes, this is fine, but worth being explicit about the trade-off.

  • Probe chart paths: Verify that livenessProbe and readinessProbe are the correct top-level Traefik Helm chart value paths for the deployed chart version. Some chart versions place probes under deployment.podSpec rather than top-level. If this has been tested against the deployed chart version, disregard.


Changes are correct and consistent with existing patterns. No blocking issues.

@timtalbot timtalbot marked this pull request as ready for review April 23, 2026 17:03
@timtalbot timtalbot requested a review from a team as a code owner April 23, 2026 17:03
@timtalbot timtalbot requested a review from stevenolen April 23, 2026 17:04
@timtalbot timtalbot assigned amdove and unassigned amdove Apr 23, 2026
@timtalbot timtalbot requested a review from amdove April 23, 2026 17:04
@timtalbot timtalbot added this pull request to the merge queue Apr 24, 2026
Merged via the queue into main with commit 7a40689 Apr 24, 2026
10 checks passed
@timtalbot timtalbot deleted the traefik-resources branch April 24, 2026 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants