Harden Traefik and Alloy remote_write for control room resilience by timtalbot · Pull Request #270 · posit-dev/ptd

timtalbot · 2026-04-23T16:57:23Z

Summary

Add resource requests/limits and probe tuning to Traefik Helm values to prevent
node CPU starvation and crash loops under load
Add max_shards=3 and max_backoff=5m to Alloy prometheus.remote_write queue_config
to prevent thundering herd when multiple clusters push to a shared endpoint

Context

Control room Traefik was crash-looping due to unbounded concurrent push connections
from workload Alloy instances. The default max_shards (200) per Alloy pod meant
recovery scenarios created massive connection floods that overwhelmed Traefik.

Changes

traefik.py: Add CPU/memory requests+limits (Burstable QoS), increase liveness
probe timeout to 5s and failure threshold to 5, readiness failure threshold to 3
helm_helpers.go: Add max_shards=3 and max_backoff=5m to both control_room and
workload prometheus.remote_write queue_config blocks

Test plan

Verify Traefik pods get Burstable QoS after deploy
Verify Alloy config includes max_shards after ptd ensure on a workload
Confirm push latencies stay sub-second under normal load

claude · 2026-04-23T16:57:38Z

Claude finished @timtalbot's task —— View job

Review of PR #270

Read review guidelines
Analyze PR diff and changed files
Create and submit review

Review submitted. Two non-blocking observations:

Consider whether a CPU limit on Traefik would better prevent the node starvation described in the PR context
Verify Traefik Helm chart probe value paths match the deployed chart version

claude

Review Summary

Overall this is a well-targeted fix — both changes address the root cause (unbounded Alloy shards overwhelming Traefik) from complementary angles. The code is clean and consistent. A few observations:

lib/steps/helm_helpers.go — Alloy queue_config

max_shards = 3 and max_backoff = "5m" are applied consistently in both the control_room and workload remote_write blocks.

Going from the default 200 shards to 3 is a 98.5% reduction. With a 15s scrape interval and typical cardinality, 3 shards should be sufficient for steady-state. However, if a workload has unusually high cardinality (many sites/pods), you could see queue_config lag. Worth confirming post-rollout via dashboards that prometheus_remote_storage_samples_pending stays near zero across workloads.

traefik.py — Resource requests/limits

CPU limit omission: Without a CPU limit, Traefik can still burst to consume all available node CPU during spikes — which is the starvation scenario described in the PR context. A CPU limit (e.g., "cpu": "500m" or "cpu": "1") would cap that burst. If the intent is to allow bursting during legitimate traffic spikes, this is fine, but worth being explicit about the trade-off.
Probe chart paths: Verify that livenessProbe and readinessProbe are the correct top-level Traefik Helm chart value paths for the deployed chart version. Some chart versions place probes under deployment.podSpec rather than top-level. If this has been tested against the deployed chart version, disregard.

Changes are correct and consistent with existing patterns. No blocking issues.

Harden Traefik and Alloy remote_write for control room resilience

c0681b0

claude Bot reviewed Apr 23, 2026

View reviewed changes

timtalbot marked this pull request as ready for review April 23, 2026 17:03

timtalbot requested a review from a team as a code owner April 23, 2026 17:03

timtalbot requested a review from stevenolen April 23, 2026 17:04

timtalbot assigned amdove and unassigned amdove Apr 23, 2026

timtalbot requested a review from amdove April 23, 2026 17:04

stevenolen approved these changes Apr 23, 2026

View reviewed changes

timtalbot added this pull request to the merge queue Apr 24, 2026

Merged via the queue into main with commit 7a40689 Apr 24, 2026
10 checks passed

timtalbot deleted the traefik-resources branch April 24, 2026 11:07

timtalbot mentioned this pull request Apr 24, 2026

Increase Alloy remote_write max_shards from 3 to 10 #274

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden Traefik and Alloy remote_write for control room resilience#270

Harden Traefik and Alloy remote_write for control room resilience#270
timtalbot merged 1 commit into
mainfrom
traefik-resources

timtalbot commented Apr 23, 2026

Uh oh!

claude Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

timtalbot commented Apr 23, 2026

Summary

Context

Changes

Test plan

Uh oh!

claude Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #270

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Review Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claude Bot commented Apr 23, 2026 •

edited

Loading