Skip to content

fix: add CPU limits to Mimir store-gateway and Traefik#276

Merged
timtalbot merged 1 commit into
mainfrom
mimir-store-gateway-cpu-limits
Apr 28, 2026
Merged

fix: add CPU limits to Mimir store-gateway and Traefik#276
timtalbot merged 1 commit into
mainfrom
mimir-store-gateway-cpu-limits

Conversation

@timtalbot
Copy link
Copy Markdown
Contributor

Summary

  • Add resource requests/limits to Mimir store-gateway pods across all deployment targets (control room, AWS workloads, Azure workloads): 100m/512Mi requests, 1 CPU/4Gi limits
  • Add missing CPU limit (1000m) to the Traefik deployment

Store-gateways running Mimir 2.10.4 exhibit unbounded CPU growth over time (reaching 2-3 cores per pod despite requesting 100m), eventually starving colocated pods on the same nodes and causing cascading Traefik proxy failures. Adding a 1 CPU limit caps runaway usage while leaving ample headroom for normal operation (~5-7m after restart).

The Traefik CPU limit was applied during a previous incident but the change wasn't committed to IaC.

Test plan

  • Store-gateway CPU limits applied manually to the control room cluster and verified via kubectl top
  • Run ptd ensure --only-steps helm --dry-run on a workload target to verify Helm values render correctly
  • Deploy to a test cluster and confirm store-gateway pods start with the new resource spec

Store-gateways running Mimir 2.10.4 exhibit unbounded CPU growth over
time, starving colocated pods and causing Traefik proxy failures.

- Add resource limits (1 CPU, 4Gi memory) to store-gateway across
  control room (aws_eks_cluster.py), AWS workloads (helm_aws.go),
  and Azure workloads (helm_azure.go)
- Add missing CPU limit (1000m) to Traefik deployment
@claude
Copy link
Copy Markdown

claude Bot commented Apr 24, 2026

Claude finished @timtalbot's task —— View job


Review Complete

  • Read review guidelines
  • Get PR diff and read changed files
  • Create and submit review

Review submitted with observations on Azure config parity, minor CPU format inconsistency, and a note about self-review comments.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Changes are correct and consistent across all three deployment targets (AWS Go, Azure Go, Python/EKS). The resource spec (100m/512Mi requests, 1 CPU/4Gi limits) is identical everywhere, and the Traefik CPU limit addition is clean.

Observations:

  1. Azure config parity (helm_azure.go:772-778): The Azure store_gateway doesn't set replicas, zoneAwareReplication, or affinity — unlike helm_aws.go which has all three. This is pre-existing and unrelated to this PR, but worth noting as a follow-up if Azure workloads run multi-replica store-gateways.

  2. Minor CPU format inconsistency: AWS Go uses "cpu": "1" while traefik.py uses "cpu": "1000m". Both are equivalent in Kubernetes but stylistically different. Not blocking.

  3. No self-review comments on the PR — per review guidelines, consider adding inline comments on key decisions (e.g., why 1 CPU / 4Gi was chosen as the limit) for future readers.

LGTM — the fix directly addresses the described unbounded CPU growth behavior.

@timtalbot timtalbot marked this pull request as ready for review April 24, 2026 21:56
@timtalbot timtalbot requested a review from a team as a code owner April 24, 2026 21:56
@timtalbot timtalbot added this pull request to the merge queue Apr 28, 2026
Merged via the queue into main with commit aed69fa Apr 28, 2026
10 checks passed
@timtalbot timtalbot deleted the mimir-store-gateway-cpu-limits branch April 28, 2026 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants