fix: add CPU limits to Mimir store-gateway and Traefik#276
Conversation
Store-gateways running Mimir 2.10.4 exhibit unbounded CPU growth over time, starving colocated pods and causing Traefik proxy failures. - Add resource limits (1 CPU, 4Gi memory) to store-gateway across control room (aws_eks_cluster.py), AWS workloads (helm_aws.go), and Azure workloads (helm_azure.go) - Add missing CPU limit (1000m) to Traefik deployment
|
Claude finished @timtalbot's task —— View job Review Complete
Review submitted with observations on Azure config parity, minor CPU format inconsistency, and a note about self-review comments. |
There was a problem hiding this comment.
Review Summary
Changes are correct and consistent across all three deployment targets (AWS Go, Azure Go, Python/EKS). The resource spec (100m/512Mi requests, 1 CPU/4Gi limits) is identical everywhere, and the Traefik CPU limit addition is clean.
Observations:
-
Azure config parity (
helm_azure.go:772-778): The Azurestore_gatewaydoesn't setreplicas,zoneAwareReplication, oraffinity— unlikehelm_aws.gowhich has all three. This is pre-existing and unrelated to this PR, but worth noting as a follow-up if Azure workloads run multi-replica store-gateways. -
Minor CPU format inconsistency: AWS Go uses
"cpu": "1"whiletraefik.pyuses"cpu": "1000m". Both are equivalent in Kubernetes but stylistically different. Not blocking. -
No self-review comments on the PR — per review guidelines, consider adding inline comments on key decisions (e.g., why 1 CPU / 4Gi was chosen as the limit) for future readers.
LGTM — the fix directly addresses the described unbounded CPU growth behavior.
Summary
Store-gateways running Mimir 2.10.4 exhibit unbounded CPU growth over time (reaching 2-3 cores per pod despite requesting 100m), eventually starving colocated pods on the same nodes and causing cascading Traefik proxy failures. Adding a 1 CPU limit caps runaway usage while leaving ample headroom for normal operation (~5-7m after restart).
The Traefik CPU limit was applied during a previous incident but the change wasn't committed to IaC.
Test plan
kubectl topptd ensure --only-steps helm --dry-runon a workload target to verify Helm values render correctly