fix(opencost): treat custom pricing as USD/month and lift memory limit#1637
Conversation
> 🤖 Generated by the Daily AI Assistant Two issues caused Headlamp's OpenCost panel to stop reporting costs on prod. 1. **Costs were ~730× too low.** The helm-chart `customPricing.costModel` fields `CPU`, `RAM`, `GPU`, `storage`, `spotCPU`, `spotRAM` are interpreted by OpenCost as **USD per month** and divided by `HoursPerMonth = 730` to derive an hourly rate (`opencost/pkg/cloud/provider/providerconfig.go:188`, `customprovider.go:95`). Our values were authored as hourly ($0.002125/vCPU/hr, $0.000797/GB/hr), so the resulting node prices landed at `node_cpu_hourly_cost = 3e-6` instead of ~$0.0021. Empirical match: `0.002125 / 730 ≈ 2.91e-6`. Allocation queries returned non-zero but vanishingly small values that Headlamp surfaces as "$0.00", i.e. no cost data. Fix: convert to per-month USD derived from the same CX33 base price ($9.31/server/month, 50/50 CPU-RAM split): CPU = 1.5517, RAM = 0.5819, storage = 0.048 (Hetzner Cloud Volume). Network-egress fields are *not* in the divide-by-730 allowlist (they are per-GB transferred) so they stay at 0.001. 2. **OOMKilled while serving Headlamp's 14d query.** The exporter was capped at 256 MiB; the cost-model engine holds the aggregation window in memory and Headlamp issues `window=14d&aggregate=namespace` and `aggregate=deployment` queries. The pod hit exit code 137 mid-response, producing the intermittent 499s observed in the UI access log and leaving the panel empty after a restart. Lift the limit to 512 MiB (~7× idle, still under the chart default of 1 GiB). Validated via `ksail workload validate` (local + prod overlays). Refs: https://github.com/opencost/opencost/blob/develop/pkg/cloud/provider/providerconfig.go#L188-L210 Refs: https://github.com/opencost/opencost-helm-chart/blob/main/charts/opencost/values.yaml
There was a problem hiding this comment.
Pull request overview
Adjusts the OpenCost HelmRelease configuration so Headlamp’s OpenCost panel reports realistic costs again in prod by correcting custom pricing units and reducing exporter OOM restarts during long-range allocation queries.
Changes:
- Converts
customPricing.costModelvalues from hourly-authored numbers to the monthly USD units OpenCost expects (CPU/RAM/storage). - Adds clarifying documentation in the HelmRelease about OpenCost’s unit conventions and the monthly→hourly conversion behavior.
- Increases
exporter.resources.limits.memoryfrom 256Mi to 512Mi to avoid OOMKilled responses on 14d allocation queries.
…etzner rates
> 🤖 Generated by the Daily AI Assistant
The previous commit had three input errors that I'd silently inherited
from the original 2025 comment block, even after fixing the per-month
units bug:
1. CX33 has **4 vCPU**, not 3. Verified live: `kubectl get nodes` on
prod reports `cpu: 4` on every CX33.
2. CX33 current monthly cap is **€6.49**, not €8.54 (which my old
hourly-based math implied). Hetzner's price-adjustment doc raised
CX33 from €4.99 → €6.49/month effective 2026-04-01.
3. Hetzner Cloud Volume is now **€0.0572/GB-month**, not €0.044
(same 2026-04-01 adjustment).
Re-derived from the authoritative source price (Hetzner's published cap)
with explicit math:
Server cost = €6.49/month → $7.01/server/month (EUR/USD ≈ 1.08)
50/50 split = $3.504 each for CPU and RAM
CPU = $3.504 / 4 vCPU = $0.876 / vCPU-month
RAM = $3.504 / 8 GB = $0.438 / GB-month
Vol = €0.0572 × 1.08 = $0.0618 / GB-month
End-to-end check (1 vCPU running for 1 hour):
ConfigMap CPU = 0.876 → OpenCost /730 = $0.00120 / vCPU-hour
Real cost = (€6.49 × 0.5 / 4 vCPU / 730 hr) × 1.08
= $0.00120 / vCPU-hour ✓
Validated: `ksail workload validate` succeeds for both clusters/local
and clusters/prod (256 files each).
Refs: https://docs.hetzner.com/general/infrastructure-and-availability/price-adjustment/
Re-checked the cost calculation (this is what /review surfaced)The conversion-factor fix (730×) was correct, but the input numbers were wrong. Three errors silently inherited from the original 2025 comment block:
Pushed End-to-end check (1 vCPU running for 1 hour): Known limitations (out of scope of this fix)
|
> 🤖 Generated by the Daily AI Assistant
Replaced hand-edited assumptions with values pulled directly from the
Hetzner Cloud Pricing API (`GET /v1/pricing`, location fsn1) and the
ECB EUR→USD reference rate for 2026-05-27.
API-verified (net of VAT):
CX33 cap = €6.49 / server / month (hourly €0.0104)
Volumes = €0.0572 / GB / month
Egress = €1.00 / TB overage (€0.001 / GB; 20 TiB incl. per server)
FX (ECB reference, 2026-05-27): 1.1637 EUR/USD. The previous 1.08
assumption was 8% low.
Re-derived (50/50 CPU-RAM split, unchanged convention):
€6.49 × 1.1637 = $7.5524 / server / month
CPU = $7.5524 × 0.5 / 4 vCPU ≈ $0.9441 / vCPU-month (was 0.876)
RAM = $7.5524 × 0.5 / 8 GB ≈ $0.4720 / GB-month (was 0.438)
Vol = €0.0572 × 1.1637 ≈ $0.0666 / GB-month (was 0.0618)
Egress = €0.001 × 1.1637 ≈ $0.001164 / GB (was 0.001 flat)
End-to-end check (1 vCPU for 1 hour):
ConfigMap CPU = 0.9441 → OpenCost / 730 = $0.001293 / vCPU-hour
Real cost = (€6.49 × 0.5 / 4 vCPU / 730 hr) × 1.1637
= €0.001112 × 1.1637 = $0.001294 / vCPU-hour ✓
Using net (excl. VAT) because VAT is location-dependent (DK = 25%) and
typically reclaimable for businesses; tracking it would conflate
compute cost with tax overhead. Gross is +25% on every figure.
Validated with `ksail workload validate` on both clusters/local and
clusters/prod (256 files each).
Updated against authoritative live dataHit the Hetzner Cloud Pricing API directly with the prod HCLOUD_TOKEN and replaced the price-adjustment-doc figures + my hand-picked FX rate with the API's own numbers + the ECB reference rate. Hetzner API (
So the EUR base prices were already right. The miss was the FX rate: ECB EUR→USD reference rate (2026-05-27): Re-derived (50/50 split, unchanged): End-to-end check (1 vCPU running for 1 hour): Net vs gross VATUsing net (excluding the 25% Danish VAT the API also returns). Gross is +25% on every figure, but VAT is location-dependent and typically reclaimable for businesses, so tracking it in OpenCost would conflate compute cost with tax overhead. Reasonable people could argue either side — easy switch if you'd rather see actual bank-account-outflow numbers. Pushed |
> 🤖 Generated by the Daily AI Assistant Per review: the request was 55Mi but the exporter idles ~75Mi (kubectl top), so the scheduler under-reserved its baseline and the pod would be a prime eviction candidate under node memory pressure. Raise the request to 128Mi (above observed idle, below the 512Mi limit).
…n warning
> 🤖 Generated by the Daily AI Assistant
The merge-queue "Deploy to Prod" gate (check-event-warnings, fail-on-warning)
failed for this PR on a benign OpenCost rollout artifact:
Killing pod/opencost-…-m9r7j Stopping container opencost
SuccessfulDelete Deleted pod: opencost-…-m9r7j
Warning Unhealthy pod/opencost-…-m9r7j Readiness probe failed:
…:9003/healthz: connect: no route to host
Root cause: the chart default rollout is maxUnavailable:1 on a single replica,
so the old pod is killed the instant the new one is created. kubelet fires one
last readiness probe ~1s after Cilium removes the dead pod's route → "no route
to host". The event lands inside the gate's 90s settle window and trips it,
even though nothing is actually wrong (the pod is gone). Any OpenCost rollout
(incl. this PR's memory/pricing change) reproduces it.
Add a postRenderer that:
1. sets maxUnavailable: 0 (surge new→Ready before old terminates;
zero-downtime, matches the homepage/headlamp convention), and
2. adds preStop.sleep: 15s on the opencost container so it keeps serving
:9003 during drain — kubelet's probes land on a live endpoint instead of a
torn-down route. Native sleep lifecycle action (GA since k8s 1.30; cluster
is 1.32), so no shell is needed in the distroless image.
Validated: `ksail workload validate` (local + prod, 256 files each); the
strategy + preStop.sleep accepted by the live 1.32 API via `kubectl apply
--dry-run=server`.
Merge-queue failure investigationThe PR-level checks (System Test) pass, but the merge-queue 1. Homepage cold-start probes (the originally-reported error, 19:58Z run). Not specific to this PR — runs for #1604, #1607, #1608, #1636 all failed identically in the same window. Platform-wide; fixed by #1636 ( 2. OpenCost teardown race (latest run, 21:11Z) — this PR's responsibility. Live events were unambiguous: Honest caveat: prod reverts to main's OpenCost config between merge_group runs, so the introducing deploy still rolls an old pod that predates the Auto-merge enabled; it will re-enter the queue once the System Test is green. |
|
🎉 This PR is included in version 1.12.3 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
Why
Headlamp's OpenCost panel stopped reporting costs on prod. Two issues were stacked.
1. Custom pricing was off by a factor of ~730
The OpenCost helm chart's
customPricing.costModelfields (CPU,RAM,GPU,storage,spotCPU,spotRAM) are interpreted as USD per month and divided byHoursPerMonth = 730to derive an hourly rate. Our values were authored as hourly, so OpenCost reportednode_cpu_hourly_cost = 3e-06(0.002125 / 730 ≈ 2.91e-6). Allocation queries returned values like$0.00227for kube-system over 14 days, which Headlamp rounds to$0.00— "no cost data". Conversion is hardcoded inproviderconfig.go:188-210; chart defaults (CPU: 1.25,RAM: 0.50) confirm the monthly convention.2. The exporter was OOMKilled while serving Headlamp's 14d query
256 MiB limit; Headlamp's
window=14d&aggregate=namespacequery holds the window in memory → exit 137 mid-response → intermittent 499s and an empty panel.What changed
costModel.CPU(USD/vCPU/month)costModel.RAM(USD/GB/month)costModel.storage(USD/GB/month)costModel.*NetworkEgress(USD/GB)exporter.resources.requests.memoryexporter.resources.limits.memorystrategy.maxUnavailablepreStop.sleep: 15)Hetzner figures are net of VAT, pulled live from
GET https://api.hetzner.cloud/v1/pricing(fsn1): CX33 = €6.49/server/month, Volume = €0.0572/GB-month, egress = €1.00/TB. FX = ECB EUR→USD reference 2026-05-27 = 1.1637. CX33 spec (4 vCPU / 8 GB / 80 GB) verified viakubectl get nodes. 50/50 CPU-RAM split is an allocation convention.End-to-end check (1 vCPU for 1 hour): ConfigMap CPU
0.9441→/730=$0.001293/vCPU-hr; real =(€6.49 × 0.5 / 4 / 730) × 1.1637 = $0.001294/vCPU-hr✓Merge-queue deploy gate (
check-event-warnings)The merge-queue Deploy to Prod job deploys to real prod, waits 90s, and fails on any
Warningevent in the window. This PR's OpenCost config change rolls the pod, and the chart-defaultmaxUnavailable:1on a single replica kills the old pod the instant the new one is created — kubelet then fires one last readiness probe ~1s after Cilium removes the dead pod's route →Unhealthy: …:9003/healthz: connect: no route to host, tripping the gate. Mitigated with a postRenderer:maxUnavailable:0(surge new→Ready before old terminates) +preStop.sleep:15(keep the container serving:9003during drain so probes land on a live endpoint; native sleep action, GA k8s 1.30, cluster is 1.32).Validation
ksail workload validateon bothclusters/localandclusters/prod(256 files each). Strategy +preStop.sleepaccepted by the live 1.32 API viakubectl apply --dry-run=server. Bug reproduced read-only on prod before the fix (node_cpu_hourly_cost = 3e-06, OOMKilled exit 137).🤖 Generated with Claude Code