fix(opencost): treat custom pricing as USD/month and lift memory limit by devantler · Pull Request #1637 · devantler-tech/platform

devantler · 2026-05-28T16:15:59Z

🤖 Generated by the Daily AI Assistant

Why

Headlamp's OpenCost panel stopped reporting costs on prod. Two issues were stacked.

1. Custom pricing was off by a factor of ~730

The OpenCost helm chart's customPricing.costModel fields (CPU, RAM, GPU, storage, spotCPU, spotRAM) are interpreted as USD per month and divided by HoursPerMonth = 730 to derive an hourly rate. Our values were authored as hourly, so OpenCost reported node_cpu_hourly_cost = 3e-06 (0.002125 / 730 ≈ 2.91e-6). Allocation queries returned values like $0.00227 for kube-system over 14 days, which Headlamp rounds to $0.00 — "no cost data". Conversion is hardcoded in providerconfig.go:188-210; chart defaults (CPU: 1.25, RAM: 0.50) confirm the monthly convention.

2. The exporter was OOMKilled while serving Headlamp's 14d query

256 MiB limit; Headlamp's window=14d&aggregate=namespace query holds the window in memory → exit 137 mid-response → intermittent 499s and an empty panel.

What changed

Pricing values were revised to use Hetzner-Pricing-API figures + the ECB FX rate; the table reflects the final values in the branch.

	before	after	source of "after"
`costModel.CPU` (USD/vCPU/month)	0.002125 (hourly!)	0.9441	€6.49/mo ÷ 4 vCPU × 0.5 × 1.1637
`costModel.RAM` (USD/GB/month)	0.000797 (hourly!)	0.4720	€6.49/mo ÷ 8 GB × 0.5 × 1.1637
`costModel.storage` (USD/GB/month)	0.00008 (hourly!)	0.0666	Volume €0.0572/GB × 1.1637
`costModel.*NetworkEgress` (USD/GB)	0.001	0.001164	€1.00/TB overage × 1.1637
`exporter.resources.requests.memory`	55Mi	128Mi	above observed ~75Mi idle
`exporter.resources.limits.memory`	256Mi	512Mi	~4× request; chart default is 1Gi
rollout `strategy.maxUnavailable`	1 (chart default)	0 (+ `preStop.sleep: 15`)	see "Merge-queue deploy gate" below

Hetzner figures are net of VAT, pulled live from GET https://api.hetzner.cloud/v1/pricing (fsn1): CX33 = €6.49/server/month, Volume = €0.0572/GB-month, egress = €1.00/TB. FX = ECB EUR→USD reference 2026-05-27 = 1.1637. CX33 spec (4 vCPU / 8 GB / 80 GB) verified via kubectl get nodes. 50/50 CPU-RAM split is an allocation convention.

End-to-end check (1 vCPU for 1 hour): ConfigMap CPU 0.9441 → /730 = $0.001293/vCPU-hr; real = (€6.49 × 0.5 / 4 / 730) × 1.1637 = $0.001294/vCPU-hr ✓

Merge-queue deploy gate (`check-event-warnings`)

The merge-queue Deploy to Prod job deploys to real prod, waits 90s, and fails on any Warning event in the window. This PR's OpenCost config change rolls the pod, and the chart-default maxUnavailable:1 on a single replica kills the old pod the instant the new one is created — kubelet then fires one last readiness probe ~1s after Cilium removes the dead pod's route → Unhealthy: …:9003/healthz: connect: no route to host, tripping the gate. Mitigated with a postRenderer: maxUnavailable:0 (surge new→Ready before old terminates) + preStop.sleep:15 (keep the container serving :9003 during drain so probes land on a live endpoint; native sleep action, GA k8s 1.30, cluster is 1.32).

The homepage probe warnings seen in earlier merge-queue runs of this PR are a separate, platform-wide issue (every PR's deploy gate hits them), fixed by #1636 (initialDelaySeconds). Not this PR's concern.

Validation

ksail workload validate on both clusters/local and clusters/prod (256 files each). Strategy + preStop.sleep accepted by the live 1.32 API via kubectl apply --dry-run=server. Bug reproduced read-only on prod before the fix (node_cpu_hourly_cost = 3e-06, OOMKilled exit 137).

🤖 Generated with Claude Code

> 🤖 Generated by the Daily AI Assistant Two issues caused Headlamp's OpenCost panel to stop reporting costs on prod. 1. **Costs were ~730× too low.** The helm-chart `customPricing.costModel` fields `CPU`, `RAM`, `GPU`, `storage`, `spotCPU`, `spotRAM` are interpreted by OpenCost as **USD per month** and divided by `HoursPerMonth = 730` to derive an hourly rate (`opencost/pkg/cloud/provider/providerconfig.go:188`, `customprovider.go:95`). Our values were authored as hourly ($0.002125/vCPU/hr, $0.000797/GB/hr), so the resulting node prices landed at `node_cpu_hourly_cost = 3e-6` instead of ~$0.0021. Empirical match: `0.002125 / 730 ≈ 2.91e-6`. Allocation queries returned non-zero but vanishingly small values that Headlamp surfaces as "$0.00", i.e. no cost data. Fix: convert to per-month USD derived from the same CX33 base price ($9.31/server/month, 50/50 CPU-RAM split): CPU = 1.5517, RAM = 0.5819, storage = 0.048 (Hetzner Cloud Volume). Network-egress fields are *not* in the divide-by-730 allowlist (they are per-GB transferred) so they stay at 0.001. 2. **OOMKilled while serving Headlamp's 14d query.** The exporter was capped at 256 MiB; the cost-model engine holds the aggregation window in memory and Headlamp issues `window=14d&aggregate=namespace` and `aggregate=deployment` queries. The pod hit exit code 137 mid-response, producing the intermittent 499s observed in the UI access log and leaving the panel empty after a restart. Lift the limit to 512 MiB (~7× idle, still under the chart default of 1 GiB). Validated via `ksail workload validate` (local + prod overlays). Refs: https://github.com/opencost/opencost/blob/develop/pkg/cloud/provider/providerconfig.go#L188-L210 Refs: https://github.com/opencost/opencost-helm-chart/blob/main/charts/opencost/values.yaml

Copilot

Pull request overview

Adjusts the OpenCost HelmRelease configuration so Headlamp’s OpenCost panel reports realistic costs again in prod by correcting custom pricing units and reducing exporter OOM restarts during long-range allocation queries.

Changes:

Converts customPricing.costModel values from hourly-authored numbers to the monthly USD units OpenCost expects (CPU/RAM/storage).
Adds clarifying documentation in the HelmRelease about OpenCost’s unit conventions and the monthly→hourly conversion behavior.
Increases exporter.resources.limits.memory from 256Mi to 512Mi to avoid OOMKilled responses on 14d allocation queries.

…etzner rates > 🤖 Generated by the Daily AI Assistant The previous commit had three input errors that I'd silently inherited from the original 2025 comment block, even after fixing the per-month units bug: 1. CX33 has **4 vCPU**, not 3. Verified live: `kubectl get nodes` on prod reports `cpu: 4` on every CX33. 2. CX33 current monthly cap is **€6.49**, not €8.54 (which my old hourly-based math implied). Hetzner's price-adjustment doc raised CX33 from €4.99 → €6.49/month effective 2026-04-01. 3. Hetzner Cloud Volume is now **€0.0572/GB-month**, not €0.044 (same 2026-04-01 adjustment). Re-derived from the authoritative source price (Hetzner's published cap) with explicit math: Server cost = €6.49/month → $7.01/server/month (EUR/USD ≈ 1.08) 50/50 split = $3.504 each for CPU and RAM CPU = $3.504 / 4 vCPU = $0.876 / vCPU-month RAM = $3.504 / 8 GB = $0.438 / GB-month Vol = €0.0572 × 1.08 = $0.0618 / GB-month End-to-end check (1 vCPU running for 1 hour): ConfigMap CPU = 0.876 → OpenCost /730 = $0.00120 / vCPU-hour Real cost = (€6.49 × 0.5 / 4 vCPU / 730 hr) × 1.08 = $0.00120 / vCPU-hour ✓ Validated: `ksail workload validate` succeeds for both clusters/local and clusters/prod (256 files each). Refs: https://docs.hetzner.com/general/infrastructure-and-availability/price-adjustment/

devantler · 2026-05-28T17:29:17Z

🤖 Generated by the Daily AI Assistant

Re-checked the cost calculation (this is what /review surfaced)

The conversion-factor fix (730×) was correct, but the input numbers were wrong. Three errors silently inherited from the original 2025 comment block:

#	Field	Previous PR value	Actual	Source
1	CX33 vCPU count	3 vCPU (comment said so)	4 vCPU	`kubectl get nodes` on prod: `cpu: 4` on every CX33
2	CX33 monthly cap	€0.0117/hr ⇒ €8.54/month	€6.49/month	Hetzner price-adjustment doc (post 2026-04-01)
3	Volume per GB-month	€0.044 ⇒ $0.048	€0.0572 ⇒ $0.0618	Same doc, same date

Pushed 8830296a with the corrected derivation:

Server cost  = €6.49/month  → $7.01/server/month  (EUR/USD ≈ 1.08)
50/50 split  = $3.504 each for CPU and RAM
CPU = $3.504 / 4 vCPU = $0.876 / vCPU-month
RAM = $3.504 / 8 GB   = $0.438 / GB-month
Vol = €0.0572 × 1.08  = $0.0618 / GB-month

End-to-end check (1 vCPU running for 1 hour):

ConfigMap CPU = 0.876 → OpenCost / 730 = $0.001200 / vCPU-hour
Real cost     = (€6.49 × 0.5 / 4 vCPU / 730 hr) × 1.08
              = $0.001200 / vCPU-hour ✓

Known limitations (out of scope of this fix)

50/50 CPU-RAM split is a convention, not a Hetzner-published breakdown. Cloud providers don't typically publish a per-resource split; common alternatives are 70/30 (CPU-heavy, as in EC2 reserved-instance breakdowns). I kept 50/50 to minimise scope vs. the original author's choice, but happy to revisit if you want a different model.
EUR/USD = 1.08 is a mid-2026 approximation, not a live rate. OpenCost has no FX tracking, so any single chosen rate is a snapshot. If it drifts ±5%, total costs scale linearly.
Autoscaler pool cx23 (2 vCPU / 4 GB / €3.99 cap) gets the same per-vCPU/per-GB rates from this config since OpenCost custom pricing is uniform per resource. Estimate for a cx23 server at our rates: 0.876 × 2 + 0.438 × 4 = $3.50/month ≈ €3.24 vs. €3.99 actual — about 19% under, acceptable for an estimate.
Storage rate uses Hetzner Cloud Volume pricing. Longhorn replicas on those volumes will multiply observed PVC cost by the replica count; this is OpenCost's standard behaviour for replicated storage and not a math error.

> 🤖 Generated by the Daily AI Assistant Replaced hand-edited assumptions with values pulled directly from the Hetzner Cloud Pricing API (`GET /v1/pricing`, location fsn1) and the ECB EUR→USD reference rate for 2026-05-27. API-verified (net of VAT): CX33 cap = €6.49 / server / month (hourly €0.0104) Volumes = €0.0572 / GB / month Egress = €1.00 / TB overage (€0.001 / GB; 20 TiB incl. per server) FX (ECB reference, 2026-05-27): 1.1637 EUR/USD. The previous 1.08 assumption was 8% low. Re-derived (50/50 CPU-RAM split, unchanged convention): €6.49 × 1.1637 = $7.5524 / server / month CPU = $7.5524 × 0.5 / 4 vCPU ≈ $0.9441 / vCPU-month (was 0.876) RAM = $7.5524 × 0.5 / 8 GB ≈ $0.4720 / GB-month (was 0.438) Vol = €0.0572 × 1.1637 ≈ $0.0666 / GB-month (was 0.0618) Egress = €0.001 × 1.1637 ≈ $0.001164 / GB (was 0.001 flat) End-to-end check (1 vCPU for 1 hour): ConfigMap CPU = 0.9441 → OpenCost / 730 = $0.001293 / vCPU-hour Real cost = (€6.49 × 0.5 / 4 vCPU / 730 hr) × 1.1637 = €0.001112 × 1.1637 = $0.001294 / vCPU-hour ✓ Using net (excl. VAT) because VAT is location-dependent (DK = 25%) and typically reclaimable for businesses; tracking it would conflate compute cost with tax overhead. Gross is +25% on every figure. Validated with `ksail workload validate` on both clusters/local and clusters/prod (256 files each).

devantler · 2026-05-28T17:32:22Z

🤖 Generated by the Daily AI Assistant

Updated against authoritative live data

Hit the Hetzner Cloud Pricing API directly with the prod HCLOUD_TOKEN and replaced the price-adjustment-doc figures + my hand-picked FX rate with the API's own numbers + the ECB reference rate.

Hetzner API (GET /v1/pricing, location fsn1, net of 25% VAT):

Item	Value	My PR was
CX33 monthly cap	€6.49	€6.49 ✓
CX33 hourly	€0.0104	€0.0104 ✓
Volume / GB-month	€0.0572	€0.0572 ✓
Egress per TB	€1.00 (= €0.001/GB)	€0.001 ✓
Included egress	20 TiB / server	mentioned ✓

So the EUR base prices were already right. The miss was the FX rate:

ECB EUR→USD reference rate (2026-05-27): 1.1637 — was using 1.08, 8% low.

Re-derived (50/50 split, unchanged):

€6.49/server/month × 1.1637 = $7.5524/server/month
  CPU = $7.5524 × 0.5 / 4 vCPU  ≈ $0.9441/vCPU-month   (was 0.876)
  RAM = $7.5524 × 0.5 / 8 GB    ≈ $0.4720/GB-month     (was 0.438)
  Vol = €0.0572 × 1.1637        ≈ $0.0666/GB-month     (was 0.0618)
  Egress = €0.001 × 1.1637      ≈ $0.001164/GB         (was flat $0.001)

End-to-end check (1 vCPU running for 1 hour):

ConfigMap CPU = 0.9441 → OpenCost / 730 = $0.001293/vCPU-hour
Real cost     = (€6.49 × 0.5 / 4 vCPU / 730 hr) × 1.1637
              =  €0.001112 × 1.1637 = $0.001294/vCPU-hour ✓

Net vs gross VAT

Using net (excluding the 25% Danish VAT the API also returns). Gross is +25% on every figure, but VAT is location-dependent and typically reclaimable for businesses, so tracking it in OpenCost would conflate compute cost with tax overhead. Reasonable people could argue either side — easy switch if you'd rather see actual bank-account-outflow numbers.

Pushed cb791c6f.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

> 🤖 Generated by the Daily AI Assistant Per review: the request was 55Mi but the exporter idles ~75Mi (kubectl top), so the scheduler under-reserved its baseline and the pod would be a prime eviction candidate under node memory pressure. Raise the request to 128Mi (above observed idle, below the 512Mi limit).

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

…n warning > 🤖 Generated by the Daily AI Assistant The merge-queue "Deploy to Prod" gate (check-event-warnings, fail-on-warning) failed for this PR on a benign OpenCost rollout artifact: Killing pod/opencost-…-m9r7j Stopping container opencost SuccessfulDelete Deleted pod: opencost-…-m9r7j Warning Unhealthy pod/opencost-…-m9r7j Readiness probe failed: …:9003/healthz: connect: no route to host Root cause: the chart default rollout is maxUnavailable:1 on a single replica, so the old pod is killed the instant the new one is created. kubelet fires one last readiness probe ~1s after Cilium removes the dead pod's route → "no route to host". The event lands inside the gate's 90s settle window and trips it, even though nothing is actually wrong (the pod is gone). Any OpenCost rollout (incl. this PR's memory/pricing change) reproduces it. Add a postRenderer that: 1. sets maxUnavailable: 0 (surge new→Ready before old terminates; zero-downtime, matches the homepage/headlamp convention), and 2. adds preStop.sleep: 15s on the opencost container so it keeps serving :9003 during drain — kubelet's probes land on a live endpoint instead of a torn-down route. Native sleep lifecycle action (GA since k8s 1.30; cluster is 1.32), so no shell is needed in the distroless image. Validated: `ksail workload validate` (local + prod, 256 files each); the strategy + preStop.sleep accepted by the live 1.32 API via `kubectl apply --dry-run=server`.

devantler · 2026-05-28T21:36:03Z

🤖 Generated by the Daily AI Assistant

Merge-queue failure investigation

The PR-level checks (System Test) pass, but the merge-queue merge_group run failed — that runs a different job, 🚀 Deploy to Prod, which deploys to real prod, waits 90s, and fails on any Warning event in the window (.github/actions/check-event-warnings, fail-on-warning: true). Checking the merge_group job logs (not the PR checks) showed two distinct causes across runs:

1. Homepage cold-start probes (the originally-reported error, 19:58Z run). Not specific to this PR — runs for #1604, #1607, #1608, #1636 all failed identically in the same window. Platform-wide; fixed by #1636 (startupProbe: initialDelaySeconds: 20), which is auto-merging. Live prod homepage has since reverted to chart-default and is stable, and this PR doesn't touch homepage — so it won't roll homepage on deploy regardless.

2. OpenCost teardown race (latest run, 21:11Z) — this PR's responsibility. Live events were unambiguous:
```
Killing pod/opencost-…-m9r7j Stopping container opencost
SuccessfulDelete Deleted pod: opencost-…-m9r7j
Warning Unhealthy pod/opencost-…-m9r7j Readiness probe failed: …:9003/healthz: connect: no route to host
```
The chart-default maxUnavailable:1 on a single replica kills the old pod the instant the new one is created; kubelet fires a last readiness probe ~1s after Cilium tears down the dead pod's route. Fixed in this PR with maxUnavailable:0 + preStop.sleep:15 (commit 28d83b6c).

Honest caveat: prod reverts to main's OpenCost config between merge_group runs, so the introducing deploy still rolls an old pod that predates the preStop hook. maxUnavailable:0 makes that termination ordered (new Ready first) rather than the current abrupt simultaneous kill, which sharply reduces — but may not 100% eliminate — the one-shot teardown probe. If it recurs, a re-run is clean. The fully deterministic alternative is a gate-level fix (ignore Unhealthy warnings on already-deleted pods); flagging for a possible follow-up.

Auto-merge enabled; it will re-enter the queue once the System Test is green.

botantler · 2026-05-28T22:07:28Z

🎉 This PR is included in version 1.12.3 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Copilot AI review requested due to automatic review settings May 28, 2026 16:16

github-project-automation Bot added this to 🌊 Project Board May 28, 2026

github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board May 28, 2026

Copilot started reviewing on behalf of devantler May 28, 2026 16:16 View session

devantler temporarily deployed to ci May 28, 2026 16:16 — with GitHub Actions Inactive

Copilot AI reviewed May 28, 2026

View reviewed changes

Comment thread k8s/bases/infrastructure/controllers/opencost/helm-release.yaml Outdated

devantler had a problem deploying to ci May 28, 2026 17:29 — with GitHub Actions Error

Copilot AI review requested due to automatic review settings May 28, 2026 17:32

Copilot started reviewing on behalf of devantler May 28, 2026 17:32 View session

devantler had a problem deploying to ci May 28, 2026 17:33 — with GitHub Actions Error

Copilot AI reviewed May 28, 2026

View reviewed changes

Comment thread k8s/bases/infrastructure/controllers/opencost/helm-release.yaml

Comment thread k8s/bases/infrastructure/controllers/opencost/helm-release.yaml Outdated

devantler temporarily deployed to ci May 28, 2026 17:38 — with GitHub Actions Inactive

devantler marked this pull request as ready for review May 28, 2026 17:41

Copilot AI review requested due to automatic review settings May 28, 2026 17:41

Copilot started reviewing on behalf of devantler May 28, 2026 17:41 View session

devantler enabled auto-merge May 28, 2026 17:42

Copilot AI reviewed May 28, 2026

View reviewed changes

devantler added this pull request to the merge queue May 28, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026

devantler added this pull request to the merge queue May 28, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026

botantler Bot approved these changes May 28, 2026

View reviewed changes

botantler Bot enabled auto-merge May 28, 2026 21:31

devantler temporarily deployed to ci May 28, 2026 21:31 — with GitHub Actions Inactive

botantler Bot added this pull request to the merge queue May 28, 2026

Merged via the queue into main with commit 68557c0 May 28, 2026
9 checks passed

botantler Bot deleted the claude/magical-roentgen-fcd554 branch May 28, 2026 22:07

github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board May 28, 2026

botantler Bot added the released label May 28, 2026

devantler mentioned this pull request May 28, 2026

ci: discount Unhealthy probe warnings on deleted pods in the deploy gate #1646

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(opencost): treat custom pricing as USD/month and lift memory limit#1637

fix(opencost): treat custom pricing as USD/month and lift memory limit#1637
botantler[bot] merged 5 commits into
mainfrom
claude/magical-roentgen-fcd554

devantler commented May 28, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

devantler commented May 28, 2026

Uh oh!

devantler commented May 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

devantler commented May 28, 2026

Uh oh!

Uh oh!

botantler Bot commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

devantler commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

1. Custom pricing was off by a factor of ~730

2. The exporter was OOMKilled while serving Headlamp's 14d query

What changed

Merge-queue deploy gate (check-event-warnings)

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

devantler commented May 28, 2026

Re-checked the cost calculation (this is what /review surfaced)

Known limitations (out of scope of this fix)

Uh oh!

devantler commented May 28, 2026

Updated against authoritative live data

Net vs gross VAT

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

devantler commented May 28, 2026

Merge-queue failure investigation

Uh oh!

Uh oh!

botantler Bot commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

devantler commented May 28, 2026 •

edited

Loading

Merge-queue deploy gate (`check-event-warnings`)