Skip to content

fix(opencost): treat custom pricing as USD/month and lift memory limit#1637

Merged
botantler[bot] merged 5 commits into
mainfrom
claude/magical-roentgen-fcd554
May 28, 2026
Merged

fix(opencost): treat custom pricing as USD/month and lift memory limit#1637
botantler[bot] merged 5 commits into
mainfrom
claude/magical-roentgen-fcd554

Conversation

@devantler
Copy link
Copy Markdown
Contributor

@devantler devantler commented May 28, 2026

🤖 Generated by the Daily AI Assistant

Why

Headlamp's OpenCost panel stopped reporting costs on prod. Two issues were stacked.

1. Custom pricing was off by a factor of ~730

The OpenCost helm chart's customPricing.costModel fields (CPU, RAM, GPU, storage, spotCPU, spotRAM) are interpreted as USD per month and divided by HoursPerMonth = 730 to derive an hourly rate. Our values were authored as hourly, so OpenCost reported node_cpu_hourly_cost = 3e-06 (0.002125 / 730 ≈ 2.91e-6). Allocation queries returned values like $0.00227 for kube-system over 14 days, which Headlamp rounds to $0.00 — "no cost data". Conversion is hardcoded in providerconfig.go:188-210; chart defaults (CPU: 1.25, RAM: 0.50) confirm the monthly convention.

2. The exporter was OOMKilled while serving Headlamp's 14d query

256 MiB limit; Headlamp's window=14d&aggregate=namespace query holds the window in memory → exit 137 mid-response → intermittent 499s and an empty panel.

What changed

Pricing values were revised to use Hetzner-Pricing-API figures + the ECB FX rate; the table reflects the final values in the branch.

before after source of "after"
costModel.CPU (USD/vCPU/month) 0.002125 (hourly!) 0.9441 €6.49/mo ÷ 4 vCPU × 0.5 × 1.1637
costModel.RAM (USD/GB/month) 0.000797 (hourly!) 0.4720 €6.49/mo ÷ 8 GB × 0.5 × 1.1637
costModel.storage (USD/GB/month) 0.00008 (hourly!) 0.0666 Volume €0.0572/GB × 1.1637
costModel.*NetworkEgress (USD/GB) 0.001 0.001164 €1.00/TB overage × 1.1637
exporter.resources.requests.memory 55Mi 128Mi above observed ~75Mi idle
exporter.resources.limits.memory 256Mi 512Mi ~4× request; chart default is 1Gi
rollout strategy.maxUnavailable 1 (chart default) 0 (+ preStop.sleep: 15) see "Merge-queue deploy gate" below

Hetzner figures are net of VAT, pulled live from GET https://api.hetzner.cloud/v1/pricing (fsn1): CX33 = €6.49/server/month, Volume = €0.0572/GB-month, egress = €1.00/TB. FX = ECB EUR→USD reference 2026-05-27 = 1.1637. CX33 spec (4 vCPU / 8 GB / 80 GB) verified via kubectl get nodes. 50/50 CPU-RAM split is an allocation convention.

End-to-end check (1 vCPU for 1 hour): ConfigMap CPU 0.9441/730 = $0.001293/vCPU-hr; real = (€6.49 × 0.5 / 4 / 730) × 1.1637 = $0.001294/vCPU-hr

Merge-queue deploy gate (check-event-warnings)

The merge-queue Deploy to Prod job deploys to real prod, waits 90s, and fails on any Warning event in the window. This PR's OpenCost config change rolls the pod, and the chart-default maxUnavailable:1 on a single replica kills the old pod the instant the new one is created — kubelet then fires one last readiness probe ~1s after Cilium removes the dead pod's route → Unhealthy: …:9003/healthz: connect: no route to host, tripping the gate. Mitigated with a postRenderer: maxUnavailable:0 (surge new→Ready before old terminates) + preStop.sleep:15 (keep the container serving :9003 during drain so probes land on a live endpoint; native sleep action, GA k8s 1.30, cluster is 1.32).

The homepage probe warnings seen in earlier merge-queue runs of this PR are a separate, platform-wide issue (every PR's deploy gate hits them), fixed by #1636 (initialDelaySeconds). Not this PR's concern.

Validation

ksail workload validate on both clusters/local and clusters/prod (256 files each). Strategy + preStop.sleep accepted by the live 1.32 API via kubectl apply --dry-run=server. Bug reproduced read-only on prod before the fix (node_cpu_hourly_cost = 3e-06, OOMKilled exit 137).

🤖 Generated with Claude Code

> 🤖 Generated by the Daily AI Assistant

Two issues caused Headlamp's OpenCost panel to stop reporting costs on
prod.

1. **Costs were ~730× too low.** The helm-chart `customPricing.costModel`
   fields `CPU`, `RAM`, `GPU`, `storage`, `spotCPU`, `spotRAM` are
   interpreted by OpenCost as **USD per month** and divided by
   `HoursPerMonth = 730` to derive an hourly rate
   (`opencost/pkg/cloud/provider/providerconfig.go:188`,
   `customprovider.go:95`). Our values were authored as hourly
   ($0.002125/vCPU/hr, $0.000797/GB/hr), so the resulting node prices
   landed at `node_cpu_hourly_cost = 3e-6` instead of ~$0.0021. Empirical
   match: `0.002125 / 730 ≈ 2.91e-6`. Allocation queries returned
   non-zero but vanishingly small values that Headlamp surfaces as
   "$0.00", i.e. no cost data.

   Fix: convert to per-month USD derived from the same CX33 base price
   ($9.31/server/month, 50/50 CPU-RAM split): CPU = 1.5517, RAM = 0.5819,
   storage = 0.048 (Hetzner Cloud Volume). Network-egress fields are
   *not* in the divide-by-730 allowlist (they are per-GB transferred) so
   they stay at 0.001.

2. **OOMKilled while serving Headlamp's 14d query.** The exporter was
   capped at 256 MiB; the cost-model engine holds the aggregation window
   in memory and Headlamp issues `window=14d&aggregate=namespace` and
   `aggregate=deployment` queries. The pod hit exit code 137 mid-response,
   producing the intermittent 499s observed in the UI access log and
   leaving the panel empty after a restart. Lift the limit to 512 MiB
   (~7× idle, still under the chart default of 1 GiB).

Validated via `ksail workload validate` (local + prod overlays).

Refs: https://github.com/opencost/opencost/blob/develop/pkg/cloud/provider/providerconfig.go#L188-L210
Refs: https://github.com/opencost/opencost-helm-chart/blob/main/charts/opencost/values.yaml
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts the OpenCost HelmRelease configuration so Headlamp’s OpenCost panel reports realistic costs again in prod by correcting custom pricing units and reducing exporter OOM restarts during long-range allocation queries.

Changes:

  • Converts customPricing.costModel values from hourly-authored numbers to the monthly USD units OpenCost expects (CPU/RAM/storage).
  • Adds clarifying documentation in the HelmRelease about OpenCost’s unit conventions and the monthly→hourly conversion behavior.
  • Increases exporter.resources.limits.memory from 256Mi to 512Mi to avoid OOMKilled responses on 14d allocation queries.

Comment thread k8s/bases/infrastructure/controllers/opencost/helm-release.yaml Outdated
…etzner rates

> 🤖 Generated by the Daily AI Assistant

The previous commit had three input errors that I'd silently inherited
from the original 2025 comment block, even after fixing the per-month
units bug:

1. CX33 has **4 vCPU**, not 3. Verified live: `kubectl get nodes` on
   prod reports `cpu: 4` on every CX33.
2. CX33 current monthly cap is **€6.49**, not €8.54 (which my old
   hourly-based math implied). Hetzner's price-adjustment doc raised
   CX33 from €4.99 → €6.49/month effective 2026-04-01.
3. Hetzner Cloud Volume is now **€0.0572/GB-month**, not €0.044
   (same 2026-04-01 adjustment).

Re-derived from the authoritative source price (Hetzner's published cap)
with explicit math:

  Server cost  = €6.49/month → $7.01/server/month  (EUR/USD ≈ 1.08)
  50/50 split  = $3.504 each for CPU and RAM
  CPU = $3.504 / 4 vCPU = $0.876 / vCPU-month
  RAM = $3.504 / 8 GB   = $0.438 / GB-month
  Vol = €0.0572 × 1.08  = $0.0618 / GB-month

End-to-end check (1 vCPU running for 1 hour):
  ConfigMap CPU = 0.876 → OpenCost /730 = $0.00120 / vCPU-hour
  Real cost     = (€6.49 × 0.5 / 4 vCPU / 730 hr) × 1.08
                = $0.00120 / vCPU-hour ✓

Validated: `ksail workload validate` succeeds for both clusters/local
and clusters/prod (256 files each).

Refs: https://docs.hetzner.com/general/infrastructure-and-availability/price-adjustment/
@devantler
Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

Re-checked the cost calculation (this is what /review surfaced)

The conversion-factor fix (730×) was correct, but the input numbers were wrong. Three errors silently inherited from the original 2025 comment block:

# Field Previous PR value Actual Source
1 CX33 vCPU count 3 vCPU (comment said so) 4 vCPU kubectl get nodes on prod: cpu: 4 on every CX33
2 CX33 monthly cap €0.0117/hr ⇒ €8.54/month €6.49/month Hetzner price-adjustment doc (post 2026-04-01)
3 Volume per GB-month €0.044 ⇒ $0.048 €0.0572 ⇒ $0.0618 Same doc, same date

Pushed 8830296a with the corrected derivation:

Server cost  = €6.49/month  → $7.01/server/month  (EUR/USD ≈ 1.08)
50/50 split  = $3.504 each for CPU and RAM
CPU = $3.504 / 4 vCPU = $0.876 / vCPU-month
RAM = $3.504 / 8 GB   = $0.438 / GB-month
Vol = €0.0572 × 1.08  = $0.0618 / GB-month

End-to-end check (1 vCPU running for 1 hour):

ConfigMap CPU = 0.876 → OpenCost / 730 = $0.001200 / vCPU-hour
Real cost     = (€6.49 × 0.5 / 4 vCPU / 730 hr) × 1.08
              = $0.001200 / vCPU-hour ✓

Known limitations (out of scope of this fix)

  • 50/50 CPU-RAM split is a convention, not a Hetzner-published breakdown. Cloud providers don't typically publish a per-resource split; common alternatives are 70/30 (CPU-heavy, as in EC2 reserved-instance breakdowns). I kept 50/50 to minimise scope vs. the original author's choice, but happy to revisit if you want a different model.
  • EUR/USD = 1.08 is a mid-2026 approximation, not a live rate. OpenCost has no FX tracking, so any single chosen rate is a snapshot. If it drifts ±5%, total costs scale linearly.
  • Autoscaler pool cx23 (2 vCPU / 4 GB / €3.99 cap) gets the same per-vCPU/per-GB rates from this config since OpenCost custom pricing is uniform per resource. Estimate for a cx23 server at our rates: 0.876 × 2 + 0.438 × 4 = $3.50/month ≈ €3.24 vs. €3.99 actual — about 19% under, acceptable for an estimate.
  • Storage rate uses Hetzner Cloud Volume pricing. Longhorn replicas on those volumes will multiply observed PVC cost by the replica count; this is OpenCost's standard behaviour for replicated storage and not a math error.

> 🤖 Generated by the Daily AI Assistant

Replaced hand-edited assumptions with values pulled directly from the
Hetzner Cloud Pricing API (`GET /v1/pricing`, location fsn1) and the
ECB EUR→USD reference rate for 2026-05-27.

API-verified (net of VAT):
  CX33 cap = €6.49 / server / month  (hourly €0.0104)
  Volumes  = €0.0572 / GB / month
  Egress   = €1.00 / TB overage  (€0.001 / GB; 20 TiB incl. per server)

FX (ECB reference, 2026-05-27): 1.1637 EUR/USD. The previous 1.08
assumption was 8% low.

Re-derived (50/50 CPU-RAM split, unchanged convention):
  €6.49 × 1.1637 = $7.5524 / server / month
  CPU = $7.5524 × 0.5 / 4 vCPU  ≈ $0.9441 / vCPU-month  (was 0.876)
  RAM = $7.5524 × 0.5 / 8 GB    ≈ $0.4720 / GB-month    (was 0.438)
  Vol = €0.0572 × 1.1637        ≈ $0.0666 / GB-month    (was 0.0618)
  Egress = €0.001 × 1.1637      ≈ $0.001164 / GB        (was 0.001 flat)

End-to-end check (1 vCPU for 1 hour):
  ConfigMap CPU = 0.9441 → OpenCost / 730 = $0.001293 / vCPU-hour
  Real cost     = (€6.49 × 0.5 / 4 vCPU / 730 hr) × 1.1637
                =  €0.001112 × 1.1637 = $0.001294 / vCPU-hour ✓

Using net (excl. VAT) because VAT is location-dependent (DK = 25%) and
typically reclaimable for businesses; tracking it would conflate
compute cost with tax overhead. Gross is +25% on every figure.

Validated with `ksail workload validate` on both clusters/local and
clusters/prod (256 files each).
Copilot AI review requested due to automatic review settings May 28, 2026 17:32
@devantler
Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

Updated against authoritative live data

Hit the Hetzner Cloud Pricing API directly with the prod HCLOUD_TOKEN and replaced the price-adjustment-doc figures + my hand-picked FX rate with the API's own numbers + the ECB reference rate.

Hetzner API (GET /v1/pricing, location fsn1, net of 25% VAT):

Item Value My PR was
CX33 monthly cap €6.49 €6.49 ✓
CX33 hourly €0.0104 €0.0104 ✓
Volume / GB-month €0.0572 €0.0572 ✓
Egress per TB €1.00 (= €0.001/GB) €0.001 ✓
Included egress 20 TiB / server mentioned ✓

So the EUR base prices were already right. The miss was the FX rate:

ECB EUR→USD reference rate (2026-05-27): 1.1637 — was using 1.08, 8% low.

Re-derived (50/50 split, unchanged):

€6.49/server/month × 1.1637 = $7.5524/server/month
  CPU = $7.5524 × 0.5 / 4 vCPU  ≈ $0.9441/vCPU-month   (was 0.876)
  RAM = $7.5524 × 0.5 / 8 GB    ≈ $0.4720/GB-month     (was 0.438)
  Vol = €0.0572 × 1.1637        ≈ $0.0666/GB-month     (was 0.0618)
  Egress = €0.001 × 1.1637      ≈ $0.001164/GB         (was flat $0.001)

End-to-end check (1 vCPU running for 1 hour):

ConfigMap CPU = 0.9441 → OpenCost / 730 = $0.001293/vCPU-hour
Real cost     = (€6.49 × 0.5 / 4 vCPU / 730 hr) × 1.1637
              =  €0.001112 × 1.1637 = $0.001294/vCPU-hour ✓

Net vs gross VAT

Using net (excluding the 25% Danish VAT the API also returns). Gross is +25% on every figure, but VAT is location-dependent and typically reclaimable for businesses, so tracking it in OpenCost would conflate compute cost with tax overhead. Reasonable people could argue either side — easy switch if you'd rather see actual bank-account-outflow numbers.

Pushed cb791c6f.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment thread k8s/bases/infrastructure/controllers/opencost/helm-release.yaml
Comment thread k8s/bases/infrastructure/controllers/opencost/helm-release.yaml Outdated
> 🤖 Generated by the Daily AI Assistant

Per review: the request was 55Mi but the exporter idles ~75Mi (kubectl
top), so the scheduler under-reserved its baseline and the pod would be
a prime eviction candidate under node memory pressure. Raise the request
to 128Mi (above observed idle, below the 512Mi limit).
@devantler devantler marked this pull request as ready for review May 28, 2026 17:41
Copilot AI review requested due to automatic review settings May 28, 2026 17:41
@devantler devantler enabled auto-merge May 28, 2026 17:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

@devantler devantler added this pull request to the merge queue May 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026
@devantler devantler added this pull request to the merge queue May 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026
…n warning

> 🤖 Generated by the Daily AI Assistant

The merge-queue "Deploy to Prod" gate (check-event-warnings, fail-on-warning)
failed for this PR on a benign OpenCost rollout artifact:

  Killing           pod/opencost-…-m9r7j  Stopping container opencost
  SuccessfulDelete  Deleted pod: opencost-…-m9r7j
  Warning Unhealthy pod/opencost-…-m9r7j  Readiness probe failed:
                    …:9003/healthz: connect: no route to host

Root cause: the chart default rollout is maxUnavailable:1 on a single replica,
so the old pod is killed the instant the new one is created. kubelet fires one
last readiness probe ~1s after Cilium removes the dead pod's route → "no route
to host". The event lands inside the gate's 90s settle window and trips it,
even though nothing is actually wrong (the pod is gone). Any OpenCost rollout
(incl. this PR's memory/pricing change) reproduces it.

Add a postRenderer that:
  1. sets maxUnavailable: 0 (surge new→Ready before old terminates;
     zero-downtime, matches the homepage/headlamp convention), and
  2. adds preStop.sleep: 15s on the opencost container so it keeps serving
     :9003 during drain — kubelet's probes land on a live endpoint instead of a
     torn-down route. Native sleep lifecycle action (GA since k8s 1.30; cluster
     is 1.32), so no shell is needed in the distroless image.

Validated: `ksail workload validate` (local + prod, 256 files each); the
strategy + preStop.sleep accepted by the live 1.32 API via `kubectl apply
--dry-run=server`.
@botantler botantler Bot enabled auto-merge May 28, 2026 21:31
@devantler
Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

Merge-queue failure investigation

The PR-level checks (System Test) pass, but the merge-queue merge_group run failed — that runs a different job, 🚀 Deploy to Prod, which deploys to real prod, waits 90s, and fails on any Warning event in the window (.github/actions/check-event-warnings, fail-on-warning: true). Checking the merge_group job logs (not the PR checks) showed two distinct causes across runs:

1. Homepage cold-start probes (the originally-reported error, 19:58Z run). Not specific to this PR — runs for #1604, #1607, #1608, #1636 all failed identically in the same window. Platform-wide; fixed by #1636 (startupProbe: initialDelaySeconds: 20), which is auto-merging. Live prod homepage has since reverted to chart-default and is stable, and this PR doesn't touch homepage — so it won't roll homepage on deploy regardless.

2. OpenCost teardown race (latest run, 21:11Z) — this PR's responsibility. Live events were unambiguous:
```
Killing pod/opencost-…-m9r7j Stopping container opencost
SuccessfulDelete Deleted pod: opencost-…-m9r7j
Warning Unhealthy pod/opencost-…-m9r7j Readiness probe failed: …:9003/healthz: connect: no route to host
```
The chart-default maxUnavailable:1 on a single replica kills the old pod the instant the new one is created; kubelet fires a last readiness probe ~1s after Cilium tears down the dead pod's route. Fixed in this PR with maxUnavailable:0 + preStop.sleep:15 (commit 28d83b6c).

Honest caveat: prod reverts to main's OpenCost config between merge_group runs, so the introducing deploy still rolls an old pod that predates the preStop hook. maxUnavailable:0 makes that termination ordered (new Ready first) rather than the current abrupt simultaneous kill, which sharply reduces — but may not 100% eliminate — the one-shot teardown probe. If it recurs, a re-run is clean. The fully deterministic alternative is a gate-level fix (ignore Unhealthy warnings on already-deleted pods); flagging for a possible follow-up.

Auto-merge enabled; it will re-enter the queue once the System Test is green.

@botantler botantler Bot added this pull request to the merge queue May 28, 2026
Merged via the queue into main with commit 68557c0 May 28, 2026
9 checks passed
@botantler botantler Bot deleted the claude/magical-roentgen-fcd554 branch May 28, 2026 22:07
@github-project-automation github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board May 28, 2026
@botantler
Copy link
Copy Markdown
Contributor

botantler Bot commented May 28, 2026

🎉 This PR is included in version 1.12.3 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

2 participants