[HDX] Bump clickhouse-operator to v0.0.6 + fix ClickHouse OOMKill by wrn14897 · Pull Request #240 · ClickHouse/ClickStack-helm-charts

wrn14897 · 2026-06-30T05:52:36Z

Summary

Bumps clickhouse-operator-helm from v0.0.2 to v0.0.6 and fixes two resulting full-stack integration-test failures caused by new operator defaults that break the single-replica ClickHouse deployment.

Supersedes the upstream bump in ClickHouse/ClickStack-helm-charts#224.

Changes

Operator bump (from #224)

clickhouse-operator-helm ~0.0.2 → ~0.0.6 (Chart.yaml, Chart.lock)
values.yaml: rename enable → enabled for webhook / certManager / crd — required, since v0.0.6 renamed these keys. The old enable keys are silently ignored in v0.0.6 and would leave webhook/certManager enabled, breaking the install (no cert-manager present).
integration-tests/full-stack/assert.sh: scope kubectl wait with --field-selector=status.phase!=Succeeded. Required by the bump: v0.0.6 introduces a version-probe Job (absent in v0.0.2) that leaves a Completed pod, which kubectl wait --for=condition=Ready pods --all can never satisfy → wait times out → suite fails.

ClickHouse hardening for operator v0.0.6 (new)

Two new operator defaults in v0.0.6 broke the single-replica ClickHouse deployment. The chart now overrides both in clickhouse.cluster.spec:

1. Explicit container resources (containerTemplate.resources: 2Gi memory, 500m CPU request)

The chart previously set no resources, relying on the operator default. v0.0.6 changed that default (release PR #206: "use the same req/limit for memory in default") to 512Mi, request == limit. At 512Mi, ClickHouse computes max_server_memory_usage ≈ 460 MiB; the full ClickStack schema plus ingestion and background merges exceeds it → OOMKilled (exit 137), crash-loop.

2. settings.enableDatabaseSync: false

v0.0.6 adds enableDatabaseSync defaulting to true (new CRD field). It creates the default database with the Replicated (DatabaseReplicated) engine, so table metadata lives in Keeper. Per the CRD docs it "Supports only Replicated and integration databases" and targets multi-replica clusters. In this single-replica deployment it only adds fragility: during ClickHouse startup a transient Keeper exception occurs —

Code: 999. Coordination::Exception: Cannot use any of provided ZooKeeper nodes. (KEEPER_EXCEPTION)   (exit 231)

— and the Replicated default database desyncs and silently drops every seeded table (otel_traces, otel_logs, …, and the goose tracking table). They never come back, so the collector loops on Table default.otel_traces does not exist and the smoke test's wait_for_table_queryable times out after 5 minutes → exit 1 (the failure seen on this PR's CI). Setting enableDatabaseSync: false keeps default Atomic, so seeded tables persist on the data volume across restarts.

Verification (local kind, fresh full-stack cluster)

For each fix, repro'd the failure on the bump and confirmed the fix end-to-end:

Resources: repro'd ClickHouse OOMKill at 512Mi; after fix CH sees 2.0 GiB, max_server_memory_usage 1.80 GiB, stable.
databaseSync: repro'd — default engine Replicated, all tables dropped after a startup Keeper hiccup, SHOW TABLES FROM default empty. After enableDatabaseSync: false: default engine Atomic, all otel_* / hyperdx_sessions / rollup tables present and queryable, ingestion succeeds (traces 173→218, logs 14→16).
helm unittest charts/clickstack: 180/180 pass (added assertions for both new defaults).
Both example values files (alb-ingress, api-only) render cleanly.

Note

2Gi raises the chart's baseline ClickHouse memory footprint vs. the operator's 512Mi default. Both settings remain overridable per environment via clickhouse.cluster.spec.

Ref

#224
#226

clickhouse-operator v0.0.6 changed its default resource block to use memory request == limit at 512Mi (operator PR #206). That ceiling is too low for the full ClickStack schema and OOMKills the ClickHouse server under ingestion + background merges, which is what broke the full-stack integration test on the operator bump. Set explicit containerTemplate.resources (2Gi memory, 500m CPU request) so the chart no longer inherits the operator default. Add a unit test asserting the default and a changeset.

changeset-bot · 2026-06-30T05:52:44Z

🦋 Changeset detected

Latest commit: 1399c31

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

github-actions · 2026-06-30T05:57:36Z

director_placeholder

clickhouse-operator v0.0.6 defaults settings.enableDatabaseSync: true, which creates the `default` database with the Replicated (DatabaseReplicated) engine so table metadata lives in Keeper. That feature targets multi-replica clusters. In this single-replica deployment, a transient Keeper exception during ClickHouse startup (observed: 'Cannot use any of provided ZooKeeper nodes', exit 231) desyncs the Replicated database and silently drops every seeded table -- they never come back, so the otel-collector loops on 'Table default.otel_traces does not exist' and the smoke test's wait_for_table_queryable times out. This is the remaining full-stack integration-test failure on the operator bump. Set enableDatabaseSync: false so `default` stays Atomic and the seeded tables persist on the data volume across restarts. Verified end-to-end on a fresh kind cluster: default DB engine Atomic, all otel_* tables present and queryable, ingestion succeeds. Add a unit test and fold the rationale into the changeset.

github-actions · 2026-06-30T07:00:29Z

Deep Review

Scope: ca8c0dc..HEAD — clickhouse-operator bump v0.0.2→v0.0.6 plus ClickHouse hardening (8 files: chart deps, values, one unit-test file, one integration-test script, two changesets).
Mode: report-only (read-only). Intent: absorb new operator defaults that broke the single-replica ClickHouse deployment.

✅ No critical issues found. The diff is small, config-focused, and the two new behaviors are covered by helm-unittest assertions. The items below are verification and coverage gaps worth closing before merge.

🟡 P2 — recommended

charts/clickstack/values.yaml:350 — enableDatabaseSync is nested under spec.settings, but its correct location in the v0.0.6 ClickHouseCluster CRD is unverified; if the operator reads it as a top-level spec field, this value is silently ignored, the data-loss fix becomes a no-op, and the unit test still passes because it only checks the rendered YAML path.
- Fix: Confirm placement against the installed v0.0.6 CRD (kubectl explain clickhousecluster.spec.settings.enableDatabaseSync vs clickhousecluster.spec.enableDatabaseSync) before relying on it.
- _{correctness, testing}
charts/clickstack-operators/values.yaml:10 — the enable→enabled rename for webhook/certManager/crd has no rendering test; per the PR's own premise the old keys are silently ignored in v0.0.6, so a wrong or stale key would silently leave the webhook/certManager enabled and break installs with no test catching it.
- Fix: Add a helm-unittest suite under charts/clickstack-operators/tests/ (or a values-render assertion) that locks in the renamed keys against the v0.0.6 subchart.
- _{testing, project-standards, maintainability}
integration-tests/full-stack/assert.sh:27 — the final wait uses the same selector as line 15 but omits the || true that line 15 carries, so a probe/Job pod still Pending or Running when the selector resolves is included, never reaches Ready, and times out the suite under set -e.
- Fix: Mirror line 15 by appending || true, or scope the selector to long-lived app pods so transient Jobs are never matched.
- _{correctness, reliability}
charts/clickstack/values.yaml:331 — the new resources block has no inline comment while the sibling enableDatabaseSync added in the same diff does, violating the AGENTS.md "document all values with inline comments" rule; the 2Gi/500m rationale lives only in a changeset that is consumed at release.
- Fix: Add a brief comment above resources: noting the operator's 512Mi default OOMKills the full ClickStack schema.
- _{project-standards, maintainability}

🔵 P3 nitpicks (4)

.changeset/clickhouse-explicit-resources.md:2 — a 4× memory default increase (512Mi→2Gi) ships as a patch, which can surprise downstream users on small clusters.
- Fix: Reconsider a minor bump, or note in the changeset body that patch is intentional.
integration-tests/full-stack/assert.sh:15 — the two kubectl wait invocations are now identical literals edited in lockstep and must be kept in sync by hand.
- Fix: Extract a shared wait_pods_ready() function or variable and call it twice.
charts/clickstack/tests/clickhouse-deployment_test.yaml:71 — the two new tests assert only the values.yaml defaults; AGENTS.md asks for both default and overridden cases.
- Fix: Add a set: override case (e.g. a custom limits.memory) to confirm pass-through.
charts/clickstack/values.yaml:332 — 2Gi is a tight production default; a spike above it is OOMKilled rather than throttled — the exact failure this change targets.
- Fix: Validate 2Gi under sustained ingestion+merge load, or document recommended overrides for higher-throughput deployments.

Reviewers (5): correctness, testing, maintainability, project-standards, reliability.

Testing gaps:

No test asserts the operator actually honors enableDatabaseSync at the chosen path or the enable→enabled rename — both are unit-test blind spots verifiable only at runtime.
charts/clickstack-operators/ has no tests/ directory, so the entire operator subchart values contract is untested.
The assert.sh field-selector behavior against a still-running short-lived Job pod is unexercised by CI.

jordan-simonovski · 2026-06-30T08:03:40Z

+          value: 2Gi
+      - equal:
+          path: spec.containerTemplate.resources.requests.cpu
+          value: 500m


minimum CPU requests are definitely good to have. Overprovisioning a node is a much bigger problem than pod OOMs, so I like this. Glad we don't implement cpu limits.

GrigoryPervakov · 2026-06-30T10:41:14Z

+        # Keep the `default` database Atomic; the Replicated engine the operator
+        # selects when this is true drops seeded tables on Keeper desync.
+        enableDatabaseSync: false


enableDatabaseSync has been there since the first release and has always been enabled by default.
There were no changes regarding Atomic->Replicated recreation since then.

It makes sense even for single-node clusters, as it helps scale further and keeps the engine set the same for different topologies.

drops seeded tables on Keeper desync

This should not happen. It never drops tables or recreates the default database as Replicated if any tables have been created(code)

If smth really gets dropped by the operator(somehow) give a repro and I'll fix

@GrigoryPervakov, thanks for following up on this PR. I dug into the issue and found the root cause. The collector (clickhouseexporter) creates an atomic database (https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/277638be46c8653cc6574f2f9a6d2597bd5c12da/exporter/clickhouseexporter/internal/clickhouse.go#L60-L73). Later, the operator drops that database and recreates it as a Replicated database (https://github.com/ClickHouse/clickhouse-operator/blob/ecd71949243d71f37a717d21acbba20424af03a1/internal/controller/clickhouse/commands.go#L274-L290), which causes the integration test to fail.

What do you suggest as the best way to proceed here?

GrigoryPervakov and others added 2 commits June 29, 2026 22:20

chore(deps): bump clickhouse-operator-helm version to v0.0.6

46ec45f

wrn14897 requested a review from a team as a code owner June 30, 2026 05:52

wrn14897 added 2 commits June 29, 2026 23:38

docs(clickhouse): trim verbose values comments

1399c31

jordan-simonovski reviewed Jun 30, 2026

View reviewed changes

jordan-simonovski approved these changes Jun 30, 2026

View reviewed changes

GrigoryPervakov reviewed Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[HDX] Bump clickhouse-operator to v0.0.6 + fix ClickHouse OOMKill#240

[HDX] Bump clickhouse-operator to v0.0.6 + fix ClickHouse OOMKill#240
wrn14897 wants to merge 4 commits into
mainfrom
warren/bump-clickhouse-operator

wrn14897 commented Jun 30, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

jordan-simonovski Jun 30, 2026

Uh oh!

GrigoryPervakov Jun 30, 2026

Uh oh!

wrn14897 Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

wrn14897 commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Operator bump (from #224)

ClickHouse hardening for operator v0.0.6 (new)

Verification (local kind, fresh full-stack cluster)

Note

Ref

Uh oh!

changeset-bot Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

github-actions Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 30, 2026

Deep Review

🟡 P2 — recommended

Uh oh!

jordan-simonovski Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

GrigoryPervakov Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

wrn14897 Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wrn14897 commented Jun 30, 2026 •

edited

Loading

changeset-bot Bot commented Jun 30, 2026 •

edited

Loading

github-actions Bot commented Jun 30, 2026 •

edited

Loading