Skip to content

[HDX] Bump clickhouse-operator to v0.0.6 + fix ClickHouse OOMKill#240

Open
wrn14897 wants to merge 4 commits into
mainfrom
warren/bump-clickhouse-operator
Open

[HDX] Bump clickhouse-operator to v0.0.6 + fix ClickHouse OOMKill#240
wrn14897 wants to merge 4 commits into
mainfrom
warren/bump-clickhouse-operator

Conversation

@wrn14897

@wrn14897 wrn14897 commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

Bumps clickhouse-operator-helm from v0.0.2 to v0.0.6 and fixes two resulting full-stack integration-test failures caused by new operator defaults that break the single-replica ClickHouse deployment.

Supersedes the upstream bump in ClickHouse/ClickStack-helm-charts#224.

Changes

Operator bump (from #224)

  • clickhouse-operator-helm ~0.0.2~0.0.6 (Chart.yaml, Chart.lock)
  • values.yaml: rename enableenabled for webhook / certManager / crdrequired, since v0.0.6 renamed these keys. The old enable keys are silently ignored in v0.0.6 and would leave webhook/certManager enabled, breaking the install (no cert-manager present).
  • integration-tests/full-stack/assert.sh: scope kubectl wait with --field-selector=status.phase!=Succeeded. Required by the bump: v0.0.6 introduces a version-probe Job (absent in v0.0.2) that leaves a Completed pod, which kubectl wait --for=condition=Ready pods --all can never satisfy → wait times out → suite fails.

ClickHouse hardening for operator v0.0.6 (new)

Two new operator defaults in v0.0.6 broke the single-replica ClickHouse deployment. The chart now overrides both in clickhouse.cluster.spec:

1. Explicit container resources (containerTemplate.resources: 2Gi memory, 500m CPU request)

The chart previously set no resources, relying on the operator default. v0.0.6 changed that default (release PR #206: "use the same req/limit for memory in default") to 512Mi, request == limit. At 512Mi, ClickHouse computes max_server_memory_usage ≈ 460 MiB; the full ClickStack schema plus ingestion and background merges exceeds it → OOMKilled (exit 137), crash-loop.

2. settings.enableDatabaseSync: false

v0.0.6 adds enableDatabaseSync defaulting to true (new CRD field). It creates the default database with the Replicated (DatabaseReplicated) engine, so table metadata lives in Keeper. Per the CRD docs it "Supports only Replicated and integration databases" and targets multi-replica clusters. In this single-replica deployment it only adds fragility: during ClickHouse startup a transient Keeper exception occurs —

Code: 999. Coordination::Exception: Cannot use any of provided ZooKeeper nodes. (KEEPER_EXCEPTION)   (exit 231)

— and the Replicated default database desyncs and silently drops every seeded table (otel_traces, otel_logs, …, and the goose tracking table). They never come back, so the collector loops on Table default.otel_traces does not exist and the smoke test's wait_for_table_queryable times out after 5 minutes → exit 1 (the failure seen on this PR's CI). Setting enableDatabaseSync: false keeps default Atomic, so seeded tables persist on the data volume across restarts.

Verification (local kind, fresh full-stack cluster)

For each fix, repro'd the failure on the bump and confirmed the fix end-to-end:

  • Resources: repro'd ClickHouse OOMKill at 512Mi; after fix CH sees 2.0 GiB, max_server_memory_usage 1.80 GiB, stable.
  • databaseSync: repro'd — default engine Replicated, all tables dropped after a startup Keeper hiccup, SHOW TABLES FROM default empty. After enableDatabaseSync: false: default engine Atomic, all otel_* / hyperdx_sessions / rollup tables present and queryable, ingestion succeeds (traces 173→218, logs 14→16).
  • helm unittest charts/clickstack: 180/180 pass (added assertions for both new defaults).
  • Both example values files (alb-ingress, api-only) render cleanly.

Note

2Gi raises the chart's baseline ClickHouse memory footprint vs. the operator's 512Mi default. Both settings remain overridable per environment via clickhouse.cluster.spec.

Ref

#224
#226

GrigoryPervakov and others added 2 commits June 29, 2026 22:20
clickhouse-operator v0.0.6 changed its default resource block to use
memory request == limit at 512Mi (operator PR #206). That ceiling is too
low for the full ClickStack schema and OOMKills the ClickHouse server
under ingestion + background merges, which is what broke the full-stack
integration test on the operator bump.

Set explicit containerTemplate.resources (2Gi memory, 500m CPU request)
so the chart no longer inherits the operator default. Add a unit test
asserting the default and a changeset.
@wrn14897 wrn14897 requested a review from a team as a code owner June 30, 2026 05:52
@changeset-bot

changeset-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 1399c31

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@github-actions

github-actions Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

director_placeholder

wrn14897 added 2 commits June 29, 2026 23:38
clickhouse-operator v0.0.6 defaults settings.enableDatabaseSync: true, which
creates the `default` database with the Replicated (DatabaseReplicated) engine
so table metadata lives in Keeper. That feature targets multi-replica clusters.
In this single-replica deployment, a transient Keeper exception during ClickHouse
startup (observed: 'Cannot use any of provided ZooKeeper nodes', exit 231)
desyncs the Replicated database and silently drops every seeded table -- they
never come back, so the otel-collector loops on 'Table default.otel_traces does
not exist' and the smoke test's wait_for_table_queryable times out. This is the
remaining full-stack integration-test failure on the operator bump.

Set enableDatabaseSync: false so `default` stays Atomic and the seeded tables
persist on the data volume across restarts. Verified end-to-end on a fresh kind
cluster: default DB engine Atomic, all otel_* tables present and queryable,
ingestion succeeds. Add a unit test and fold the rationale into the changeset.
@github-actions

Copy link
Copy Markdown
Contributor

Deep Review

Scope: ca8c0dc..HEAD — clickhouse-operator bump v0.0.2→v0.0.6 plus ClickHouse hardening (8 files: chart deps, values, one unit-test file, one integration-test script, two changesets).
Mode: report-only (read-only). Intent: absorb new operator defaults that broke the single-replica ClickHouse deployment.

✅ No critical issues found. The diff is small, config-focused, and the two new behaviors are covered by helm-unittest assertions. The items below are verification and coverage gaps worth closing before merge.

🟡 P2 — recommended

  • charts/clickstack/values.yaml:350enableDatabaseSync is nested under spec.settings, but its correct location in the v0.0.6 ClickHouseCluster CRD is unverified; if the operator reads it as a top-level spec field, this value is silently ignored, the data-loss fix becomes a no-op, and the unit test still passes because it only checks the rendered YAML path.
    • Fix: Confirm placement against the installed v0.0.6 CRD (kubectl explain clickhousecluster.spec.settings.enableDatabaseSync vs clickhousecluster.spec.enableDatabaseSync) before relying on it.
    • correctness, testing
  • charts/clickstack-operators/values.yaml:10 — the enableenabled rename for webhook/certManager/crd has no rendering test; per the PR's own premise the old keys are silently ignored in v0.0.6, so a wrong or stale key would silently leave the webhook/certManager enabled and break installs with no test catching it.
    • Fix: Add a helm-unittest suite under charts/clickstack-operators/tests/ (or a values-render assertion) that locks in the renamed keys against the v0.0.6 subchart.
    • testing, project-standards, maintainability
  • integration-tests/full-stack/assert.sh:27 — the final wait uses the same selector as line 15 but omits the || true that line 15 carries, so a probe/Job pod still Pending or Running when the selector resolves is included, never reaches Ready, and times out the suite under set -e.
    • Fix: Mirror line 15 by appending || true, or scope the selector to long-lived app pods so transient Jobs are never matched.
    • correctness, reliability
  • charts/clickstack/values.yaml:331 — the new resources block has no inline comment while the sibling enableDatabaseSync added in the same diff does, violating the AGENTS.md "document all values with inline comments" rule; the 2Gi/500m rationale lives only in a changeset that is consumed at release.
    • Fix: Add a brief comment above resources: noting the operator's 512Mi default OOMKills the full ClickStack schema.
    • project-standards, maintainability
🔵 P3 nitpicks (4)
  • .changeset/clickhouse-explicit-resources.md:2 — a 4× memory default increase (512Mi→2Gi) ships as a patch, which can surprise downstream users on small clusters.
    • Fix: Reconsider a minor bump, or note in the changeset body that patch is intentional.
  • integration-tests/full-stack/assert.sh:15 — the two kubectl wait invocations are now identical literals edited in lockstep and must be kept in sync by hand.
    • Fix: Extract a shared wait_pods_ready() function or variable and call it twice.
  • charts/clickstack/tests/clickhouse-deployment_test.yaml:71 — the two new tests assert only the values.yaml defaults; AGENTS.md asks for both default and overridden cases.
    • Fix: Add a set: override case (e.g. a custom limits.memory) to confirm pass-through.
  • charts/clickstack/values.yaml:332 — 2Gi is a tight production default; a spike above it is OOMKilled rather than throttled — the exact failure this change targets.
    • Fix: Validate 2Gi under sustained ingestion+merge load, or document recommended overrides for higher-throughput deployments.

Reviewers (5): correctness, testing, maintainability, project-standards, reliability.

Testing gaps:

  • No test asserts the operator actually honors enableDatabaseSync at the chosen path or the enableenabled rename — both are unit-test blind spots verifiable only at runtime.
  • charts/clickstack-operators/ has no tests/ directory, so the entire operator subchart values contract is untested.
  • The assert.sh field-selector behavior against a still-running short-lived Job pod is unexercised by CI.

value: 2Gi
- equal:
path: spec.containerTemplate.resources.requests.cpu
value: 500m

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minimum CPU requests are definitely good to have. Overprovisioning a node is a much bigger problem than pod OOMs, so I like this. Glad we don't implement cpu limits.

Comment on lines +348 to +350
# Keep the `default` database Atomic; the Replicated engine the operator
# selects when this is true drops seeded tables on Keeper desync.
enableDatabaseSync: false

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enableDatabaseSync has been there since the first release and has always been enabled by default.
There were no changes regarding Atomic->Replicated recreation since then.

It makes sense even for single-node clusters, as it helps scale further and keeps the engine set the same for different topologies.

drops seeded tables on Keeper desync

This should not happen. It never drops tables or recreates the default database as Replicated if any tables have been created(code)

If smth really gets dropped by the operator(somehow) give a repro and I'll fix

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GrigoryPervakov, thanks for following up on this PR. I dug into the issue and found the root cause. The collector (clickhouseexporter) creates an atomic database (https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/277638be46c8653cc6574f2f9a6d2597bd5c12da/exporter/clickhouseexporter/internal/clickhouse.go#L60-L73). Later, the operator drops that database and recreates it as a Replicated database (https://github.com/ClickHouse/clickhouse-operator/blob/ecd71949243d71f37a717d21acbba20424af03a1/internal/controller/clickhouse/commands.go#L274-L290), which causes the integration test to fail.

What do you suggest as the best way to proceed here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants