[HDX] Bump clickhouse-operator to v0.0.6 + fix ClickHouse OOMKill#240
[HDX] Bump clickhouse-operator to v0.0.6 + fix ClickHouse OOMKill#240wrn14897 wants to merge 4 commits into
Conversation
clickhouse-operator v0.0.6 changed its default resource block to use memory request == limit at 512Mi (operator PR #206). That ceiling is too low for the full ClickStack schema and OOMKills the ClickHouse server under ingestion + background merges, which is what broke the full-stack integration test on the operator bump. Set explicit containerTemplate.resources (2Gi memory, 500m CPU request) so the chart no longer inherits the operator default. Add a unit test asserting the default and a changeset.
🦋 Changeset detectedLatest commit: 1399c31 The changes in this PR will be included in the next version bump. Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
director_placeholder |
clickhouse-operator v0.0.6 defaults settings.enableDatabaseSync: true, which creates the `default` database with the Replicated (DatabaseReplicated) engine so table metadata lives in Keeper. That feature targets multi-replica clusters. In this single-replica deployment, a transient Keeper exception during ClickHouse startup (observed: 'Cannot use any of provided ZooKeeper nodes', exit 231) desyncs the Replicated database and silently drops every seeded table -- they never come back, so the otel-collector loops on 'Table default.otel_traces does not exist' and the smoke test's wait_for_table_queryable times out. This is the remaining full-stack integration-test failure on the operator bump. Set enableDatabaseSync: false so `default` stays Atomic and the seeded tables persist on the data volume across restarts. Verified end-to-end on a fresh kind cluster: default DB engine Atomic, all otel_* tables present and queryable, ingestion succeeds. Add a unit test and fold the rationale into the changeset.
Deep ReviewScope: ✅ No critical issues found. The diff is small, config-focused, and the two new behaviors are covered by helm-unittest assertions. The items below are verification and coverage gaps worth closing before merge. 🟡 P2 — recommended
🔵 P3 nitpicks (4)
Reviewers (5): correctness, testing, maintainability, project-standards, reliability. Testing gaps:
|
| value: 2Gi | ||
| - equal: | ||
| path: spec.containerTemplate.resources.requests.cpu | ||
| value: 500m |
There was a problem hiding this comment.
minimum CPU requests are definitely good to have. Overprovisioning a node is a much bigger problem than pod OOMs, so I like this. Glad we don't implement cpu limits.
| # Keep the `default` database Atomic; the Replicated engine the operator | ||
| # selects when this is true drops seeded tables on Keeper desync. | ||
| enableDatabaseSync: false |
There was a problem hiding this comment.
enableDatabaseSync has been there since the first release and has always been enabled by default.
There were no changes regarding Atomic->Replicated recreation since then.
It makes sense even for single-node clusters, as it helps scale further and keeps the engine set the same for different topologies.
drops seeded tables on Keeper desync
This should not happen. It never drops tables or recreates the default database as Replicated if any tables have been created(code)
If smth really gets dropped by the operator(somehow) give a repro and I'll fix
There was a problem hiding this comment.
@GrigoryPervakov, thanks for following up on this PR. I dug into the issue and found the root cause. The collector (clickhouseexporter) creates an atomic database (https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/277638be46c8653cc6574f2f9a6d2597bd5c12da/exporter/clickhouseexporter/internal/clickhouse.go#L60-L73). Later, the operator drops that database and recreates it as a Replicated database (https://github.com/ClickHouse/clickhouse-operator/blob/ecd71949243d71f37a717d21acbba20424af03a1/internal/controller/clickhouse/commands.go#L274-L290), which causes the integration test to fail.
What do you suggest as the best way to proceed here?
Summary
Bumps
clickhouse-operator-helmfrom v0.0.2 to v0.0.6 and fixes two resulting full-stack integration-test failures caused by new operator defaults that break the single-replica ClickHouse deployment.Supersedes the upstream bump in ClickHouse/ClickStack-helm-charts#224.
Changes
Operator bump (from #224)
clickhouse-operator-helm~0.0.2→~0.0.6(Chart.yaml,Chart.lock)values.yaml: renameenable→enabledforwebhook/certManager/crd— required, since v0.0.6 renamed these keys. The oldenablekeys are silently ignored in v0.0.6 and would leavewebhook/certManagerenabled, breaking the install (no cert-manager present).integration-tests/full-stack/assert.sh: scopekubectl waitwith--field-selector=status.phase!=Succeeded. Required by the bump: v0.0.6 introduces a version-probe Job (absent in v0.0.2) that leaves aCompletedpod, whichkubectl wait --for=condition=Ready pods --allcan never satisfy → wait times out → suite fails.ClickHouse hardening for operator v0.0.6 (new)
Two new operator defaults in v0.0.6 broke the single-replica ClickHouse deployment. The chart now overrides both in
clickhouse.cluster.spec:1. Explicit container resources (
containerTemplate.resources: 2Gi memory, 500m CPU request)The chart previously set no
resources, relying on the operator default. v0.0.6 changed that default (release PR #206: "use the same req/limit for memory in default") to 512Mi, request == limit. At 512Mi, ClickHouse computesmax_server_memory_usage≈ 460 MiB; the full ClickStack schema plus ingestion and background merges exceeds it → OOMKilled (exit 137), crash-loop.2.
settings.enableDatabaseSync: falsev0.0.6 adds
enableDatabaseSyncdefaulting to true (new CRD field). It creates thedefaultdatabase with the Replicated (DatabaseReplicated) engine, so table metadata lives in Keeper. Per the CRD docs it "Supports only Replicated and integration databases" and targets multi-replica clusters. In this single-replica deployment it only adds fragility: during ClickHouse startup a transient Keeper exception occurs —— and the Replicated
defaultdatabase desyncs and silently drops every seeded table (otel_traces,otel_logs, …, and the goose tracking table). They never come back, so the collector loops onTable default.otel_traces does not existand the smoke test'swait_for_table_queryabletimes out after 5 minutes → exit 1 (the failure seen on this PR's CI). SettingenableDatabaseSync: falsekeepsdefaultAtomic, so seeded tables persist on the data volume across restarts.Verification (local kind, fresh full-stack cluster)
For each fix, repro'd the failure on the bump and confirmed the fix end-to-end:
max_server_memory_usage1.80 GiB, stable.defaultengineReplicated, all tables dropped after a startup Keeper hiccup,SHOW TABLES FROM defaultempty. AfterenableDatabaseSync: false:defaultengine Atomic, allotel_*/hyperdx_sessions/ rollup tables present and queryable, ingestion succeeds (traces 173→218, logs 14→16).helm unittest charts/clickstack: 180/180 pass (added assertions for both new defaults).alb-ingress,api-only) render cleanly.Note
2Gi raises the chart's baseline ClickHouse memory footprint vs. the operator's 512Mi default. Both settings remain overridable per environment via
clickhouse.cluster.spec.Ref
#224
#226