-
Notifications
You must be signed in to change notification settings - Fork 51
[HDX] Bump clickhouse-operator to v0.0.6 + fix ClickHouse OOMKill #240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
46ec45f
a0e16e8
ed46ecc
1399c31
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| --- | ||
| "helm-charts": patch | ||
| --- | ||
|
|
||
| chore(deps): bump clickhouse-operator-helm to v0.0.6 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| --- | ||
| "helm-charts": patch | ||
| --- | ||
|
|
||
| fix(clickhouse): harden ClickHouse defaults for clickhouse-operator v0.0.6 | ||
|
|
||
| Two operator-default changes in v0.0.6 broke the single-replica ClickHouse | ||
| deployment; the chart now overrides both: | ||
|
|
||
| - Explicit `containerTemplate.resources` (2Gi memory, 500m CPU request). The | ||
| operator otherwise applies a 512Mi default (request == limit as of v0.0.6), | ||
| which is too low for the full ClickStack schema and OOMKills the server | ||
| (exit 137) under ingestion plus background merges. | ||
|
|
||
| - `settings.enableDatabaseSync: false`. The operator now defaults this to true, | ||
| which creates the `default` database with the Replicated (DatabaseReplicated) | ||
| engine so table metadata lives in Keeper. That feature targets multi-replica | ||
| clusters; in a single-replica deployment a transient Keeper hiccup during | ||
| startup desyncs the Replicated database and silently drops all seeded tables, | ||
| which never come back. Keeping `default` Atomic stores tables on the | ||
| persistent data volume so they survive restarts. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -328,6 +328,12 @@ clickhouse: | |
| image: | ||
| repository: clickhouse/clickhouse-server | ||
| tag: "25.7-alpine" | ||
| resources: | ||
| requests: | ||
| cpu: 500m | ||
| memory: 2Gi | ||
| limits: | ||
| memory: 2Gi | ||
| replicas: 1 | ||
| shards: 1 | ||
| keeperClusterRef: | ||
|
|
@@ -339,6 +345,9 @@ clickhouse: | |
| requests: | ||
| storage: 10Gi | ||
| settings: | ||
| # Keep the `default` database Atomic; the Replicated engine the operator | ||
| # selects when this is true drops seeded tables on Keeper desync. | ||
| enableDatabaseSync: false | ||
|
Comment on lines
+348
to
+350
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It makes sense even for single-node clusters, as it helps scale further and keeps the engine set the same for different topologies.
This should not happen. It never drops tables or recreates the default database as Replicated if any tables have been created(code) If smth really gets dropped by the operator(somehow) give a repro and I'll fix
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @GrigoryPervakov, thanks for following up on this PR. I dug into the issue and found the root cause. The collector (clickhouseexporter) creates an atomic database (https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/277638be46c8653cc6574f2f9a6d2597bd5c12da/exporter/clickhouseexporter/internal/clickhouse.go#L60-L73). Later, the operator drops that database and recreates it as a Replicated database (https://github.com/ClickHouse/clickhouse-operator/blob/ecd71949243d71f37a717d21acbba20424af03a1/internal/controller/clickhouse/commands.go#L274-L290), which causes the integration test to fail. What do you suggest as the best way to proceed here?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @GrigoryPervakov I've created a draft PR ClickHouse/clickhouse-operator#255. Let me know what you think. For the ClickStack Helm chart, we should use the Replicated engine going forward. That will be done in a follow-up PR. |
||
| extraUsersConfig: | ||
| users: | ||
| app: | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minimum CPU requests are definitely good to have. Overprovisioning a node is a much bigger problem than pod OOMs, so I like this. Glad we don't implement cpu limits.