Antalya 26.3: Cluster Joins part 2 - global mode#1782
Conversation
Cluster Joins part 2 - global mode
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 01d8b03ac1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
arthurpassos
left a comment
There was a problem hiding this comment.
I am not familiar with this piece of code, I highly doubt I can add any value here. The code looks sane, and the tests as well.
Audit: PR #1782 — Antalya 26.3: Cluster Joins part 2 - global modeAI audit note: This review comment was generated by AI (Cursor agent, audit-review skill). Scope reviewedHead Confirmed defectsMedium:
Medium: GLOBAL mode loses
Low (latent):
Coverage summary
|
Expected.
Fixed in a8dbd32 |
Audit: PR #1782 — Antalya 26.3: Cluster Joins part 2 - global modeAI audit note: This review was generated by AI (Cursor agent, audit-review skill). Static analysis only; no local build or test run. Last updated: 2026-05-19 Executive summaryPR #1782 frontports cluster-join GLOBAL mode ( Four follow-up commits on 26.3 address AI review feedback (GLOBAL IN, cross join, Open confirmed defects: 1 Medium, 1 Low (latent). Commits in scope (26.3 delta)
Files reviewed
Architecture (call graph & transitions)Entry points
GLOBAL mode transition (
|
| Field | Detail |
|---|---|
| Impact | Session/profile object_storage_cluster_join_mode='global' forces JoinLocality::Global on queries that use rewriteJoinToGlobalJoin from parallel replicas or ClusterProxy, even for ordinary MergeTree JOIN MergeTree, overriding parallel_replicas_prefer_local_join=1. |
| Anchor | src/Storages/buildQueryTreeForShard.cpp — RewriteJoinToGlobalJoinVisitor::enterImpl |
| Trigger | object_storage_cluster_join_mode='global' + rewriteJoinToGlobalJoin from Planner/findParallelReplicasQuery.cpp:513 or Interpreters/ClusterProxy/executeQuery.cpp:832. |
| Why defect | prefer_local_join = parallel_replicas_prefer_local_join && object_storage_cluster_join_mode != GLOBAL is evaluated inside a shared visitor; only IStorageCluster::updateQueryWithJoinToSendIfNeeded should apply this object-storage setting. |
| Fix direction | Add a caller flag to rewriteJoinToGlobalJoin / visitor constructor; do not read object_storage_cluster_join_mode from context in the shared visitor. |
| Regression test | MergeTree JOIN MergeTree with max_parallel_replicas>1, parallel_replicas_prefer_local_join=1, object_storage_cluster_join_mode='global'; assert plan keeps JOIN local. |
bool prefer_local_join = getContext()->getSettingsRef()[Setting::parallel_replicas_prefer_local_join]
&& getContext()->getSettingsRef()[Setting::object_storage_cluster_join_mode] != ObjectStorageClusterJoinMode::GLOBAL;
bool should_use_global_join = !prefer_local_join || !allStoragesAreMergeTree(join_node->getRightTableExpression());
if (should_use_global_join)
join_node->setLocality(JoinLocality::Global);Callers of rewriteJoinToGlobalJoin:
IStorageCluster.cpp— GLOBAL branch (intended)findParallelReplicasQuery.cpp— parallel replicas (unintended side effect)ClusterProxy/executeQuery.cpp— distributed parallel replicas (unintended side effect)
Low (latent): SearcherVisitor cannot find match index ≥ 2
| Field | Detail |
|---|---|
| Impact | All call sites pass entry=1 today. Future entry>=2 returns nullptr → LOGICAL_ERROR in getQueryTreeInfo / updateQueryWithJoinToSendIfNeeded. |
| Anchor | src/Storages/IStorageCluster.cpp — SearcherVisitor::needChildVisit |
| Trigger | SearcherVisitor(..., entry=2, ...) on a tree with two TABLE / TABLE_FUNCTION nodes. |
| Why defect | needChildVisit uses !current_entry; after the first match current_entry>=1, traversal stops before a second sibling is visited. |
| Fix direction | Use current_entry < entry or rely only on !passed_node. |
| Regression test | Unit test: two table functions; entry=2 returns the second node. |
bool needChildVisit(QueryTreeNodePtr & /*parent*/, QueryTreeNodePtr & /*child*/)
{
return getSubqueryDepth() <= 2 && !passed_node && !current_entry;
}Resolved findings (addressed in PR)
Medium: GLOBAL IN not rewritten when IN subquery only references StorageDistributed — fixed in a8dbd327
| Field | Detail |
|---|---|
| Original issue | RewriteInToGlobalInVisitor skipped rewrite when every table in the IN subquery was StorageDistributed (no_replace early return). Wrong on object-storage initiator: swarm nodes are not the distributed table's cluster. |
| Fix | rewrite_for_distributed parameter; when true, skip no_replace logic. IStorageCluster GLOBAL path calls rewriteInToGlobalIn(..., /*rewrite_for_distributed*/ true). |
| Tests | test_cluster_joins.py: JOIN distributed table, GLOBAL IN / IN with distributed subquery; conftest with_zookeeper=True for ON CLUSTER. |
if (!rewrite_for_distributed)
{
bool no_replace = true;
for (const auto & table_node : extractTableExpressions(query->getJoinTree(), false, true))
{
// ... StorageDistributed check ...
}
if (no_replace)
return;
}Note: rewriteInToGlobalIn(..., true) runs only when info.has_local_columns_in_where. Pure WHERE tag IN (SELECT … FROM distributed) without other non–object-storage column nodes in WHERE/PREWHERE may not set that flag; behavior then relies on buildQueryTreeForShard (find_cross_join=true + object_storage_cluster_join_mode=GLOBAL in DistributedProductModeRewriteInJoinVisitor). New integration tests cover plain IN with a distributed table for both local and global join modes.
Other fixes verified (no remaining defect classification)
| Commit | Issue | Resolution |
|---|---|---|
3926fa3 |
GLOBAL IN rewrite in object-storage cluster path | rewriteInToGlobalIn / visitor plumbing |
fb33a04 |
CROSS JOIN right-side not materialized | find_cross_join + CrossJoinNode handling in buildQueryTreeForShard and DistributedProductModeRewriteInJoinVisitor |
b63ad834 |
Root IN replacement not visible to caller |
rewriteInToGlobalIn(QueryTreeNodePtr &) + visit full tree / WHERE reference |
| (feature) | Temp tables not sent to swarm | ReadFromCluster captures external_tables from planner_context → RemoteQueryExecutor |
| (feature) | getQueryProcessingStage |
LOCAL: FetchColumns when join/cross join/local WHERE; GLOBAL: join stays on swarm (no forced FetchColumns for join strip) — intentional |
Not confirmed / out of scope
| Topic | Status |
|---|---|
object_storage_remote_initiator + GLOBAL join |
external_tables wired for direct ReadFromCluster path only. Remote-initiator recursion via StorageDistributed::read + Context::createCopy in convertToRemote not exhaustively traced; not classified as confirmed defect. |
| Cross join with cluster table not at index 0 | Same limitation as LOCAL mode; by design / documented behavior. |
IN only in JOIN ON (not WHERE) |
Not covered by has_local_columns_in_where; may rely on buildQueryTreeForShard stack logic — no failing test identified. |
| CI | PR CI had integration/regression failures at a8dbd327; not analyzed for root cause vs. this feature. |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Cluster Joins part 2 - global mode
Documentation entry for user-facing changes
Frontport of #1527
Setting
object_storage_cluster_join_modewiith valueglobal.In queries like
when left table is executed on cluster (
s3Cluster,Icebergwithobject_storage_clustersetting, etc.) data from right table is extracted and sent to swarm nodes as temorary tables. JOIN is executed on swarm nodes.This PR also includes several fixes for issues, found by AI
buildQueryTreeForShardwith ARRAY JOIN and GLOBAL JOIN ClickHouse/ClickHouse#96310These changes are in last three commits, and new for 26.3 port, do not exists in #1527 for 26.1.
CI/CD Options
Exclude tests:
Regression jobs to run: