Fix Iceberg read optimization returning NULLs for stats-less manifests (#1545) — antalya-26.3 by il9ue · Pull Request #1814 · Altinity/ClickHouse

il9ue · 2026-05-20T07:29:46Z

Changelog category (leave one):

Bug Fix

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix Iceberg read optimization returning NULL for every column when reading from manifests written without per-file column statistics (typical of non-Spark writers like pyiceberg with default settings). Affects icebergLocal, icebergS3, icebergAzure, icebergHDFS, and all *Cluster variants. Antalya 26.3 fix for Altinity/ClickHouse#1545.

Description

Antalya-specific bug fix on antalya-26.3. No upstream cherry-pick — this bug exists only on Antalya, introduced by Altinity/ClickHouse#1069 ("Read optimization using Iceberg metadata"). Mirror of the 25.8 fix in Altinity/ClickHouse#1688.

Why this fires

When reading an Iceberg table written by a non-Spark writer that omits per-file column statistics from the manifest's Avro schema (pyiceberg with default settings, format v1 writers, and others), the allow_experimental_iceberg_read_optimization path produces silent data loss: correct row counts, every column value NULL. The reporter confirmed it on icebergLocal; investigation showed the same code path fires for icebergS3, icebergAzure, icebergHDFS, and all *Cluster variants.

Root cause

IcebergIterator always populates file_meta_info before yielding objects, so the file_meta_data.has_value() check in the optimization passes. The issue is what's inside the populated DataFileMetaInfo: when the manifest's data_file.value_counts / column_sizes / null_value_counts Avro fields are all absent (per the Iceberg spec, all three are optional), DataFileMetaInfo::columns_info stays empty.

The optimization's second loop in StorageObjectStorageSource::createReader then iterates every requested column, finds none of them in the empty columns_info map, and adds them all to constant_columns_with_values with Field() (NULL). requested_columns_copy is cleared, need_only_count = true, the Parquet reader returns row count only, and generate() injects every column as a constant-NULL column at the correct row count.

The optimization conflates "no stats were written" with "all columns are absent." Absent stats tell us nothing about which columns are physically present in the file.

The fix

Add any_stats_field_present (bool) to DataFileMetaInfo. Populate it during manifest parsing in AvroForIcebergDeserializer.cpp — true if any of value_counts, column_sizes, or null_value_counts were emitted by the writer. Gate the optimization's absent-NULL loop on this flag: when no stats were emitted, skip the loop entirely and fall through to the Parquet reader, which correctly handles both physically-present columns (read normally) and schema-evolved-absent columns (handled upstream by IcebergMetadata::getInitialSchemaByPath setting the file's own schema as initial_header).

A per-column presence set was considered but is unnecessary because schema evolution is already handled upstream of the optimization; the boolean is sufficient.

JSON serialization (cluster reads via toJson() / JSON-ptr constructor) is updated to round-trip the new field. Missing-on-deserialization defaults to false, which matches pre-fix behavior.

Files changed

src/Storages/ObjectStorage/DataLakes/IDataLakeMetadata.h: added any_stats_field_present field to DataFileMetaInfo; constructor signature updated.
src/Storages/ObjectStorage/DataLakes/IDataLakeMetadata.cpp: JSON serde round-trips the new field; missing-on-deserialize defaults to false.
src/Storages/ObjectStorage/DataLakes/Iceberg/ManifestFile.h: header updates for ParsedManifestFileEntry.
src/Storages/ObjectStorage/DataLakes/Common/AvroForIcebergDeserializer.cpp: tracks whether any stats Avro field was present during manifest parsing on 26.3.
src/Storages/ObjectStorage/DataLakes/Iceberg/IcebergIterator.cpp: forwards the new bool when constructing DataFileMetaInfo.
src/Storages/ObjectStorage/StorageObjectStorageSource.cpp: the absent-NULL loop now skips when any_stats_field_present is false.

Note: 26.3 uses AvroForIcebergDeserializer.cpp for manifest parsing where 25.8 / 26.1 use ManifestFile.cpp (file was split upstream). Same logic, different file.

Tested

Integration test tests/integration/test_storage_iceberg_no_spark/test_iceberg_read_optimization_empty_stats.py ported from the 25.8 PR. Test logic and conftest fixture (started_cluster_iceberg_no_spark) compatible with 26.3 as-is. Four scenarios:
- test_iceberg_local_returns_actual_rows_with_stats_less_manifest — reproducer, fails without the fix.
- test_iceberg_local_returns_correct_rows_when_optimization_disabled — control.
- test_iceberg_local_partial_stats_manifest_reads_correctly — manifest with value_counts only.
- test_iceberg_local_full_stats_manifest_reads_correctly — full Spark-style stats regression guard.
Local build verification: changed files passed clang -fsyntax-only against 26.3's source headers in the verification round of Backport #100607 to 25.8.16: Re-add {database} macro support in clickhouse-client prompt #1688. Full integration test execution will run on CI.

CI/CD Options

Exclude tests:

Regression jobs to run:

When an Iceberg manifest's per-file column statistics are absent (a common case for non-Spark writers like pyiceberg with default settings), DataFileMetaInfo::columns_info is empty. The optimization in StorageObjectStorageSource::createReader misread this as 'all columns are absent from the file' and returned constant NULLs for every row while still returning the correct row count. Result: silent data loss on icebergLocal, icebergS3, icebergAzure, icebergHDFS, and all *Cluster variants. Track whether any per-file stats were emitted via a new 'any_stats_field_present' boolean on DataFileMetaInfo, populated during manifest parsing in AvroForIcebergDeserializer. The optimization's absent-NULL loop only fires when stats are present; when stats are absent entirely, fall through to the Parquet reader, which correctly handles both physically-present columns (read normally) and schema-evolved-absent columns (handled by IcebergMetadata::getInitialSchemaByPath setting the file's own schema as initial_header). Closes Altinity#1545. Mirror of Altinity#1688 (antalya-25.8 fix). Signed-off-by: Daniel Q. Kim <daniel.kim@altinity.com>

…mpty-stats-26.3 Signed-off-by: Daniel Q. Kim <daniel.kim@altinity.com> # Conflicts: # src/Storages/ObjectStorage/StorageObjectStorageSource.cpp

il9ue · 2026-05-20T12:51:12Z

Heads up

CI is failing with a workflow template error:

Error: The template is not valid. .github/workflows/pull_request.yml
(Line: 5285, Col: 18): Error reading JToken from JsonReader.
Path '', line 0, position 0.

This appears to be coming from antalya-26.3's pull_request.yml — that file is identical between this PR and the base branch (verified via git diff origin/antalya-26.3 -- .github/workflows/pull_request.yml), and the failing line references fromJson(needs.config_workflow.outputs.data) which suggests config_workflow isn't producing the expected output upstream.

This looks like an infrastructure issue on antalya-26.3 affecting all PRs targeting that branch, not anything specific to this PR. The code itself builds clean locally with clang-21 + RelWithDebInfo.

Is this something the CI team is aware of, or should I open a separate issue?

strtgbb · 2026-05-20T13:05:23Z

This is a known issue. We have separate workflows for internal and external branches. When the author of an external branch is an org member, some jobs get confused.

ianton-ru · 2026-05-20T13:06:59Z


 void DataFileMetaInfo::serialize(WriteBuffer & out) const
 {
+    writeIntBinary(static_cast<UInt8>(stats_were_read), out);


Does this break backward compatibility?

il9ue requested review from ianton-ru and zvonand May 20, 2026 08:01

il9ue mentioned this pull request May 20, 2026

Fix Iceberg read optimization returning NULLs for stats-less manifests (#1545) #1764

Closed

25 tasks

il9ue force-pushed the fix/iceberg-empty-stats-26.3 branch from d6ea43d to 4c61947 Compare May 20, 2026 09:49

Merge remote-tracking branch 'origin/antalya-26.3' into fix/iceberg-e…

4f483f6

…mpty-stats-26.3 Signed-off-by: Daniel Q. Kim <daniel.kim@altinity.com> # Conflicts: # src/Storages/ObjectStorage/StorageObjectStorageSource.cpp

il9ue force-pushed the fix/iceberg-empty-stats-26.3 branch from 4c61947 to 4f483f6 Compare May 20, 2026 12:21

ianton-ru reviewed May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Iceberg read optimization returning NULLs for stats-less manifests (#1545) — antalya-26.3#1814

Fix Iceberg read optimization returning NULLs for stats-less manifests (#1545) — antalya-26.3#1814
il9ue wants to merge 2 commits into
Altinity:antalya-26.3from
il9ue:fix/iceberg-empty-stats-26.3

il9ue commented May 20, 2026

Uh oh!

il9ue commented May 20, 2026

Uh oh!

strtgbb commented May 20, 2026

Uh oh!

ianton-ru May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

il9ue commented May 20, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Description

Why this fires

Root cause

The fix

Files changed

Tested

CI/CD Options

Uh oh!

il9ue commented May 20, 2026

Heads up

Uh oh!

strtgbb commented May 20, 2026

Uh oh!

ianton-ru May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants