Fix Iceberg read optimization returning NULLs for stats-less manifests (#1545) — antalya-26.3#1814
Fix Iceberg read optimization returning NULLs for stats-less manifests (#1545) — antalya-26.3#1814il9ue wants to merge 2 commits into
Conversation
When an Iceberg manifest's per-file column statistics are absent (a common case for non-Spark writers like pyiceberg with default settings), DataFileMetaInfo::columns_info is empty. The optimization in StorageObjectStorageSource::createReader misread this as 'all columns are absent from the file' and returned constant NULLs for every row while still returning the correct row count. Result: silent data loss on icebergLocal, icebergS3, icebergAzure, icebergHDFS, and all *Cluster variants. Track whether any per-file stats were emitted via a new 'any_stats_field_present' boolean on DataFileMetaInfo, populated during manifest parsing in AvroForIcebergDeserializer. The optimization's absent-NULL loop only fires when stats are present; when stats are absent entirely, fall through to the Parquet reader, which correctly handles both physically-present columns (read normally) and schema-evolved-absent columns (handled by IcebergMetadata::getInitialSchemaByPath setting the file's own schema as initial_header). Closes Altinity#1545. Mirror of Altinity#1688 (antalya-25.8 fix). Signed-off-by: Daniel Q. Kim <daniel.kim@altinity.com>
d6ea43d to
4c61947
Compare
…mpty-stats-26.3 Signed-off-by: Daniel Q. Kim <daniel.kim@altinity.com> # Conflicts: # src/Storages/ObjectStorage/StorageObjectStorageSource.cpp
4c61947 to
4f483f6
Compare
Heads upCI is failing with a workflow template error:
This appears to be coming from antalya-26.3's pull_request.yml — that file is identical between this PR and the base branch (verified via This looks like an infrastructure issue on antalya-26.3 affecting all PRs targeting that branch, not anything specific to this PR. The code itself builds clean locally with clang-21 + RelWithDebInfo. Is this something the CI team is aware of, or should I open a separate issue? |
|
This is a known issue. We have separate workflows for internal and external branches. When the author of an external branch is an org member, some jobs get confused. |
|
|
||
| void DataFileMetaInfo::serialize(WriteBuffer & out) const | ||
| { | ||
| writeIntBinary(static_cast<UInt8>(stats_were_read), out); |
There was a problem hiding this comment.
Does this break backward compatibility?
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix Iceberg read optimization returning NULL for every column when reading from manifests written without per-file column statistics (typical of non-Spark writers like pyiceberg with default settings). Affects
icebergLocal,icebergS3,icebergAzure,icebergHDFS, and all*Clustervariants. Antalya 26.3 fix for Altinity/ClickHouse#1545.Description
Antalya-specific bug fix on
antalya-26.3. No upstream cherry-pick — this bug exists only on Antalya, introduced by Altinity/ClickHouse#1069 ("Read optimization using Iceberg metadata"). Mirror of the 25.8 fix in Altinity/ClickHouse#1688.Why this fires
When reading an Iceberg table written by a non-Spark writer that omits per-file column statistics from the manifest's Avro schema (pyiceberg with default settings, format v1 writers, and others), the
allow_experimental_iceberg_read_optimizationpath produces silent data loss: correct row counts, every column valueNULL. The reporter confirmed it onicebergLocal; investigation showed the same code path fires foricebergS3,icebergAzure,icebergHDFS, and all*Clustervariants.Root cause
IcebergIteratoralways populatesfile_meta_infobefore yielding objects, so thefile_meta_data.has_value()check in the optimization passes. The issue is what's inside the populatedDataFileMetaInfo: when the manifest'sdata_file.value_counts/column_sizes/null_value_countsAvro fields are all absent (per the Iceberg spec, all three are optional),DataFileMetaInfo::columns_infostays empty.The optimization's second loop in
StorageObjectStorageSource::createReaderthen iterates every requested column, finds none of them in the emptycolumns_infomap, and adds them all toconstant_columns_with_valueswithField()(NULL).requested_columns_copyis cleared,need_only_count = true, the Parquet reader returns row count only, andgenerate()injects every column as a constant-NULL column at the correct row count.The optimization conflates "no stats were written" with "all columns are absent." Absent stats tell us nothing about which columns are physically present in the file.
The fix
Add
any_stats_field_present(bool) toDataFileMetaInfo. Populate it during manifest parsing inAvroForIcebergDeserializer.cpp—trueif any ofvalue_counts,column_sizes, ornull_value_countswere emitted by the writer. Gate the optimization's absent-NULL loop on this flag: when no stats were emitted, skip the loop entirely and fall through to the Parquet reader, which correctly handles both physically-present columns (read normally) and schema-evolved-absent columns (handled upstream byIcebergMetadata::getInitialSchemaByPathsetting the file's own schema asinitial_header).A per-column presence set was considered but is unnecessary because schema evolution is already handled upstream of the optimization; the boolean is sufficient.
JSON serialization (cluster reads via
toJson()/ JSON-ptr constructor) is updated to round-trip the new field. Missing-on-deserialization defaults tofalse, which matches pre-fix behavior.Files changed
src/Storages/ObjectStorage/DataLakes/IDataLakeMetadata.h: addedany_stats_field_presentfield toDataFileMetaInfo; constructor signature updated.src/Storages/ObjectStorage/DataLakes/IDataLakeMetadata.cpp: JSON serde round-trips the new field; missing-on-deserialize defaults tofalse.src/Storages/ObjectStorage/DataLakes/Iceberg/ManifestFile.h: header updates forParsedManifestFileEntry.src/Storages/ObjectStorage/DataLakes/Common/AvroForIcebergDeserializer.cpp: tracks whether any stats Avro field was present during manifest parsing on 26.3.src/Storages/ObjectStorage/DataLakes/Iceberg/IcebergIterator.cpp: forwards the new bool when constructingDataFileMetaInfo.src/Storages/ObjectStorage/StorageObjectStorageSource.cpp: the absent-NULL loop now skips whenany_stats_field_presentisfalse.Note: 26.3 uses
AvroForIcebergDeserializer.cppfor manifest parsing where 25.8 / 26.1 useManifestFile.cpp(file was split upstream). Same logic, different file.Tested
tests/integration/test_storage_iceberg_no_spark/test_iceberg_read_optimization_empty_stats.pyported from the 25.8 PR. Test logic and conftest fixture (started_cluster_iceberg_no_spark) compatible with 26.3 as-is. Four scenarios:test_iceberg_local_returns_actual_rows_with_stats_less_manifest— reproducer, fails without the fix.test_iceberg_local_returns_correct_rows_when_optimization_disabled— control.test_iceberg_local_partial_stats_manifest_reads_correctly— manifest withvalue_countsonly.test_iceberg_local_full_stats_manifest_reads_correctly— full Spark-style stats regression guard.clang -fsyntax-onlyagainst 26.3's source headers in the verification round of Backport #100607 to 25.8.16: Re-add {database} macro support in clickhouse-client prompt #1688. Full integration test execution will run on CI.CI/CD Options
Exclude tests:
Regression jobs to run: