Support V2 Format for Iceberg Ingestion#19534
Conversation
FrankChen021
left a comment
There was a problem hiding this comment.
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 1 |
| P2 | 1 |
| P3 | 0 |
| Total | 2 |
Reviewed 8 of 8 changed files.
This is an automated review by Codex GPT-5.5
| { | ||
| private static final Logger log = new Logger(IcebergNativeRecordReader.class); | ||
|
|
||
| private static final WarehouseFileIO WAREHOUSE_FILE_IO = new WarehouseFileIO(); |
There was a problem hiding this comment.
[P1] Undefined FileIO class breaks compilation
The new reader instantiates WarehouseFileIO, but no class with that name is defined in this package or elsewhere in the repo, and there is no import that could resolve it. Any build that compiles this extension will fail before the V2 ingestion path can run. Use an existing Iceberg FileIO implementation or add the missing implementation.
|
|
||
| // Handle residual filter based on mode | ||
| if (detectedResidual != null) { | ||
| if (detectedResidual == null) { |
There was a problem hiding this comment.
[P2] Residual-filter check is inverted
This condition now enters the residual-handling branch when detectedResidual is null, even though null means no residual was found. With ResidualFilterMode.FAIL, valid scans that have no residual, such as partition-pruned scans, will throw a residual-filter error, while scans that do have a residual skip the warning/failure path. Restore the check to run only when detectedResidual != null.
Fixes #19190
Description
This PR fixes Iceberg delete semantics which was previously broken for v2 tables. No spec changes are required. The same ingestion spec works for V1 and V2 tables. Delete detection is automatic at planning time.
In the V2 path, each
FileScanTaskis encoded as anIcebergFileTaskInputSource, containing the data file path, format, size, delete file metadata (paths, content type, equality field IDs), and the table schema captured as JSON at planning time.IcebergNativeRecordReader processes each split:
- Position delete files: Rows to skip are read and filtered to only positions matching the current data file path.
- Equality delete files : Rows of key tuples are read per equality delete file.
MapBasedInputRowviaIcebergRecordConverterFileIO is resolved per path: local paths use
WarehouseFileIO, non-local paths (HDFS, S3) fall back tocatalog.loadTable().io()using the namespace and table name passed through the split.Release note
The
druid-iceberg-extensionsinput source now correctly handles Iceberg V2 tables that contain row-level delete files. Previously, position deletes and equality deletes were silently ignored, causing deleted rows to appear in ingested segments. Druid now auto-detects delete files at planning time and applies them during ingestion with no spec changes required.Key changed/added classes in this PR
IcebergCatalogIcebergInputSourceThis PR has: