fix: normalize dictionary types in Arrow scans by GayathriSrividya · Pull Request #3444 · apache/iceberg-python

GayathriSrividya · 2026-05-30T13:38:50Z

Rationale

Iceberg treats Arrow dictionary encoding as an encoding detail rather than a separate logical type. However, ArrowScan.to_table currently concatenates batches without decoding dictionary-encoded columns first. A table containing both plain strings and dictionary-encoded strings therefore fails to scan with ArrowTypeError: Unable to merge.

This can occur in production when files written with dictionary encoding are later rewritten by Athena or Trino optimization into plain strings.

Changes

Recursively unwrap Arrow dictionary types while preserving unrelated Arrow types and schema metadata.
Normalize dictionary-encoded batches before permissive concatenation in ArrowScan.to_table.
Add regression coverage for mixed plain/dictionary string batches and nested dictionary types.

Attribution

I checked the issue and PR history before opening this PR. I did not find an earlier PR or implementation to cherry-pick for #3260.

Verification

make lint
make test (3711 passed, 1534 deselected)
uv run python -m pytest tests/io/test_pyarrow.py -q -k "mixed_dictionary or ensure_non_dictionary"

fix: normalize dictionary types in Arrow scans

002837c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: normalize dictionary types in Arrow scans#3444

fix: normalize dictionary types in Arrow scans#3444
GayathriSrividya wants to merge 1 commit into
apache:mainfrom
GayathriSrividya:fix/issue-3260-arrow-scan-dictionary-strings

GayathriSrividya commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

GayathriSrividya commented May 30, 2026

Rationale

Changes

Attribution

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant