Skip to content

fix: normalize dictionary types in Arrow scans#3444

Open
GayathriSrividya wants to merge 1 commit into
apache:mainfrom
GayathriSrividya:fix/issue-3260-arrow-scan-dictionary-strings
Open

fix: normalize dictionary types in Arrow scans#3444
GayathriSrividya wants to merge 1 commit into
apache:mainfrom
GayathriSrividya:fix/issue-3260-arrow-scan-dictionary-strings

Conversation

@GayathriSrividya
Copy link
Copy Markdown

Rationale

Iceberg treats Arrow dictionary encoding as an encoding detail rather than a separate logical type. However, ArrowScan.to_table currently concatenates batches without decoding dictionary-encoded columns first. A table containing both plain strings and dictionary-encoded strings therefore fails to scan with ArrowTypeError: Unable to merge.

This can occur in production when files written with dictionary encoding are later rewritten by Athena or Trino optimization into plain strings.

Changes

  • Recursively unwrap Arrow dictionary types while preserving unrelated Arrow types and schema metadata.
  • Normalize dictionary-encoded batches before permissive concatenation in ArrowScan.to_table.
  • Add regression coverage for mixed plain/dictionary string batches and nested dictionary types.

Attribution

I checked the issue and PR history before opening this PR. I did not find an earlier PR or implementation to cherry-pick for #3260.

Verification

  • make lint
  • make test (3711 passed, 1534 deselected)
  • uv run python -m pytest tests/io/test_pyarrow.py -q -k "mixed_dictionary or ensure_non_dictionary"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant