Merged
Conversation
Parallelizes manifest processing to improve performance for large tables with many manifest files. After parallel processing, merges the resulting partition maps to produce the final aggregated result.
Contributor
Author
|
Hey @jayceslesar could you please take a look on this PR? Thank you. I took this PR as ref and wanted to apply to |
Fokko
reviewed
Aug 20, 2025
Contributor
Fokko
left a comment
There was a problem hiding this comment.
Hey @emilie-wang Thanks for speeding this up. While at it, I think we need to do a minor refactor as well to keep everything readable.
pyiceberg/table/inspect.py
Outdated
| partitions_map: Dict[Tuple[str, Any], Any] = {} | ||
| snapshot = self._get_snapshot(snapshot_id) | ||
| for manifest in snapshot.manifests(self.tbl.io): | ||
| def process_manifest(manifest: ManifestFile) -> Dict[Tuple[str, Any], Any]: |
Contributor
There was a problem hiding this comment.
Since we're at it, I would suggest two things:
- Move the inline function to the class level, and add an underscore to the name, to indicate that it is considered private
_process_manifest. - Merge this function with
update_partitions_map, since that function isn't used anywhere else.
Contributor
Author
There was a problem hiding this comment.
Hi @Fokko, thank you for the review and updated with the code refactoring.
Fokko
reviewed
Aug 20, 2025
Fokko
approved these changes
Aug 20, 2025
Contributor
|
Thanks for fixing this @emilie-wang 🙌 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Parallelizes manifest processing to improve performance for large tables with many manifest files. After parallel processing, merges the resulting partition maps to produce the final aggregated result.
Previous example ref: e937f6a
Rationale for this change
Perf improvement.
We experienced slowness with table.inspect.partitions() with large table.
Are these changes tested?
Yes.
Are there any user-facing changes?
No.