Add all filles metadata tables by soumya-ghosh · Pull Request #1626 · apache/iceberg-python

soumya-ghosh · 2025-02-08T10:49:26Z

Implements below metadata table from - #1053

all_files
all_data_files
all_delete_files

Refactored code for files metadata for better reusability

- all_files - all_data_files - all_delete_files

kevinjqliu

Thanks for the PR, I added a few comments

kevinjqliu · 2025-02-08T18:16:58Z

pyiceberg/table/inspect.py

+        all_manifest_files_by_snapshot: Iterator[List[ManifestFile]] = executor.map(
+            lambda args: args[0].manifests(self.tbl.io), [(snapshot,) for snapshot in snapshots]
+        )
+        all_manifest_files = list(
+            {(manifest.manifest_path, manifest) for manifest_list in all_manifest_files_by_snapshot for manifest in manifest_list}
+        )
+        all_files_by_manifest: Iterator[List[Dict[str, Any]]] = executor.map(
+            lambda args: self._files_by_manifest(*args), [(manifest, data_file_filter) for _, manifest in all_manifest_files]
+        )
+        all_files_list = [file for files in all_files_by_manifest for file in files]
+        return pa.Table.from_pylist(
+            all_files_list,
+            schema=self._get_files_schema(),
+        )


WDYT about something like this?

Also i would rename _files_by_manifest and have it return pa.Table, so we can skip the flatten and just concat the tables.

Suggested change

all_manifest_files_by_snapshot: Iterator[List[ManifestFile]] = executor.map(

lambda args: args[0].manifests(self.tbl.io), [(snapshot,) for snapshot in snapshots]

)

all_manifest_files = list(

{(manifest.manifest_path, manifest) for manifest_list in all_manifest_files_by_snapshot for manifest in manifest_list}

)

all_files_by_manifest: Iterator[List[Dict[str, Any]]] = executor.map(

lambda args: self._files_by_manifest(*args), [(manifest, data_file_filter) for _, manifest in all_manifest_files]

)

all_files_list = [file for files in all_files_by_manifest for file in files]

return pa.Table.from_pylist(

all_files_list,

schema=self._get_files_schema(),

)

manifest_lists = executor.map(

lambda snapshot: snapshot.manifests(self.tbl.io),

snapshots

)

unique_manifests = {

(manifest.manifest_path, manifest)

for manifest_list in manifest_lists

for manifest in manifest_list

}

file_lists = executor.map(

self._files_by_manifest,

[(manifest, data_file_filter) for _, manifest in unique_manifests]

)

all_files = [

file

for file_list in file_lists

for file in file_list

]

return pa.Table.from_pylist(

all_files,

schema=self._get_files_schema()

)

I agree with this, the impl of the _files_by_manifest enforces uniqueness which wasn't clear

kevinjqliu · 2025-02-08T18:18:10Z

pyiceberg/table/inspect.py

+        self, manifest_list: ManifestFile, data_file_filter: Optional[Set[DataFileContent]] = None
+    ) -> List[Dict[str, Any]]:
+        files: list[dict[str, Any]] = []
+        schema = self.tbl.metadata.schema()


when time traveling with different snapshots, we shouldnt just use the current table schema
for context #1053 (comment)

@kevinjqliu updated code as per comments.

Fokko · 2025-05-06T07:58:49Z

@soumya-ghosh Gentle ping, would you be interested in contributing this? Would be great to get this in 🚀

tests/integration/test_inspect_table.py

Fokko · 2025-05-06T08:03:51Z

pyiceberg/table/inspect.py

            return pa.Table.from_pylist(
-                files,
-                schema=files_schema,
+                [],
+                schema=self._get_files_schema(),
            )


Nice one, this can be further simplified to:

return self._get_files_schema().empty_table()

Less is more :)

Fokko · 2025-05-06T08:05:21Z

pyiceberg/table/inspect.py

+    def _files(self, snapshot_id: Optional[int] = None, data_file_filter: Optional[Set[DataFileContent]] = None) -> "pa.Table":
+        import pyarrow as pa
+
+        files_table: list[pa.Table] = []


nit: we can move this one down, we don't need to create the error when we return on line 642

Fokko · 2025-05-06T08:06:58Z

@soumya-ghosh I see that you incorporated the feedback by @kevinjqliu directly, instead of accepting the suggestion. That also works, thanks for working on this. I think we're pretty close 👍

soumya-ghosh · 2025-05-06T08:07:46Z

Yes @Fokko, there is an open discussion that was happening in #1053 (comment).

I will raise another PR for docs about the inspect operations.

Fokko · 2025-05-06T08:35:23Z

pyiceberg/table/inspect.py

+        return self._all_files({DataFileContent.DATA})
+
+    def all_delete_files(self) -> "pa.Table":
+        return self._all_files({DataFileContent.POSITION_DELETES, DataFileContent.EQUALITY_DELETES})


This should also include Puffin files:

We have a Spark table to test this:

iceberg-python/dev/provision.py

Lines 121 to 138 in 05f07ee

for format_version in [2, 3]:

identifier = f'{catalog_name}.default.test_positional_mor_deletes_v{format_version}'

spark.sql(

f"""

CREATE OR REPLACE TABLE {identifier} (

dt date,

number integer,

letter string

)

USING iceberg

TBLPROPERTIES (

'write.delete.mode'='merge-on-read',

'write.update.mode'='merge-on-read',

'write.merge.mode'='merge-on-read',

'format-version'='{format_version}'

);

"""

)

Okay, will check this. If this requires changes, it will also need changes in files and delete_files table.

@Fokko Added an integration test for table with format version 3, used Spark to write through pyiceberg to V3 table were failing.

Note that, the outputs of files metadata (and all other related tables) do not completely match with Spark counterparts due to additional columns in like first_row_id, referenced_data_file, content_offset, content_size_in_bytes. This needs to added first in DataFile class then propagated as required. Should be addressed in different issue, will it part of V3 tracking issue?

Yes, let's do that in a separate PR: #1982

Fokko

Left one minor comment for partition, apart from that, this looks great to me. Thanks @soumya-ghosh for working on this 🙌

Fokko · 2025-05-08T07:49:53Z

pyiceberg/table/inspect.py

+                    "content": data_file.content,
+                    "file_path": data_file.file_path,
+                    "file_format": data_file.file_format,
+                    "spec_id": data_file.spec_id,


In Spark we also have the partition column, I think it would be good to add that one here as well:

iceberg-python/pyiceberg/table/inspect.py

Lines 124 to 125 in 9fff025

partition_record = self.tbl.metadata.specs_struct()

pa_record_struct = schema_to_pyarrow(partition_record)

@Fokko Added partition column in files metadata table schema and added a test for the same

On minor point, could we swap the order of spec_id and partition to keep it the same as in Spark:

order of spec_id and partition column fixed.

Fokko · 2025-05-08T07:53:30Z

pyiceberg/table/inspect.py

+        return self._all_files({DataFileContent.DATA})
+
+    def all_delete_files(self) -> "pa.Table":
+        return self._all_files({DataFileContent.POSITION_DELETES, DataFileContent.EQUALITY_DELETES})


Yes, let's do that in a separate PR: #1982

tests/integration/test_inspect_table.py

Fokko

One minor remark, apart from that it looks good.

Pinging @geruh @kevinjqliu to see if they have any further comments

Fokko · 2025-05-09T09:25:12Z

pyiceberg/table/inspect.py

+                    "content": data_file.content,
+                    "file_path": data_file.file_path,
+                    "file_format": data_file.file_format,
+                    "spec_id": data_file.spec_id,


On minor point, could we swap the order of spec_id and partition to keep it the same as in Spark:

Fokko · 2025-05-13T14:39:00Z

Let's merge this to unblock #1958. Thanks @soumya-ghosh for working on this, and thanks @kevinjqliu and @geruh for the reviews 🙌

Implements below metadata table from - apache#1053 - `all_files` - `all_data_files` - `all_delete_files` Refactored code for files metadata for better reusability

Add metadata tables

96c680b

- all_files - all_data_files - all_delete_files

soumya-ghosh mentioned this pull request Feb 8, 2025

[feat] add missing metadata tables #1053

Open

16 tasks

kevinjqliu reviewed Feb 8, 2025

View reviewed changes

refactored _get_files_from_manifest and _all_files methods

fb10185

jayceslesar mentioned this pull request May 3, 2025

feat: delete orphaned files #1958

Open

Fokko reviewed May 6, 2025

View reviewed changes

tests/integration/test_inspect_table.py Show resolved Hide resolved

Fokko reviewed May 6, 2025

View reviewed changes

soumya-ghosh added 2 commits May 7, 2025 00:01

Merge branch 'main' into all_files_metadata_tables

95a63cb

Add integration tests format version 3 for files metadata tables

9fff025

Fokko reviewed May 8, 2025

View reviewed changes

Add partition field in files metadata table schema

2bce484

Fokko approved these changes May 9, 2025

View reviewed changes

Fix order of fields in files schema

9bdf5c7

Fokko requested a review from kevinjqliu May 10, 2025 20:57

Fokko merged commit 2a54034 into apache:main May 13, 2025
10 checks passed

	for format_version in [2, 3]:
	identifier = f'{catalog_name}.default.test_positional_mor_deletes_v{format_version}'
	spark.sql(
	f"""
	CREATE OR REPLACE TABLE {identifier} (
	dt date,
	number integer,
	letter string
	)
	USING iceberg
	TBLPROPERTIES (
	'write.delete.mode'='merge-on-read',
	'write.update.mode'='merge-on-read',
	'write.merge.mode'='merge-on-read',
	'format-version'='{format_version}'
	);
	"""
	)

	partition_record = self.tbl.metadata.specs_struct()
	pa_record_struct = schema_to_pyarrow(partition_record)

Conversation

soumya-ghosh commented Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko commented May 6, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko commented May 6, 2025

Uh oh!

soumya-ghosh commented May 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko commented May 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

soumya-ghosh commented Feb 8, 2025 •

edited

Loading