-
Notifications
You must be signed in to change notification settings - Fork 434
Description
Apache Iceberg version
0.10.0 (latest release)
Please describe the bug 🐞
Summary
expire_snapshots() fails on tables with statistics metadata when using PyIceberg 0.10.0 with Pydantic 2.x due to incorrect positional argument usage in RemoveStatisticsUpdate initialization.
Environment
- PyIceberg Version: 0.10.0
- Pydantic Version: 2.12.4
- Python Version: 3.13.3
- Platform: macOS
- Catalog Type: AWS Glue
Description
When calling expire_snapshots() on Iceberg tables that contain statistics metadata, the operation fails with a Pydantic validation error. The error only occurs on tables with statistics files in their metadata (e.g., tables that have undergone Trino compaction operations with statistics collection enabled).
Error
Traceback (most recent call last):
File "/lib/python3.13/site-packages/pyiceberg/table/__init__.py", line 1208, in commit
return self._apply(self._transaction._table_metadata)
File "/lib/python3.13/site-packages/pyiceberg/table/__init__.py", line 1182, in _apply
updated_metadata = update_table_metadata(
File "/lib/python3.13/site-packages/pyiceberg/table/update/__init__.py", line 195, in update_table_metadata
new_metadata = _apply_table_update(update, base_metadata, context)
File "/lib/python3.13/site-packages/pyiceberg/table/update/__init__.py", line 490, in _apply_table_update
for upd in updates:
File "/lib/python3.13/site-packages/pyiceberg/table/update/__init__.py", line 505, in <genexpr>
RemoveStatisticsUpdate(statistics_file.snapshot_id)
TypeError: BaseModel.__init__() takes 1 positional argument but 2 were givenSteps to Reproduce
- Create an Iceberg table with statistics metadata (e.g., via Trino compaction with statistics collection)
- Verify table has statistics files:
from pyiceberg.catalog import load_catalog catalog = load_catalog('glue', **{'type': 'glue', 'region_name': 'us-east-1'}) table = catalog.load_table('database.table_name') print(f"Statistics files: {len(table.metadata.statistics)}") # Should be > 0
- Attempt to expire snapshots:
table.manage_snapshots().expire_snapshots().retain_last(3).commit()
- Observe TypeError
Root Cause
In pyiceberg/table/update/__init__.py at line 505, RemoveStatisticsUpdate is instantiated with a positional argument:
remove_statistics_updates = (
RemoveStatisticsUpdate(statistics_file.snapshot_id) # ❌ Positional argument
for statistics_file in base_metadata.statistics
if statistics_file.snapshot_id in update.snapshot_ids
)However, RemoveStatisticsUpdate is a Pydantic 2.x BaseModel, which requires keyword arguments for initialization. This is a breaking change from Pydantic 1.x.
Expected Behavior
expire_snapshots() should successfully remove expired snapshots and their associated statistics metadata without errors.
Actual Behavior
Operation fails with TypeError: BaseModel.__init__() takes 1 positional argument but 2 were given.
Proposed Fix
Change the positional argument to a keyword argument:
remove_statistics_updates = (
RemoveStatisticsUpdate(snapshot_id=statistics_file.snapshot_id) # ✓ Keyword argument
for statistics_file in base_metadata.statistics
if statistics_file.snapshot_id in update.snapshot_ids
)File: pyiceberg/table/update/__init__.py
Line: 505
Impact
This bug affects any table that has statistics metadata, which occurs when:
- Trino performs compaction operations with statistics collection enabled
- Statistics are explicitly written to table metadata
- Tables are managed with tools that generate statistics files
In environments with automated compaction jobs, this prevents snapshot expiration from functioning, leading to unbounded metadata growth.
Workaround
Manually patch the installed package:
# Find the installation path
python -c "import pyiceberg.table.update; import os; print(os.path.dirname(pyiceberg.table.update.__file__))"
# Apply the fix (adjust path accordingly)
sed -i 's/RemoveStatisticsUpdate(statistics_file.snapshot_id)/RemoveStatisticsUpdate(snapshot_id=statistics_file.snapshot_id)/' \
<path-to-site-packages>/pyiceberg/table/update/__init__.pyAdditional Context
Why This Bug May Go Unnoticed
Most Iceberg tables do not have statistics metadata:
- Standard
APPENDoperations don't create statistics - Only specific operations (like Trino compaction) generate statistics
- The bug only triggers when tables have statistics AND snapshots are expired
In our environment testing 10 tables, only 1 had statistics metadata (from a dedicated Trino compaction job), making this a rare but critical failure mode.
Pydantic 2.x Migration
This appears to be an incomplete migration to Pydantic 2.x. While most of PyIceberg correctly uses keyword arguments with Pydantic models, this specific instance was missed.
Similar issues may exist elsewhere in the codebase where Pydantic models are instantiated with positional arguments.
Related Code
RemoveStatisticsUpdate class definition (appears to be correctly defined as a Pydantic model):
class RemoveStatisticsUpdate(TableUpdate):
snapshot_id: intThe issue is purely in the instantiation at line 505, not the class definition.
Verification
After applying the fix:
- All 10 test tables successfully expire snapshots
- Table with statistics metadata (1096 snapshots) correctly reduced to 2 snapshots with
retain_last(2) - Statistics metadata properly cleaned up alongside expired snapshots
- All tables remain readable after expiration
Willingness to contribute
- I can contribute a fix for this bug independently
- I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- I cannot contribute a fix for this bug at this time