Skip to content

Bug Report: RemoveStatisticsUpdate fails with Pydantic 2.x - positional argument error #2944

@MrDerecho

Description

@MrDerecho

Apache Iceberg version

0.10.0 (latest release)

Please describe the bug 🐞

Summary

expire_snapshots() fails on tables with statistics metadata when using PyIceberg 0.10.0 with Pydantic 2.x due to incorrect positional argument usage in RemoveStatisticsUpdate initialization.

Environment

  • PyIceberg Version: 0.10.0
  • Pydantic Version: 2.12.4
  • Python Version: 3.13.3
  • Platform: macOS
  • Catalog Type: AWS Glue

Description

When calling expire_snapshots() on Iceberg tables that contain statistics metadata, the operation fails with a Pydantic validation error. The error only occurs on tables with statistics files in their metadata (e.g., tables that have undergone Trino compaction operations with statistics collection enabled).

Error

Traceback (most recent call last):
  File "/lib/python3.13/site-packages/pyiceberg/table/__init__.py", line 1208, in commit
    return self._apply(self._transaction._table_metadata)
  File "/lib/python3.13/site-packages/pyiceberg/table/__init__.py", line 1182, in _apply
    updated_metadata = update_table_metadata(
  File "/lib/python3.13/site-packages/pyiceberg/table/update/__init__.py", line 195, in update_table_metadata
    new_metadata = _apply_table_update(update, base_metadata, context)
  File "/lib/python3.13/site-packages/pyiceberg/table/update/__init__.py", line 490, in _apply_table_update
    for upd in updates:
  File "/lib/python3.13/site-packages/pyiceberg/table/update/__init__.py", line 505, in <genexpr>
    RemoveStatisticsUpdate(statistics_file.snapshot_id)
TypeError: BaseModel.__init__() takes 1 positional argument but 2 were given

Steps to Reproduce

  1. Create an Iceberg table with statistics metadata (e.g., via Trino compaction with statistics collection)
  2. Verify table has statistics files:
    from pyiceberg.catalog import load_catalog
    
    catalog = load_catalog('glue', **{'type': 'glue', 'region_name': 'us-east-1'})
    table = catalog.load_table('database.table_name')
    print(f"Statistics files: {len(table.metadata.statistics)}")  # Should be > 0
  3. Attempt to expire snapshots:
    table.manage_snapshots().expire_snapshots().retain_last(3).commit()
  4. Observe TypeError

Root Cause

In pyiceberg/table/update/__init__.py at line 505, RemoveStatisticsUpdate is instantiated with a positional argument:

remove_statistics_updates = (
    RemoveStatisticsUpdate(statistics_file.snapshot_id)  # ❌ Positional argument
    for statistics_file in base_metadata.statistics
    if statistics_file.snapshot_id in update.snapshot_ids
)

However, RemoveStatisticsUpdate is a Pydantic 2.x BaseModel, which requires keyword arguments for initialization. This is a breaking change from Pydantic 1.x.

Expected Behavior

expire_snapshots() should successfully remove expired snapshots and their associated statistics metadata without errors.

Actual Behavior

Operation fails with TypeError: BaseModel.__init__() takes 1 positional argument but 2 were given.

Proposed Fix

Change the positional argument to a keyword argument:

remove_statistics_updates = (
    RemoveStatisticsUpdate(snapshot_id=statistics_file.snapshot_id)  # ✓ Keyword argument
    for statistics_file in base_metadata.statistics
    if statistics_file.snapshot_id in update.snapshot_ids
)

File: pyiceberg/table/update/__init__.py
Line: 505

Impact

This bug affects any table that has statistics metadata, which occurs when:

  • Trino performs compaction operations with statistics collection enabled
  • Statistics are explicitly written to table metadata
  • Tables are managed with tools that generate statistics files

In environments with automated compaction jobs, this prevents snapshot expiration from functioning, leading to unbounded metadata growth.

Workaround

Manually patch the installed package:

# Find the installation path
python -c "import pyiceberg.table.update; import os; print(os.path.dirname(pyiceberg.table.update.__file__))"

# Apply the fix (adjust path accordingly)
sed -i 's/RemoveStatisticsUpdate(statistics_file.snapshot_id)/RemoveStatisticsUpdate(snapshot_id=statistics_file.snapshot_id)/' \
  <path-to-site-packages>/pyiceberg/table/update/__init__.py

Additional Context

Why This Bug May Go Unnoticed

Most Iceberg tables do not have statistics metadata:

  • Standard APPEND operations don't create statistics
  • Only specific operations (like Trino compaction) generate statistics
  • The bug only triggers when tables have statistics AND snapshots are expired

In our environment testing 10 tables, only 1 had statistics metadata (from a dedicated Trino compaction job), making this a rare but critical failure mode.

Pydantic 2.x Migration

This appears to be an incomplete migration to Pydantic 2.x. While most of PyIceberg correctly uses keyword arguments with Pydantic models, this specific instance was missed.

Similar issues may exist elsewhere in the codebase where Pydantic models are instantiated with positional arguments.

Related Code

RemoveStatisticsUpdate class definition (appears to be correctly defined as a Pydantic model):

class RemoveStatisticsUpdate(TableUpdate):
    snapshot_id: int

The issue is purely in the instantiation at line 505, not the class definition.

Verification

After applying the fix:

  • All 10 test tables successfully expire snapshots
  • Table with statistics metadata (1096 snapshots) correctly reduced to 2 snapshots with retain_last(2)
  • Statistics metadata properly cleaned up alongside expired snapshots
  • All tables remain readable after expiration

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions