Add support for Bodo DataFrame by ehsantn · Pull Request #2167 · apache/iceberg-python

ehsantn · 2025-07-03T20:01:08Z

Rationale for this change

Adds support for Bodo DataFrame library, which is a drop in replacement for Pandas that accelerates and scales Python code automatically by applying query, compiler and HPC optimizations.

Are these changes tested?

Added integration test.

Are there any user-facing changes?

Adds Table.to_bodo() function. Example code:

df = table.to_bodo()  # equivalent to `bodo.pandas.read_iceberg_table(table)`
df = df[df["trip_distance"] >= 10.0]
df = df[["VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime"]]
print(df)

kevinjqliu

LGTM!

kevinjqliu · 2025-07-05T18:18:38Z

@ehsantn looks like theres an issue with the dependency resolution

poetry install --all-extras
Installing dependencies from lock file

The current project's supported Python range (3.9.23) is not compatible with some of the required packages Python requirement:
  - numpy requires Python >=3.10, so it will not be installable for Python 3.9.23

Because no versions of pandas match >=1.0.0,<2.3.0 || >2.3.0,<3.0.0
 and pandas (2.3.0) depends on numpy (>=1.22.4), pandas (>=1.0.0,<3.0.0) requires numpy (>=1.22.4).
Because numpy (2.2.6) requires Python >=3.10
 and no versions of numpy match >=1.22.4,<2.2.6 || >2.2.6, numpy is forbidden.
Thus, pandas is forbidden.
So, because pyiceberg depends on pandas (>=1.0.0,<3.0.0), version solving failed.

  * Check your dependencies Python requirement: The Python requirement can be specified via the `python` or `markers` properties

    For numpy, a possible solution would be to set the `python` property to "<empty>"

    https://python-poetry.org/docs/dependency-specification/#python-restricted-dependencies,
    https://python-poetry.org/docs/dependency-specification/#using-environment-markers

make: *** [Makefile:63: install-dependencies] Error 1
Error: Process completed with exit code 2.

ehsantn · 2025-07-05T21:42:30Z

@kevinjqliu Thanks for the quick review. Bodo requires Python >=3.10 since Python 3.9 has been removed by some dependency packages quite a while ago. Do all optional dependencies of PyIceberg need to support Python 3.9? What do you recommend?
I can try to package Bodo for 3.9 with some workarounds if there is no other solution.

https://numpy.org/neps/nep-0029-deprecation_policy.html#support-table (Python 3.9 is removed since Apr 05, 2024).
https://pypi.org/project/numba (3.10+)

kevinjqliu · 2025-07-06T20:38:24Z

poetry.lock

+version = "0.61.2"
+description = "compiling Python code using LLVM"
 optional = true
-python-versions = ">=3.9"


it looks like numpy for 3.9 is removed here

Numpy is here: https://github.com/ehsantn/iceberg-python/blob/f36265b8cdc9fa3056ad28784467579514cfc850/poetry.lock#L3424
I'm working on packaging Bodo for Python 3.9 to avoid these Poetry issues: bodo-ai/Bodo#637
Our team will just miss structured pattern matching and better type hints of Python 3.10 :)

sg 3.9 is also EOL in a few months (2025-10)
https://devguide.python.org/versions/#supported-versions

ehsantn · 2025-07-07T20:02:35Z

Ok, updated Bodo to support Python 3.9 so this should work now. Tried poetry install --all-extras in an Ubuntu environment and it works.

kevinjqliu · 2025-07-08T02:34:57Z

@ehsantn i merged a few library upgrades. could you rebase this PR?

ehsantn · 2025-07-08T02:40:33Z

@ehsantn i merged a few library upgrades. could you rebase this PR?

Done. I assume the CI failure is not related to this PR? The test doesn't seem relevant.

kevinjqliu · 2025-07-08T05:15:08Z

maybe try rebase main again, idk what CI is doing

ehsantn · 2025-07-08T12:37:47Z

Done. No idea why the CI fails here in unrelated tests. Maybe some dependency got upgraded in the lock file?

ehsantn · 2025-07-08T13:23:52Z

Looks like the Bodo test is actually failing (logs are not very visible). Seems to be just a configuration issue. Working on a fix.
https://github.com/apache/iceberg-python/actions/runs/16132597025/job/45522782328#step:5:2479

ehsantn · 2025-07-08T21:28:39Z

All tests are passing locally for me now. Hopefully the CI works too.

kevinjqliu · 2025-07-08T22:13:27Z

tests/integration/test_writes/test_partitioned_writes.py

+    under_20_arrow = version.parse(pyarrow.__version__) < version.parse("20.0.0")
+


we should find another way to make these tests pass instead of branching on pyarrow version

Any ideas? Maybe use a range of "safe" values instead of a single file size value? I'd be happy to open another PR if there is more work for this.

Bodo is currently pinned to Arrow 19 since the current release version of PyIceberg supports up to Arrow 19. Bodo uses Arrow C++, which currently requires pinning to a single version for pip wheels to work (conda-forge builds against 4 latest Arrow versions in this case but pip doesn't support this yet). It'd be great if PyIceberg wouldn't set an upper version for Arrow if possible.

Any ideas? Maybe use a range of "safe" values instead of a single file size value? I'd be happy to open another PR if there is more work for this.

I think we can just parameterize the file size. We're not really testing anything related to the size of the file.

It'd be great if PyIceberg wouldn't set an upper version for Arrow if possible.

yea agreed. lets see if we can remove the upper bound

ehsantn · 2025-07-11T17:52:25Z

@kevinjqliu please advise on next steps. This PR looks ready to merge to me and the flakiness of existing unit tests should be addressed separately (would be happy to contribute if priority).

kevinjqliu

@ehsantn i think this should make the tests more maintainable. wdyt?

kevinjqliu · 2025-07-15T02:04:14Z

tests/integration/test_writes/test_partitioned_writes.py

+    under_20_arrow = version.parse(pyarrow.__version__) < version.parse("20.0.0")
+


Any ideas? Maybe use a range of "safe" values instead of a single file size value? I'd be happy to open another PR if there is more work for this.

I think we can just parameterize the file size. We're not really testing anything related to the size of the file.

It'd be great if PyIceberg wouldn't set an upper version for Arrow if possible.

yea agreed. lets see if we can remove the upper bound

kevinjqliu · 2025-07-15T02:05:01Z

tests/integration/test_writes/test_partitioned_writes.py

    }
    assert summaries[5] == {
-        "removed-files-size": "16174",
+        "removed-files-size": "15774" if under_20_arrow else "16174",


lets just do this instead since we're not really testing for the file size

Suggested change

"removed-files-size": "15774" if under_20_arrow else "16174",

"removed-files-size": summaries[5]["removed-files-size"],

Sounds good.

ehsantn · 2025-07-15T02:14:27Z

@ehsantn i think this should make the tests more maintainable. wdyt?

Sure, will work on it ASAP.

ehsantn · 2025-07-15T13:58:45Z

@ehsantn i think this should make the tests more maintainable. wdyt?

Sure, will work on it ASAP.

Done. Let me know what you think.

kevinjqliu · 2025-07-15T16:28:47Z

Thanks for adding this @ehsantn

ehsantn · 2025-07-15T16:42:12Z

Thanks for the review and help @kevinjqliu !

# Rationale for this change Adds support for Bodo DataFrame library, which is a drop in replacement for Pandas that accelerates and scales Python code automatically by applying query, compiler and HPC optimizations. # Are these changes tested? Added integration test. # Are there any user-facing changes? Adds `Table.to_bodo()` function. Example code: ```python df = table.to_bodo() # equivalent to `bodo.pandas.read_iceberg_table(table)` df = df[df["trip_distance"] >= 10.0] df = df[["VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime"]] print(df) ```

ehsantn added 3 commits July 3, 2025 15:54

Add Bodo support

9befbaa

add integration test

b7b5ba1

update lock file

f36265b

ehsantn marked this pull request as ready for review July 4, 2025 02:22

kevinjqliu approved these changes Jul 5, 2025

View reviewed changes

ehsantn mentioned this pull request Jul 5, 2025

Support Python 3.9 for PyIceberg bodo-ai/Bodo#637

Merged

3 tasks

kevinjqliu reviewed Jul 6, 2025

View reviewed changes

ehsantn changed the title ~~Added support for Bodo DataFrame~~ Add support for Bodo DataFrame Jul 7, 2025

ehsantn added 2 commits July 7, 2025 15:59

Update Bodo version for Python 3.9

7c0fbc2

Merge branch 'main' into ehsan/bodo_support

947f503

Merge branch 'main' into ehsan/bodo_support

8f5ba9d

ehsantn added 2 commits July 8, 2025 08:33

Merge branch 'main' into ehsan/bodo_support

71287d6

lock again

cae2425

update bodo and fix test issues

ccfdd81

kevinjqliu reviewed Jul 8, 2025

View reviewed changes

Merge branch 'main' into ehsan/bodo_support

74d1393

kevinjqliu reviewed Jul 15, 2025

View reviewed changes

ehsantn added 2 commits July 15, 2025 09:55

remove file size checks

f99171c

Merge branch 'main' into ehsan/bodo_support

447487b

kevinjqliu merged commit 2d7d089 into apache:main Jul 15, 2025
11 checks passed

kevinjqliu mentioned this pull request Jul 20, 2025

bug? test_bodo_nan in tests/integration/test_reads.py hangs locally #2225

Closed

3 tasks

		under_20_arrow = version.parse(pyarrow.__version__) < version.parse("20.0.0")

	"removed-files-size": "15774" if under_20_arrow else "16174",
	"removed-files-size": summaries[5]["removed-files-size"],

Conversation

ehsantn commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

kevinjqliu commented Jul 5, 2025

Uh oh!

ehsantn commented Jul 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ehsantn commented Jul 7, 2025

Uh oh!

kevinjqliu commented Jul 8, 2025

Uh oh!

ehsantn commented Jul 8, 2025

Uh oh!

kevinjqliu commented Jul 8, 2025

Uh oh!

ehsantn commented Jul 8, 2025

Uh oh!

ehsantn commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehsantn commented Jul 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ehsantn Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ehsantn commented Jul 11, 2025

Uh oh!

kevinjqliu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ehsantn commented Jul 15, 2025

Uh oh!

ehsantn commented Jul 15, 2025

Uh oh!

Uh oh!

kevinjqliu commented Jul 15, 2025

Uh oh!

ehsantn commented Jul 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

ehsantn commented Jul 3, 2025 •

edited

Loading

ehsantn commented Jul 8, 2025 •

edited

Loading

ehsantn Jul 8, 2025 •

edited

Loading

kevinjqliu left a comment •

edited

Loading