Add support for Bodo DataFrame#2167
Conversation
|
@ehsantn looks like theres an issue with the dependency resolution |
|
@kevinjqliu Thanks for the quick review. Bodo requires Python >=3.10 since Python 3.9 has been removed by some dependency packages quite a while ago. Do all optional dependencies of PyIceberg need to support Python 3.9? What do you recommend? https://numpy.org/neps/nep-0029-deprecation_policy.html#support-table (Python 3.9 is removed since Apr 05, 2024). |
| version = "0.61.2" | ||
| description = "compiling Python code using LLVM" | ||
| optional = true | ||
| python-versions = ">=3.9" |
There was a problem hiding this comment.
it looks like numpy for 3.9 is removed here
There was a problem hiding this comment.
Numpy is here: https://github.com/ehsantn/iceberg-python/blob/f36265b8cdc9fa3056ad28784467579514cfc850/poetry.lock#L3424
I'm working on packaging Bodo for Python 3.9 to avoid these Poetry issues: bodo-ai/Bodo#637
Our team will just miss structured pattern matching and better type hints of Python 3.10 :)
There was a problem hiding this comment.
sg 3.9 is also EOL in a few months (2025-10)
https://devguide.python.org/versions/#supported-versions
|
Ok, updated Bodo to support Python 3.9 so this should work now. Tried |
|
@ehsantn i merged a few library upgrades. could you rebase this PR? |
Done. I assume the CI failure is not related to this PR? The test doesn't seem relevant. |
|
maybe try rebase |
|
Done. No idea why the CI fails here in unrelated tests. Maybe some dependency got upgraded in the lock file? |
|
Looks like the Bodo test is actually failing (logs are not very visible). Seems to be just a configuration issue. Working on a fix. |
|
All tests are passing locally for me now. Hopefully the CI works too. |
| under_20_arrow = version.parse(pyarrow.__version__) < version.parse("20.0.0") | ||
|
|
There was a problem hiding this comment.
we should find another way to make these tests pass instead of branching on pyarrow version
There was a problem hiding this comment.
Any ideas? Maybe use a range of "safe" values instead of a single file size value? I'd be happy to open another PR if there is more work for this.
Bodo is currently pinned to Arrow 19 since the current release version of PyIceberg supports up to Arrow 19. Bodo uses Arrow C++, which currently requires pinning to a single version for pip wheels to work (conda-forge builds against 4 latest Arrow versions in this case but pip doesn't support this yet). It'd be great if PyIceberg wouldn't set an upper version for Arrow if possible.
There was a problem hiding this comment.
Any ideas? Maybe use a range of "safe" values instead of a single file size value? I'd be happy to open another PR if there is more work for this.
I think we can just parameterize the file size. We're not really testing anything related to the size of the file.
It'd be great if PyIceberg wouldn't set an upper version for Arrow if possible.
yea agreed. lets see if we can remove the upper bound
|
@kevinjqliu please advise on next steps. This PR looks ready to merge to me and the flakiness of existing unit tests should be addressed separately (would be happy to contribute if priority). |
There was a problem hiding this comment.
@ehsantn i think this should make the tests more maintainable. wdyt?
| under_20_arrow = version.parse(pyarrow.__version__) < version.parse("20.0.0") | ||
|
|
There was a problem hiding this comment.
Any ideas? Maybe use a range of "safe" values instead of a single file size value? I'd be happy to open another PR if there is more work for this.
I think we can just parameterize the file size. We're not really testing anything related to the size of the file.
It'd be great if PyIceberg wouldn't set an upper version for Arrow if possible.
yea agreed. lets see if we can remove the upper bound
| } | ||
| assert summaries[5] == { | ||
| "removed-files-size": "16174", | ||
| "removed-files-size": "15774" if under_20_arrow else "16174", |
There was a problem hiding this comment.
lets just do this instead since we're not really testing for the file size
| "removed-files-size": "15774" if under_20_arrow else "16174", | |
| "removed-files-size": summaries[5]["removed-files-size"], |
Sure, will work on it ASAP. |
Done. Let me know what you think. |
|
Thanks for adding this @ehsantn |
|
Thanks for the review and help @kevinjqliu ! |
<!--
Thanks for opening a pull request!
-->
<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->
# Rationale for this change
Adds support for Bodo DataFrame library, which is a drop in replacement
for Pandas that accelerates and scales Python code automatically by
applying query, compiler and HPC optimizations.
# Are these changes tested?
Added integration test.
# Are there any user-facing changes?
Adds `Table.to_bodo()` function. Example code:
```python
df = table.to_bodo() # equivalent to `bodo.pandas.read_iceberg_table(table)`
df = df[df["trip_distance"] >= 10.0]
df = df[["VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime"]]
print(df)
```
<!-- In the case of user-facing changes, please add the changelog label.
-->
Rationale for this change
Adds support for Bodo DataFrame library, which is a drop in replacement for Pandas that accelerates and scales Python code automatically by applying query, compiler and HPC optimizations.
Are these changes tested?
Added integration test.
Are there any user-facing changes?
Adds
Table.to_bodo()function. Example code: