Conversation
|
There is also an open issue on the Regarding fsspec/adlfs#493, is the protocol identical? |
|
I am not sure but have been testing it with Azurite localy and it works as expected. I am going to try use it on the cloud. |
|
@christophediprima Thanks for testing that, appreciate it. We also test against |
|
We have been testing it on Azure Blob Storage with my team and we had no issues. What kind of tests can you think about? |
|
Looks like we have a few adls integration tests against the azurite docker iceberg-python/tests/io/test_fsspec.py Line 298 in b86d7d5 perhaps we can extend these to include wasb and wasbs |
<!--
Thanks for opening a pull request!
-->
<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->
# Rationale for this change
Starting from version 20, PyArrow supports ADLS filesystem. This PR adds
Pyarrow Azure support to Pyiceberg.
PyArrow is the [default
IO](https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/__init__.py#L366-L369)
for Pyiceberg catalogs. In Azure environment it handles wider spectrum
of auth strategies then Fsspec, including, for instance, [Managed
Identities](https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview).
Also, prior to this PR
#1663 (that is not merged
yet) there was no support for wasb(s) with Fsspec.
See the corresponding issue for more details:
#2112
# Are these changes tested?
Tests are added under tests/io/test_pyarrow.py.
# Are there any user-facing changes?
There are no API breaking changes. Direct impact of the PR: Pyarrow
FileIO in Pyiceberg supports Azure cloud environment. Examples of impact
for final users:
- Pyiceberg is usable in services with Managed Identities auth strategy.
- Pyiceberg is usable with wasb(s) schemes in Azure.
<!-- In the case of user-facing changes, please add the changelog label.
-->
---------
Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
Co-authored-by: Kevin Liu <kevin.jq.liu@gmail.com>
<!--
Thanks for opening a pull request!
-->
<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->
# Rationale for this change
Starting from version 20, PyArrow supports ADLS filesystem. This PR adds
Pyarrow Azure support to Pyiceberg.
PyArrow is the [default
IO](https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/__init__.py#L366-L369)
for Pyiceberg catalogs. In Azure environment it handles wider spectrum
of auth strategies then Fsspec, including, for instance, [Managed
Identities](https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview).
Also, prior to this PR
apache#1663 (that is not merged
yet) there was no support for wasb(s) with Fsspec.
See the corresponding issue for more details:
apache#2112
# Are these changes tested?
Tests are added under tests/io/test_pyarrow.py.
# Are there any user-facing changes?
There are no API breaking changes. Direct impact of the PR: Pyarrow
FileIO in Pyiceberg supports Azure cloud environment. Examples of impact
for final users:
- Pyiceberg is usable in services with Managed Identities auth strategy.
- Pyiceberg is usable with wasb(s) schemes in Azure.
<!-- In the case of user-facing changes, please add the changelog label.
-->
---------
Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
Co-authored-by: Kevin Liu <kevin.jq.liu@gmail.com>
|
depends on fsspec/adlfs#493 |
<!--
Thanks for opening a pull request!
-->
<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->
# Rationale for this change
Starting from version 20, PyArrow supports ADLS filesystem. This PR adds
Pyarrow Azure support to Pyiceberg.
PyArrow is the [default
IO](https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/__init__.py#L366-L369)
for Pyiceberg catalogs. In Azure environment it handles wider spectrum
of auth strategies then Fsspec, including, for instance, [Managed
Identities](https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview).
Also, prior to this PR
apache#1663 (that is not merged
yet) there was no support for wasb(s) with Fsspec.
See the corresponding issue for more details:
apache#2112
# Are these changes tested?
Tests are added under tests/io/test_pyarrow.py.
# Are there any user-facing changes?
There are no API breaking changes. Direct impact of the PR: Pyarrow
FileIO in Pyiceberg supports Azure cloud environment. Examples of impact
for final users:
- Pyiceberg is usable in services with Managed Identities auth strategy.
- Pyiceberg is usable with wasb(s) schemes in Azure.
<!-- In the case of user-facing changes, please add the changelog label.
-->
---------
Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
Co-authored-by: Kevin Liu <kevin.jq.liu@gmail.com>
|
i have a local change that parameterizes all the adls integration tests with abfs, abfss, wasb, and wasbs its currently failing with, note the wrong path |
|
pushed the parameterized test here for reference. i changed all reference of the protocol for adls to use the |
|
added the monkey patch solution here for reference. we can also wait for fsspec/adlfs#493 to land |
|
fsspec/adlfs#512 added the ability to override protocol but for older versions of adlfs, we would still need to monkey patch |
|
Since fsspec/adlfs#512 has been merged, should we pull in the latest main? |
|
ah looks like it didnt make the current latest release |
|
@Fokko how do u feel about the monkeypatch solution here? |
|
Commenting here vs. the dev list: I'm personally not a huge fan of the monkey patch solution, since it relies on us remembering to correct it after the next adlfs release. |
|
I'd rather wait for the official release. I'm not a fan of monkey patching either as it is tied to a specific version 😞 |
Closes #2271 and #1606
This will work as soon as this is merged: fsspec/adlfs#493