Skip to content

Support ADLS with Pyarrow file IO #2111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

NikitaMatskevich
Copy link
Contributor

@NikitaMatskevich NikitaMatskevich commented Jun 17, 2025

Rationale for this change

Starting from version 20, PyArrow supports ADLS filesystem. This PR adds Pyarrow Azure support to Pyiceberg.

PyArrow is the default IO for Pyiceberg catalogs. In Azure environment it handles wider spectrum of auth strategies then Fsspec, including, for instance, Managed Identities. Also, prior to this PR #1663 (that is not merged yet) there was no support for wasb(s) with Fsspec.

See the corresponding issue for more details: #2112

Are these changes tested?

Tests are added under tests/io/test_pyarrow.py.

Are there any user-facing changes?

There are no API breaking changes. Direct impact of the PR: Pyarrow FileIO in Pyiceberg supports Azure cloud environment. Examples of impact for final users:

  • Pyiceberg is usable in services with Managed Identities auth strategy.
  • Pyiceberg is usable with wasb(s) schemes in Azure.

@NikitaMatskevich NikitaMatskevich force-pushed the nmatckevich/support-adls-pyarrow-file-io branch from e4e7260 to 076b68b Compare June 17, 2025 15:48
@Fokko
Copy link
Contributor

Fokko commented Jun 17, 2025

@NikitaMatskevich Thanks for working on this, I know a lot of users are waiting for this. It looks like some tests are failing (you can run the linters locally using make lint), could you look into those?

@NikitaMatskevich
Copy link
Contributor Author

NikitaMatskevich commented Jun 18, 2025

@Fokko thank you for looking into this! Sorry, indeed, missed some formatting issues. Now it should be fine.

@NikitaMatskevich Thanks for working on this, I know a lot of users are waiting for this. It looks like some tests are failing (you can run the linters locally using make lint), could you look into those?

@kevinjqliu kevinjqliu self-requested a review June 18, 2025 15:48
@djouallah
Copy link

I am one of those users, does this support authentication using auth token ? (not sas token)

@NikitaMatskevich
Copy link
Contributor Author

NikitaMatskevich commented Jun 19, 2025

I am one of those users, does this support authentication using auth token ? (not sas token)

From the docs:

If neither account_key or sas_token is specified a DefaultAzureCredential is used for authentication. This means it will try several types of authentication and go with the first one that works. If any authentication parameters are provided when initialising the FileSystem, they will be used instead of the default credential.

Here is the diagram of a DefaultAzureCredential flow.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Generally LGTM

I have a few comments. We can either address them here or in a follow up PR

@@ -82,6 +82,10 @@
ADLS_CLIENT_ID = "adls.client-id"
ADLS_CLIENT_SECRET = "adls.client-secret"
ADLS_ACCOUNT_HOST = "adls.account-host"
ADLS_BLOB_STORAGE_AUTHORITY = "adls.blob-storage-authority"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -197,6 +204,7 @@
MAP_VALUE_NAME = "value"
DOC = "doc"
UTC_ALIASES = {"UTC", "+00:00", "Etc/UTC", "Z"}
MIN_PYARROW_VERSION_SUPPORTING_AZURE_FS = "20.0.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: inline this at the function level

Copy link
Contributor Author

@NikitaMatskevich NikitaMatskevich Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I will have to keep in sync the version used in tests and this one... I can do it if it's ok for you

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think thats fine, we can just use

        if version.parse(pyarrow.__version__) < version.parse("20.0.0"):

This is technically a "public" variable and I dont want users to be able to import it.

@@ -394,6 +402,9 @@ def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSyste
elif scheme in {"gs", "gcs"}:
return self._initialize_gcs_fs()

elif scheme in {"abfs", "abfss", "wasb", "wasbs"}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cant find any pyarrow docs that indicates wasb and wasbs is supported.

@@ -1670,9 +1678,8 @@ def test_new_output_file_gcs(pyarrow_fileio_gcs: PyArrowFileIO) -> None:


@pytest.mark.gcs
@pytest.mark.skip(reason="Open issue on Arrow: https://github.com/apache/arrow/issues/36993")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see that apache/arrow#36993 is still open. is the issue resolved and we can run this test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running make test-gcs locally fails

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird, do we not run these integration tests in CI??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I wrongly assumed they will be run with make test...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no worries! I was surprised too. we gotta fix it :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking issue #2124

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually on my machine only 1 test was failing. I restored the annotation on it. Does make test-gcs run normally for you now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it still fails for me, even after recreating the poetry env.

we can just take out all these changes and address them in a separate PR

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert the change to not skip gcs tests

@pytest.mark.skip(reason="Open issue on Arrow: https://github.com/apache/arrow/issues/36993")

Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, lets just remove the changes to gcs integration test and address them in a separate PR.

#2125 would help make sure the file io integration changes are safe to make

@@ -1670,9 +1678,8 @@ def test_new_output_file_gcs(pyarrow_fileio_gcs: PyArrowFileIO) -> None:


@pytest.mark.gcs
@pytest.mark.skip(reason="Open issue on Arrow: https://github.com/apache/arrow/issues/36993")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it still fails for me, even after recreating the poetry env.

we can just take out all these changes and address them in a separate PR

@@ -197,6 +204,7 @@
MAP_VALUE_NAME = "value"
DOC = "doc"
UTC_ALIASES = {"UTC", "+00:00", "Etc/UTC", "Z"}
MIN_PYARROW_VERSION_SUPPORTING_AZURE_FS = "20.0.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think thats fine, we can just use

        if version.parse(pyarrow.__version__) < version.parse("20.0.0"):

This is technically a "public" variable and I dont want users to be able to import it.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding this feature :)

@kevinjqliu
Copy link
Contributor

@NikitaMatskevich looks like the linter errored, could you run make lint?

@kevinjqliu
Copy link
Contributor

actually i just push the make lint change :)

@kevinjqliu kevinjqliu merged commit 84c91f0 into apache:main Jun 20, 2025
11 checks passed
@kevinjqliu
Copy link
Contributor

Thanks for working on this @NikitaMatskevich and thanks @Fokko for the review

amitgilad3 pushed a commit to amitgilad3/iceberg-python that referenced this pull request Jul 7, 2025
<!--
Thanks for opening a pull request!
-->

<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->

# Rationale for this change

Starting from version 20, PyArrow supports ADLS filesystem. This PR adds
Pyarrow Azure support to Pyiceberg.

PyArrow is the [default
IO](https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/__init__.py#L366-L369)
for Pyiceberg catalogs. In Azure environment it handles wider spectrum
of auth strategies then Fsspec, including, for instance, [Managed
Identities](https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview).
Also, prior to this PR
apache#1663 (that is not merged
yet) there was no support for wasb(s) with Fsspec.

See the corresponding issue for more details:
apache#2112

# Are these changes tested?

Tests are added under tests/io/test_pyarrow.py.

# Are there any user-facing changes?

There are no API breaking changes. Direct impact of the PR: Pyarrow
FileIO in Pyiceberg supports Azure cloud environment. Examples of impact
for final users:
- Pyiceberg is usable in services with Managed Identities auth strategy.
 - Pyiceberg is usable with wasb(s) schemes in Azure.

<!-- In the case of user-facing changes, please add the changelog label.
-->

---------

Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
Co-authored-by: Kevin Liu <kevin.jq.liu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants