-
Notifications
You must be signed in to change notification settings - Fork 332
Support ADLS with Pyarrow file IO #2111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support ADLS with Pyarrow file IO #2111
Conversation
e4e7260
to
076b68b
Compare
@NikitaMatskevich Thanks for working on this, I know a lot of users are waiting for this. It looks like some tests are failing (you can run the linters locally using |
@Fokko thank you for looking into this! Sorry, indeed, missed some formatting issues. Now it should be fine.
|
I am one of those users, does this support authentication using auth token ? (not sas token) |
From the docs: If neither account_key or sas_token is specified a DefaultAzureCredential is used for authentication. This means it will try several types of authentication and go with the first one that works. If any authentication parameters are provided when initialising the FileSystem, they will be used instead of the default credential. Here is the diagram of a DefaultAzureCredential flow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! Generally LGTM
I have a few comments. We can either address them here or in a follow up PR
@@ -82,6 +82,10 @@ | |||
ADLS_CLIENT_ID = "adls.client-id" | |||
ADLS_CLIENT_SECRET = "adls.client-secret" | |||
ADLS_ACCOUNT_HOST = "adls.account-host" | |||
ADLS_BLOB_STORAGE_AUTHORITY = "adls.blob-storage-authority" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add these to the docs https://py.iceberg.apache.org/configuration/#azure-data-lake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pyiceberg/io/pyarrow.py
Outdated
@@ -197,6 +204,7 @@ | |||
MAP_VALUE_NAME = "value" | |||
DOC = "doc" | |||
UTC_ALIASES = {"UTC", "+00:00", "Etc/UTC", "Z"} | |||
MIN_PYARROW_VERSION_SUPPORTING_AZURE_FS = "20.0.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: inline this at the function level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I will have to keep in sync the version used in tests and this one... I can do it if it's ok for you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think thats fine, we can just use
if version.parse(pyarrow.__version__) < version.parse("20.0.0"):
This is technically a "public" variable and I dont want users to be able to import it.
@@ -394,6 +402,9 @@ def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSyste | |||
elif scheme in {"gs", "gcs"}: | |||
return self._initialize_gcs_fs() | |||
|
|||
elif scheme in {"abfs", "abfss", "wasb", "wasbs"}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cant find any pyarrow docs that indicates wasb
and wasbs
is supported.
@@ -1670,9 +1678,8 @@ def test_new_output_file_gcs(pyarrow_fileio_gcs: PyArrowFileIO) -> None: | |||
|
|||
|
|||
@pytest.mark.gcs | |||
@pytest.mark.skip(reason="Open issue on Arrow: https://github.com/apache/arrow/issues/36993") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see that apache/arrow#36993 is still open. is the issue resolved and we can run this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
running make test-gcs
locally fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weird, do we not run these integration tests in CI??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I wrongly assumed they will be run with make test...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no worries! I was surprised too. we gotta fix it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracking issue #2124
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually on my machine only 1 test was failing. I restored the annotation on it. Does make test-gcs run normally for you now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it still fails for me, even after recreating the poetry env.
we can just take out all these changes and address them in a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert the change to not skip gcs tests
@pytest.mark.skip(reason="Open issue on Arrow: https://github.com/apache/arrow/issues/36993")
Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, lets just remove the changes to gcs integration test and address them in a separate PR.
#2125 would help make sure the file io integration changes are safe to make
@@ -1670,9 +1678,8 @@ def test_new_output_file_gcs(pyarrow_fileio_gcs: PyArrowFileIO) -> None: | |||
|
|||
|
|||
@pytest.mark.gcs | |||
@pytest.mark.skip(reason="Open issue on Arrow: https://github.com/apache/arrow/issues/36993") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it still fails for me, even after recreating the poetry env.
we can just take out all these changes and address them in a separate PR
pyiceberg/io/pyarrow.py
Outdated
@@ -197,6 +204,7 @@ | |||
MAP_VALUE_NAME = "value" | |||
DOC = "doc" | |||
UTC_ALIASES = {"UTC", "+00:00", "Etc/UTC", "Z"} | |||
MIN_PYARROW_VERSION_SUPPORTING_AZURE_FS = "20.0.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think thats fine, we can just use
if version.parse(pyarrow.__version__) < version.parse("20.0.0"):
This is technically a "public" variable and I dont want users to be able to import it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for adding this feature :)
@NikitaMatskevich looks like the linter errored, could you run |
actually i just push the |
Thanks for working on this @NikitaMatskevich and thanks @Fokko for the review |
<!-- Thanks for opening a pull request! --> <!-- In the case this PR will resolve an issue, please replace ${GITHUB_ISSUE_ID} below with the actual Github issue id. --> <!-- Closes #${GITHUB_ISSUE_ID} --> # Rationale for this change Starting from version 20, PyArrow supports ADLS filesystem. This PR adds Pyarrow Azure support to Pyiceberg. PyArrow is the [default IO](https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/__init__.py#L366-L369) for Pyiceberg catalogs. In Azure environment it handles wider spectrum of auth strategies then Fsspec, including, for instance, [Managed Identities](https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview). Also, prior to this PR apache#1663 (that is not merged yet) there was no support for wasb(s) with Fsspec. See the corresponding issue for more details: apache#2112 # Are these changes tested? Tests are added under tests/io/test_pyarrow.py. # Are there any user-facing changes? There are no API breaking changes. Direct impact of the PR: Pyarrow FileIO in Pyiceberg supports Azure cloud environment. Examples of impact for final users: - Pyiceberg is usable in services with Managed Identities auth strategy. - Pyiceberg is usable with wasb(s) schemes in Azure. <!-- In the case of user-facing changes, please add the changelog label. --> --------- Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com> Co-authored-by: Kevin Liu <kevin.jq.liu@gmail.com>
Rationale for this change
Starting from version 20, PyArrow supports ADLS filesystem. This PR adds Pyarrow Azure support to Pyiceberg.
PyArrow is the default IO for Pyiceberg catalogs. In Azure environment it handles wider spectrum of auth strategies then Fsspec, including, for instance, Managed Identities. Also, prior to this PR #1663 (that is not merged yet) there was no support for wasb(s) with Fsspec.
See the corresponding issue for more details: #2112
Are these changes tested?
Tests are added under tests/io/test_pyarrow.py.
Are there any user-facing changes?
There are no API breaking changes. Direct impact of the PR: Pyarrow FileIO in Pyiceberg supports Azure cloud environment. Examples of impact for final users: