Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(python): Use polars parquet reader for delta scan #19103

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

ion-elgreco
Copy link
Contributor

This an intermediate stage until we have something working with delta-kernel-rs.

Couple odd things:

  • Hive schema is still required even though we have the new schema param, probably best to merge these things.
  • Polars Schema doesn't have from_arrow() method, so currently creating empty arrow table and going through DataFrame :)

@ritchie46
Copy link
Member

Nice, I didn't know we could bring our own readers.

@ion-elgreco ion-elgreco changed the title refactor(python): use polars parquet reader for delta read/scan refactor(python): use polars parquet reader for delta scan Oct 5, 2024
@ion-elgreco
Copy link
Contributor Author

Nice, I didn't know we could bring our own readers.

Absolutely we can, this is actually encouraged. Polars parquet reader is also lots faster than pyarrow.

The only thing is, this intermediate stage will be bound to same protocol support as the pyarrow scanner. At some point we need to finish a full native reader polars and delta-kernel-rs, some preliminary work I did here: https://github.com/ion-elgreco/polars-deltalake/tree/feat/delta_io_plugin

Just hard to find time nowadays. A dev from the core team who is working on parquet could probably do this easier since that dev is deep into the polars rust code ^^

@ion-elgreco ion-elgreco changed the title refactor(python): use polars parquet reader for delta scan refactor(python): Use polars parquet reader for delta scan Oct 5, 2024
@github-actions github-actions bot added internal An internal refactor or improvement python Related to Python Polars and removed title needs formatting labels Oct 5, 2024
Copy link

codecov bot commented Oct 5, 2024

Codecov Report

Attention: Patch coverage is 60.41667% with 19 lines in your changes missing coverage. Please review.

Project coverage is 79.78%. Comparing base (baa65b8) to head (e3ef45a).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
py-polars/polars/io/delta.py 60.41% 14 Missing and 5 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main   #19103   +/-   ##
=======================================
  Coverage   79.77%   79.78%           
=======================================
  Files        1531     1531           
  Lines      208561   208522   -39     
  Branches     2913     2922    +9     
=======================================
- Hits       166377   166366   -11     
+ Misses      41633    41600   -33     
- Partials      551      556    +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ion-elgreco
Copy link
Contributor Author

ion-elgreco commented Oct 5, 2024

@ritchie46 one test fails on windows: https://github.com/pola-rs/polars/actions/runs/11192235502/job/31116136984?pr=19103#step:11:202 when it encounters a hive path, I am not seeing this on linux though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal An internal refactor or improvement python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants