Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41692: [Python] Improve substrait extended expressions support #41693

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

amol-
Copy link
Member

@amol- amol- commented May 16, 2024

Addresses some missing features and usability issues when using PyArrow with Substrait ExtendedExpressions

  • Allow passing BoundExpressions for Scanner(columns=X) instead of a dict of expressions.
  • Allow passing BoundExpressions for Scanner(filter=X) so that user doesn't have to distinguish between Expression and BoundExpressions and can always just use pyarrow.substrait.deserialize_expressions
  • Allow decoding pyarrow.BoundExpressions directly from protobuf.Message, thus allowing to use substrait-python objects.
  • Return memoryview from methods encoding substrait, so that those can be directly passed to substrait-python (or more in general other python libraries) without a copy being involved.
  • Allow decoding messages from memoryview so that the output of encoding functions can be sent back to dencoding functions.
  • Allow to encode and decode schemas from substrait
  • When encoding schemas return the extension types required for a substrait consumer to decode the schema
  • Handle arrow extension types when decoding a schema
  • Update docstrings and documentation

Copy link

⚠️ GitHub issue #41692 has been automatically assigned in GitHub to PR creator.

@@ -56,7 +56,7 @@ Status ParseFromBufferImpl(const Buffer& buf, const std::string& full_name,
if (message->ParseFromZeroCopyStream(&buf_stream)) {
return Status::OK();
}
return Status::IOError("ParseFromZeroCopyStream failed for ", full_name);
return Status::Invalid("ParseFromZeroCopyStream failed for ", full_name);
Copy link
Member Author

@amol- amol- Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed odd that it failed with an IOError given that the ArrayInputStream was built from already in-memory Buffer, so no IO was involved. If it fails, it actually means that the Substrait data is invalid

@github-actions github-actions bot added awaiting committer review Awaiting committer review Component: Documentation and removed awaiting review Awaiting review labels Jul 3, 2024
@amol-
Copy link
Member Author

amol- commented Jul 3, 2024

The failures seem to be related to #42149 and #43134

@amol-
Copy link
Member Author

amol- commented Jul 4, 2024

Marking as ready for review as the failures seem to be unrelated, @jorisvandenbossche would you mind reviewing this when you have the chance?

@amol- amol- marked this pull request as ready for review July 4, 2024 13:14
@amol- amol- requested a review from westonpace as a code owner July 4, 2024 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant