Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update arrow 47.0.0 in DataFusion #7587

Merged
merged 17 commits into from
Sep 25, 2023
Merged

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Sep 18, 2023

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Removes the dictionary_expressions feature as no longer necessary

@github-actions github-actions bot added physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate substrait labels Sep 18, 2023
@tustvold tustvold added the api change Changes the API exposed to users of the crate label Sep 18, 2023
@@ -35,7 +35,7 @@ itertools = "0.11"
object_store = "0.7.0"
prost = "0.11"
prost-types = "0.11"
substrait = "0.13.1"
substrait = "0.14.0"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: unfortunately even the latest subtrait is not yet upgraded to prost 0.12

@tustvold tustvold marked this pull request as ready for review September 22, 2023 13:31
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- thank you @tustvold

cc @viirya and @sunchao -- I think this PR changes how expressions with dictionaries are handled (though I think the same underlying code is still called).

"| logical_plan | Sort: t.a ASC NULLS LAST, t.b DESC NULLS FIRST |",
"| | TableScan: t projection=[a, b] |",
"| physical_plan | SortExec: expr=[a@0 ASC NULLS LAST,b@1 DESC] |",
"| | MemoryExec: partitions=1, partition_sizes=[5], output_ordering=a@0 ASC NULLS LAST,b@1 ASC NULLS LAST |",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is because we no longer need the small batch size, right?

use std::sync::Arc;

/// Applies a binary [`Datum`] kernel `f` to `lhs` and `rhs`
pub(crate) fn apply(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is likely obvious to you, but I think it would help to provide some context here of why this function exists

Something like

Suggested change
pub(crate) fn apply(
///
/// This is used to provide a single function implementation that works for all combinations of
/// 2 `ColumnarValue` inputs, and maps the [`ColumnarValue`] abstraction of DataFusion to
/// the [`Datum`] abstraction of arrow-rs
pub(crate) fn apply(

/// Converter for the group values
row_converter: CardinalityAwareRowConverter,
row_converter: RowConverter,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was removed as the upstream interning was removed in apache/arrow-rs#4811

cc @JayjeetAtGithub

@@ -199,6 +203,20 @@ impl GroupValues for GroupValuesRows {
}
};

// TODO: Materialize dictionaries in group keys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if this is described in a ticket yet (I think the answer is no, but I am not 100% sure)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No a ticket would be a good addition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #7647

@sunchao
Copy link
Member

sunchao commented Sep 22, 2023

Thanks for the ping @alamb . So just trying to understand the implication of this PR on dictionary.

Removes the dictionary_expressions feature as no longer necessary

This is the only thing that's changed? and we should expect expressions operating on dictionary arrays still behave the same?

Thanks!

@tustvold
Copy link
Contributor Author

There should be no externally visible changes, correct

@alamb alamb changed the title Update arrow 47.0.0 Update arrow 47.0.0 in DataFusion Sep 25, 2023
@tustvold tustvold merged commit 5f38135 into apache:main Sep 25, 2023
21 checks passed
Ted-Jiang pushed a commit to Ted-Jiang/arrow-datafusion that referenced this pull request Oct 7, 2023
* Update arrow 47.0.0

* Downgrade prost substrait

* Fix deprecations

* Prost deprecations

* Update pbjson-build

* Format

* Remove CardinalityAwareRowConverter

* Further fixes

* Ignore spill tests

* Fix tests

* Fix Clippy

* Update pin

* Review feedback

* Clippy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate optimizer Optimizer rules physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants