Add insert_or_update and get_payloads to map #12701

dmitrybugakov · 2024-10-01T14:09:10Z

Which issue does this PR close?

Closes #12594

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

comphead · 2024-10-01T16:48:02Z

Thanks @dmitrybugakov please help understand what are use cases for this functions in DF or downstream projects?

dmitrybugakov · 2024-10-01T17:38:12Z

@comphead Thanks for the review!
I'm using this functions as part of the mode function implementation, specifically for String/View types.

comphead · 2024-10-01T19:18:03Z

Right but still not sure should it be part of DataFusion core. The mode function is in contrib package, however DF core is heavily using StringViews.
@alamb @jayzhan211 WDYT on the core extension?

alamb

Thank you @dmitrybugakov -- my biggest concern with this PR is the duplication between insert_if_new and insert_or_update

It seems like the biggest difference is when the observe_payload and update_payload functions are called (I had to study the code to see that observe_payload seems to be called also for newly inserted payloads as well as pre-exisitng ones)

what do you think instead of adding an entirely new copy, extending the existing insert_if_new function to take 3 functions (instead of just two)

Something like this, perhaps:

    pub fn insert_if_new<MP, OP>(
        &mut self,
        values: &ArrayRef,
        make_payload_fn: MP,
        observe_payload_fn: OP,
        update_payload_fn: UP,
    ) where
        MP: FnMut(Option<&[u8]>) -> V,
        UP: FnMut(&mut V),
        OP: FnMut(V),
    {

?

alamb · 2024-10-02T11:08:32Z

FWIW I think having a structure like this (optimized maps that can emit arrow arrays) in the core of datafusion is a good idea given they are general purpose. I think this one is pretty general purpose

alamb

Thank you @dmitrybugakov

I read this code and it all makes sense to me

If anyone is curious here is the use of get_payloads in mode: https://github.com/datafusion-contrib/datafusion-functions-extra/blob/dd6806037c6f3db949768485f9afdc8db0797d03/src/common/mode/bytes.rs#L81

The one last thing I would like to do before merging this is to run some performance benchmarks to make sure there are no regressions

alamb · 2024-10-04T19:28:03Z

datafusion/physical-expr-common/src/binary_map.rs

+        for (value, &hash) in values.iter().zip(batch_hashes.iter()) {
+            // Handle null value
+            let Some(value) = value else {
+                if let Some(&(payload, _)) = self.null.as_ref() {


I think this code would be faster if you checked for nulls outside the loop rather than on each row

alamb · 2024-10-04T19:31:26Z

datafusion/physical-expr-common/src/binary_view_map.rs

+    /// # Returns
+    ///
+    /// A vector of payloads for each value, or `None` if the value is not found.
+    pub fn get_payloads(self, values: &ArrayRef) -> Vec<Option<V>> {


maybe we could call this (and the other function) into_values() as that is what it appears to be doing (consuming the map to get the values)

I considered that as well, but when comparing into_values with into_state (which literally converts the map to a state), using the name into_values feels a bit off to me. Since it only returns the values’ state without changing the map.

alamb · 2024-10-04T19:36:07Z

datafusion/physical-expr-common/src/binary_view_map.rs

@@ -178,8 +183,12 @@ where
    /// `make_payload_fn`:  invoked for each value that is not already present
    /// to create the payload, in order of the values in `values`
    ///
-    /// `observe_payload_fn`: invoked once, for each value in `values`, that was
-    /// already present in the map, with corresponding payload value.
+    /// `update_payload_fn`: invoked when the value is already present in the map.


Re-reading this code i wonder why the term "payload" is used rather than "value" -- it is a Map after all, so it has keys and values 🤔

Maybe it is because the values was already used.

Yep, that might be the case. It could make sense to stick with the more traditional key/value naming.

dmitrybugakov · 2024-10-05T10:51:24Z

datafusion/physical-expr-common/benches/binary_view_map.rs

@@ -0,0 +1,77 @@
+// Licensed to the Apache Software Foundation (ASF) under one


ArrowBytesViewMap insert_if_new - items: 1000000, null_density: 0.1, str_len: 20
time: [29.995 ms 30.207 ms 30.447 ms]
Found 16 outliers among 100 measurements (16.00%)
3 (3.00%) high mild
13 (13.00%) high severe

ArrowBytesViewMap get_payloads - items: 1000000, null_density: 0.1, str_len: 20
time: [7.6807 ms 7.7203 ms 7.7624 ms]
Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild

ArrowBytesViewMap insert_if_new - items: 1000000, null_density: 0.1, str_len: 50
time: [35.002 ms 35.305 ms 35.637 ms]
Found 7 outliers among 100 measurements (7.00%)
6 (6.00%) high mild
1 (1.00%) high severe

ArrowBytesViewMap get_payloads - items: 1000000, null_density: 0.1, str_len: 50
time: [8.7820 ms 8.8271 ms 8.8757 ms]
Found 7 outliers among 100 measurements (7.00%)
7 (7.00%) high mild

ArrowBytesViewMap insert_if_new - items: 1000000, null_density: 0.5, str_len: 20
time: [23.549 ms 23.660 ms 23.784 ms]
Found 11 outliers among 100 measurements (11.00%)
5 (5.00%) high mild
6 (6.00%) high severe

ArrowBytesViewMap get_payloads - items: 1000000, null_density: 0.5, str_len: 20
time: [11.704 ms 11.746 ms 11.793 ms]
Found 10 outliers among 100 measurements (10.00%)
6 (6.00%) high mild
4 (4.00%) high severe

ArrowBytesViewMap insert_if_new - items: 1000000, null_density: 0.5, str_len: 50
time: [26.712 ms 27.083 ms 27.475 ms]

ArrowBytesViewMap get_payloads - items: 1000000, null_density: 0.5, str_len: 50
time: [12.457 ms 12.516 ms 12.581 ms]
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe

dmitrybugakov · 2024-10-05T10:56:36Z

datafusion/physical-expr-common/benches/binary_map.rs

@@ -0,0 +1,74 @@
+// Licensed to the Apache Software Foundation (ASF) under one


ArrowBytesMap insert_if_new - items: 1000000, null_density: 0.1, str_len: 20
time: [26.761 ms 27.106 ms 27.472 ms]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild

ArrowBytesMap get_payloads - items: 1000000, null_density: 0.1, str_len: 20
time: [7.1595 ms 7.1766 ms 7.1950 ms]
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) low mild
3 (3.00%) high mild
2 (2.00%) high severe

ArrowBytesMap insert_if_new - items: 1000000, null_density: 0.1, str_len: 50
time: [31.591 ms 31.749 ms 31.929 ms]
Found 11 outliers among 100 measurements (11.00%)
6 (6.00%) high mild
5 (5.00%) high severe

ArrowBytesMap get_payloads - items: 1000000, null_density: 0.1, str_len: 50
time: [8.2213 ms 8.2395 ms 8.2596 ms]
Found 9 outliers among 100 measurements (9.00%)
4 (4.00%) high mild
5 (5.00%) high severe

ArrowBytesMap insert_if_new - items: 1000000, null_density: 0.5, str_len: 20
time: [22.149 ms 22.258 ms 22.378 ms]
Found 13 outliers among 100 measurements (13.00%)
6 (6.00%) high mild
7 (7.00%) high severe

ArrowBytesMap get_payloads - items: 1000000, null_density: 0.5, str_len: 20
time: [10.727 ms 10.783 ms 10.842 ms]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild

ArrowBytesMap insert_if_new - items: 1000000, null_density: 0.5, str_len: 50
time: [23.929 ms 24.083 ms 24.252 ms]
Found 17 outliers among 100 measurements (17.00%)
10 (10.00%) high mild
7 (7.00%) high severe

ArrowBytesMap get_payloads - items: 1000000, null_density: 0.5, str_len: 50
time: [11.141 ms 11.165 ms 11.194 ms]
Found 10 outliers among 100 measurements (10.00%)
3 (3.00%) high mild
7 (7.00%) high severe

dmitrybugakov · 2024-10-05T11:22:02Z

datafusion/physical-expr-common/Cargo.toml

@@ -37,8 +37,17 @@ path = "src/lib.rs"

 [dependencies]
 ahash = { workspace = true }
-arrow = { workspace = true }
+arrow = { workspace = true, features = ["test_utils"] }
+criterion = "0.5"


might be, it makes sense to move criterion to the workspace deps?

alamb

@dmitrybugakov -- thank you, this is epic work

I was dreaming about this change last night, and spent some time reviewing the mode implementation this morning

One of the main benefits of the BinaryMap structures in DataFusion over a normal HashMap is that the keys (aka string values) can be emitted directly as an Arrow StringArray without having to copy the values.

What I don't understand about mode is that it actually seems to be just looking for the single largest value and returning it as a ScalarValue

https://github.com/datafusion-contrib/datafusion-functions-extra/blob/dd6806037c6f3db949768485f9afdc8db0797d03/src/common/mode/bytes.rs#L234C1-L244

So I guess I am trying to figure out, maybe we could implement mode with just a normal HashMap (or one from hashbrown) to make the code simpler 🤔 and avoid the need for this change in the datafusion core?

I really apoligize for not reviewing the mode implementation more carefully before.

dmitrybugakov · 2024-10-06T18:17:07Z

Thank you, @alamb
No worries, I'll check the hashbrown then.

That’s the reason I started using it.

One of the main benefits of the BinaryMap structures in DataFusion over a normal HashMap is that the keys (aka string values) can be emitted directly as an Arrow StringArray without having to copy the values.

alamb · 2024-10-07T11:25:33Z

Thank you for understanding - and I apologize for all the runaround just to result in a closed PR. I will try and do better next time

github-actions bot added the physical-expr Physical Expressions label Oct 1, 2024

alamb reviewed Oct 2, 2024

View reviewed changes

dmitrybugakov requested a review from alamb October 3, 2024 12:48

alamb reviewed Oct 4, 2024

View reviewed changes

dmitrybugakov commented Oct 5, 2024

View reviewed changes

dmitrybugakov force-pushed the extend-binary-map branch from f4ddf8c to 46206f5 Compare October 5, 2024 11:18

dmitrybugakov commented Oct 5, 2024

View reviewed changes

dmitrybugakov added 7 commits October 5, 2024 13:32

Add insert_or_update and get_payloads to map

ff50dac

minor changes

ff51426

avoid code duplication

8675940

minor changes

cb4c56b

minor changes

50db977

minor changes

e573715

minor changes

44b5507

dmitrybugakov force-pushed the extend-binary-map branch from 1eb1b66 to 44b5507 Compare October 5, 2024 11:33

Update cli Cargo.lock

514dfaf

dmitrybugakov requested a review from alamb October 5, 2024 12:35

alamb reviewed Oct 6, 2024

View reviewed changes

dmitrybugakov closed this Oct 6, 2024

dmitrybugakov deleted the extend-binary-map branch October 6, 2024 18:19

alamb mentioned this pull request Oct 7, 2024

Clarify documentation on ArrowBytesMap and ArrowBytesViewMap #12789

Merged

dmitrybugakov mentioned this pull request Oct 7, 2024

Simplify mode function logic datafusion-contrib/datafusion-functions-extra#10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add insert_or_update and get_payloads to map #12701

Add insert_or_update and get_payloads to map #12701

dmitrybugakov commented Oct 1, 2024

comphead commented Oct 1, 2024

dmitrybugakov commented Oct 1, 2024

comphead commented Oct 1, 2024

alamb left a comment

alamb commented Oct 2, 2024

alamb left a comment

alamb Oct 4, 2024

alamb Oct 4, 2024

dmitrybugakov Oct 5, 2024

alamb Oct 4, 2024

dmitrybugakov Oct 5, 2024 •

edited

Loading

dmitrybugakov Oct 5, 2024

dmitrybugakov Oct 5, 2024

dmitrybugakov Oct 5, 2024

alamb left a comment

dmitrybugakov commented Oct 6, 2024 •

edited

Loading

alamb commented Oct 7, 2024

		@@ -0,0 +1,77 @@
		// Licensed to the Apache Software Foundation (ASF) under one

		@@ -0,0 +1,74 @@
		// Licensed to the Apache Software Foundation (ASF) under one

Add insert_or_update and get_payloads to map #12701

Add insert_or_update and get_payloads to map #12701

Conversation

dmitrybugakov commented Oct 1, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

comphead commented Oct 1, 2024

dmitrybugakov commented Oct 1, 2024

comphead commented Oct 1, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb commented Oct 2, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb Oct 4, 2024

Choose a reason for hiding this comment

alamb Oct 4, 2024

Choose a reason for hiding this comment

dmitrybugakov Oct 5, 2024

Choose a reason for hiding this comment

alamb Oct 4, 2024

Choose a reason for hiding this comment

dmitrybugakov Oct 5, 2024 • edited Loading

Choose a reason for hiding this comment

dmitrybugakov Oct 5, 2024

Choose a reason for hiding this comment

dmitrybugakov Oct 5, 2024

Choose a reason for hiding this comment

dmitrybugakov Oct 5, 2024

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

dmitrybugakov commented Oct 6, 2024 • edited Loading

alamb commented Oct 7, 2024

dmitrybugakov Oct 5, 2024 •

edited

Loading

dmitrybugakov commented Oct 6, 2024 •

edited

Loading