RFC: Demonstrate new `GroupHashAggregate` stream approach (runs more than 2x faster!) #6800

alamb · 2023-06-29T16:13:33Z

TLDR

This branch executes Q17 (which has a high cardinality grouping on a Decimal128) in less than half (44%) of the time as main. 🥳

Which issue does this PR close?

Related to #4973

This PR contains a technical spike / proof of concept that the hash aggregate approach described in #4973 (comment) will improve performance dramatically

I do not intend to ever merge this PR, but rather if it proves promising, I will break it up and incrementally merge it into the existing code (steps TBD)

Rationale for this change

We want faster grouping behavior, especially when there are large numbers of distinct groups

What changes are included in this PR?

A new GroupedHashAggregateStream2 operator that implements vectorized / multi-group updates
A new GroupsAccumulator trait with a proposed vectorized API for managing and updating group state
An generic implementation of GroupsAccumulator for AVG for PrimitiveArray (including decimal)

Stuff I plan to complete in this PR

Complete fuzz testing of accumulate function
Implement opt_filter in accumulate functions
An adapter that implements GroupsAccumulator in terms of Accumulator (for slower, but simpler accumulators)

I am very pleased with how the code looks

Things not done:

Filtering (though I don't expect it would change perfomance at all without filters)
Null handling for counts

Performance Results:

This branch runs Q17 in less than half (44%) of the time as main. 🥳

This branch: Query 17 avg time: 766.31 ms
main: Query 17 avg time: 1789.73 ms

Details

Correctness

Both main and this branch produce the same answer

    +-------------------+
    | avg_yearly        |
    +-------------------+
    | 348406.0542857143 |
    +-------------------+

This branch

Query 17 iteration 0 took 876.5 ms and returned 1 rows
Query 17 iteration 1 took 757.5 ms and returned 1 rows
Query 17 iteration 2 took 737.6 ms and returned 1 rows
Query 17 iteration 3 took 728.6 ms and returned 1 rows
Query 17 iteration 4 took 731.3 ms and returned 1 rows
Query 17 avg time: 766.31 ms

Main

Query 17 iteration 0 took 1794.5 ms and returned 1 rows
Query 17 iteration 1 took 1825.9 ms and returned 1 rows
Query 17 iteration 2 took 1799.1 ms and returned 1 rows
Query 17 iteration 3 took 1793.4 ms and returned 1 rows
Query 17 iteration 4 took 1735.7 ms and returned 1 rows
Query 17 avg time: 1789.73 ms

Methodology

Run this command

cargo run --profile release-nonlto --bin tpch -- benchmark datafusion --iterations 5 -m --format parquet -q 17 --path /Users/alamb/Software/arrow-datafusion/benchmarks/data/

Query:

select
        sum(l_extendedprice) / 7.0 as avg_yearly
from
    lineitem,
    part
where
        p_partkey = l_partkey
  and p_brand = 'Brand#23'
  and p_container = 'MED BOX'
  and l_quantity < (
    select
            0.2 * avg(l_quantity)
    from
        lineitem
    where
            l_partkey = p_partkey

Here is the original plan:

[2023-06-29T13:26:41Z DEBUG datafusion::physical_planner] Optimized physical plan:
    ProjectionExec: expr=[CAST(SUM(lineitem.l_extendedprice)@0 AS Float64) / 7 as avg_yearly]
      AggregateExec: mode=Final, gby=[], aggr=[SUM(lineitem.l_extendedprice)]
        CoalescePartitionsExec
          AggregateExec: mode=Partial, gby=[], aggr=[SUM(lineitem.l_extendedprice)]
            ProjectionExec: expr=[l_extendedprice@1 as l_extendedprice]
              CoalesceBatchesExec: target_batch_size=8192
                HashJoinExec: mode=Partitioned, join_type=Inner, on=[(p_partkey@2, l_partkey@1)], filter=CAST(l_quantity@0 AS Decimal128(30, 15)) < Float64(0.2) * AVG(lineitem.l_quantity)@1
                  CoalesceBatchesExec: target_batch_size=8192
                    RepartitionExec: partitioning=Hash([p_partkey@2], 2), input_partitions=2
                      ProjectionExec: expr=[l_quantity@1 as l_quantity, l_extendedprice@2 as l_extendedprice, p_partkey@3 as p_partkey]
                        CoalesceBatchesExec: target_batch_size=8192
                          HashJoinExec: mode=Partitioned, join_type=Inner, on=[(l_partkey@0, p_partkey@0)]
                            CoalesceBatchesExec: target_batch_size=8192
                              RepartitionExec: partitioning=Hash([l_partkey@0], 2), input_partitions=2
                                MemoryExec: partitions=2, partition_sizes=[367, 366]
                            CoalesceBatchesExec: target_batch_size=8192
                              RepartitionExec: partitioning=Hash([p_partkey@0], 2), input_partitions=2
                                ProjectionExec: expr=[p_partkey@0 as p_partkey]
                                  CoalesceBatchesExec: target_batch_size=8192
                                    FilterExec: p_brand@1 = Brand#23 AND p_container@2 = MED BOX
                                      MemoryExec: partitions=2, partition_sizes=[13, 12]
                  ProjectionExec: expr=[CAST(0.2 * CAST(AVG(lineitem.l_quantity)@1 AS Float64) AS Decimal128(30, 15)) as Float64(0.2) * AVG(lineitem.l_quantity), l_partkey@0 as l_partkey]
                    AggregateExec: mode=FinalPartitioned, gby=[l_partkey@0 as l_partkey], aggr=[AVG(lineitem.l_quantity)]   <-- want to use the new stream for here
                      CoalesceBatchesExec: target_batch_size=8192
                        RepartitionExec: partitioning=Hash([l_partkey@0], 2), input_partitions=2
                          AggregateExec: mode=Partial, gby=[l_partkey@0 as l_partkey], aggr=[AVG(lineitem.l_quantity)]
                            MemoryExec: partitions=2, partition_sizes=[367, 366]

Next Steps

Stuff I would do after the above is done:

Implement GroupsAccumulators for all existing RowAccumulators (see list below)
Reduce duplication between BoundedAggregateStream and GroupedHashAggregateStream #6798
Write a blog post about it

Here is the list of RowAccumulators (aka accumulators that have
specialized implementations). I think Avg is the trickiest to
implement (and it is already done)

alamb · 2023-06-29T16:32:33Z

datafusion/core/src/physical_plan/aggregates/row_hash2.rs

+
+    /// The actual group by values, stored in arrow Row format
+    /// the index of group_by_values is the index
+    /// https://github.com/apache/arrow-rs/issues/4466


apache/arrow-rs#4466

Dandandan · 2023-06-29T18:20:37Z

datafusion/core/src/physical_plan/aggregates/row_hash2.rs

+    /// The actual group by values, stored in arrow Row format
+    /// the index of group_by_values is the index
+    /// https://github.com/apache/arrow-rs/issues/4466
+    group_by_values: Vec<OwnedRow>,


This should probably be a buffer of some sort? OwnedRow has a copy of the RowConfig per value. If we want to keep using rows(?), something like the following would do:

pub struct AppendableRows { /// Underlying row bytes buffer: Vec<u8>, /// Row `i` has data `&buffer[offsets[i]..offsets[i+1]]` offsets: Vec<usize>, /// The config for these rows config: RowConfig, }

Thanks @Dandandan -- that is an excellent point. That is what I was trying to get at with apache/arrow-rs#4466

Note that the formulation in this PR is no worse than what is on master I don't think (which also has an OwnedRow per group)

https://github.com/apache/arrow-datafusion/blob/e91af991c5ae4a6b4afab2cb1b0c9307a69e4046/datafusion/core/src/physical_plan/aggregates/utils.rs#L40

Ah I saw you mentioned the need for it in the feature request

And it's interesting currently already it does with an OwnedRow, didn't realize that

(I am feeling very good about the ability to make the code faster 🚀 )

(BTW @tustvold is being a hero. Here is a PR to help apache/arrow-rs#4470)

I like this change. It is important to reduce the memory size of group rows/keys.
One optimization we can do further is when the group keys are fixed length, we can void the offsets vec also.

Dandandan · 2023-06-29T19:05:12Z

datafusion/core/src/physical_plan/aggregates/row_hash2.rs

+        create_hashes(group_values, &self.random_state, &mut batch_hashes)?;
+
+        for (row, hash) in batch_hashes.into_iter().enumerate() {
+            let entry = self.map.get_mut(hash, |(_hash, group_idx)| {


I wonder if we could get this more in line with the hash join, with the following steps:

Create candidates (possible matches) based on hash-equality

Compare keys (column-wise) in a vectorized fashion (take + eq + and)

Filter candidates based on filter (filter).

That is interesting 🤔 This is basically what the existing grouping operator does. I'll try and check out the join code at some point and see if i can transfer any of the learnings over here)

From our experiences, convert_columns is also quite expensive. It may worth considering to directly compare column by column, and only do the row conversion when spilling is required.

I wonder if we can special case single column grouping like we do for SortPreservingMerge, the row format is only really beneficial when dealing with multiple columns as it avoids per-field type dispatch.

FWIW row conversion should have similar performance to the take kernel, with exception to dictionaries. I would be interested if this is not the case, as that is a bug.

I wonder if we can special case single column grouping like we do for SortPreservingMerge, the row format is only really beneficial when dealing with multiple columns as it avoids per-field type dispatch.

Yes, I think this would be an excellent idea.

Basically @sunchao I think we have seen that for single column sorting (in this case grouping) keeping the native representation is better than converting to row format. However, once there are sort(or group) columns involved the dynamic dispatch logic for comparsions quickly dominates the row conversion costs.

I am a bit concerned about "boiling the ocean" when improving grouping. Any work will take a significant amount of time, so keeping the scope down is important to make the change practical

That being said, if we with go with the formulation in this PR, we'll be in a much better place to try and special group storage -- it may not be obvious but the actual operator / stream code in this PR is quite a bit simpler than the existing row_hash even though it has all the same features. This difference is largely due to not tracking parallel sets of aggregators (row and Accumulator)s

100% agree, I think we should focus on getting a consistent accumulator representation and interface, before undertaking additional optimisation work of the hash table machinery

Yes, I totally agree with the approach. Getting the other changes ironed out is definitely more important for now.

Also agree we should finish the other changes first as it will get too big otherwise 👍

I might do some experiments in the future with a similar approach as I mentioned above. I think the conversion might be relatively fast, but it will make other operations (e.g. equality) slower as it is not specialized on fixed size types and not as well vectorized.

From our experiences, convert_columns is also quite expensive. It may worth considering to directly compare column by column, and only do the row conversion when spilling is required.

I think the encoder implemented by @tustvold is very efficient. In the past I did some test on this code path it almost take no time.

alamb · 2023-06-30T17:12:25Z

datafusion/core/src/physical_plan/aggregates/row_hash2.rs

+
+use super::AggregateExec;
+
+/// Grouping aggregate


This code follows the basic structure of row_hash but the aggregate state management is different

alamb · 2023-06-30T17:13:10Z

datafusion/core/src/physical_plan/aggregates/row_hash2.rs

+
+                match self.mode {
+                    AggregateMode::Partial | AggregateMode::Single => {
+                        acc.update_batch(


Here is one key difference -- each accumulator is called once per input batch (not once per group)

alamb · 2023-06-30T17:14:49Z

datafusion/physical-expr/src/aggregate/average.rs

+                let avg_fn =
+                    move |sum: i128, count: u64| decimal_averager.avg(sum, count as i128);
+
+                Ok(Box::new(AvgGroupsAccumulator::<Decimal128Type, _>::new(


Here is a specialized accumulator -- it will be instantiated once per native type or other type we need to support in the accumulator, but this will result in a specialized accumulator for each native type. 👨‍🍳 👌

This also serves the purpose of allowing us to eventually deprecate the ScalarValue binary operations - #6842

alamb · 2023-06-30T17:15:38Z

datafusion/physical-expr/src/aggregate/average.rs

+        // TODO combine the null mask from values and opt_filter
+        let valids = values.nulls();
+
+        // This is based on (ahem, COPY/PASTA) arrow::compute::aggregate::sum


This particular code is likely to be very common across most accumulators so I would hope to find some way to generalize it into its own function / macro

datafusion/physical-expr/src/aggregate/average.rs

alamb · 2023-06-30T17:21:01Z

datafusion/physical-expr/src/aggregate/groups_accumulator.rs

+use arrow_array::{ArrayRef, BooleanArray};
+use datafusion_common::Result;
+
+/// An implementation of GroupAccumulator is for a single aggregate


Here is the new GroupsAccumulator trait that all accumulators would have to implement.

I would also plan to create a struct that implements this trait for aggregates based on Accumulator s

struct GroupsAdapter { groups: Vec<Box<dyn Accumulator>> } impl GroupsAccumulator for GroupsAdapter { ... }

So in that way we can start with simpler (but slower) Accumulator implementations for aggregates, and provide a fast GroupsAccumulator for the aggregates / types that need the specialization

datafusion/physical-expr/src/aggregate/utils.rs

datafusion/physical-expr/src/aggregate/average.rs

Dandandan · 2023-07-02T08:17:56Z

I did some profiling on the current version on query 17: seems that a portion (at least 10% but could be more) of the time is spent now around Row/OwnedRow - would be interesting to see how much it improves after using apache/arrow-rs#4470

Dandandan · 2023-07-02T17:14:19Z

@alamb do you continue this PR on your own or would some form of assistance help? E.g. writing some of those accumulators?

Dandandan · 2023-07-02T20:03:31Z

datafusion/physical-expr/src/aggregate/average.rs

+                group_indicies,
+                values,
+                opt_filter,
+                |group_index, _new_value| {


I wonder if this compiles into the same code as with only iterating over group_indicies

It would be super helpful if you could test that / figure out if it is worth specializing -- the original version didn't handle input nulls correctly

Dandandan · 2023-07-02T20:04:08Z

datafusion/physical-expr/src/aggregate/average.rs

+
+        if values.null_count() == 0 {
+            accumulate_all(
+                group_indicies,


Suggested change

group_indicies,

group_indices,

?

🤦

git commit -a -m 'fix spelling of indices' [alamb/hash_agg_spike d760a5f115] fix spelling of indices 4 files changed, 24 insertions(+), 24 deletions(-)

alamb · 2023-07-07T21:30:32Z

I just have the last two accumulators

BoolAndRowAccumulator
 BoolOrRowAccumulator

To complete and I think I'll be ready to create a PR for review

Dandandan · 2023-07-08T09:18:39Z

Found time for a small optimization (to reuse the buffer to create the hashes).

…ion into alamb/hash_agg_spike

alamb · 2023-07-08T15:06:18Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

@@ -111,6 +111,8 @@ pub(crate) struct GroupedHashAggregateStream {
    /// first element in the array corresponds to normal accumulators
    /// second element in the array corresponds to row accumulators
    indices: [Vec<Range<usize>>; 2],
+    // buffer to be reused to store hashes
+    hashes_buffer: Vec<u64>,


❤️ this is a good change -- thanks @Dandandan . Pretty soon there will be no allocations while processing each batch (aka the hot loop) 🥳 -- I think with #6888 we can get rid of the counts in the sum accumulator

Note that this change was made to the existing row_hash (not the new one). I will port the change to the new one as part of #6904

…ates

alamb · 2023-07-08T16:54:19Z

Ok, here are some numbers (TPCH SF1). I am quite pleased

My next plan is to turn this into a PR

--------------------
Benchmark tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ alamb_hash_agg_spike ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  789.36ms │             768.82ms │     no change │
│ QQuery 2     │  292.62ms │             219.58ms │ +1.33x faster │
│ QQuery 3     │  408.23ms │             388.36ms │     no change │
│ QQuery 4     │  239.14ms │             236.48ms │     no change │
│ QQuery 5     │  512.51ms │             516.96ms │     no change │
│ QQuery 6     │  208.24ms │             211.47ms │     no change │
│ QQuery 7     │  869.70ms │             896.97ms │     no change │
│ QQuery 8     │  574.60ms │             591.00ms │     no change │
│ QQuery 9     │  893.77ms │             908.34ms │     no change │
│ QQuery 10    │  650.66ms │             621.45ms │     no change │
│ QQuery 11    │  204.09ms │             178.99ms │ +1.14x faster │
│ QQuery 12    │  334.17ms │             327.36ms │     no change │
│ QQuery 13    │  744.82ms │             634.29ms │ +1.17x faster │
│ QQuery 14    │  292.05ms │             281.81ms │     no change │
│ QQuery 15    │  247.06ms │             218.11ms │ +1.13x faster │
│ QQuery 16    │  247.45ms │             209.87ms │ +1.18x faster │
│ QQuery 17    │ 2534.68ms │            1135.75ms │ +2.23x faster │
│ QQuery 18    │ 2630.03ms │            1751.31ms │ +1.50x faster │
│ QQuery 19    │  521.75ms │             528.30ms │     no change │
│ QQuery 20    │  926.76ms │             440.71ms │ +2.10x faster │
│ QQuery 21    │ 1278.07ms │            1275.54ms │     no change │
│ QQuery 22    │  150.15ms │             150.67ms │     no change │
└──────────────┴───────────┴──────────────────────┴───────────────┘
--------------------
Benchmark tpch_mem.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ alamb_hash_agg_spike ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  489.23ms │             455.08ms │ +1.08x faster │
│ QQuery 2     │  243.33ms │             134.34ms │ +1.81x faster │
│ QQuery 3     │  166.61ms │             158.30ms │     no change │
│ QQuery 4     │  112.69ms │             109.91ms │     no change │
│ QQuery 5     │  371.31ms │             367.26ms │     no change │
│ QQuery 6     │   38.85ms │              39.05ms │     no change │
│ QQuery 7     │  857.14ms │             848.70ms │     no change │
│ QQuery 8     │  228.76ms │             226.56ms │     no change │
│ QQuery 9     │  525.80ms │             507.89ms │     no change │
│ QQuery 10    │  322.86ms │             304.78ms │ +1.06x faster │
│ QQuery 11    │  185.13ms │             157.05ms │ +1.18x faster │
│ QQuery 12    │  158.53ms │             152.98ms │     no change │
│ QQuery 13    │  511.26ms │             254.26ms │ +2.01x faster │
│ QQuery 14    │   44.26ms │              43.50ms │     no change │
│ QQuery 15    │   75.39ms │              45.33ms │ +1.66x faster │
│ QQuery 16    │  196.56ms │             158.71ms │ +1.24x faster │
│ QQuery 17    │ 2260.88ms │             788.95ms │ +2.87x faster │
│ QQuery 18    │ 2375.63ms │            1416.96ms │ +1.68x faster │
│ QQuery 19    │  158.64ms │             150.11ms │ +1.06x faster │
│ QQuery 20    │  830.32ms │             305.56ms │ +2.72x faster │
│ QQuery 21    │  995.44ms │             978.06ms │     no change │
│ QQuery 22    │   84.62ms │              79.60ms │ +1.06x faster │
└──────────────┴───────────┴──────────────────────┴───────────────┘

alamb · 2023-07-08T17:41:26Z

Tracking my plan in #6889

alamb · 2023-07-08T17:53:45Z

I also tried some Clickebench queries and I got a similar speedup (3x) -- I am feeling good about this one

SELECT COUNT(DISTINCT "UserID") FROM 'hits.parquet';

Main:

1 row in set. Query took 8.231 seconds.

This branch

1 row in set. Query took 3.879 seconds.

🚀

Dandandan · 2023-07-08T19:25:59Z

I also tried some Clickebench queries and I got a similar speedup (3x) -- I am feeling good about this one
SELECT COUNT(DISTINCT "UserID") FROM 'hits.parquet';
Main:
1 row in set. Query took 8.231 seconds.
This branch
1 row in set. Query took 3.879 seconds.
rocket

Amazing 🚀 I think for this query we should also consider avoiding the conversion to the row-format as this likely will be one of the more expensive things now.

alamb · 2023-07-09T10:17:43Z

Amazing 🚀 I think for this query we should also consider avoiding the conversion to the row-format as this likely will be one of the more expensive things now.

That is a good idea -- it worked well for sorting as well. I put a note on #6889 to track writing up a real ticket

yahoNanJing · 2023-07-11T02:03:14Z

datafusion/physical-expr/src/aggregate/average.rs

+    counts: Vec<u64>,
+
+    /// Sums per group, stored as the native type
+    sums: Vec<T::Native>,


Is it possible to combine the counts and sums into one property, like avg_states: Vec<(T::Native, u64)>? Since one sum and the related count are always used together, I think it's better to put them together for better cache locality.

FYI @alamb sounds like a useful suggestion

Is it possible to combine the counts and sums into one property, like avg_states: Vec<(T::Native, u64)>? Since one sum and the related count are always used together, I think it's better to put them together for better cache locality.

Thank you for the comment @yahoNanJing

The reason the sums and counts are stored separately is to minimize copying when forming the final output -- since the final output is columnar (two columns) keeping the data as two Vecs allows the final ArrayRefs to be created directly from that data.

It would be an interesting experiment to see if keeping them together and improving cache locality outweighed the extra copy.

BTW if people are looking to optimize the inner loops more, I think removing the bounds checks with unsafe might also help (but I don't plan to pursue it until I find need to optimize more)

So instead of

let sum = &mut self.sums[group_index]; *sum = sum.add_wrapping(new_value);

unsafe { let sum = sums.get_unchecked_mut(group_index); *sum = sum.add_wrapping(new_value); }

Is it possible to make a tuple (T::Native, u64) as a primitive type at the arrow-rs side so that we can create an array of tuple? Then we don't need to return two arrays for the state()

Ah -- I see what you are saying -- I think we could potentially use a StructArray for the state (which would be a single "column" in arrow) but the underlying storage is still two separate contiguous arrays.

Maybe we could use FixedSizeBinaryArray 🤔 and pack/unpack the tuples to the appropriate size

It would be an interesting experiment

I'm afraid both of the StructArray and FixedSizeBinaryArray may have additional overhead.

If T::Native can be a tuple, then we can provide a new array, called TupleArray. The element type is a tuple, (T::Native, T::Native). Then this tuple can be any nested tuples. And this new TupleArray can cover any nested tuple cases.

It would definitely be a cool thing to try

alamb · 2023-07-11T11:32:17Z

For anyone following along, I have created a proposed PR with these changes that is ready for review: #6904

alamb changed the title ~~(NOT READY FOR REVIEW YET) POC: Demonstrate new GroupHashAggregate stream approach~~ (NOT READY FOR REVIEW YET) POC: Demonstrate new GroupHashAggregate stream approach Jun 29, 2023

github-actions bot added the core Core DataFusion crate label Jun 29, 2023

alamb commented Jun 29, 2023

View reviewed changes

Dandandan reviewed Jun 29, 2023

View reviewed changes

github-actions bot added the physical-expr Physical Expressions label Jun 29, 2023

alamb mentioned this pull request Jun 30, 2023

Improve the performance of Aggregator, grouping, aggregation #4973

Closed

4 tasks

alamb force-pushed the alamb/hash_agg_spike branch from 31335b4 to e02c35d Compare June 30, 2023 13:46

alamb commented Jun 30, 2023

View reviewed changes

alamb changed the title ~~(NOT READY FOR REVIEW YET) POC: Demonstrate new GroupHashAggregate stream approach~~ Demonstrate new GroupHashAggregate stream approach (runs more than 2x faster!) Jun 30, 2023

alamb changed the title ~~Demonstrate new GroupHashAggregate stream approach (runs more than 2x faster!)~~ RFC: Demonstrate new GroupHashAggregate stream approach (runs more than 2x faster!) Jun 30, 2023

alamb requested a review from mingmwang June 30, 2023 17:23

tustvold reviewed Jun 30, 2023

View reviewed changes

datafusion/physical-expr/src/aggregate/average.rs Outdated Show resolved Hide resolved

Dandandan reviewed Jul 1, 2023

View reviewed changes

datafusion/physical-expr/src/aggregate/average.rs Outdated Show resolved Hide resolved

alamb mentioned this pull request Jul 2, 2023

RowAccumulators support generics #6657

Closed

Dandandan reviewed Jul 2, 2023

View reviewed changes

alamb added 11 commits July 2, 2023 16:24

POC: Demonstrate new GroupHashAggregate stream approach

9b22745

complete accumulator

4ce6671

touchups

5694190

Add comments

a58b006

Update comments and simplify code

73cb33f

factor out accumulate

0b5d74f

split nullable/non nullable handling

c30874d

Refactor out accumulation in average

2370220

Move accumulator to their own function

26570f9

update more comments

bed990e

Begin writing tests for accumulate

25787a0

Reuse hashes buffer

f2fc450

github-actions bot added the logical-expr Logical plan and expressions label Jul 8, 2023

alamb mentioned this pull request Jul 8, 2023

Performance: Use a specialized sum accumulator for retractable aggregregates #6888

Merged

alamb added 4 commits July 8, 2023 10:55

Complete BoolAnd and BoolOr accumulators

b781910

Fix doc

aebe77f

Merge remote-tracking branch 'apache/main' into alamb/hash_agg_spike

7c17638

Merge branch 'alamb/hash_agg_spike' of github.com:alamb/arrow-datafus…

0a5a749

…ion into alamb/hash_agg_spike

github-actions bot removed the logical-expr Logical plan and expressions label Jul 8, 2023

alamb commented Jul 8, 2023

View reviewed changes

alamb added 3 commits July 8, 2023 11:29

clippy

f684ae8

Performance: Use a specialized sum accumulator for retractable aggreg…

e798074

…ates

Simplify sum and make it faster

afcab34

alamb mentioned this pull request Jul 8, 2023

Complete Implement fast Vectorized grouping for high cardinality #6889

Closed

11 tasks

This was referenced Jul 9, 2023

TPCH, Query 18 and 17 very slow #5646

Closed

Vectorized hash grouping #6904

Merged

Minor: Add output to aggregrate_fuzz.rs on failure #6905

Merged

Implement fast min/max accumulator for binary / strings (now it uses the slower path) #6906

Open

yahoNanJing reviewed Jul 11, 2023

View reviewed changes

yahoNanJing mentioned this pull request Jul 11, 2023

Improve the hash join performance by replacing the RawTable to a simple Vec for JoinHashMap #6910

Open

alamb closed this in #6904 Jul 13, 2023

alamb mentioned this pull request Jul 14, 2023

Improve aggregate performance by special casing single group keys #6969

Closed

alamb mentioned this pull request Jul 24, 2023

[EPIC] Improve aggregate performance with adaptive sizing in accumulators / avoiding reallocations in accumulators #7065

Open

2 tasks

RFC: Demonstrate new GroupHashAggregate stream approach (runs more than 2x faster!) #6800

RFC: Demonstrate new GroupHashAggregate stream approach (runs more than 2x faster!) #6800

Conversation

alamb commented Jun 29, 2023 • edited Loading

TLDR

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Performance Results:

Correctness

This branch

Main

Methodology

Next Steps

Choose a reason for hiding this comment

Dandandan Jun 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan Jun 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan commented Jul 2, 2023 • edited Loading

Dandandan commented Jul 2, 2023

Choose a reason for hiding this comment

alamb Jul 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jul 7, 2023

Dandandan commented Jul 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jul 8, 2023

alamb commented Jul 8, 2023

alamb commented Jul 8, 2023

Dandandan commented Jul 8, 2023

alamb commented Jul 9, 2023 • edited Loading

yahoNanJing Jul 11, 2023 • edited Loading

Choose a reason for hiding this comment

Dandandan Jul 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yahoNanJing Jul 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jul 11, 2023

RFC: Demonstrate new `GroupHashAggregate` stream approach (runs more than 2x faster!) #6800

RFC: Demonstrate new `GroupHashAggregate` stream approach (runs more than 2x faster!) #6800

alamb commented Jun 29, 2023 •

edited

Loading

Dandandan Jun 29, 2023 •

edited

Loading

Dandandan Jun 29, 2023 •

edited

Loading

Dandandan commented Jul 2, 2023 •

edited

Loading

alamb Jul 2, 2023 •

edited

Loading

alamb commented Jul 9, 2023 •

edited

Loading

yahoNanJing Jul 11, 2023 •

edited

Loading

Dandandan Jul 11, 2023 •

edited

Loading

yahoNanJing Jul 12, 2023 •

edited

Loading