-
Notifications
You must be signed in to change notification settings - Fork 166
perf: Improve hash agg performance in large-scale data situations #1178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
perf: Improve hash agg performance in large-scale data situations #1178
Conversation
Thank you for your insights! The Streaming HashAggregate does not spill; it sends out tuples once the hash table is full. Its performance is closely linked to the distinct property of the data from subnodes within the in-memory hash range. However, evaluating this distinct property can be challenging. While HashAggregate may require more space and disk I/O, it can deduplicate more tuples. Conversely, Streaming HashAggregate might incur higher network I/O, deduplicating fewer tuples but potentially speeding up slice execution upon the motion node. Ultimately, the effectiveness of either approach depends on factors like network, disk performance, and, crucially, the distinct property of subnode data, especially given the in-memory limit of the hash table. I encountered a similar observation in #1173, where we discussed adding a GUC to control Streaming HashAggregate. This allows users to tailor the approach to their specific environments, which could be beneficial, especially in ORCA. Anyway, your work looks great! |
Actually, I was thinking, in CBDB's Streaming Partial HashAggregate, is it really necessary for the following two fields to be int64? Maybe int32 is enough.
|
Except for some special cases, it should be very rare for the sum of +Inf/-Inf values in an aggregate to exceed 2^32-1. If we do not consider compatibility with such special case, we could change the type from int64 to int32. Combined with a field reordering like the example below, the struct size can be reduced from 144 bytes to 128 bytes. This would keep the memory usage for numeric aggregation in line with Greenplum, while also improving hash aggregation efficiency. origin: 144bytes
reordering: 128bytes
I'll try to give a performance comparison of this change by next Monday. Then we'll decide how to proceed. |
What I want to express is that for Streaming Partial HashAggregate, the data structure NumericAggState for the intermediate state in the first stage aggregation doesn't need to be the same as the state data structure for the final stage aggregation. For the final stage aggregation's state data structure, it should indeed maintain consistency with PG, and the two fields mentioned above could indeed exceed the maximum value of int32. However, for Streaming Partial HashAggregate, we can do some special handling: when the above two fields reach the maximum value of int32, we can stream out the results of the Partial HashAggregate, and then reconstruct a new Partial HashAggregate result starting from zero. Considering that int32 should be very large, even in extreme cases where the number of tuples is very high, the overall execution performance won't actually decline, because in such cases, the cost that NumericAggState brings to motion is almost negligible compared to scanning and performing aggregation operations on more than 2^32 - 1 tuples. |
In other words, for the intermediate state data structure NumericAggState in the first stage aggregation, all the int64 fields inside can actually be changed to int32, not just only the above two. |
This approach is challenging. For instance, both the Finalize HashAggregate and Partial HashAggregate phases rely on the same workflow—advance_aggregates → xxx → do_numeric_accum—to perform numeric accumulation and update the aggregate state. If we were to use different NumericAggState structures for different phases, we would need to refactor a large number of numeric-related functions to support a more compact state data structure. |
I took a look at the latest PostgreSQL code (PostgreSQL 18 beta). Although pInfcount and nInfcount in NumericAggState are defined as int64, they are actually used as boolean flags in practice. I'm considering starting a discussion upstream to propose changing these two fields to bool in order to reduce the memory footprint of the NumericAggState struct. If there is consensus, I’d like to drive this change upstream. update: |
330c593
to
9e16c2a
Compare
During the TPC-DS tests, we observed an issue where CloudberryDB (CBDB) performed worse than Greenplum (GPDB) when the query plan generated a multi-stage aggregation (e.g., TPC-DS query 04). The phenomenon showed that the deduplication effect of CBDB’s **Streaming Partial HashAggregate** was significantly worse. As a result, the **Finalize HashAggregate** operator in CBDB processed significantly more data compared to GPDB under the same dataset. Example plan from CBDB: Gather Motion 32:1 (slice1; segments: 32) (cost=0.00..19988.81 rows=1800000 width=76) -> Finalize HashAggregate (cost=0.00..19663.99 rows=56250 width=81) Group Key: customer_gp.c_customer_id, customer_gp.c_first_name, customer_gp.c_last_name, customer_gp.c_preferred_cust_flag, customer_gp.c_birth_country, customer_gp.c_login, customer_gp.c_email_address, date_dim_gp.d_year -> Redistribute Motion 32:32 (slice2; segments: 32) (cost=0.00..19603.35 rows=56250 width=81) Hash Key: customer_gp.c_customer_id, customer_gp.c_first_name, customer_gp.c_last_name, customer_gp.c_preferred_cust_flag, customer_gp.c_birth_country, customer_gp.c_login, customer_gp.c_email_address, date_dim_gp.d_year -> Streaming Partial HashAggregate (cost=0.00..19589.09 rows=56250 width=81) Group Key: customer_gp.c_customer_id, customer_gp.c_first_name, customer_gp.c_last_name, customer_gp.c_preferred_cust_flag, customer_gp.c_birth_country, customer_gp.c_login, customer_gp.c_email_address, date_dim_gp.d_year -> Hash Join (cost=0.00..12346.24 rows=6935137 width=95) Hash Cond: (store_sales_gp.ss_customer_sk = customer_gp.c_customer_sk) ... ... Upon further investigation, we found that the **`NumericAggState`** structure in CloudberryDB contained two additional fields compared to Greenplum: ```c int64 pInfcount; /* count of +Inf values */ int64 nInfcount; /* count of -Inf values */ These fields were introduced in PostgreSQL 14 to support +/- Inf values for numeric types. Consequently, the size of the NumericAggState structure increased from 128 bytes to 144 bytes. In the Streaming Partial HashAggregate, a NumericAggState structure is created for each grouping key to track statistics for numeric types. This results in CBDB allocating 16 more bytes per grouping key compared to GPDB. This additional memory allocation contributes to the observed performance difference. To address this issue, we need to adjust the aggstate->hash_mem_limit Inspired by PostgreSQL’s handling of hash_mem_limit, we introduce a scaling factor called hash_mem_multiplier. Following the changes made in PostgreSQL 15, we set the default value of hash_mem_multiplier to 2.0 to ensure better memory utilization during hash-based aggregation operations. https://www.postgresql.org/message-id/flat/CAH2-Wzndc_ROk6CY-bC6p9O53q974Y0Ey4WX8jcPbuTZYM4Q3A%40mail.gmail.com
…functions reduce memory usage and improve performance of numeric aggregate functions by changing infinity counters from int64 to uint32. Adds overflow protection and retains compatibility with typical workloads. Introduce a new GUC parameter gp_enable_numeric_inf_count_check to control overflow checking for infinity counters.When disabled, the system will treat all infinity values as NaN to avoid incorrect results caused by counter overflow in cases with extremely large infinity counts.
9e16c2a
to
d8b983d
Compare
During the TPC-DS tests, we observed an issue where CloudberryDB (CBDB) performed worse than Greenplum (GPDB) when the query plan generated a multi-stage aggregation (e.g., TPC-DS query 04). The phenomenon showed that the deduplication effect of CBDB’s Streaming Partial HashAggregate was significantly worse. As a result, the Finalize HashAggregate operator in CBDB processed significantly more data compared to GPDB under the same dataset.
Upon further investigation, we found that the
NumericAggState
structure in CloudberryDB contained two additional fields compared to Greenplum:These fields were introduced in PostgreSQL 14 to support +/- Inf values for numeric types. Consequently, the size of the NumericAggState structure increased from 128 bytes to 144 bytes.
In the Streaming Partial HashAggregate, a NumericAggState structure is created for each grouping key to track statistics for numeric types. This results in CBDB allocating 16 more bytes per grouping key compared to GPDB. This additional memory allocation contributes to the observed performance difference.
To address this issue, we need to adjust the aggstate->hash_mem_limit Inspired by PostgreSQL’s handling of hash_mem_limit, we introduce a scaling factor called hash_mem_multiplier. Following the changes made in PostgreSQL 15, we should set the default value of hash_mem_multiplier to 2.0 to ensure better memory utilization during hash-based aggregation operations. In actual TPC-DS tests, we found that under most of the same statement_mem conditions, when hash_mem_multiplier=1.5 is set, the performance of greenplum and cloudberrydb is similar.
https://www.postgresql.org/message-id/flat/CAH2-Wzndc_ROk6CY-bC6p9O53q974Y0Ey4WX8jcPbuTZYM4Q3A%40mail.gmail.com
Fixes #ISSUE_Number
What does this PR do?
Type of Change
Breaking Changes
Test Plan
make installcheck
make -C src/test installcheck-cbdb-parallel
Impact
Performance:
User-facing changes:
Dependencies:
Checklist
Additional Context
CI Skip Instructions