perf: Improve hash agg performance in large-scale data situations #1178

gongxun0928 · 2025-06-20T06:52:44Z

During the TPC-DS tests, we observed an issue where CloudberryDB (CBDB) performed worse than Greenplum (GPDB) when the query plan generated a multi-stage aggregation (e.g., TPC-DS query 04). The phenomenon showed that the deduplication effect of CBDB’s Streaming Partial HashAggregate was significantly worse. As a result, the Finalize HashAggregate operator in CBDB processed significantly more data compared to GPDB under the same dataset.

Example plan from CBDB:
Gather Motion 32:1  (slice1; segments: 32)  (cost=0.00..19988.81 rows=1800000 width=76)
      ->  Finalize HashAggregate  (cost=0.00..19663.99 rows=56250 width=81)
            Group Key: customer_gp.c_customer_id, customer_gp.c_first_name,
                  customer_gp.c_last_name, customer_gp.c_preferred_cust_flag,
                  customer_gp.c_birth_country, customer_gp.c_login,
                  customer_gp.c_email_address, date_dim_gp.d_year
            ->  Redistribute Motion 32:32  (slice2; segments: 32)  (cost=0.00..19603.35 rows=56250 width=81)
                  Hash Key: customer_gp.c_customer_id, customer_gp.c_first_name, customer_gp.c_last_name,
                        customer_gp.c_preferred_cust_flag, customer_gp.c_birth_country, customer_gp.c_login,
                        customer_gp.c_email_address, date_dim_gp.d_year
                  ->  Streaming Partial HashAggregate  (cost=0.00..19589.09 rows=56250 width=81)
                        Group Key: customer_gp.c_customer_id, customer_gp.c_first_name, customer_gp.c_last_name,
                              customer_gp.c_preferred_cust_flag, customer_gp.c_birth_country, customer_gp.c_login,
                              customer_gp.c_email_address, date_dim_gp.d_year
                        ->  Hash Join  (cost=0.00..12346.24 rows=6935137 width=95)
                              Hash Cond: (store_sales_gp.ss_customer_sk = customer_gp.c_customer_sk)
                              ...
                              ...

Upon further investigation, we found that the NumericAggState structure in CloudberryDB contained two additional fields compared to Greenplum:

int64 pInfcount; /* count of +Inf values */
int64 nInfcount; /* count of -Inf values */

These fields were introduced in PostgreSQL 14 to support +/- Inf values for numeric types. Consequently, the size of the NumericAggState structure increased from 128 bytes to 144 bytes.

In the Streaming Partial HashAggregate, a NumericAggState structure is created for each grouping key to track statistics for numeric types. This results in CBDB allocating 16 more bytes per grouping key compared to GPDB. This additional memory allocation contributes to the observed performance difference.

To address this issue, we need to adjust the aggstate->hash_mem_limit Inspired by PostgreSQL’s handling of hash_mem_limit, we introduce a scaling factor called hash_mem_multiplier. Following the changes made in PostgreSQL 15, we should set the default value of hash_mem_multiplier to 2.0 to ensure better memory utilization during hash-based aggregation operations. In actual TPC-DS tests, we found that under most of the same statement_mem conditions, when hash_mem_multiplier=1.5 is set, the performance of greenplum and cloudberrydb is similar.

https://www.postgresql.org/message-id/flat/CAH2-Wzndc_ROk6CY-bC6p9O53q974Y0Ey4WX8jcPbuTZYM4Q3A%40mail.gmail.com

Fixes #ISSUE_Number

What does this PR do?

Type of Change

Bug fix (non-breaking change)
New feature (non-breaking change)
Breaking change (fix or feature with breaking changes)
Documentation update

Breaking Changes

Test Plan

Unit tests added/updated
Integration tests added/updated
Passed make installcheck
Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Followed contribution guide
Added/updated documentation
Reviewed code for security implications
Requested review from cloudberry committers

Additional Context

CI Skip Instructions

avamingli · 2025-06-20T07:38:50Z

The phenomenon showed that the deduplication effect of CBDB’s Streaming Partial HashAggregate was significantly worse. As a result, the Finalize HashAggregate operator in CBDB processed significantly more data compared to GPDB under the same dataset

Thank you for your insights!

The Streaming HashAggregate does not spill; it sends out tuples once the hash table is full. Its performance is closely linked to the distinct property of the data from subnodes within the in-memory hash range. However, evaluating this distinct property can be challenging.

While HashAggregate may require more space and disk I/O, it can deduplicate more tuples. Conversely, Streaming HashAggregate might incur higher network I/O, deduplicating fewer tuples but potentially speeding up slice execution upon the motion node. Ultimately, the effectiveness of either approach depends on factors like network, disk performance, and, crucially, the distinct property of subnode data, especially given the in-memory limit of the hash table.

I encountered a similar observation in #1173, where we discussed adding a GUC to control Streaming HashAggregate. This allows users to tailor the approach to their specific environments, which could be beneficial, especially in ORCA.

Anyway, your work looks great!

src/backend/executor/nodeAgg.c

jianlirong · 2025-06-20T12:52:00Z

Actually, I was thinking, in CBDB's Streaming Partial HashAggregate, is it really necessary for the following two fields to be int64? Maybe int32 is enough.

int64 pInfcount; /* count of +Inf values */
int64 nInfcount; /* count of -Inf values */

gongxun0928 · 2025-06-20T14:44:13Z

Actually, I was thinking, in CBDB's Streaming Partial HashAggregate, is it really necessary for the following two fields to be int64? Maybe int32 is enough.
int64 pInfcount; /* count of +Inf values */
int64 nInfcount; /* count of -Inf values */

Except for some special cases, it should be very rare for the sum of +Inf/-Inf values in an aggregate to exceed 2^32-1. If we do not consider compatibility with such special case, we could change the type from int64 to int32. Combined with a field reordering like the example below, the struct size can be reduced from 144 bytes to 128 bytes. This would keep the memory usage for numeric aggregation in line with Greenplum, while also improving hash aggregation efficiency.

origin: 144bytes

typedef struct NumericAggState
{
	bool		calcSumX2;		/* if true, calculate sumX2 */
	MemoryContext agg_context;	/* context we're calculating in */
	int64		N;				/* count of processed numbers */
	NumericSumAccum sumX;		/* sum of processed numbers */
	NumericSumAccum sumX2;		/* sum of squares of processed numbers */
	int			maxScale;		/* maximum scale seen so far */
	int64		maxScaleCount;	/* number of values seen with maximum scale */
	/* These counts are *not* included in N!  Use NA_TOTAL_COUNT() as needed */
	int64		NaNcount;		/* count of NaN values */
	int64		pInfcount;		/* count of +Inf values */
	int64		nInfcount;		/* count of -Inf values */
} NumericAggState;

reordering: 128bytes

typedef struct NumericAggState
{
	bool		calcSumX2;		/* if true, calculate sumX2 */
	int32		maxScale;		/* maximum scale seen so far */
	int64		maxScaleCount;	/* number of values seen with maximum scale */
	MemoryContext agg_context;	/* context we're calculating in */
	int64		N;				/* count of processed numbers */
	NumericSumAccum sumX;		/* sum of processed numbers */
	NumericSumAccum sumX2;		/* sum of squares of processed numbers */
	int64		NaNcount;		/* count of NaN values */
	int32		pInfcount;		/* count of +Inf values */
	int32		nInfcount;		/* count of -Inf values */
} NumericAggState;

I'll try to give a performance comparison of this change by next Monday. Then we'll decide how to proceed.

jianlirong · 2025-06-20T15:39:08Z

What I want to express is that for Streaming Partial HashAggregate, the data structure NumericAggState for the intermediate state in the first stage aggregation doesn't need to be the same as the state data structure for the final stage aggregation. For the final stage aggregation's state data structure, it should indeed maintain consistency with PG, and the two fields mentioned above could indeed exceed the maximum value of int32. However, for Streaming Partial HashAggregate, we can do some special handling: when the above two fields reach the maximum value of int32, we can stream out the results of the Partial HashAggregate, and then reconstruct a new Partial HashAggregate result starting from zero. Considering that int32 should be very large, even in extreme cases where the number of tuples is very high, the overall execution performance won't actually decline, because in such cases, the cost that NumericAggState brings to motion is almost negligible compared to scanning and performing aggregation operations on more than 2^32 - 1 tuples.

jianlirong · 2025-06-20T15:42:15Z

In other words, for the intermediate state data structure NumericAggState in the first stage aggregation, all the int64 fields inside can actually be changed to int32, not just only the above two.

gongxun0928 · 2025-06-20T16:41:58Z

In other words, for the intermediate state data structure NumericAggState in the first stage aggregation, all the int64 fields inside can actually be changed to int32, not just only the above two.

This approach is challenging. For instance, both the Finalize HashAggregate and Partial HashAggregate phases rely on the same workflow—advance_aggregates → xxx → do_numeric_accum—to perform numeric accumulation and update the aggregate state. If we were to use different NumericAggState structures for different phases, we would need to refactor a large number of numeric-related functions to support a more compact state data structure.
Additionally, in finalize_aggregates, we would need to select different serialization functions depending on the type of aggsplit.
All of these changes would result in significant differences between our codebase and upstream PostgreSQL, making future version upgrades more difficult to manage.

gongxun0928 · 2025-06-23T02:39:08Z

I took a look at the latest PostgreSQL code (PostgreSQL 18 beta). Although pInfcount and nInfcount in NumericAggState are defined as int64, they are actually used as boolean flags in practice. I'm considering starting a discussion upstream to propose changing these two fields to bool in order to reduce the memory footprint of the NumericAggState struct. If there is consensus, I’d like to drive this change upstream.

update：
The only scenario where we actually need to increment or decrement the count of infinities is during sliding window aggregation, where we need to reuse previous statistics. However, even in this case, the values of pInfcount and nInfcount will never exceed 2^32-1, so using a 32-bit integer should be sufficient.

During the TPC-DS tests, we observed an issue where CloudberryDB (CBDB) performed worse than Greenplum (GPDB) when the query plan generated a multi-stage aggregation (e.g., TPC-DS query 04). The phenomenon showed that the deduplication effect of CBDB’s **Streaming Partial HashAggregate** was significantly worse. As a result, the **Finalize HashAggregate** operator in CBDB processed significantly more data compared to GPDB under the same dataset. Example plan from CBDB: Gather Motion 32:1 (slice1; segments: 32) (cost=0.00..19988.81 rows=1800000 width=76) -> Finalize HashAggregate (cost=0.00..19663.99 rows=56250 width=81) Group Key: customer_gp.c_customer_id, customer_gp.c_first_name, customer_gp.c_last_name, customer_gp.c_preferred_cust_flag, customer_gp.c_birth_country, customer_gp.c_login, customer_gp.c_email_address, date_dim_gp.d_year -> Redistribute Motion 32:32 (slice2; segments: 32) (cost=0.00..19603.35 rows=56250 width=81) Hash Key: customer_gp.c_customer_id, customer_gp.c_first_name, customer_gp.c_last_name, customer_gp.c_preferred_cust_flag, customer_gp.c_birth_country, customer_gp.c_login, customer_gp.c_email_address, date_dim_gp.d_year -> Streaming Partial HashAggregate (cost=0.00..19589.09 rows=56250 width=81) Group Key: customer_gp.c_customer_id, customer_gp.c_first_name, customer_gp.c_last_name, customer_gp.c_preferred_cust_flag, customer_gp.c_birth_country, customer_gp.c_login, customer_gp.c_email_address, date_dim_gp.d_year -> Hash Join (cost=0.00..12346.24 rows=6935137 width=95) Hash Cond: (store_sales_gp.ss_customer_sk = customer_gp.c_customer_sk) ... ... Upon further investigation, we found that the **`NumericAggState`** structure in CloudberryDB contained two additional fields compared to Greenplum: ```c int64 pInfcount; /* count of +Inf values */ int64 nInfcount; /* count of -Inf values */ These fields were introduced in PostgreSQL 14 to support +/- Inf values for numeric types. Consequently, the size of the NumericAggState structure increased from 128 bytes to 144 bytes. In the Streaming Partial HashAggregate, a NumericAggState structure is created for each grouping key to track statistics for numeric types. This results in CBDB allocating 16 more bytes per grouping key compared to GPDB. This additional memory allocation contributes to the observed performance difference. To address this issue, we need to adjust the aggstate->hash_mem_limit Inspired by PostgreSQL’s handling of hash_mem_limit, we introduce a scaling factor called hash_mem_multiplier. Following the changes made in PostgreSQL 15, we set the default value of hash_mem_multiplier to 2.0 to ensure better memory utilization during hash-based aggregation operations. https://www.postgresql.org/message-id/flat/CAH2-Wzndc_ROk6CY-bC6p9O53q974Y0Ey4WX8jcPbuTZYM4Q3A%40mail.gmail.com

…functions reduce memory usage and improve performance of numeric aggregate functions by changing infinity counters from int64 to uint32. Adds overflow protection and retains compatibility with typical workloads. Introduce a new GUC parameter gp_enable_numeric_inf_count_check to control overflow checking for infinity counters.When disabled, the system will treat all infinity values as NaN to avoid incorrect results caused by counter overflow in cases with extremely large infinity counts.

avamingli reviewed Jun 20, 2025

View reviewed changes

src/backend/executor/nodeAgg.c Outdated Show resolved Hide resolved

src/backend/executor/nodeAgg.c Show resolved Hide resolved

avamingli added type: Performance cloudberry runs slow on some particular query executor labels Jun 20, 2025

gongxun0928 force-pushed the performance/improve-hash-agg-performance-in-large-scale-data branch from 330c593 to 9e16c2a Compare June 23, 2025 08:54

gongxun0928 added 2 commits July 2, 2025 15:26

gongxun0928 force-pushed the performance/improve-hash-agg-performance-in-large-scale-data branch from 9e16c2a to d8b983d Compare July 2, 2025 10:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: Improve hash agg performance in large-scale data situations #1178

perf: Improve hash agg performance in large-scale data situations #1178

gongxun0928 commented Jun 20, 2025 •

edited

Loading

Uh oh!

avamingli commented Jun 20, 2025

Uh oh!

Uh oh!

Uh oh!

jianlirong commented Jun 20, 2025

Uh oh!

gongxun0928 commented Jun 20, 2025

Uh oh!

jianlirong commented Jun 20, 2025

Uh oh!

jianlirong commented Jun 20, 2025 •

edited

Loading

Uh oh!

gongxun0928 commented Jun 20, 2025

Uh oh!

gongxun0928 commented Jun 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

perf: Improve hash agg performance in large-scale data situations #1178

Are you sure you want to change the base?

perf: Improve hash agg performance in large-scale data situations #1178

Conversation

gongxun0928 commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Type of Change

Breaking Changes

Test Plan

Impact

Checklist

Additional Context

CI Skip Instructions

Uh oh!

avamingli commented Jun 20, 2025

Uh oh!

Uh oh!

Uh oh!

jianlirong commented Jun 20, 2025

Uh oh!

gongxun0928 commented Jun 20, 2025

Uh oh!

jianlirong commented Jun 20, 2025

Uh oh!

jianlirong commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gongxun0928 commented Jun 20, 2025

Uh oh!

gongxun0928 commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

gongxun0928 commented Jun 20, 2025 •

edited

Loading

jianlirong commented Jun 20, 2025 •

edited

Loading

gongxun0928 commented Jun 23, 2025 •

edited

Loading