Skip to content

perf: Improve hash agg performance in large-scale data situations #1178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

gongxun0928
Copy link
Contributor

@gongxun0928 gongxun0928 commented Jun 20, 2025

During the TPC-DS tests, we observed an issue where CloudberryDB (CBDB) performed worse than Greenplum (GPDB) when the query plan generated a multi-stage aggregation (e.g., TPC-DS query 04). The phenomenon showed that the deduplication effect of CBDB’s Streaming Partial HashAggregate was significantly worse. As a result, the Finalize HashAggregate operator in CBDB processed significantly more data compared to GPDB under the same dataset.

Example plan from CBDB:
Gather Motion 32:1  (slice1; segments: 32)  (cost=0.00..19988.81 rows=1800000 width=76)
      ->  Finalize HashAggregate  (cost=0.00..19663.99 rows=56250 width=81)
            Group Key: customer_gp.c_customer_id, customer_gp.c_first_name,
                  customer_gp.c_last_name, customer_gp.c_preferred_cust_flag,
                  customer_gp.c_birth_country, customer_gp.c_login,
                  customer_gp.c_email_address, date_dim_gp.d_year
            ->  Redistribute Motion 32:32  (slice2; segments: 32)  (cost=0.00..19603.35 rows=56250 width=81)
                  Hash Key: customer_gp.c_customer_id, customer_gp.c_first_name, customer_gp.c_last_name,
                        customer_gp.c_preferred_cust_flag, customer_gp.c_birth_country, customer_gp.c_login,
                        customer_gp.c_email_address, date_dim_gp.d_year
                  ->  Streaming Partial HashAggregate  (cost=0.00..19589.09 rows=56250 width=81)
                        Group Key: customer_gp.c_customer_id, customer_gp.c_first_name, customer_gp.c_last_name,
                              customer_gp.c_preferred_cust_flag, customer_gp.c_birth_country, customer_gp.c_login,
                              customer_gp.c_email_address, date_dim_gp.d_year
                        ->  Hash Join  (cost=0.00..12346.24 rows=6935137 width=95)
                              Hash Cond: (store_sales_gp.ss_customer_sk = customer_gp.c_customer_sk)
                              ...
                              ...

Upon further investigation, we found that the NumericAggState structure in CloudberryDB contained two additional fields compared to Greenplum:

int64 pInfcount; /* count of +Inf values */
int64 nInfcount; /* count of -Inf values */

These fields were introduced in PostgreSQL 14 to support +/- Inf values for numeric types. Consequently, the size of the NumericAggState structure increased from 128 bytes to 144 bytes.

In the Streaming Partial HashAggregate, a NumericAggState structure is created for each grouping key to track statistics for numeric types. This results in CBDB allocating 16 more bytes per grouping key compared to GPDB. This additional memory allocation contributes to the observed performance difference.

To address this issue, we need to adjust the aggstate->hash_mem_limit Inspired by PostgreSQL’s handling of hash_mem_limit, we introduce a scaling factor called hash_mem_multiplier. Following the changes made in PostgreSQL 15, we should set the default value of hash_mem_multiplier to 2.0 to ensure better memory utilization during hash-based aggregation operations. In actual TPC-DS tests, we found that under most of the same statement_mem conditions, when hash_mem_multiplier=1.5 is set, the performance of greenplum and cloudberrydb is similar.

https://www.postgresql.org/message-id/flat/CAH2-Wzndc_ROk6CY-bC6p9O53q974Y0Ey4WX8jcPbuTZYM4Q3A%40mail.gmail.com

Fixes #ISSUE_Number

What does this PR do?

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


@avamingli
Copy link
Contributor

The phenomenon showed that the deduplication effect of CBDB’s Streaming Partial HashAggregate was significantly worse. As a result, the Finalize HashAggregate operator in CBDB processed significantly more data compared to GPDB under the same dataset

Thank you for your insights!

The Streaming HashAggregate does not spill; it sends out tuples once the hash table is full. Its performance is closely linked to the distinct property of the data from subnodes within the in-memory hash range. However, evaluating this distinct property can be challenging.

While HashAggregate may require more space and disk I/O, it can deduplicate more tuples. Conversely, Streaming HashAggregate might incur higher network I/O, deduplicating fewer tuples but potentially speeding up slice execution upon the motion node. Ultimately, the effectiveness of either approach depends on factors like network, disk performance, and, crucially, the distinct property of subnode data, especially given the in-memory limit of the hash table.

I encountered a similar observation in #1173, where we discussed adding a GUC to control Streaming HashAggregate. This allows users to tailor the approach to their specific environments, which could be beneficial, especially in ORCA.

Anyway, your work looks great!

@avamingli avamingli added type: Performance cloudberry runs slow on some particular query executor labels Jun 20, 2025
@jianlirong
Copy link

Actually, I was thinking, in CBDB's Streaming Partial HashAggregate, is it really necessary for the following two fields to be int64? Maybe int32 is enough.

int64 pInfcount; /* count of +Inf values */
int64 nInfcount; /* count of -Inf values */

@gongxun0928
Copy link
Contributor Author

Actually, I was thinking, in CBDB's Streaming Partial HashAggregate, is it really necessary for the following two fields to be int64? Maybe int32 is enough.

int64 pInfcount; /* count of +Inf values */
int64 nInfcount; /* count of -Inf values */

Except for some special cases, it should be very rare for the sum of +Inf/-Inf values in an aggregate to exceed 2^32-1. If we do not consider compatibility with such special case, we could change the type from int64 to int32. Combined with a field reordering like the example below, the struct size can be reduced from 144 bytes to 128 bytes. This would keep the memory usage for numeric aggregation in line with Greenplum, while also improving hash aggregation efficiency.

origin: 144bytes

typedef struct NumericAggState
{
	bool		calcSumX2;		/* if true, calculate sumX2 */
	MemoryContext agg_context;	/* context we're calculating in */
	int64		N;				/* count of processed numbers */
	NumericSumAccum sumX;		/* sum of processed numbers */
	NumericSumAccum sumX2;		/* sum of squares of processed numbers */
	int			maxScale;		/* maximum scale seen so far */
	int64		maxScaleCount;	/* number of values seen with maximum scale */
	/* These counts are *not* included in N!  Use NA_TOTAL_COUNT() as needed */
	int64		NaNcount;		/* count of NaN values */
	int64		pInfcount;		/* count of +Inf values */
	int64		nInfcount;		/* count of -Inf values */
} NumericAggState;

reordering: 128bytes

typedef struct NumericAggState
{
	bool		calcSumX2;		/* if true, calculate sumX2 */
	int32		maxScale;		/* maximum scale seen so far */
	int64		maxScaleCount;	/* number of values seen with maximum scale */
	MemoryContext agg_context;	/* context we're calculating in */
	int64		N;				/* count of processed numbers */
	NumericSumAccum sumX;		/* sum of processed numbers */
	NumericSumAccum sumX2;		/* sum of squares of processed numbers */
	int64		NaNcount;		/* count of NaN values */
	int32		pInfcount;		/* count of +Inf values */
	int32		nInfcount;		/* count of -Inf values */
} NumericAggState;

I'll try to give a performance comparison of this change by next Monday. Then we'll decide how to proceed.

@jianlirong
Copy link

What I want to express is that for Streaming Partial HashAggregate, the data structure NumericAggState for the intermediate state in the first stage aggregation doesn't need to be the same as the state data structure for the final stage aggregation. For the final stage aggregation's state data structure, it should indeed maintain consistency with PG, and the two fields mentioned above could indeed exceed the maximum value of int32. However, for Streaming Partial HashAggregate, we can do some special handling: when the above two fields reach the maximum value of int32, we can stream out the results of the Partial HashAggregate, and then reconstruct a new Partial HashAggregate result starting from zero. Considering that int32 should be very large, even in extreme cases where the number of tuples is very high, the overall execution performance won't actually decline, because in such cases, the cost that NumericAggState brings to motion is almost negligible compared to scanning and performing aggregation operations on more than 2^32 - 1 tuples.

@jianlirong
Copy link

jianlirong commented Jun 20, 2025

In other words, for the intermediate state data structure NumericAggState in the first stage aggregation, all the int64 fields inside can actually be changed to int32, not just only the above two.

@gongxun0928
Copy link
Contributor Author

In other words, for the intermediate state data structure NumericAggState in the first stage aggregation, all the int64 fields inside can actually be changed to int32, not just only the above two.

This approach is challenging. For instance, both the Finalize HashAggregate and Partial HashAggregate phases rely on the same workflow—advance_aggregates → xxx → do_numeric_accum—to perform numeric accumulation and update the aggregate state. If we were to use different NumericAggState structures for different phases, we would need to refactor a large number of numeric-related functions to support a more compact state data structure.
Additionally, in finalize_aggregates, we would need to select different serialization functions depending on the type of aggsplit.
All of these changes would result in significant differences between our codebase and upstream PostgreSQL, making future version upgrades more difficult to manage.

@gongxun0928
Copy link
Contributor Author

gongxun0928 commented Jun 23, 2025

I took a look at the latest PostgreSQL code (PostgreSQL 18 beta). Although pInfcount and nInfcount in NumericAggState are defined as int64, they are actually used as boolean flags in practice. I'm considering starting a discussion upstream to propose changing these two fields to bool in order to reduce the memory footprint of the NumericAggState struct. If there is consensus, I’d like to drive this change upstream.

update:
The only scenario where we actually need to increment or decrement the count of infinities is during sliding window aggregation, where we need to reuse previous statistics. However, even in this case, the values of pInfcount and nInfcount will never exceed 2^32-1, so using a 32-bit integer should be sufficient.

@gongxun0928 gongxun0928 force-pushed the performance/improve-hash-agg-performance-in-large-scale-data branch from 330c593 to 9e16c2a Compare June 23, 2025 08:54
During the TPC-DS tests, we observed an issue where CloudberryDB (CBDB)
performed worse than Greenplum (GPDB) when the query plan generated a
multi-stage aggregation (e.g., TPC-DS query 04). The phenomenon showed
that the deduplication effect of CBDB’s **Streaming Partial HashAggregate**
was significantly worse. As a result, the **Finalize HashAggregate**
operator in CBDB processed significantly more data compared to GPDB
under the same dataset.

Example plan from CBDB:
Gather Motion 32:1  (slice1; segments: 32)  (cost=0.00..19988.81 rows=1800000 width=76)
      ->  Finalize HashAggregate  (cost=0.00..19663.99 rows=56250 width=81)
            Group Key: customer_gp.c_customer_id, customer_gp.c_first_name,
                  customer_gp.c_last_name, customer_gp.c_preferred_cust_flag,
                  customer_gp.c_birth_country, customer_gp.c_login,
                  customer_gp.c_email_address, date_dim_gp.d_year
            ->  Redistribute Motion 32:32  (slice2; segments: 32)  (cost=0.00..19603.35 rows=56250 width=81)
                  Hash Key: customer_gp.c_customer_id, customer_gp.c_first_name, customer_gp.c_last_name,
                        customer_gp.c_preferred_cust_flag, customer_gp.c_birth_country, customer_gp.c_login,
                        customer_gp.c_email_address, date_dim_gp.d_year
                  ->  Streaming Partial HashAggregate  (cost=0.00..19589.09 rows=56250 width=81)
                        Group Key: customer_gp.c_customer_id, customer_gp.c_first_name, customer_gp.c_last_name,
                              customer_gp.c_preferred_cust_flag, customer_gp.c_birth_country, customer_gp.c_login,
                              customer_gp.c_email_address, date_dim_gp.d_year
                        ->  Hash Join  (cost=0.00..12346.24 rows=6935137 width=95)
                              Hash Cond: (store_sales_gp.ss_customer_sk = customer_gp.c_customer_sk)
                              ...
                              ...

Upon further investigation, we found that the **`NumericAggState`** structure
in CloudberryDB contained two additional fields compared to Greenplum:
```c
int64 pInfcount; /* count of +Inf values */
int64 nInfcount; /* count of -Inf values */

These fields were introduced in PostgreSQL 14 to support +/- Inf values for
numeric types. Consequently, the size of the NumericAggState structure
increased from 128 bytes to 144 bytes.

In the Streaming Partial HashAggregate, a NumericAggState structure is
created for each grouping key to track statistics for numeric types.
This results in CBDB allocating 16 more bytes per grouping key compared
to GPDB. This additional memory allocation contributes to the observed
performance difference.

To address this issue, we need to adjust the aggstate->hash_mem_limit
Inspired by PostgreSQL’s handling of hash_mem_limit, we introduce a
scaling factor called hash_mem_multiplier. Following the changes made
in PostgreSQL 15, we set the default value of hash_mem_multiplier to
2.0 to ensure better memory utilization during hash-based aggregation
operations.

https://www.postgresql.org/message-id/flat/CAH2-Wzndc_ROk6CY-bC6p9O53q974Y0Ey4WX8jcPbuTZYM4Q3A%40mail.gmail.com
…functions

reduce memory usage and improve performance of numeric aggregate functions
by changing infinity counters from int64 to uint32. Adds overflow protection
and retains compatibility with typical workloads.

Introduce a new GUC parameter gp_enable_numeric_inf_count_check to control
overflow checking for infinity counters.When disabled, the system will treat
all infinity values as NaN to avoid incorrect results caused by counter
overflow in cases with extremely large infinity counts.
@gongxun0928 gongxun0928 force-pushed the performance/improve-hash-agg-performance-in-large-scale-data branch from 9e16c2a to d8b983d Compare July 2, 2025 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
executor type: Performance cloudberry runs slow on some particular query
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants