[WIP] Add guc gp_hot_standby_snapshot_restore_point_name #1184

fanfuxiaoran · 2025-06-23T07:43:36Z

Fixes #ISSUE_Number
base on the pr #1152

What does this PR do?

For hot standby cluster, only at the restore_point
it can reach a consistent state. So the snapshot
on hot standby should be token at thre restore_point.

Add the guc gp_hot_standby_snapshot_restore_point_name,
it sets the restore point name that the snapshot should be taken at.

Add a new type of xlog: XLOG_RESTORE_POINT_RUNNING_XACTS,
it records the running xids and the restore point name.
On hot standby, when doing recovery, it the xlogRecord's
restore point name equals the
gp_hot_standby_snapshot_restore_point_name,
then we will update the hot standby snapshot.

Beside the XLOG_RESTORE_POINT_RUNNING_XACTS, I think
for the checkpoint shutdown xlogrecord, the hot standby
cluster is also consistent, we can update the snapshot
too.

TODO:
I have added the test hot_standby/snapshot. But the test
failed. It is expected to fail when query table hs_sh
on hot standby before the restore point "r1" created.
It means this commit cannot work correctly, I must miss
something.

Type of Change

Bug fix (non-breaking change)
New feature (non-breaking change)
Breaking change (fix or feature with breaking changes)
Documentation update

Breaking Changes

Test Plan

Unit tests added/updated
Integration tests added/updated
Passed make installcheck
Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Followed contribution guide
Added/updated documentation
Reviewed code for security implications
Requested review from cloudberry committers

Additional Context

CI Skip Instructions

This is the initial commit to support hot standby dispatch in GPDB. In this commit, hot standby dispatch is enabled when the hot_standby GUC is set to ON, and the standby coordinator can be connected and run queries on. Basic query dispatching and error handling cases are covered, please see the isolation2/hot_standby tests for those cases. Current limitations that will be addressed in coming works: * No read-committed isolation from global transaction, so e.g. a SELECT on standby QD could see partial INSERT results on the primary QD. * No repeatable-read isolation, so e.g., a UDF that runs multiple SELECTs on the standby QD could see different results from the SELECTs even they are the same. * No transaction block BEGIN ... END, and as a result, no cursor support or other things that depend on BEGIN...END. * Query conflict between primary and standby has not been tested yet. This will be done with/after the isolation work. Co-authored-by: Soumyadeep Chakraborty <soumyadeep2007@gmail.com> Co-authored-by: Jimmy Yih <jyih@vmware.com>

…all cases"" This reverts commit e1c99e4.

We currently have the GPDB-specific gp_pause_on_restore_point_replay hidden developer GUC which allows us to pause when replaying a restore point record. The logic was a bit flawed and needed some refactoring to accommodate the current hot standby work. These are the changes that were made: * The gp_pause_on_restore_point_replay GUC has been changed from a boolean type to a string type. This allows us to set exactly which restore point to pause on (assuming the restore points provided are unique). The user/application can update the GUC, do a reload, and resume WAL replay to advance towards the next restore point to pause on. * The pausing logic has been moved out of the xlog_redo() function and into its own separate function. If WAL replay has reached the restore point designated in the gp_pause_on_restore_point_replay GUC, it will now pause near the end of the main redo apply loop. When resumed (via a `SELECT pg_wal_replay_resume()` call), we check if a promotion has been requested. If there is a promotion request, then the continuous recovery target has been reached where we will then stop recovery and go through promotion by piggybacking on the existing recovery target logic. Co-authored-by: Jimmy Yih <jyih@vmware.com> Co-authored-by: Soumyadeep Chakraborty <soumyadeep2007@gmail.com>

To support hot standby, we need to reconstruct the state of running dtx at the time of checkpoint on the standby. It is key to the correctness of distributed snapshot the standby will use. One key piece of information is latestCompletedGxid - it provides the xmax of the snapshot. But unlike primary who just sets latestCompletedGxid = nextGxid - 1, the standby cannot use nextGxid. This is because nextGxid was bumped in the checkpoint and cannot represent the xmax of running dtx (see CreateCheckPoint). It is OK for the primary since it does not need to reconstruct the running dtx. So now we introduce a new XLOG type XLOG_LATESTCOMPLETED_GXID which directly writes the latestCompletedGxid at the checkpoint time. It is only written on QD and when hot standby is active. P.S. the alternative is to bump nextGxid at the startup instead of checkpoint, so its value can be used for the standby to initialize latestCompletedGxid. But for the primary, it would be impossible to know the correct number of gxid to bump, since gp_gxid_prefetch_num can change before restart. CBDB: Change the rmgr from XLOG to STANDBY as there is no room in the 4 high bits in xl_info. And it makes sense to put it into STANDBY rmgr since it is used to make hot_standby snapshot.

The previous few commits have removed some road blocks for supporting it. This commit mainly just deals two more aspects wrt to distributed transactions: * Initialize latestCompletedGxid during StartupXLOG, and update it while the standby replays new transactions. * Construct an in-progress dtx array when creating distributed snapshot according to the shmCommittedGxidArray[] we already keep in the standby. It was pondered whether or not to add a new WAL type XLOG_RUNNING_DISTRIBUTED_XACTS similar to XLOG_RUNNING_XACTS. But it seems unnecessary at the moment: we already have the running dtx information in the checkpoint record. The other information in the XLOG_RUNNING_XACTS record does not seem to be needed to support read-committed isolation. There are a few other callers of ProcArrayApplyRecoveryInfo() that relies on the XLOG_RUNNING_XACTS, but it doesn't seem we have a need to emulate them for dtx.

In previous commits we've supported hot standby dispatch and read-committed isolation. In order to support repeatable-read isolation, the only real complication is just to support the BEGIN...END block. The snapshot selection and usage for repeatable-read on a hot standby is exactly the same as a primary. And, the main difference between a single-statement transaction and a BEGIN...END block is just the DTX context of the QEs: in the former case the QEs are DTX_CONTEXT_QE_AUTO_COMMIT_IMPLICIT, but in the latter case they are DTX_CONTEXT_QE_TWO_PHASE_EXPLICIT_WRITER (see setupQEDtxContext()). We had Assert/ERROR in the code to assume that for EXPLICIT_WRITER, there's always a valid distributed xid for that transaction. However, that is not the case for hot standby: a standby never allocates an xid and there's definitely no use of an xid in its BEGIN...END block. Therefore, all we need to do is just make sure to not apply this assumption to hot standby. After that, supporting repeatable-read is a no-op. Another small change is to rename IS_STANDBY_QE to IS_HOT_STANDBY_QE to better correspond to IS_HOT_STANDBY_QD.

Fixed result differences and non-runnable tests. The notable ones are: 1. Backward cursor fetch is not supported in GPDB. Move to the "_disallowed" test. 2. Some ERROR messages are replaced by "not supported" ones in GPDB which should be fine. 3. "cannot execute SELECT FOR SHARE in a read-only transaction" is replaced by "cannot acquire lock mode ExclusiveLock ... during recovery". The reason is that the QD needs to acquire the lock if GDD isn't enabled. If later we found it needed we may try to change the error message for standby just to be a little more informative. 4. The "test_setup" was added to the standby schedule by mistake. Removing it. With that, we can add this test schedule to the hot standby pipeline job.

For the most part, query conflict on standby works w/o any changes in GPDB. Add tests for the expected beavior. One notable issue is that we are not considering distributed snapshot in the snapshot conflict detection at this point. We added test for that behavior too. Add these tests: 1. All the query conflict types mentioned in https://www.postgresql.org/docs/12/hot-standby.html#HOT-STANDBY-CONFLICT. There is actually one that's not mentioned there which is deadlock conflict. Still yet to produce a test for that. 2. GUCs hot_standby_feedback and vacuum_defer_cleanup_age. 3. System view gp_stat_database_conflicts which is a cluster-wide view of pg_stat_database_conflicts. Note that, in the test we need to get the max of conflict count among all segments to avoid flakiness. Ideally we should just have something like gp_stat_database_conflicts_summary to print the max counts, but we are not allowed to change catalog now. So leaving that as a FIXME item. 4. A test case showing distributed snapshot isn't taken into account when detecting snapshot conflict. This is a limitation that we'll address with a restore-point based dtx snapshot creation approach later.

For a selected list of PG system views (started with 'pg_'prefix ), we will create a corresponding 'gp_' view for each one in the list. Each 'gp_' view is basically a UNION ALL of the results of running the corresponding 'pg_' view on all segments (including the coordinator). Note that, these views do not aggregate the results. The aggregate version of the views will be named with a '_summary' appendix (such as 'gp_stat_all_tables_summary'). To add a new 'pg_' view to this list, simply put the name in file 'src/backend/catalog/system_views_gp.in'. This commit adds an initial list of views that we think make sense to have 'gp_' views. With this change, we also remove the existing definition of gp_stat_archiver view and let it be generated automatically. We also had gp_stat_replication but it carries additional column than pg_stat_replication so it cannot use the automatic way.

Some pg_ views have been modified by cbdb: the gp_segment_id colmun has been added to them. So they are failed to be transformed from the pg_ views to gp_ views (see commit 5028222620d410fe3d4c60f732a599e269006968) So just remove them from system_vies_gp.in. Maybe better to fix them later.

For hot standby cluster, only at the restore_point it can reach a consistent state. So the snapshot on hot standby should be token at thre restore_point. Add the guc gp_hot_standby_snapshot_restore_point_name, it sets the restore point name that the snapshot should be token at. Add a new type of xlog: XLOG_RESTORE_POINT_RUNNING_XACTS, it records the running xids and the restore point name. On hot standby, when doing recovery, it the xlogRecord's restore point name equals the gp_hot_standby_snapshot_restore_point_name, then we will update the hot standby snapshot. Beside the XLOG_RESTORE_POINT_RUNNING_XACTS, I think the for checkpoint shutdown xlogrecord, the hot standby cluster is also consistent, we can update the snapshot too. TODO: I have added the test hot_standby/snapshot. But the test failed. It is expected to fail when query table hs_sh on hot standby before the restore point "r1" created. It means this commit cannot work correctly, I must miss something.

fanfuxiaoran · 2025-06-23T07:47:47Z

@weinan003 @Mulily0513, this pr is still in progress, the test case hot_standby/snapshot shows unexpected result, please help to fix and finish it.

my-ship-it · 2025-06-25T05:46:24Z

@weinan003 @Mulily0513, this pr is still in progress, the test case hot_standby/snapshot shows unexpected result, please help to fix and finish it.

Thanks for POC this.

huansong and others added 12 commits June 23, 2025 15:42

Revert "Revert "Include distributed xid in transaction commit WAL in …

7c9c47c

…all cases"" This reverts commit e1c99e4.

Fix hot_standby isolation2 and regress tests

dbd6bae

fanfuxiaoran force-pushed the hotstandby_restore_point branch from de83218 to 62f845a Compare June 23, 2025 07:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Add guc gp_hot_standby_snapshot_restore_point_name #1184

[WIP] Add guc gp_hot_standby_snapshot_restore_point_name #1184

Uh oh!

fanfuxiaoran commented Jun 23, 2025 •

edited

Loading

Uh oh!

fanfuxiaoran commented Jun 23, 2025

Uh oh!

my-ship-it commented Jun 25, 2025

Uh oh!

Uh oh!

[WIP] Add guc gp_hot_standby_snapshot_restore_point_name #1184

Are you sure you want to change the base?

[WIP] Add guc gp_hot_standby_snapshot_restore_point_name #1184

Uh oh!

Conversation

fanfuxiaoran commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Type of Change

Breaking Changes

Test Plan

Impact

Checklist

Additional Context

CI Skip Instructions

Uh oh!

fanfuxiaoran commented Jun 23, 2025

Uh oh!

my-ship-it commented Jun 25, 2025

Uh oh!

Uh oh!

fanfuxiaoran commented Jun 23, 2025 •

edited

Loading