Skip to content

[WIP] Add guc gp_hot_standby_snapshot_restore_point_name #1184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

fanfuxiaoran
Copy link
Contributor

@fanfuxiaoran fanfuxiaoran commented Jun 23, 2025

Fixes #ISSUE_Number
base on the pr #1152

What does this PR do?

For hot standby cluster, only at the restore_point
it can reach a consistent state. So the snapshot
on hot standby should be token at thre restore_point.

Add the guc gp_hot_standby_snapshot_restore_point_name,
it sets the restore point name that the snapshot should be taken at.

Add a new type of xlog: XLOG_RESTORE_POINT_RUNNING_XACTS,
it records the running xids and the restore point name.
On hot standby, when doing recovery, it the xlogRecord's
restore point name equals the
gp_hot_standby_snapshot_restore_point_name,
then we will update the hot standby snapshot.

Beside the XLOG_RESTORE_POINT_RUNNING_XACTS, I think
for the checkpoint shutdown xlogrecord, the hot standby
cluster is also consistent, we can update the snapshot
too.

TODO:
I have added the test hot_standby/snapshot. But the test
failed. It is expected to fail when query table hs_sh
on hot standby before the restore point "r1" created.
It means this commit cannot work correctly, I must miss
something.

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


huansong and others added 12 commits June 23, 2025 15:42
This is the initial commit to support hot standby dispatch in GPDB. In this
commit, hot standby dispatch is enabled when the hot_standby GUC is set to ON,
and the standby coordinator can be connected and run queries on. Basic query
dispatching and error handling cases are covered, please see the
isolation2/hot_standby tests for those cases.

Current limitations that will be addressed in coming works:
* No read-committed isolation from global transaction, so e.g. a SELECT on
  standby QD could see partial INSERT results on the primary QD.
* No repeatable-read isolation, so e.g., a UDF that runs multiple SELECTs on the
  standby QD could see different results from the SELECTs even they are the same.
* No transaction block BEGIN ... END, and as a result, no cursor support or
  other things that depend on BEGIN...END.
* Query conflict between primary and standby has not been tested yet. This will
  be done with/after the isolation work.

Co-authored-by: Soumyadeep Chakraborty <soumyadeep2007@gmail.com>
Co-authored-by: Jimmy Yih <jyih@vmware.com>
We currently have the GPDB-specific gp_pause_on_restore_point_replay
hidden developer GUC which allows us to pause when replaying a restore
point record. The logic was a bit flawed and needed some refactoring
to accommodate the current hot standby work.

These are the changes that were made:
* The gp_pause_on_restore_point_replay GUC has been changed from a
  boolean type to a string type. This allows us to set exactly which
  restore point to pause on (assuming the restore points provided are
  unique). The user/application can update the GUC, do a reload, and
  resume WAL replay to advance towards the next restore point to pause
  on.
* The pausing logic has been moved out of the xlog_redo() function and
  into its own separate function. If WAL replay has reached the
  restore point designated in the gp_pause_on_restore_point_replay
  GUC, it will now pause near the end of the main redo apply
  loop. When resumed (via a `SELECT pg_wal_replay_resume()` call), we
  check if a promotion has been requested. If there is a promotion
  request, then the continuous recovery target has been reached where
  we will then stop recovery and go through promotion by piggybacking
  on the existing recovery target logic.

Co-authored-by: Jimmy Yih <jyih@vmware.com>
Co-authored-by: Soumyadeep Chakraborty <soumyadeep2007@gmail.com>
To support hot standby, we need to reconstruct the state of running dtx at the
time of checkpoint on the standby. It is key to the correctness of distributed
snapshot the standby will use. One key piece of information is
latestCompletedGxid - it provides the xmax of the snapshot.

But unlike primary who just sets latestCompletedGxid = nextGxid - 1, the standby
cannot use nextGxid. This is because nextGxid was bumped in the checkpoint and
cannot represent the xmax of running dtx (see CreateCheckPoint). It is OK for
the primary since it does not need to reconstruct the running dtx.

So now we introduce a new XLOG type XLOG_LATESTCOMPLETED_GXID which directly
writes the latestCompletedGxid at the checkpoint time. It is only written on
QD and when hot standby is active.

P.S. the alternative is to bump nextGxid at the startup instead of checkpoint,
so its value can be used for the standby to initialize latestCompletedGxid. But
for the primary, it would be impossible to know the correct number of gxid to
bump, since gp_gxid_prefetch_num can change before restart.

CBDB: Change the rmgr from XLOG to STANDBY as there is no room in the 4
high bits in xl_info. And it makes sense to put it into STANDBY rmgr
since it is used to make hot_standby snapshot.
The previous few commits have removed some road blocks for supporting it. This
commit mainly just deals two more aspects wrt to distributed transactions:

* Initialize latestCompletedGxid during StartupXLOG, and update it while the
  standby replays new transactions.
* Construct an in-progress dtx array when creating distributed snapshot
  according to the shmCommittedGxidArray[] we already keep in the standby.

It was pondered whether or not to add a new WAL type XLOG_RUNNING_DISTRIBUTED_XACTS
similar to XLOG_RUNNING_XACTS. But it seems unnecessary at the moment: we already
have the running dtx information in the checkpoint record. The other information
in the XLOG_RUNNING_XACTS record does not seem to be needed to support
read-committed isolation. There are a few other callers of ProcArrayApplyRecoveryInfo()
that relies on the XLOG_RUNNING_XACTS, but it doesn't seem we have a need to
emulate them for dtx.
In previous commits we've supported hot standby dispatch and read-committed
isolation. In order to support repeatable-read isolation, the only real
complication is just to support the BEGIN...END block. The snapshot selection
and usage for repeatable-read on a hot standby is exactly the same as a primary.

And, the main difference between a single-statement transaction and a
BEGIN...END block is just the DTX context of the QEs: in the former case the
QEs are DTX_CONTEXT_QE_AUTO_COMMIT_IMPLICIT, but in the latter case they
are DTX_CONTEXT_QE_TWO_PHASE_EXPLICIT_WRITER (see setupQEDtxContext()).

We had Assert/ERROR in the code to assume that for EXPLICIT_WRITER, there's
always a valid distributed xid for that transaction. However, that is not the
case for hot standby: a standby never allocates an xid and there's definitely
no use of an xid in its BEGIN...END block. Therefore, all we need to do is just
make sure to not apply this assumption to hot standby. After that, supporting
repeatable-read is a no-op.

Another small change is to rename IS_STANDBY_QE to IS_HOT_STANDBY_QE to better
correspond to IS_HOT_STANDBY_QD.
Fixed result differences and non-runnable tests. The notable ones are:
1. Backward cursor fetch is not supported in GPDB. Move to the "_disallowed" test.
2. Some ERROR messages are replaced by "not supported" ones in GPDB which
   should be fine.
3. "cannot execute SELECT FOR SHARE in a read-only transaction" is replaced by
   "cannot acquire lock mode ExclusiveLock ... during recovery". The reason is
   that the QD needs to acquire the lock if GDD isn't enabled. If later we
   found it needed we may try to change the error message for standby just to
   be a little more informative.
4. The "test_setup" was added to the standby schedule by mistake. Removing it.

With that, we can add this test schedule to the hot standby pipeline job.
For the most part, query conflict on standby works w/o any changes in GPDB.
Add tests for the expected beavior.

One notable issue is that we are not considering distributed snapshot in the
snapshot conflict detection at this point. We added test for that behavior too.

Add these tests:
1. All the query conflict types mentioned in
   https://www.postgresql.org/docs/12/hot-standby.html#HOT-STANDBY-CONFLICT.
   There is actually one that's not mentioned there which is deadlock conflict.
   Still yet to produce a test for that.
2. GUCs hot_standby_feedback and vacuum_defer_cleanup_age.
3. System view gp_stat_database_conflicts which is a cluster-wide view of
   pg_stat_database_conflicts. Note that, in the test we need to get the max of
   conflict count among all segments to avoid flakiness. Ideally we should just
   have something like gp_stat_database_conflicts_summary to print the max
   counts, but we are not allowed to change catalog now. So leaving that as a
   FIXME item.
4. A test case showing distributed snapshot isn't taken into account when
   detecting snapshot conflict. This is a limitation that we'll address with
   a restore-point based dtx snapshot creation approach later.
For a selected list of PG system views (started with 'pg_'prefix ), we
will create a corresponding 'gp_' view for each one in the list.
Each 'gp_' view is basically a UNION ALL of the results of running the
corresponding 'pg_' view on all segments (including the coordinator).

Note that, these views do not aggregate the results. The aggregate
version of the views will be named with a '_summary' appendix (such
as 'gp_stat_all_tables_summary').

To add a new 'pg_' view to this list, simply put the name in file
'src/backend/catalog/system_views_gp.in'. This commit adds an initial
list of views that we think make sense to have 'gp_' views.

With this change, we also remove the existing definition of
gp_stat_archiver view and let it be generated automatically.
We also had gp_stat_replication but it carries additional column than
pg_stat_replication so it cannot use the automatic way.
Some pg_ views have been modified by cbdb: the gp_segment_id
colmun has been added to them. So they are failed to be transformed
from the pg_ views to gp_ views (see commit
5028222620d410fe3d4c60f732a599e269006968)
So just remove them from system_vies_gp.in. Maybe better to fix
them later.
For hot standby cluster, only at the restore_point
it can reach a consistent state.  So the snapshot
on hot standby should be token at thre restore_point.

Add the guc gp_hot_standby_snapshot_restore_point_name,
it sets the restore point name that the snapshot should
be token at.

Add a new type of xlog: XLOG_RESTORE_POINT_RUNNING_XACTS,
it records the running xids and the restore point name.
On hot standby, when doing recovery, it the xlogRecord's
restore point name equals the
gp_hot_standby_snapshot_restore_point_name,
then we will update the hot standby snapshot.

Beside the XLOG_RESTORE_POINT_RUNNING_XACTS, I think the
for checkpoint shutdown xlogrecord, the hot standby
cluster is also consistent, we can update the snapshot
too.

TODO:
I have added the test hot_standby/snapshot. But the test
failed. It is expected to fail when query table hs_sh
on hot standby before the restore point "r1" created.
It means this commit cannot work correctly, I must miss
something.
@fanfuxiaoran fanfuxiaoran force-pushed the hotstandby_restore_point branch from de83218 to 62f845a Compare June 23, 2025 07:44
@fanfuxiaoran
Copy link
Contributor Author

@weinan003 @Mulily0513, this pr is still in progress, the test case hot_standby/snapshot shows unexpected result, please help to fix and finish it.

@my-ship-it
Copy link
Contributor

@weinan003 @Mulily0513, this pr is still in progress, the test case hot_standby/snapshot shows unexpected result, please help to fix and finish it.

Thanks for POC this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants