Skip to content

Cherry pick hot standby commits from gpdb #1152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

fanfuxiaoran
Copy link
Contributor

@fanfuxiaoran fanfuxiaoran commented Jun 10, 2025

Fixes #ISSUE_Number

What does this PR do?

This pr cherry-picks a bunch of commits from gpdb about hot standby feature.
The main changes are:

  • Enable to do read-only query on hot standby
  • Support to build the correct distributed snapshot on hot standby
  • Support read-committed/repeatable-read dtx isolation for hot standby

Also cherry-pick the commit about restore point.
Refactor restore point pausing logic for continuous archive recovery

Due to those commits are not continuous, so there are some conflicts with cbdb.

  • As some perl refactor code are not cherry-picked, fix the tab test in
    current style.
  • Fix the isolation tests in current style
  • Fix system_views_gp.in.
    Some pg_ views have been modified by cbdb: the gp_segment_id
    colmun has been added to them. So they are failed to be transformed
    from the pg_ views to gp_ views . So just remove them from
    system_vies_gp.in. Maybe better to fix them later.
  • Put XLOG_LATESTCOMPLETED_GXID into STANDBY rmgr in xlog
    in commit "Add XLOG_LATESTCOMPLETED_GXID"
    Change the rmgr from XLOG to STANDBY as there is no room in the 4
    high bits in xl_info. And it makes sense to put it into STANDBY rmgr
    since it is used to make hot_standby snapshot

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


@fanfuxiaoran fanfuxiaoran force-pushed the cherry_pick_hot_standby branch from f1d72f8 to 1ee2fcc Compare June 10, 2025 10:09
@fanfuxiaoran fanfuxiaoran changed the title [WIP]:Cherry pick hot standby commits from gpdb Cherry pick hot standby commits from gpdb Jun 11, 2025
@yjhjstz
Copy link
Member

yjhjstz commented Jun 11, 2025

@fanfuxiaoran
Copy link
Contributor Author

Revert "Include distributed xid in transaction commit WAL in all cases"

this pr affact tpcc .

Hmm... Could you explain more about it ? @yjhjstz The hot standby needs this commit to build distributed snapshot.

@fanfuxiaoran fanfuxiaoran requested a review from yjhjstz June 11, 2025 03:54
@yjhjstz
Copy link
Member

yjhjstz commented Jun 11, 2025

due to increased size of wal, that make tpcc performance degradation.

@fanfuxiaoran
Copy link
Contributor Author

due to increased size of wal, that make tpcc performance degradation.

Yes, It will indeed increase the size of the wal-log. Does it impact performance due to increasing the io of flushing wal-log? or the race of writing the wal-log?

@my-ship-it
Copy link
Contributor

my-ship-it commented Jun 11, 2025

due to increased size of wal, that make tpcc performance degradation.

Yes, It will indeed increase the size of the wal-log. Does it impact performance due to increasing the io of flushing wal-log? or the race of writing the wal-log?

Based on previous tests, adding dxid information has a relatively large impact on TPCC because of wal size. However, to support HOT Standby, we need to add this information, which is a contradiction.

@my-ship-it
Copy link
Contributor

due to increased size of wal, that make tpcc performance degradation.

Yes, It will indeed increase the size of the wal-log. Does it impact performance due to increasing the io of flushing wal-log? or the race of writing the wal-log?

Based on previous tests, adding dxid information has a relatively large impact on TPCC because of wal size. However, to support HOT Standby, we need to add this information, which is a contradiction.

Maybe it's possible to add a GUC to control the behavior, but it increases the complexity of understanding

@fanfuxiaoran fanfuxiaoran requested a review from weinan003 June 16, 2025 05:59
@yjhjstz
Copy link
Member

yjhjstz commented Jun 18, 2025

after merge #1078, there is conflicts at src/backend/access/transam/xlog.c.

@fanfuxiaoran
Copy link
Contributor Author

after merge #1078, there is conflicts at src/backend/access/transam/xlog.c.

Thanks for pointing it out. I will take a look.

bool isDtxPrepared = isPreparedDtxTransaction();
DistributedTransactionId distrib_xid = getDistributedTransactionId();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use EnableHotStandby to control distrib_xid usage ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually no. EnableHotStandby only will be true on hot_standby cluster. On the primary cluster,
it's always false. But the distrib_xid is written to xlog by primary cluster.

Discussed with Max. We prefer to use a Var , may be named 'enable_distri_xid' , it is like the var '
NEXTGXID', it will be both store in a configure file and xlog. Then it will be synced to standby
cluster.

The standby cluster will check the variable 'enable_distri_xid', if it's true, then the cluster
cannot be configured as hot_standby.

How do you think about this idea?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can optimize wal to counteract this influence.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should reslove conflict then add installcheck-hot-standby pipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'm thinking about to add a job to run installcheck-hot-standby in pipeline? @edespino Could you help to add it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add like below in .github/workflows/build-cloudberry.yml

{"test":"ic-isolation2",
               "make_configs":["src/test/isolation2:installcheck-isolation2"]
              },

huansong and others added 11 commits June 20, 2025 10:06
This is the initial commit to support hot standby dispatch in GPDB. In this
commit, hot standby dispatch is enabled when the hot_standby GUC is set to ON,
and the standby coordinator can be connected and run queries on. Basic query
dispatching and error handling cases are covered, please see the
isolation2/hot_standby tests for those cases.

Current limitations that will be addressed in coming works:
* No read-committed isolation from global transaction, so e.g. a SELECT on
  standby QD could see partial INSERT results on the primary QD.
* No repeatable-read isolation, so e.g., a UDF that runs multiple SELECTs on the
  standby QD could see different results from the SELECTs even they are the same.
* No transaction block BEGIN ... END, and as a result, no cursor support or
  other things that depend on BEGIN...END.
* Query conflict between primary and standby has not been tested yet. This will
  be done with/after the isolation work.

Co-authored-by: Soumyadeep Chakraborty <soumyadeep2007@gmail.com>
Co-authored-by: Jimmy Yih <jyih@vmware.com>
We currently have the GPDB-specific gp_pause_on_restore_point_replay
hidden developer GUC which allows us to pause when replaying a restore
point record. The logic was a bit flawed and needed some refactoring
to accommodate the current hot standby work.

These are the changes that were made:
* The gp_pause_on_restore_point_replay GUC has been changed from a
  boolean type to a string type. This allows us to set exactly which
  restore point to pause on (assuming the restore points provided are
  unique). The user/application can update the GUC, do a reload, and
  resume WAL replay to advance towards the next restore point to pause
  on.
* The pausing logic has been moved out of the xlog_redo() function and
  into its own separate function. If WAL replay has reached the
  restore point designated in the gp_pause_on_restore_point_replay
  GUC, it will now pause near the end of the main redo apply
  loop. When resumed (via a `SELECT pg_wal_replay_resume()` call), we
  check if a promotion has been requested. If there is a promotion
  request, then the continuous recovery target has been reached where
  we will then stop recovery and go through promotion by piggybacking
  on the existing recovery target logic.

Co-authored-by: Jimmy Yih <jyih@vmware.com>
Co-authored-by: Soumyadeep Chakraborty <soumyadeep2007@gmail.com>
To support hot standby, we need to reconstruct the state of running dtx at the
time of checkpoint on the standby. It is key to the correctness of distributed
snapshot the standby will use. One key piece of information is
latestCompletedGxid - it provides the xmax of the snapshot.

But unlike primary who just sets latestCompletedGxid = nextGxid - 1, the standby
cannot use nextGxid. This is because nextGxid was bumped in the checkpoint and
cannot represent the xmax of running dtx (see CreateCheckPoint). It is OK for
the primary since it does not need to reconstruct the running dtx.

So now we introduce a new XLOG type XLOG_LATESTCOMPLETED_GXID which directly
writes the latestCompletedGxid at the checkpoint time. It is only written on
QD and when hot standby is active.

P.S. the alternative is to bump nextGxid at the startup instead of checkpoint,
so its value can be used for the standby to initialize latestCompletedGxid. But
for the primary, it would be impossible to know the correct number of gxid to
bump, since gp_gxid_prefetch_num can change before restart.

CBDB: Change the rmgr from XLOG to STANDBY as there is no room in the 4
high bits in xl_info. And it makes sense to put it into STANDBY rmgr
since it is used to make hot_standby snapshot.
The previous few commits have removed some road blocks for supporting it. This
commit mainly just deals two more aspects wrt to distributed transactions:

* Initialize latestCompletedGxid during StartupXLOG, and update it while the
  standby replays new transactions.
* Construct an in-progress dtx array when creating distributed snapshot
  according to the shmCommittedGxidArray[] we already keep in the standby.

It was pondered whether or not to add a new WAL type XLOG_RUNNING_DISTRIBUTED_XACTS
similar to XLOG_RUNNING_XACTS. But it seems unnecessary at the moment: we already
have the running dtx information in the checkpoint record. The other information
in the XLOG_RUNNING_XACTS record does not seem to be needed to support
read-committed isolation. There are a few other callers of ProcArrayApplyRecoveryInfo()
that relies on the XLOG_RUNNING_XACTS, but it doesn't seem we have a need to
emulate them for dtx.
In previous commits we've supported hot standby dispatch and read-committed
isolation. In order to support repeatable-read isolation, the only real
complication is just to support the BEGIN...END block. The snapshot selection
and usage for repeatable-read on a hot standby is exactly the same as a primary.

And, the main difference between a single-statement transaction and a
BEGIN...END block is just the DTX context of the QEs: in the former case the
QEs are DTX_CONTEXT_QE_AUTO_COMMIT_IMPLICIT, but in the latter case they
are DTX_CONTEXT_QE_TWO_PHASE_EXPLICIT_WRITER (see setupQEDtxContext()).

We had Assert/ERROR in the code to assume that for EXPLICIT_WRITER, there's
always a valid distributed xid for that transaction. However, that is not the
case for hot standby: a standby never allocates an xid and there's definitely
no use of an xid in its BEGIN...END block. Therefore, all we need to do is just
make sure to not apply this assumption to hot standby. After that, supporting
repeatable-read is a no-op.

Another small change is to rename IS_STANDBY_QE to IS_HOT_STANDBY_QE to better
correspond to IS_HOT_STANDBY_QD.
Fixed result differences and non-runnable tests. The notable ones are:
1. Backward cursor fetch is not supported in GPDB. Move to the "_disallowed" test.
2. Some ERROR messages are replaced by "not supported" ones in GPDB which
   should be fine.
3. "cannot execute SELECT FOR SHARE in a read-only transaction" is replaced by
   "cannot acquire lock mode ExclusiveLock ... during recovery". The reason is
   that the QD needs to acquire the lock if GDD isn't enabled. If later we
   found it needed we may try to change the error message for standby just to
   be a little more informative.
4. The "test_setup" was added to the standby schedule by mistake. Removing it.

With that, we can add this test schedule to the hot standby pipeline job.
For the most part, query conflict on standby works w/o any changes in GPDB.
Add tests for the expected beavior.

One notable issue is that we are not considering distributed snapshot in the
snapshot conflict detection at this point. We added test for that behavior too.

Add these tests:
1. All the query conflict types mentioned in
   https://www.postgresql.org/docs/12/hot-standby.html#HOT-STANDBY-CONFLICT.
   There is actually one that's not mentioned there which is deadlock conflict.
   Still yet to produce a test for that.
2. GUCs hot_standby_feedback and vacuum_defer_cleanup_age.
3. System view gp_stat_database_conflicts which is a cluster-wide view of
   pg_stat_database_conflicts. Note that, in the test we need to get the max of
   conflict count among all segments to avoid flakiness. Ideally we should just
   have something like gp_stat_database_conflicts_summary to print the max
   counts, but we are not allowed to change catalog now. So leaving that as a
   FIXME item.
4. A test case showing distributed snapshot isn't taken into account when
   detecting snapshot conflict. This is a limitation that we'll address with
   a restore-point based dtx snapshot creation approach later.
For a selected list of PG system views (started with 'pg_'prefix ), we
will create a corresponding 'gp_' view for each one in the list.
Each 'gp_' view is basically a UNION ALL of the results of running the
corresponding 'pg_' view on all segments (including the coordinator).

Note that, these views do not aggregate the results. The aggregate
version of the views will be named with a '_summary' appendix (such
as 'gp_stat_all_tables_summary').

To add a new 'pg_' view to this list, simply put the name in file
'src/backend/catalog/system_views_gp.in'. This commit adds an initial
list of views that we think make sense to have 'gp_' views.

With this change, we also remove the existing definition of
gp_stat_archiver view and let it be generated automatically.
We also had gp_stat_replication but it carries additional column than
pg_stat_replication so it cannot use the automatic way.
Some pg_ views have been modified by cbdb: the gp_segment_id
colmun has been added to them. So they are failed to be transformed
from the pg_ views to gp_ views (see commit
5028222620d410fe3d4c60f732a599e269006968)
So just remove them from system_vies_gp.in. Maybe better to fix
them later.
Copy link
Member

@yjhjstz yjhjstz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants