Reconcile (repair or expand) quorum queue membership periodically #8218

SimonUnge · 2023-05-17T00:07:57Z

Proposed Changes

As a subset PR for issue #7209

Add a process that periodically evaluates queue node membership. Currently only Quorum Queues but the
idea is that this process would also evaluate Stream Queues.

The high level algorithm is

A node periodically evaluates local leader queue membership
- The interval is defaulted to 60 minutes
The node loops over all local leader queues, fetching the queues members
- Checks the members list against rabbit_nodes:list_members() to find 'dead' members. This should be rare, but could happen if the forget_cluster_node command fails before it 'shrinks' a queue. If there are 'dead' members, the shrink command is executed for those dead members
- Checks the member list against new cluster members by comparing the members list with rabbit_nodes:list_running(). If the following criteria is met, the process will add one new node as a member to the quorum queue
  - The queue has not met its target membership count (by looking at target-group-size policy and config)
  - There are enough new nodes in the cluster for the queue to reach its target membership count
  - The queues current members are up and running
  - The queues current members are 'reasonably' caught up. [Not implemented]
If the process either adds or removes members during the evaluation, the interval/timeout value to evaluate again is lowered - currently 10 second, as there might be more work to do, i.e maybe add more members, as the code only adds one member at a time.
If the process does nothing, i.e noop, the interval/timeout value is the default 60 minutes.
If a node is added or removed to the cluster the interval/timeout is lowered - currently to 10 seconds.
If there are policy changes, the interval/timeout is lowered
If a queue is created, the interval is lowered

Configuration and policies

Static configuration

quorum_queue.continuous_membership_reconciliation.enabled = true|false

quorum_queue.continuous_membership_reconciliation.auto_remove = true|false

quorum_queue.continuous_membership_reconciliation.interval = non negative integer

quorum_queue.continuous_membership_reconciliation.trigger_interval = non negative integer
n
quorum_queue.continuous_membership_reconciliation.target_group_size = non negative integer

Policy and argument

Policy(user and operator): target-group-size = non negative integer
argument: x-quorum-target-group-size = non negative integer`

Tasks

Check to react to policy changes
Logic to pick suitable nodes as new member nodes for the new nodes list (i.e current load, availability zone etc)
Configuration to enable the feature
Configure timeout values
Add configuration target-group-size
Policys (oper and user). Merge strategy use max.

Future work

Check to evaluate if members are reasonably caught up
Check to evaluate if the log size it 'too big'

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

Bug fix (non-breaking change which fixes issue #NNNN)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause an observable behavior change in existing systems)
Documentation improvements (corrections, new content, etc)
Cosmetic change (whitespace, formatting, etc)
Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

I have read the CONTRIBUTING.md document
I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
I have added tests that prove my fix is effective or that my feature works
All tests pass locally with my changes
If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

The design was in collaboration with @kjnilsson.

Alternative solutions investigated and partly implemented, was to add functionality to the RA library, with a new callback that triggered once RA noticed a node joining or leaving the cluster, and asking the 'client' to evaluate a queues membership. This solution was 'per queue' where as this PR is per node. There was worries that RA did too much non RAFT tasks, so the idea to update RA was dropped.

michaelklishin · 2023-05-19T07:50:35Z

If a branch isn't ready, please do not submit PRs without a title here. You can submit it in your own fork where everyone should be able to review it.

SimonUnge · 2023-05-19T16:44:08Z

@michaelklishin sorry created it for comments from Karl, and thought it was fine to create it in a DRAFT state. I will update the PR today though as requested by Karl. But yeah, the title could have been better even for a draft.

kjnilsson

Looks really good. Some smaller nitpicks and suggestions - time to add some tests! :)

deps/rabbit/src/rabbit_queue_member_eval.erl

SimonUnge · 2023-05-30T23:14:58Z

Looks really good. Some smaller nitpicks and suggestions - time to add some tests! :)

Tests added!

kjnilsson · 2023-05-31T06:18:00Z

Hi Simon, I'm away this week but will take a look when I get back.

…

On Wed, 31 May 2023 at 01:15, Simon ***@***.***> wrote: @SimonUnge <https://github.com/SimonUnge> requested your review on: #8218 <#8218> Evaluate quorum queue membership periodically. — Reply to this email directly, view it on GitHub <#8218 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJAHFDFG6M4HMC57M4CRW3XIZ5QFANCNFSM6AAAAAAYEKUS2M> . You are receiving this because your review was requested.Message ID: ***@***.***>

-- *Karl Nilsson*

deps/rabbit/src/rabbit_queue_member_eval.erl

ansd · 2023-06-07T09:25:44Z

I wonder whether node maintenance should be taken into account such that no quorum queue replicas are changed if a node upgrade takes a long time. rabbit_maintenance:is_being_drained_*/1 should tell (thanks @dcorbacho).

SimonUnge · 2023-06-07T16:52:45Z

@ansd @kjnilsson

I wonder whether node maintenance should be taken into account such that no quorum queue replicas are changed if a node upgrade takes a long time. rabbit_maintenance:is_being_drained_*/1 should tell (thanks @dcorbacho).

Good idea. Should 'all' nodes be checked - that is, the leader node, the current members, and the potential new nodes - or would it be enough to just check leader and potential new nodes?

filter_out_drained_nodes_local_read/1 seems to be a good function to use.

SimonUnge · 2023-06-08T18:33:53Z

@ansd @kjnilsson

I wonder whether node maintenance should be taken into account such that no quorum queue replicas are changed if a node upgrade takes a long time. rabbit_maintenance:is_being_drained_*/1 should tell (thanks @dcorbacho).

Good idea. Should 'all' nodes be checked - that is, the leader node, the current members, and the potential new nodes - or would it be enough to just check leader and potential new nodes?

filter_out_drained_nodes_local_read/1 seems to be a good function to use.

Attempt to check if nodes are in maintenance mode added - logic now is to check the leader node, and new potential members.

deps/rabbit/priv/schema/rabbit.schema

kjnilsson · 2023-06-26T12:09:40Z

deps/rabbit/src/rabbit.erl

@@ -175,6 +175,12 @@
                    {requires,    [rabbit_alarm, guid_generator]},
                    {enables,     core_initialized}]}).

+-rabbit_boot_step({rabbit_queue_member_eval,


Could we no just have the process as static configuration?

I tried to 'follow suit' from other supervised processes and creation of supervisors, so piggybacked on that code.

But no objections changing it to something else?

deps/rabbit/src/rabbit.erl

deps/rabbit/src/rabbit_queue_member_eval.erl

deps/rabbit/src/rabbit_quorum_queue.erl

michaelklishin

This definitely needs naming work, both for user-facing and
internal (e.g. module and function) names.

It also has some overlap with the queue leader location (even though the algorithm that picks an initial leader does not have to be identical to those adding extra replicas). Hopefully this will be
organized into a similar system of pluggable modules before 3.13.0.

deps/rabbit/priv/schema/rabbit.schema

deps/rabbit/src/rabbit.erl

deps/rabbit/src/rabbit_queue_member_eval.erl

deps/rabbit/test/member_evaluation_SUITE.erl

michaelklishin

I'd rename this feature to "member reconciliation" or "member repair" because it's not obvious to me what "member evaluation" means. "Reconciliation" is a term used by Akka and "repair" by Riak, Cassandra.

michaelklishin

The naming part was addressed, now we can make the leader selection algorithm a little bit more efficient (and match what happens during the initial leader placement), and perhaps do one last round of testing.

michaelklishin · 2023-07-09T05:12:46Z

Apparently there is no way for me to "clear" my earlier review with a "just comment", I have to explicitly approve or the PR will keep saying I've requested changes. Oh well. This is an "approval in anticipation" I guess :)

michaelklishin · 2023-07-10T09:42:35Z

I have resolved a conflict with #8790.

mkuratczyk · 2023-07-10T17:01:51Z

I found a bug in the implementation. It could be specific to a node reset but resets happen.

With this config file:

quorum_queue.continuous_membership_reconciliation.enabled = true
cluster_formation.target_cluster_size_hint = 3

Given these steps:

start-cluster (make/bazel)
import 10k-quorum-queues (https://github.com/rabbitmq/sample-configs/blob/main/queues/10k-quorum-queues.json)
rabbitmqctl -n rabbit-2 stop_app; rabbitmqctl -n rabbit-2 reset; rabbitmqctl -n rabbit-2 join_cluster rabbit-0@$(hostname); rabbitmqctl -n rabbit-2 start_app

Node rabbit-2 starts and looks like it has the queues but it doesn't:

some queues may show +1, some +2, but either way they are broken
logs on the leader

2023-07-10 18:17:55.681223+02:00 [debug] <0.132220.0> queue 'qq4367' in vhost '/': committing ra cluster change to [{'%2F_qq4367','rabbit-0@mkuratczykPF0JR'},{'%2F_qq4367','rabbit-1@mkuratczykPF0JR'}]

logs on the reset node

2023-07-10 18:18:27.376516+02:00 [warning] <0.1405668.0> Quorum queue recovery: configured member of queue 'qq1534' in vhost '/' was not found on this node. Starting member as a new one. Context: name_not_registered

queue record (only two nodes mentioned)

(rabbit-0@mkuratczykPF0JR)3> rabbit_amqqueue:lookup({resource,<<"/">>,queue,<<"qq7564">>}).
{ok,{amqqueue,{resource,<<"/">>,queue,<<"qq7564">>},
              true,false,none,
              [{<<"x-queue-type">>,longstr,<<"quorum">>}],
              {'%2F_qq7564','rabbit-0@mkuratczykPF0JR'},
              [],[],[],undefined,undefined,[],[],live,0,[],<<"/">>,
              #{user => <<"rmq-internal">>},
              rabbit_quorum_queue,
              #{nodes =>
                    ['rabbit-0@mkuratczykPF0JR','rabbit-1@mkuratczykPF0JR']}}}

quorum status

> rabbitmq-queues -n rabbit-0 quorum_status qq1010
Status of quorum queue qq1010 on node rabbit-0@mkuratczykPF0JR ...
┌──────────────────────────┬────────────┬───────────┬──────────────┬────────────────┬──────┬─────────────────┐
│ Node Name                │ Raft State │ Log Index │ Commit Index │ Snapshot Index │ Term │ Machine Version │
├──────────────────────────┼────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit-0@mkuratczykPF0JR │ leader     │ 3         │ 3            │ undefined      │ 1    │ 3               │
├──────────────────────────┼────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit-2@mkuratczykPF0JR │ follower   │ 0         │ 0            │ undefined      │ 0    │ 3               │
├──────────────────────────┼────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit-1@mkuratczykPF0JR │ follower   │ 3         │ 3            │ undefined      │ 1    │ 3               │
└──────────────────────────┴────────────┴───────────┴──────────────┴────────────────┴──────┴─────────────────┘

TL;DR, rabbit-2 is running a member of a queue of the same name, but it's actually not clustered with the other two members

mkuratczyk · 2023-07-10T17:02:41Z

Can I also ask for another name change:
quorum_queue.continuous_membership_reconciliation.default_timeout -> quorum_queue.continuous_membership_reconciliation.interval

SimonUnge · 2023-07-10T17:07:03Z

cluster_formation.target_cluster_size_hint = 3

Looking into it. But, this config is wrong, this will not trigger the auto growth code though.
Does the same issue happen with qourum?queue.continuous_membership_reconciliation.target_group_size = 3?
@mkuratczyk

SimonUnge · 2023-07-10T17:58:06Z

@mkuratczyk I do see the same 'issue' on the main branch, i.e without the reconcile code. Or, rather, I see that issue that on the reset node, i.e Context: name_not_registered log with the same status for the queue:

rabbitmq-queues -n rabbit-2 quorum_status q1                                                                                                           
Status of quorum queue q1 on node rabbit-2@147ddad209ac ...                                                                                                                                  
┌───────────────────────┬────────────┬───────────┬──────────────┬────────────────┬──────┬─────────────────┐                                                                                  
│ Node Name             │ Raft State │ Log Index │ Commit Index │ Snapshot Index │ Term │ Machine Version │                                                                                  
├───────────────────────┼────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤                                                                                  
│ rabbit-0@147ddad209ac │ leader     │ 2         │ 2            │ undefined      │ 1    │ 3               │                                                                                  
├───────────────────────┼────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤                                                                                  
│ rabbit-2@147ddad209ac │ follower   │ 0         │ 0            │ undefined      │ 0    │ 3               │                                                                                  
├───────────────────────┼────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤                                                                                  
│ rabbit-1@147ddad209ac │ follower   │ 2         │ 2            │ undefined      │ 1    │ 3               │                                                                                  
└───────────────────────┴────────────┴───────────┴──────────────┴────────────────┴──────┴─────────────────┘

Since you reset the node, it 'should' be automatically 'cleaned up' by my code. But, that takes 10 sec to trigger the check after the node was stopped.

I would think you normally run 'forget_cluster_node' when you remove a node like you did though, so this issue should not happen.

Not sure what a good solution would be here though? @kjnilsson any ideas?

mkuratczyk · 2023-07-10T19:42:02Z

Thanks, I checked on main before I fully understood the problem and I thought the the problem only occurs when I see +1 in the Management UI (which, for some reason, never happened to me on main). We'll look into this issue separately then

SimonUnge · 2023-07-10T20:03:08Z

Thanks, I checked on main before I fully understood the problem and I thought the the problem only occurs when I see +1 in the Management UI (which, for some reason, never happened to me on main). We'll look into this issue separately then

Yeah the +1 I guess happened as some queues did start to get the auto-clean triggered. But, since you did not have target-group-size specified, it did not grow back to 3.

SimonUnge · 2023-07-10T20:46:18Z

Can I also ask for another name change:
quorum_queue.continuous_membership_reconciliation.default_timeout -> quorum_queue.continuous_membership_reconciliation.interval

How about
quorum_queue.continuous_membership_reconciliation.interval for the default long interval and quorum_queue.continuous_membership_reconciliation.trigger_interval for the short interval? Or like... I don't know, recheck_interval... event_interval...

michaelklishin · 2023-07-11T06:18:14Z

quorum_queue.continuous_membership_reconciliation.interval and quorum_queue.continuous_membership_reconciliation.trigger_interval sound good to me.

kjnilsson · 2023-07-11T14:13:32Z

I am wondering if we should perhaps not (at least not by default) remove members automatically. It feels like removals are the most likely to cause unexpected issues.

SimonUnge · 2023-07-11T16:04:29Z

@kjnilsson Yes, I'll make it optional, off by default. Does not hurt with the flexibility.

That said, what is the expected behavior for @mkuratczyk example above, when you do 'stop_app, reset, join_cluster, start_app' (regardless if you have this feature in or not). Is it basically a case of 'Doctor, it hurts when I do this' and the response is 'Then don't do that'? I.e, don't just reset, run forget_cluster node first?

mkuratczyk · 2023-07-11T16:24:32Z

I've played with this a bit more. On main, the main issue I mentioned doesn't actually happen, but you're also right that it is a problem in the steps I performed. I guess this should be prevented on a different level.

on main: after stop/reset/join/start, the nodes starts, the QQs remember it and start followers on it
on the PR branch: the same steps lead to the member being removed from QQs and then the node starts, but for some reason decides to start the queues, but they are not clustered with the other two nodes (so we effectively have two Ra clusters of the same name - one is 2-node and the other one single-node, with the queue record pointing at the 2-node one)

if I have the target size configured, this gets fixed through reconciliation
if I run forget_cluster_node before reset, the node starts without any QQ processes

I think the main conclusion is that we should probably make reset perform the forget steps. But this PR makes this more urgent :)

SimonUnge · 2023-07-11T22:23:02Z

@michaelklishin I think most name changes is done, auto remove is now optional (default off), and rebased into one commit.

michaelklishin · 2023-07-13T14:50:36Z

We still would like to address one (fairly degenerate) scenario described above. There are several options, some are not mutually exclusive. We will file a new issue for that.

SimonUnge · 2023-07-13T16:27:54Z

We still would like to address one (fairly degenerate) scenario described above. There are several options, some are not mutually exclusive. We will file a new issue for that.

Yes, I'd assume the same scenario would happen if you did the following operations on 'node 2':

stop_app, reset, run forget_cluster_node, join_cluster, start_app

node 2 would then still have the queue, but the cluster version of the queue removed the queue from its membership (with the forget_cluster_node).

auto-remove is now optional, default off. But, if we feel its unsafe, we could remove it all together, and readd it once the duplicated queue scenario is figured out?

michaelklishin · 2023-12-27T13:18:26Z

An important follow-up for environments with a lot of quorum queues: #10193

kjnilsson · 2024-01-03T09:05:25Z

An important follow-up for environments with a lot of quorum queues: #10193

Just to clarify: the change in #10193 isn't directly related to the functionality in this PR.

mergify bot added the bazel label May 17, 2023

SimonUnge changed the title ~~Work in progress~~ Evaluate quorum queue membership periodically May 19, 2023

SimonUnge marked this pull request as ready for review May 19, 2023 17:46

kjnilsson requested changes May 26, 2023

View reviewed changes

SimonUnge requested a review from kjnilsson May 30, 2023 23:15

SimonUnge force-pushed the eval_membership_stand_alone_process branch 3 times, most recently from 381df01 to 41405a1 Compare June 2, 2023 18:23

tvhong-amazon reviewed Jun 2, 2023

View reviewed changes

SimonUnge force-pushed the eval_membership_stand_alone_process branch 2 times, most recently from f2dbb36 to 8d80817 Compare June 6, 2023 19:46

kjnilsson added this to the 3.13.0 milestone Jun 7, 2023

kjnilsson requested a review from michaelklishin June 12, 2023 16:24

SimonUnge force-pushed the eval_membership_stand_alone_process branch 2 times, most recently from f005c57 to 00c3fb7 Compare June 14, 2023 16:28

kjnilsson requested changes Jun 26, 2023

View reviewed changes

SimonUnge force-pushed the eval_membership_stand_alone_process branch 2 times, most recently from 65629bd to f0e6de2 Compare June 27, 2023 22:08

SimonUnge requested a review from kjnilsson June 30, 2023 17:54

michaelklishin requested changes Jul 2, 2023

View reviewed changes

michaelklishin reviewed Jul 2, 2023

View reviewed changes

deps/rabbit/src/rabbit_queue_member_eval.erl Outdated Show resolved Hide resolved

michaelklishin reviewed Jul 2, 2023

View reviewed changes

deps/rabbit/test/member_evaluation_SUITE.erl Outdated Show resolved Hide resolved

michaelklishin reviewed Jul 2, 2023

View reviewed changes

michaelklishin reviewed Jul 9, 2023

View reviewed changes

michaelklishin approved these changes Jul 9, 2023

View reviewed changes

mkuratczyk mentioned this pull request Jul 11, 2023

Quorum queue declaration behavior when initial replica count < actual cluster member at declaration time #3850

Open

SimonUnge force-pushed the eval_membership_stand_alone_process branch from 9886443 to 6b15c57 Compare July 11, 2023 19:23

See rabbitmq#7209. Evaluate quorum queue membership periodically.

559a83d

SimonUnge force-pushed the eval_membership_stand_alone_process branch from 6b15c57 to 559a83d Compare July 11, 2023 20:14

michaelklishin approved these changes Jul 13, 2023

View reviewed changes

michaelklishin added 2 commits July 13, 2023 19:27

Merge branch 'main' into eval_membership_stand_alone_process

56967a5

Merge branch 'main' into eval_membership_stand_alone_process

2452b4d

michaelklishin merged commit 52d78e0 into rabbitmq:main Jul 13, 2023
12 checks passed

michaelklishin mentioned this pull request Dec 21, 2023

A more automated way to maintain a sufficient minimum of replicas for quorum queues and streams #7209

Closed

michaelklishin mentioned this pull request Dec 27, 2023

QQ: Avoid secondary process when repairing leader record #10242

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconcile (repair or expand) quorum queue membership periodically #8218

Reconcile (repair or expand) quorum queue membership periodically #8218

SimonUnge commented May 17, 2023 •

edited

Loading

michaelklishin commented May 19, 2023

SimonUnge commented May 19, 2023 •

edited

Loading

kjnilsson left a comment

SimonUnge commented May 30, 2023

kjnilsson commented May 31, 2023 via email

ansd commented Jun 7, 2023

SimonUnge commented Jun 7, 2023 •

edited

Loading

SimonUnge commented Jun 8, 2023

kjnilsson Jun 26, 2023

SimonUnge Jun 26, 2023 •

edited

Loading

michaelklishin left a comment

michaelklishin left a comment

michaelklishin left a comment

michaelklishin commented Jul 9, 2023

michaelklishin commented Jul 10, 2023

mkuratczyk commented Jul 10, 2023

mkuratczyk commented Jul 10, 2023

SimonUnge commented Jul 10, 2023 •

edited

Loading

SimonUnge commented Jul 10, 2023 •

edited

Loading

mkuratczyk commented Jul 10, 2023

SimonUnge commented Jul 10, 2023

SimonUnge commented Jul 10, 2023

michaelklishin commented Jul 11, 2023

kjnilsson commented Jul 11, 2023

SimonUnge commented Jul 11, 2023

mkuratczyk commented Jul 11, 2023

SimonUnge commented Jul 11, 2023

michaelklishin commented Jul 13, 2023

SimonUnge commented Jul 13, 2023

michaelklishin commented Dec 27, 2023

kjnilsson commented Jan 3, 2024

Reconcile (repair or expand) quorum queue membership periodically #8218

Reconcile (repair or expand) quorum queue membership periodically #8218

Conversation

SimonUnge commented May 17, 2023 • edited Loading

Proposed Changes

Configuration and policies

Tasks

Future work

Types of Changes

Checklist

Further Comments

michaelklishin commented May 19, 2023

SimonUnge commented May 19, 2023 • edited Loading

kjnilsson left a comment

Choose a reason for hiding this comment

SimonUnge commented May 30, 2023

kjnilsson commented May 31, 2023 via email

ansd commented Jun 7, 2023

SimonUnge commented Jun 7, 2023 • edited Loading

SimonUnge commented Jun 8, 2023

kjnilsson Jun 26, 2023

Choose a reason for hiding this comment

SimonUnge Jun 26, 2023 • edited Loading

Choose a reason for hiding this comment

michaelklishin left a comment

Choose a reason for hiding this comment

michaelklishin left a comment

Choose a reason for hiding this comment

michaelklishin left a comment

Choose a reason for hiding this comment

michaelklishin commented Jul 9, 2023

michaelklishin commented Jul 10, 2023

mkuratczyk commented Jul 10, 2023

mkuratczyk commented Jul 10, 2023

SimonUnge commented Jul 10, 2023 • edited Loading

SimonUnge commented Jul 10, 2023 • edited Loading

mkuratczyk commented Jul 10, 2023

SimonUnge commented Jul 10, 2023

SimonUnge commented Jul 10, 2023

michaelklishin commented Jul 11, 2023

kjnilsson commented Jul 11, 2023

SimonUnge commented Jul 11, 2023

mkuratczyk commented Jul 11, 2023

SimonUnge commented Jul 11, 2023

michaelklishin commented Jul 13, 2023

SimonUnge commented Jul 13, 2023

michaelklishin commented Dec 27, 2023

kjnilsson commented Jan 3, 2024

SimonUnge commented May 17, 2023 •

edited

Loading

SimonUnge commented May 19, 2023 •

edited

Loading

SimonUnge commented Jun 7, 2023 •

edited

Loading

SimonUnge Jun 26, 2023 •

edited

Loading

SimonUnge commented Jul 10, 2023 •

edited

Loading

SimonUnge commented Jul 10, 2023 •

edited

Loading