Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconcile (repair or expand) quorum queue membership periodically #8218

Merged

Conversation

SimonUnge
Copy link
Collaborator

@SimonUnge SimonUnge commented May 17, 2023

Proposed Changes

As a subset PR for issue #7209

Add a process that periodically evaluates queue node membership. Currently only Quorum Queues but the
idea is that this process would also evaluate Stream Queues.

The high level algorithm is

  • A node periodically evaluates local leader queue membership
    • The interval is defaulted to 60 minutes
  • The node loops over all local leader queues, fetching the queues members
    • Checks the members list against rabbit_nodes:list_members() to find 'dead' members. This should be rare, but could happen if the forget_cluster_node command fails before it 'shrinks' a queue. If there are 'dead' members, the shrink command is executed for those dead members
    • Checks the member list against new cluster members by comparing the members list with rabbit_nodes:list_running(). If the following criteria is met, the process will add one new node as a member to the quorum queue
      • The queue has not met its target membership count (by looking at target-group-size policy and config)
      • There are enough new nodes in the cluster for the queue to reach its target membership count
      • The queues current members are up and running
      • The queues current members are 'reasonably' caught up. [Not implemented]
  • If the process either adds or removes members during the evaluation, the interval/timeout value to evaluate again is lowered - currently 10 second, as there might be more work to do, i.e maybe add more members, as the code only adds one member at a time.
  • If the process does nothing, i.e noop, the interval/timeout value is the default 60 minutes.
  • If a node is added or removed to the cluster the interval/timeout is lowered - currently to 10 seconds.
  • If there are policy changes, the interval/timeout is lowered
  • If a queue is created, the interval is lowered

Configuration and policies

Static configuration

quorum_queue.continuous_membership_reconciliation.enabled = true|false

quorum_queue.continuous_membership_reconciliation.auto_remove = true|false

quorum_queue.continuous_membership_reconciliation.interval = non negative integer

quorum_queue.continuous_membership_reconciliation.trigger_interval = non negative integer
n
quorum_queue.continuous_membership_reconciliation.target_group_size = non negative integer

Policy and argument

Policy(user and operator): target-group-size = non negative integer
argument: x-quorum-target-group-size = non negative integer`

Tasks

  • Check to react to policy changes
  • Logic to pick suitable nodes as new member nodes for the new nodes list (i.e current load, availability zone etc)
  • Configuration to enable the feature
  • Configure timeout values
  • Add configuration target-group-size
  • Policys (oper and user). Merge strategy use max.

Future work

  • Check to evaluate if members are reasonably caught up
  • Check to evaluate if the log size it 'too big'

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass locally with my changes
  • If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

The design was in collaboration with @kjnilsson.

Alternative solutions investigated and partly implemented, was to add functionality to the RA library, with a new callback that triggered once RA noticed a node joining or leaving the cluster, and asking the 'client' to evaluate a queues membership. This solution was 'per queue' where as this PR is per node. There was worries that RA did too much non RAFT tasks, so the idea to update RA was dropped.

@mergify mergify bot added the bazel label May 17, 2023
@michaelklishin
Copy link
Member

If a branch isn't ready, please do not submit PRs without a title here. You can submit it in your own fork where everyone should be able to review it.

@SimonUnge
Copy link
Collaborator Author

SimonUnge commented May 19, 2023

@michaelklishin sorry created it for comments from Karl, and thought it was fine to create it in a DRAFT state. I will update the PR today though as requested by Karl. But yeah, the title could have been better even for a draft.

@SimonUnge SimonUnge changed the title Work in progress Evaluate quorum queue membership periodically May 19, 2023
@SimonUnge SimonUnge marked this pull request as ready for review May 19, 2023 17:46
Copy link
Contributor

@kjnilsson kjnilsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. Some smaller nitpicks and suggestions - time to add some tests! :)

deps/rabbit/src/rabbit_queue_member_eval.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_queue_member_eval.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_queue_member_eval.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_queue_member_eval.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_queue_member_eval.erl Outdated Show resolved Hide resolved
@SimonUnge
Copy link
Collaborator Author

Looks really good. Some smaller nitpicks and suggestions - time to add some tests! :)

Tests added!

@SimonUnge SimonUnge requested a review from kjnilsson May 30, 2023 23:15
@kjnilsson
Copy link
Contributor

kjnilsson commented May 31, 2023 via email

@SimonUnge SimonUnge force-pushed the eval_membership_stand_alone_process branch 3 times, most recently from 381df01 to 41405a1 Compare June 2, 2023 18:23
deps/rabbit/src/rabbit_queue_member_eval.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_queue_member_eval.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_queue_member_eval.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_queue_member_eval.erl Outdated Show resolved Hide resolved
@SimonUnge SimonUnge force-pushed the eval_membership_stand_alone_process branch 2 times, most recently from f2dbb36 to 8d80817 Compare June 6, 2023 19:46
@kjnilsson kjnilsson added this to the 3.13.0 milestone Jun 7, 2023
@ansd
Copy link
Member

ansd commented Jun 7, 2023

I wonder whether node maintenance should be taken into account such that no quorum queue replicas are changed if a node upgrade takes a long time. rabbit_maintenance:is_being_drained_*/1 should tell (thanks @dcorbacho).

@SimonUnge
Copy link
Collaborator Author

SimonUnge commented Jun 7, 2023

@ansd @kjnilsson

I wonder whether node maintenance should be taken into account such that no quorum queue replicas are changed if a node upgrade takes a long time. rabbit_maintenance:is_being_drained_*/1 should tell (thanks @dcorbacho).

Good idea. Should 'all' nodes be checked - that is, the leader node, the current members, and the potential new nodes - or would it be enough to just check leader and potential new nodes?

filter_out_drained_nodes_local_read/1 seems to be a good function to use.

@SimonUnge
Copy link
Collaborator Author

@ansd @kjnilsson

I wonder whether node maintenance should be taken into account such that no quorum queue replicas are changed if a node upgrade takes a long time. rabbit_maintenance:is_being_drained_*/1 should tell (thanks @dcorbacho).

Good idea. Should 'all' nodes be checked - that is, the leader node, the current members, and the potential new nodes - or would it be enough to just check leader and potential new nodes?

filter_out_drained_nodes_local_read/1 seems to be a good function to use.

Attempt to check if nodes are in maintenance mode added - logic now is to check the leader node, and new potential members.

@SimonUnge SimonUnge force-pushed the eval_membership_stand_alone_process branch 2 times, most recently from f005c57 to 00c3fb7 Compare June 14, 2023 16:28
deps/rabbit/priv/schema/rabbit.schema Outdated Show resolved Hide resolved
@@ -175,6 +175,12 @@
{requires, [rabbit_alarm, guid_generator]},
{enables, core_initialized}]}).

-rabbit_boot_step({rabbit_queue_member_eval,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we no just have the process as static configuration?

Copy link
Collaborator Author

@SimonUnge SimonUnge Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to 'follow suit' from other supervised processes and creation of supervisors, so piggybacked on that code.

But no objections changing it to something else?

deps/rabbit/src/rabbit.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_queue_member_eval.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_queue_member_eval.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_quorum_queue.erl Show resolved Hide resolved
deps/rabbit/src/rabbit_quorum_queue.erl Outdated Show resolved Hide resolved
@SimonUnge SimonUnge force-pushed the eval_membership_stand_alone_process branch 2 times, most recently from 65629bd to f0e6de2 Compare June 27, 2023 22:08
@SimonUnge SimonUnge requested a review from kjnilsson June 30, 2023 17:54
Copy link
Member

@michaelklishin michaelklishin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely needs naming work, both for user-facing and
internal (e.g. module and function) names.

It also has some overlap with the queue leader location (even though the algorithm that picks an initial leader does not have to be identical to those adding extra replicas). Hopefully this will be
organized into a similar system of pluggable modules before 3.13.0.

deps/rabbit/priv/schema/rabbit.schema Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_queue_member_eval.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_queue_member_eval.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_queue_member_eval.erl Outdated Show resolved Hide resolved
deps/rabbit/src/rabbit_queue_member_eval.erl Outdated Show resolved Hide resolved
Copy link
Member

@michaelklishin michaelklishin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename this feature to "member reconciliation" or "member repair" because it's not obvious to me what "member evaluation" means. "Reconciliation" is a term used by Akka and "repair" by Riak, Cassandra.

Copy link
Member

@michaelklishin michaelklishin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming part was addressed, now we can make the leader selection algorithm a little bit more efficient (and match what happens during the initial leader placement), and perhaps do one last round of testing.

@michaelklishin
Copy link
Member

Apparently there is no way for me to "clear" my earlier review with a "just comment", I have to explicitly approve or the PR will keep saying I've requested changes. Oh well. This is an "approval in anticipation" I guess :)

@michaelklishin
Copy link
Member

I have resolved a conflict with #8790.

@mkuratczyk
Copy link
Contributor

I found a bug in the implementation. It could be specific to a node reset but resets happen.

With this config file:

quorum_queue.continuous_membership_reconciliation.enabled = true
cluster_formation.target_cluster_size_hint = 3

Given these steps:

  1. start-cluster (make/bazel)
  2. import 10k-quorum-queues (https://github.com/rabbitmq/sample-configs/blob/main/queues/10k-quorum-queues.json)
  3. rabbitmqctl -n rabbit-2 stop_app; rabbitmqctl -n rabbit-2 reset; rabbitmqctl -n rabbit-2 join_cluster rabbit-0@$(hostname); rabbitmqctl -n rabbit-2 start_app

Node rabbit-2 starts and looks like it has the queues but it doesn't:

  • some queues may show +1, some +2, but either way they are broken
    Screenshot 2023-07-10 at 18 00 35
  • logs on the leader
2023-07-10 18:17:55.681223+02:00 [debug] <0.132220.0> queue 'qq4367' in vhost '/': committing ra cluster change to [{'%2F_qq4367','rabbit-0@mkuratczykPF0JR'},{'%2F_qq4367','rabbit-1@mkuratczykPF0JR'}]
  • logs on the reset node
2023-07-10 18:18:27.376516+02:00 [warning] <0.1405668.0> Quorum queue recovery: configured member of queue 'qq1534' in vhost '/' was not found on this node. Starting member as a new one. Context: name_not_registered
  • queue record (only two nodes mentioned)
(rabbit-0@mkuratczykPF0JR)3> rabbit_amqqueue:lookup({resource,<<"/">>,queue,<<"qq7564">>}).
{ok,{amqqueue,{resource,<<"/">>,queue,<<"qq7564">>},
              true,false,none,
              [{<<"x-queue-type">>,longstr,<<"quorum">>}],
              {'%2F_qq7564','rabbit-0@mkuratczykPF0JR'},
              [],[],[],undefined,undefined,[],[],live,0,[],<<"/">>,
              #{user => <<"rmq-internal">>},
              rabbit_quorum_queue,
              #{nodes =>
                    ['rabbit-0@mkuratczykPF0JR','rabbit-1@mkuratczykPF0JR']}}}
  • quorum status
> rabbitmq-queues -n rabbit-0 quorum_status qq1010
Status of quorum queue qq1010 on node rabbit-0@mkuratczykPF0JR ...
┌──────────────────────────┬────────────┬───────────┬──────────────┬────────────────┬──────┬─────────────────┐
│ Node Name                │ Raft State │ Log Index │ Commit Index │ Snapshot Index │ Term │ Machine Version │
├──────────────────────────┼────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit-0@mkuratczykPF0JR │ leader     │ 3         │ 3            │ undefined      │ 1    │ 3               │
├──────────────────────────┼────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit-2@mkuratczykPF0JR │ follower   │ 0         │ 0            │ undefined      │ 0    │ 3               │
├──────────────────────────┼────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit-1@mkuratczykPF0JR │ follower   │ 3         │ 3            │ undefined      │ 1    │ 3               │
└──────────────────────────┴────────────┴───────────┴──────────────┴────────────────┴──────┴─────────────────┘

TL;DR, rabbit-2 is running a member of a queue of the same name, but it's actually not clustered with the other two members

@mkuratczyk
Copy link
Contributor

Can I also ask for another name change:
quorum_queue.continuous_membership_reconciliation.default_timeout -> quorum_queue.continuous_membership_reconciliation.interval

@SimonUnge
Copy link
Collaborator Author

SimonUnge commented Jul 10, 2023

cluster_formation.target_cluster_size_hint = 3

Looking into it. But, this config is wrong, this will not trigger the auto growth code though.
Does the same issue happen with qourum?queue.continuous_membership_reconciliation.target_group_size = 3?
@mkuratczyk

@SimonUnge
Copy link
Collaborator Author

SimonUnge commented Jul 10, 2023

@mkuratczyk I do see the same 'issue' on the main branch, i.e without the reconcile code. Or, rather, I see that issue that on the reset node, i.e Context: name_not_registered log with the same status for the queue:

rabbitmq-queues -n rabbit-2 quorum_status q1                                                                                                           
Status of quorum queue q1 on node rabbit-2@147ddad209ac ...                                                                                                                                  
┌───────────────────────┬────────────┬───────────┬──────────────┬────────────────┬──────┬─────────────────┐                                                                                  
│ Node Name             │ Raft State │ Log Index │ Commit Index │ Snapshot Index │ Term │ Machine Version │                                                                                  
├───────────────────────┼────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤                                                                                  
│ rabbit-0@147ddad209ac │ leader     │ 2         │ 2            │ undefined      │ 1    │ 3               │                                                                                  
├───────────────────────┼────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤                                                                                  
│ rabbit-2@147ddad209ac │ follower   │ 0         │ 0            │ undefined      │ 0    │ 3               │                                                                                  
├───────────────────────┼────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤                                                                                  
│ rabbit-1@147ddad209ac │ follower   │ 2         │ 2            │ undefined      │ 1    │ 3               │                                                                                  
└───────────────────────┴────────────┴───────────┴──────────────┴────────────────┴──────┴─────────────────┘ 

Since you reset the node, it 'should' be automatically 'cleaned up' by my code. But, that takes 10 sec to trigger the check after the node was stopped.

I would think you normally run 'forget_cluster_node' when you remove a node like you did though, so this issue should not happen.

Not sure what a good solution would be here though? @kjnilsson any ideas?

@mkuratczyk
Copy link
Contributor

Thanks, I checked on main before I fully understood the problem and I thought the the problem only occurs when I see +1 in the Management UI (which, for some reason, never happened to me on main). We'll look into this issue separately then

@SimonUnge
Copy link
Collaborator Author

Thanks, I checked on main before I fully understood the problem and I thought the the problem only occurs when I see +1 in the Management UI (which, for some reason, never happened to me on main). We'll look into this issue separately then

Yeah the +1 I guess happened as some queues did start to get the auto-clean triggered. But, since you did not have target-group-size specified, it did not grow back to 3.

@SimonUnge
Copy link
Collaborator Author

Can I also ask for another name change:
quorum_queue.continuous_membership_reconciliation.default_timeout -> quorum_queue.continuous_membership_reconciliation.interval

How about
quorum_queue.continuous_membership_reconciliation.interval for the default long interval and quorum_queue.continuous_membership_reconciliation.trigger_interval for the short interval? Or like... I don't know, recheck_interval... event_interval...

@michaelklishin
Copy link
Member

quorum_queue.continuous_membership_reconciliation.interval and quorum_queue.continuous_membership_reconciliation.trigger_interval sound good to me.

@kjnilsson
Copy link
Contributor

I am wondering if we should perhaps not (at least not by default) remove members automatically. It feels like removals are the most likely to cause unexpected issues.

@SimonUnge
Copy link
Collaborator Author

@kjnilsson Yes, I'll make it optional, off by default. Does not hurt with the flexibility.

That said, what is the expected behavior for @mkuratczyk example above, when you do 'stop_app, reset, join_cluster, start_app' (regardless if you have this feature in or not). Is it basically a case of 'Doctor, it hurts when I do this' and the response is 'Then don't do that'? I.e, don't just reset, run forget_cluster node first?

@mkuratczyk
Copy link
Contributor

I've played with this a bit more. On main, the main issue I mentioned doesn't actually happen, but you're also right that it is a problem in the steps I performed. I guess this should be prevented on a different level.

on main: after stop/reset/join/start, the nodes starts, the QQs remember it and start followers on it
on the PR branch: the same steps lead to the member being removed from QQs and then the node starts, but for some reason decides to start the queues, but they are not clustered with the other two nodes (so we effectively have two Ra clusters of the same name - one is 2-node and the other one single-node, with the queue record pointing at the 2-node one)

if I have the target size configured, this gets fixed through reconciliation
if I run forget_cluster_node before reset, the node starts without any QQ processes

I think the main conclusion is that we should probably make reset perform the forget steps. But this PR makes this more urgent :)

@SimonUnge SimonUnge force-pushed the eval_membership_stand_alone_process branch from 9886443 to 6b15c57 Compare July 11, 2023 19:23
@SimonUnge SimonUnge force-pushed the eval_membership_stand_alone_process branch from 6b15c57 to 559a83d Compare July 11, 2023 20:14
@SimonUnge
Copy link
Collaborator Author

@michaelklishin I think most name changes is done, auto remove is now optional (default off), and rebased into one commit.

@michaelklishin
Copy link
Member

We still would like to address one (fairly degenerate) scenario described above. There are several options, some are not mutually exclusive. We will file a new issue for that.

@SimonUnge
Copy link
Collaborator Author

We still would like to address one (fairly degenerate) scenario described above. There are several options, some are not mutually exclusive. We will file a new issue for that.

Yes, I'd assume the same scenario would happen if you did the following operations on 'node 2':

stop_app, reset, run forget_cluster_node, join_cluster, start_app

node 2 would then still have the queue, but the cluster version of the queue removed the queue from its membership (with the forget_cluster_node).

auto-remove is now optional, default off. But, if we feel its unsafe, we could remove it all together, and readd it once the duplicated queue scenario is figured out?

@michaelklishin michaelklishin merged commit 52d78e0 into rabbitmq:main Jul 13, 2023
12 checks passed
@michaelklishin
Copy link
Member

An important follow-up for environments with a lot of quorum queues: #10193

@kjnilsson
Copy link
Contributor

An important follow-up for environments with a lot of quorum queues: #10193

Just to clarify: the change in #10193 isn't directly related to the functionality in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants