-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tolerate partition deallocation invariant failures from delete_topic_cmd #7385
Tolerate partition deallocation invariant failures from delete_topic_cmd #7385
Conversation
A controller log may possibly end up with the commands sequence that would deallocate partitions from a node when the node does not have partitions in the domain. This change relaxes `vassert` to log warning in deallocation scenarios that directly result from `delete_topic_cmd` and prevents decreasing pratition counts below zero.
@dlex - do you have a theory how the log could end up in such a state? Could it be that that the partition in-memory state gets inconsistent with the log causing multiple deletion commands to be accepted (because it checks the in-memory state to see if an action makes sense). I'm a bit worried about relaxing an assert that guards against a clearly invalid state, though I do agree in this case it might be harmless. I think it would be nice to get a second set of eyes on this, perhaps @jcsp , @dotnwat or @mmaslankaprv could oblige. |
@@ -30,6 +30,7 @@ class allocation_node { | |||
enum class state { active, decommissioned, deleted }; | |||
using allocation_capacity | |||
= named_type<uint32_t, struct allocation_node_slot_tag>; | |||
enum class deallocation_error_policy { strict, relaxed }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include some comments about what this is about and reference the current issue.
&& domain_partitions <= _allocated_partitions, | ||
"Unable to deallocate partition from core {} in domain {} at node {}", | ||
core, | ||
domain_partitions <= _allocated_partitions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this remaining condition still fail in the same way? E.g., domain 0 has 0 allocated partions and domain 1 has 1.
So we have { 0, 1 } for domain allocated and 1 global allocated.
Then a dealloc request comes in for domain 1: with relaxed policy, we will now skip the --domain_partitions but we will --allocated_partitions. So now we (still) have {0, 1} domain allocated but 0 global and this assert will trigger when we start processing domain 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of that since I've posted this PR. I think a much better way to go here would be this:
- In case of the
!(domain_partitions > allocation_capacity{0})
, do not changeallocation_node
state at all. The most probable reason for this condition is a repeated delete topic operation. While trying to tolerate a repeated deletion, it's not a good idea to update eitherweights
orallocated_partitions
. - Do not try to relax any other invariants yet before we observe a case when relaxing them can help in a specific case.
I would suggest to first find an underlying issue that caused the allocated partition count to go negative. The whole system should be strongly consistent and other checks should prevent allocated partition count to go negative. I would suggest not to merge this workaround. |
Hmm. So we know we have a bug, but we haven't found the source. From the way #7343 manifested immediately on upgrade (and not on partition creation/deletion), it sounds like there was some content written by an earlier version of redpanda that violated the expectations of our latest code. The question is whether the writer was wrong (in which case we need this tolerant apply() logic), or whether we're doing something wrong while applying updates (in which case we can fix it there without making apply() tolerant). |
After taking a fresh look at the problem, a highly likely cause for the problem has been found: #7406. Closing this one as not needed, and also per @mmaslankaprv suggestion.
Exactly this. |
A controller log may possibly end up with the commands sequence that
would deallocate partitions from a node when the node does not have
partitions in the domain. This change relaxes
vassert
to log warningin deallocation scenarios that directly result from
delete_topic_cmd
and prevents decreasing pratition counts below zero.
May be a workarond for #7343.
Backports Required
UX Changes
Release Notes