Tolerate partition deallocation invariant failures from delete_topic_cmd #7385

dlex · 2022-11-18T22:46:58Z

A controller log may possibly end up with the commands sequence that
would deallocate partitions from a node when the node does not have
partitions in the domain. This change relaxes vassert to log warning
in deallocation scenarios that directly result from delete_topic_cmd
and prevents decreasing pratition counts below zero.

May be a workarond for #7343.

Backports Required

UX Changes

none

Release Notes

Automatic recovery from certain controller states causing deletion of non-exiting partitions while replaying

A controller log may possibly end up with the commands sequence that would deallocate partitions from a node when the node does not have partitions in the domain. This change relaxes `vassert` to log warning in deallocation scenarios that directly result from `delete_topic_cmd` and prevents decreasing pratition counts below zero.

travisdowns · 2022-11-18T23:06:46Z

@dlex - do you have a theory how the log could end up in such a state? Could it be that that the partition in-memory state gets inconsistent with the log causing multiple deletion commands to be accepted (because it checks the in-memory state to see if an action makes sense).

I'm a bit worried about relaxing an assert that guards against a clearly invalid state, though I do agree in this case it might be harmless.

I think it would be nice to get a second set of eyes on this, perhaps @jcsp , @dotnwat or @mmaslankaprv could oblige.

travisdowns · 2022-11-18T23:10:57Z

src/v/cluster/scheduling/allocation_node.h

@@ -30,6 +30,7 @@ class allocation_node {
    enum class state { active, decommissioned, deleted };
    using allocation_capacity
      = named_type<uint32_t, struct allocation_node_slot_tag>;
+    enum class deallocation_error_policy { strict, relaxed };


Please include some comments about what this is about and reference the current issue.

travisdowns · 2022-11-18T23:13:41Z

src/v/cluster/scheduling/allocation_node.cc

-        && domain_partitions <= _allocated_partitions,
-      "Unable to deallocate partition from core {} in domain {} at node {}",
-      core,
+      domain_partitions <= _allocated_partitions,


Can't this remaining condition still fail in the same way? E.g., domain 0 has 0 allocated partions and domain 1 has 1.

So we have { 0, 1 } for domain allocated and 1 global allocated.

Then a dealloc request comes in for domain 1: with relaxed policy, we will now skip the --domain_partitions but we will --allocated_partitions. So now we (still) have {0, 1} domain allocated but 0 global and this assert will trigger when we start processing domain 1?

I'm thinking of that since I've posted this PR. I think a much better way to go here would be this:

In case of the !(domain_partitions > allocation_capacity{0}), do not change allocation_node state at all. The most probable reason for this condition is a repeated delete topic operation. While trying to tolerate a repeated deletion, it's not a good idea to update either weights or allocated_partitions.

Do not try to relax any other invariants yet before we observe a case when relaxing them can help in a specific case.

mmaslankaprv · 2022-11-21T08:00:42Z

I would suggest to first find an underlying issue that caused the allocated partition count to go negative. The whole system should be strongly consistent and other checks should prevent allocated partition count to go negative. I would suggest not to merge this workaround.

jcsp · 2022-11-21T08:50:49Z

Hmm. So we know we have a bug, but we haven't found the source.

From the way #7343 manifested immediately on upgrade (and not on partition creation/deletion), it sounds like there was some content written by an earlier version of redpanda that violated the expectations of our latest code. The question is whether the writer was wrong (in which case we need this tolerant apply() logic), or whether we're doing something wrong while applying updates (in which case we can fix it there without making apply() tolerant).

dlex · 2022-11-21T18:21:50Z

After taking a fresh look at the problem, a highly likely cause for the problem has been found: #7406. Closing this one as not needed, and also per @mmaslankaprv suggestion.

or whether we're doing something wrong while applying updates (in which case we can fix it there without making apply() tolerant)

Exactly this.

github-actions bot added the area/redpanda label Nov 18, 2022

dlex requested a review from travisdowns November 18, 2022 22:47

piyushredpanda added this to the v22.3.4 milestone Nov 18, 2022

travisdowns reviewed Nov 18, 2022

View reviewed changes

dlex closed this Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tolerate partition deallocation invariant failures from delete_topic_cmd #7385

Tolerate partition deallocation invariant failures from delete_topic_cmd #7385

dlex commented Nov 18, 2022

travisdowns commented Nov 18, 2022

travisdowns Nov 18, 2022

travisdowns Nov 18, 2022 •

edited

Loading

dlex Nov 21, 2022

mmaslankaprv commented Nov 21, 2022

jcsp commented Nov 21, 2022

dlex commented Nov 21, 2022

Tolerate partition deallocation invariant failures from delete_topic_cmd #7385

Tolerate partition deallocation invariant failures from delete_topic_cmd #7385

Conversation

dlex commented Nov 18, 2022

Backports Required

UX Changes

Release Notes

travisdowns commented Nov 18, 2022

travisdowns Nov 18, 2022

Choose a reason for hiding this comment

travisdowns Nov 18, 2022 • edited Loading

Choose a reason for hiding this comment

dlex Nov 21, 2022

Choose a reason for hiding this comment

mmaslankaprv commented Nov 21, 2022

jcsp commented Nov 21, 2022

dlex commented Nov 21, 2022

travisdowns Nov 18, 2022 •

edited

Loading