Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tolerate partition deallocation invariant failures from delete_topic_cmd #7385

Closed

Conversation

dlex
Copy link
Contributor

@dlex dlex commented Nov 18, 2022

A controller log may possibly end up with the commands sequence that
would deallocate partitions from a node when the node does not have
partitions in the domain. This change relaxes vassert to log warning
in deallocation scenarios that directly result from delete_topic_cmd
and prevents decreasing pratition counts below zero.

May be a workarond for #7343.

Backports Required

  • none - not a bug fix
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v22.3.x
  • v22.2.x
  • v22.1.x

UX Changes

  • none

Release Notes

  • Automatic recovery from certain controller states causing deletion of non-exiting partitions while replaying

A controller log may possibly end up with the commands sequence that
would deallocate partitions from a node when the node does not have
partitions in the domain. This change relaxes `vassert` to log warning
in deallocation scenarios that directly result from `delete_topic_cmd`
and prevents decreasing pratition counts below zero.
@piyushredpanda piyushredpanda added this to the v22.3.4 milestone Nov 18, 2022
@travisdowns
Copy link
Member

@dlex - do you have a theory how the log could end up in such a state? Could it be that that the partition in-memory state gets inconsistent with the log causing multiple deletion commands to be accepted (because it checks the in-memory state to see if an action makes sense).

I'm a bit worried about relaxing an assert that guards against a clearly invalid state, though I do agree in this case it might be harmless.

I think it would be nice to get a second set of eyes on this, perhaps @jcsp , @dotnwat or @mmaslankaprv could oblige.

@@ -30,6 +30,7 @@ class allocation_node {
enum class state { active, decommissioned, deleted };
using allocation_capacity
= named_type<uint32_t, struct allocation_node_slot_tag>;
enum class deallocation_error_policy { strict, relaxed };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include some comments about what this is about and reference the current issue.

&& domain_partitions <= _allocated_partitions,
"Unable to deallocate partition from core {} in domain {} at node {}",
core,
domain_partitions <= _allocated_partitions,
Copy link
Member

@travisdowns travisdowns Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this remaining condition still fail in the same way? E.g., domain 0 has 0 allocated partions and domain 1 has 1.

So we have { 0, 1 } for domain allocated and 1 global allocated.

Then a dealloc request comes in for domain 1: with relaxed policy, we will now skip the --domain_partitions but we will --allocated_partitions. So now we (still) have {0, 1} domain allocated but 0 global and this assert will trigger when we start processing domain 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of that since I've posted this PR. I think a much better way to go here would be this:

  • In case of the !(domain_partitions > allocation_capacity{0}), do not change allocation_node state at all. The most probable reason for this condition is a repeated delete topic operation. While trying to tolerate a repeated deletion, it's not a good idea to update either weights or allocated_partitions.
  • Do not try to relax any other invariants yet before we observe a case when relaxing them can help in a specific case.

@mmaslankaprv
Copy link
Member

I would suggest to first find an underlying issue that caused the allocated partition count to go negative. The whole system should be strongly consistent and other checks should prevent allocated partition count to go negative. I would suggest not to merge this workaround.

@jcsp
Copy link
Contributor

jcsp commented Nov 21, 2022

Hmm. So we know we have a bug, but we haven't found the source.

From the way #7343 manifested immediately on upgrade (and not on partition creation/deletion), it sounds like there was some content written by an earlier version of redpanda that violated the expectations of our latest code. The question is whether the writer was wrong (in which case we need this tolerant apply() logic), or whether we're doing something wrong while applying updates (in which case we can fix it there without making apply() tolerant).

@dlex
Copy link
Contributor Author

dlex commented Nov 21, 2022

After taking a fresh look at the problem, a highly likely cause for the problem has been found: #7406. Closing this one as not needed, and also per @mmaslankaprv suggestion.

or whether we're doing something wrong while applying updates (in which case we can fix it there without making apply() tolerant)

Exactly this.

@dlex dlex closed this Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants