Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tolerate partition deallocation invariant failures from delete_topic_cmd #7385

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 21 additions & 6 deletions src/v/cluster/scheduling/allocation_node.cc
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@

#include "cluster/scheduling/allocation_node.h"

#include "cluster/logger.h"

#include <fmt/ranges.h>

namespace cluster {
Expand Down Expand Up @@ -58,7 +60,9 @@ allocation_node::allocate(const partition_allocation_domain domain) {
}

void allocation_node::deallocate_on(
ss::shard_id core, const partition_allocation_domain domain) {
ss::shard_id core,
const partition_allocation_domain domain,
const deallocation_error_policy error_policy) {
vassert(
core < _weights.size(),
"Tried to deallocate a non-existing core:{} - {}",
Expand All @@ -73,13 +77,24 @@ void allocation_node::deallocate_on(
allocation_capacity& domain_partitions
= _allocated_domain_partitions[domain];
vassert(
domain_partitions > allocation_capacity{0}
&& domain_partitions <= _allocated_partitions,
"Unable to deallocate partition from core {} in domain {} at node {}",
core,
domain_partitions <= _allocated_partitions,
Copy link
Member

@travisdowns travisdowns Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this remaining condition still fail in the same way? E.g., domain 0 has 0 allocated partions and domain 1 has 1.

So we have { 0, 1 } for domain allocated and 1 global allocated.

Then a dealloc request comes in for domain 1: with relaxed policy, we will now skip the --domain_partitions but we will --allocated_partitions. So now we (still) have {0, 1} domain allocated but 0 global and this assert will trigger when we start processing domain 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of that since I've posted this PR. I think a much better way to go here would be this:

  • In case of the !(domain_partitions > allocation_capacity{0}), do not change allocation_node state at all. The most probable reason for this condition is a repeated delete topic operation. While trying to tolerate a repeated deletion, it's not a good idea to update either weights or allocated_partitions.
  • Do not try to relax any other invariants yet before we observe a case when relaxing them can help in a specific case.

"Allocation node data inconsistency: not enough _allocated_partitions. "
"Domain: {}, core: {}, node: {}",
domain,
core,
*this);
--domain_partitions;
if (likely(domain_partitions > allocation_capacity{0})) {
--domain_partitions;
} else {
const std::string msg = fmt::format(
"Allocation node data inconsisteny: no partition in domain {} to "
"deallocate. Core: {}, node: {}",
domain,
core,
*this);
vassert(error_policy == deallocation_error_policy::relaxed, "{}", msg);
vlog(clusterlog.warn, "{}", msg);
}

_allocated_partitions--;
_weights[core]--;
Expand Down
6 changes: 5 additions & 1 deletion src/v/cluster/scheduling/allocation_node.h
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ class allocation_node {
enum class state { active, decommissioned, deleted };
using allocation_capacity
= named_type<uint32_t, struct allocation_node_slot_tag>;
enum class deallocation_error_policy { strict, relaxed };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include some comments about what this is about and reference the current issue.


allocation_node(
model::node_id,
Expand Down Expand Up @@ -122,7 +123,10 @@ class allocation_node {
private:
friend allocation_state;

void deallocate_on(ss::shard_id core, partition_allocation_domain);
void deallocate_on(
ss::shard_id core,
partition_allocation_domain,
deallocation_error_policy);
void allocate_on(ss::shard_id core, partition_allocation_domain);

model::node_id _id;
Expand Down
8 changes: 5 additions & 3 deletions src/v/cluster/scheduling/allocation_state.cc
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,8 @@ void allocation_state::rollback(
const std::vector<model::broker_shard>& v,
const partition_allocation_domain domain) {
for (auto& bs : v) {
deallocate(bs, domain);
deallocate(
bs, domain, allocation_node::deallocation_error_policy::strict);
}
}

Expand Down Expand Up @@ -157,9 +158,10 @@ bool allocation_state::is_empty(model::node_id id) const {

void allocation_state::deallocate(
const model::broker_shard& replica,
const partition_allocation_domain domain) {
const partition_allocation_domain domain,
const allocation_node::deallocation_error_policy error_policy) {
if (auto it = _nodes.find(replica.node_id); it != _nodes.end()) {
it->second->deallocate_on(replica.shard, domain);
it->second->deallocate_on(replica.shard, domain, error_policy);
}
}

Expand Down
5 changes: 4 additions & 1 deletion src/v/cluster/scheduling/allocation_state.h
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,10 @@ class allocation_state {
int16_t available_nodes() const;

// Operations on state
void deallocate(const model::broker_shard&, partition_allocation_domain);
void deallocate(
const model::broker_shard&,
partition_allocation_domain,
allocation_node::deallocation_error_policy);
void apply_update(
std::vector<model::broker_shard>,
raft::group_id,
Expand Down
6 changes: 4 additions & 2 deletions src/v/cluster/scheduling/partition_allocator.cc
Original file line number Diff line number Diff line change
Expand Up @@ -395,7 +395,8 @@ void partition_allocator::deallocate(
const partition_allocation_domain domain) {
for (auto& r : replicas) {
// find in brokers
_state->deallocate(r, domain);
_state->deallocate(
r, domain, allocation_node::deallocation_error_policy::relaxed);
}
}

Expand All @@ -420,7 +421,8 @@ void partition_allocator::remove_allocations(
const std::vector<model::broker_shard>& to_remove,
const partition_allocation_domain domain) {
for (const auto& bs : to_remove) {
_state->deallocate(bs, domain);
_state->deallocate(
bs, domain, allocation_node::deallocation_error_policy::relaxed);
}
}

Expand Down
5 changes: 4 additions & 1 deletion src/v/cluster/scheduling/types.cc
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,10 @@ allocation_units::~allocation_units() {
for (auto& pas : _assignments) {
for (auto& replica : pas.replicas) {
if (!_previous.contains(replica)) {
_state->deallocate(replica, _domain);
_state->deallocate(
replica,
_domain,
allocation_node::deallocation_error_policy::strict);
}
}
}
Expand Down