Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update abort_old_txes of the consumer group to use locks #11566

Closed
wants to merge 1 commit into from

Conversation

rystsov
Copy link
Contributor

@rystsov rystsov commented Jun 20, 2023

An execution of abort_old_txes could span multiple terms so the so the method could modify new state assuming it's the old state resulting in undefined behavior.

This commit is the rewrite of the reverted f7fc026f in 11474. The problem was caused by:

  • do_detach_partition got blocked
  • RP ignored blocked do_detach_partition and attempted next op leading double registration of the consumer groups ntp

The op was blocked by a deadlock:

  • do_abort_old_txes was waiting for read lock while holding _gate
  • do_detach_partition was holding write lock while waiting to the gate to be closed

This version doesn't wait for the read lock to become available and exit do_abort_old_txes releasing the _gate.

It still isn't clear why RP ignored a blocked op

Fixes #11562

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.1.x
  • v22.3.x
  • v22.2.x

Release Notes

  • none

An execution of abort_old_txes could span multiple terms so the so the
method could modify new state assuming it's the old state resulting in
undefined behavior.

This commit is the rewrite of the reverted f7fc026 in 11474. The pro-
blem was caused by:

  - do_detach_partition got blocked
  - RP ignored blocked do_detach_partition and attempted next op leading
    double registration of the consumer groups ntp

The op was blocked by a deadlock:

  - do_abort_old_txes was waiting for read lock while holding _gate
  - do_detach_partition was holding write lock while waiting to the
    gate to be closed

This version doesn't wait for the read lock to become available and exit
do_abort_old_txes releasing the _gate.

It still isn't clear why RP ignored a blocked op

co_await shutdown_groups(std::move(groups_for_shutdown));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question.. why should shutdown_groups() be within catchup_writelock scope? Does the following break something?


std::vector<group_ptr> groups_for_shutdown;
{
   auto units = co_await p->catchup_lock.hold_write_lock();

    // Becasue shutdown group is async operation we should run it after
    // rehash for groups map
 
    for (auto g_it = _groups.begin(); g_it != _groups.end();) {
        if (g_it->second->partition()->ntp() == p->partition->ntp()) {
            groups_for_shutdown.push_back(g_it->second);
            _groups.erase(g_it++);
            continue;
        }
        ++g_it;
    }
    // if p has background work that won't complete without an abort being
    // requested, then do that now because once the partition is removed from
    // the _partitions container it won't be available in group_manager::stop.
    if (!p->as.abort_requested()) {
        p->as.request_abort();
    }
    _partitions.erase(ntp);
    _partitions.rehash(0);
}
co_await shutdown_groups(std::move(groups_for_shutdown));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most likely nothing but it takes effort to reason so I prefer to leave it as is. The PR doesn't introduce this lock there so I think it's safer to left it intact.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is related to the deadlock.. AFAICT, moving shutdown_groups() out of write scope removes the nested lock/gate dependency and fixes the deadlock, worth exploring.

@rystsov
Copy link
Contributor Author

rystsov commented Jun 23, 2023

Closing the PR in favor of #11621

@rystsov rystsov closed this Jun 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reimplement "txn: update abort_old_txes to use locks" without a bug
2 participants