Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v24.2.x] k/group_manager: return not_coordinator quickly in tx operations #23189

Commits on Sep 4, 2024

  1. k/group_manager: return not_coordinator quickly in tx operations

    group_manager::attached_partition::catchup_lock can get blocked for
    extended periods of time. For example in the following scenario:
    1. consumer_offsets partition leader gets isolated
    2. some group operation acquires a read lock and tries to replicate a
      batch to the consumer_offsets partition. This operation hangs for an
      indefinite period of time.
    3. the consumer_offsets leader steps down
    4. group state cleanup gets triggered, tries to acquire a write lock,
      hangs until (2) finishes
    
    Meanwhile, clients trying to perform any tx group operations will get a
    coordinator_load_in_progress errors and blindly retry, without even
    trying to find the real coordinator.
    
    Check for leadership without the read lock first to prevent that (this
    is basically a "double-check" pattern as we have to check the second
    time under the lock.)
    
    (cherry picked from commit 440ed2c)
    ztlpn authored and vbotbuildovich committed Sep 4, 2024
    Configuration menu
    Copy the full SHA
    17890ac View commit details
    Browse the repository at this point in the history