-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
idempotency: Fix lifetime issues from intrusive container #7824
Conversation
src/v/cluster/rm_stm.cc
Outdated
if ( | ||
_log_state.lru_idempotent_pids.size() | ||
<= _max_concurrent_producer_ids()) { | ||
if (_log_state.seq_table.size() <= _max_concurrent_producer_ids()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seq_table contains not only idempotent pids.
We can not use it to check the size of idempotent_pids.
For example _max_concurrent_producer_ids =n + 1. You can have n + 1 transaction pids and 1 idempotent_pid. So you will clear idempotent_pid in this case
Why we apply this settings not for transaction/idempotent pids together:
- We can not delete pid for transaction if transaction is not aborted or commited.
- For idempotent we can just forget old pids in LRU order.
- So we decided check size of pids (transaction and idempotency) independently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I thought is_idempotent() is mutually exclusive from is_transactional(), looks like its not. I'm testing another version of the fix, will get back.
e6d334d
to
703a4f7
Compare
For non auto-unlinking intrusive containers, we need to delete the linked object before destroying the containers, otherwise can cause crashes. Without this fix, there were multiple code paths that didn't exercise this behavior causing chaos tests to crash.
703a4f7
to
8478438
Compare
for (const auto& entry : _log_state.seq_table) { | ||
_log_state.unlink_lru_pid(entry.second); | ||
} | ||
vassert( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I am not sure about assert there
Maybe vlog error, because it is stop
method and in general we can finish stopping redpanda process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats an invariant that should always hold true. If not it can potentially crash. vassert() OTOH gives us much more meaningful crash diagnostics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Running full chaos suite, will merge tomorrow if its green. |
/backport 22.3.x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch. was there an issue/crash/bactrace that could be referenced?
For non auto-unlinking intrusive containers, we need to delete
the linked object before destroying the containers, otherwise can cause
crashes.
Without this fix, there were multiple code paths that didn't exercise
this behavior causing chaos tests to crash.
Backports Required
UX Changes
Release Notes
Bug Fixes