Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idempotency: Fix lifetime issues from intrusive container #7824

Merged
merged 2 commits into from
Dec 21, 2022

Conversation

bharathv
Copy link
Contributor

@bharathv bharathv commented Dec 17, 2022

For non auto-unlinking intrusive containers, we need to delete
the linked object before destroying the containers, otherwise can cause
crashes.

Without this fix, there were multiple code paths that didn't exercise
this behavior causing chaos tests to crash.

Backports Required

  • none - not a bug fix
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v22.3.x
  • v22.2.x
  • v22.1.x

UX Changes

Release Notes

Bug Fixes

  • Fixes a segfault in partition shutdown caused by incorrect use of intrusive lists.

if (
_log_state.lru_idempotent_pids.size()
<= _max_concurrent_producer_ids()) {
if (_log_state.seq_table.size() <= _max_concurrent_producer_ids()) {
Copy link
Contributor

@VadimPlh VadimPlh Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seq_table contains not only idempotent pids.
We can not use it to check the size of idempotent_pids.
For example _max_concurrent_producer_ids =n + 1. You can have n + 1 transaction pids and 1 idempotent_pid. So you will clear idempotent_pid in this case

Why we apply this settings not for transaction/idempotent pids together:

  1. We can not delete pid for transaction if transaction is not aborted or commited.
  2. For idempotent we can just forget old pids in LRU order.
  3. So we decided check size of pids (transaction and idempotency) independently

Copy link
Contributor Author

@bharathv bharathv Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I thought is_idempotent() is mutually exclusive from is_transactional(), looks like its not. I'm testing another version of the fix, will get back.

@bharathv bharathv force-pushed the chaos_fix_intrusive branch 2 times, most recently from e6d334d to 703a4f7 Compare December 20, 2022 18:00
For non auto-unlinking intrusive containers, we need to delete
the linked object before destroying the containers, otherwise can cause
crashes.

Without this fix, there were multiple code paths that didn't exercise
this behavior causing chaos tests to crash.
for (const auto& entry : _log_state.seq_table) {
_log_state.unlink_lru_pid(entry.second);
}
vassert(
Copy link
Contributor

@VadimPlh VadimPlh Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I am not sure about assert there
Maybe vlog error, because it is stop method and in general we can finish stopping redpanda process

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats an invariant that should always hold true. If not it can potentially crash. vassert() OTOH gives us much more meaningful crash diagnostics.

@VadimPlh VadimPlh self-requested a review December 20, 2022 18:12
Copy link
Contributor

@VadimPlh VadimPlh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@bharathv
Copy link
Contributor Author

Running full chaos suite, will merge tomorrow if its green.

@bharathv bharathv merged commit 4116b06 into redpanda-data:dev Dec 21, 2022
@bharathv
Copy link
Contributor Author

/backport 22.3.x

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch. was there an issue/crash/bactrace that could be referenced?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants