-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix cross-shard allocator manipulation #7351
Fix cross-shard allocator manipulation #7351
Conversation
Const member prevents copy and move assignment, but copy and move assignment seem to have reasonable semantics (the destination shard id is replaced by the source) and we need it for allocation_units oncore tracking. Issue redpanda-data#5558.
Even though they ave movable, allocation_units objects must be destroyed on the same core the *original* allocation was made on, as they contain an embedded pointer back to the allocation state which lives on that core, since to do otherwise would be a cross-shard race-condition. Enforce this condition in debug mode using oncore checker.
4844d09
to
c61631c
Compare
Return the allocation_units wrapped in a foreign pointer, since they need to be freed on the same core on which they were allocated. Fixes redpanda-data#5558.
Checks at the top of most functions to ensure that the core of the caller is the same as the home core of the state, only in debug builds.
c61631c
to
44c0dc7
Compare
Nice work I added myself as a reviewer just to be kept in the loop here |
This is relatively clean. Any reason is draft. |
ci-failure is #6870 |
@emaxerrno wrote:
I just didn't have time to fill out the cover letter yet but I wanted to get a CI run in in the meantime and maybe get an eye on it. I've marked it as ready for review now. |
@@ -29,7 +29,7 @@ class oncore final { | |||
void verify_shard_source_location(const char* file, int linenum) const; | |||
|
|||
private: | |||
const shard_id_type _owner_shard; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dotnwat you added this const in a recent compile-time related change, but is it necessary? It prevents assignment but I think assignment basically DTRT at least in the case of allocation_units
the destination will be assigned the internals which still point to the shard associated with the source so the oncore should also be assigned with the shard of the source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it isn't necessary, but i'm unaware of any case in which the owner shard is not the same as the shard that created the oncore object (and could make that write-once assignment). it almost seems like if it is being manually assigned or changed, then it implies that some code is doing manual tracking of core owner, which defeats the sort of checks we wanted in the first place.
i'll take a closer look at this PR to understand what is happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dotnwat - right, it may be true that "the owner shard is not the same as the shard that created the oncore object" but making the member const
is implies something even stricter: "this object can never be assigned". Now I think you can argue that assignment of an oncore
object itself doesn't make much sense since the only state it has should not change, but this also means any object that contains an oncore
member can't be assigned (at least not without overriding the operator=
and manually assigning all the fields except that one, ugh), and that may be an entirely valid operation.
So one way to keep similar semantics without preventing assignment operator generation could be to assert in the oncore
assignment operator that the LHS and RHS shard ID are the same.
nit: please can you use the |
Do we like this as a cause for #7343 ? It seems odd that the issue would manifest on upgrade if this is a long-standing bug in Redpanda. |
/backport v22.3.x |
/backport v22.2.x |
Oops! Something went wrong. |
Failed to run cherry-pick command. I executed the below command:
|
The attempt to backport to v22.2.x branch failed because this PR modifies @travisdowns @dotnwat or anybody else who knows the innards of this codebase...need advice. |
/backport v22.3.x |
notes from conversation with @dlex regarding backport to v22.2.x branch:
|
@andrewhsu i don't think you need to backport 6986. this PR makes a one line change to that oncore file, but in 22.2.x that file exists in a different location. you could even drop the change to that file entirely, just document it in the PR as a conflict. |
fyi i added release notes section to the pr body; it was empty |
Thanks all for the help here. Just to clarify, the Remove const from shard_id in oncore change was related and required for this series of changes in |
This patch series prevents cross-shard manipulation of the allocator state which may cause segfaults and assertion failures.
The main problem is that the destructor for allocation_units maintains a pointer to the state from which it was created, which when moved onto another shard will do a call back to the original shard on destruction: a race condition. Now we use foreign pointer to avoid this: the destructor will be called on the original shard.
Fixes #5558.
Backports Required
UX Changes
Release Notes
Bug Fixes