Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a use-after-move in partition_balancer #7406

Merged

Conversation

dlex
Copy link
Contributor

@dlex dlex commented Nov 21, 2022

A use-after-move in lambda capture expression may lead to incorrect initialization of allocation domain to 0, which would affect deallocation of __consumer_offsets partitions causing a vassert in allocation_node.

Related to #7343 (fixes one of the underlying issues).

Backports Required

  • none - not a bug fix
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v22.3.x
  • v22.2.x
  • v22.1.x

UX Changes

  • none

Release Notes

Bug Fixes

  • Fix an issue that prevented 22.3.x nodes from starting when recent controller log contained commands to manipulate __consumer_offsets partitions

@dlex dlex requested a review from ztlpn November 21, 2022 16:15
@piyushredpanda piyushredpanda added this to the v22.3.4 milestone Nov 21, 2022
ztlpn
ztlpn previously approved these changes Nov 21, 2022
Copy link
Contributor

@ztlpn ztlpn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch

@bharathv
Copy link
Contributor

great find 🤯

@dlex dlex force-pushed the 7343_fix-use-after-move-in-delete-topic-cmd branch from bb547a2 to d1a16bc Compare November 21, 2022 18:17
@dlex
Copy link
Contributor Author

dlex commented Nov 21, 2022

force-push: lint failure

@dlex dlex marked this pull request as ready for review November 21, 2022 18:24
@emaxerrno
Copy link
Contributor

@dlex - does this need backport?

@dlex
Copy link
Contributor Author

dlex commented Nov 21, 2022

@dlex - does this need backport?

Only to 22.3

@dlex
Copy link
Contributor Author

dlex commented Nov 21, 2022

CI failure due to #6903, retrying

@dlex
Copy link
Contributor Author

dlex commented Nov 21, 2022

/backport v22.3.x

@vbotbuildovich
Copy link
Collaborator

The pull request is not merged yet. Cancelling backport...

Workflow run logs.

@dlex dlex changed the title fix a use-after-move in partition_balancer Fix a use-after-move in partition_balancer Nov 21, 2022
@dlex
Copy link
Contributor Author

dlex commented Nov 22, 2022

CI failure due to #7418, retrying

dotnwat
dotnwat previously approved these changes Nov 22, 2022
Comment on lines -60 to +61
allocation_domain = get_allocation_domain(
del_cmd.key)](std::error_code ec) {
if (ec == errc::success) {
vassert(
topic_assignments.has_value(),
"Topic had to exist before successful delete");
deallocate_topic(
*topic_assignments, in_progress, allocation_domain);
.then(
[this,
tp_ns = std::move(del_cmd.key),
topic_assignments = std::move(topic_assignments),
in_progress = std::move(in_progress)](std::error_code ec) {
if (ec == errc::success) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch.

prior to this change get_allocation_domain ran before the continuation (when the capture list was created). but now it runs when the continuation is scheduled. is that something to be concerned with? in this case it might make sense to remove std::move from tp_ns in the capture list and pay a small string copy cost to keep things simple and preserve the semantics.

Copy link
Contributor Author

@dlex dlex Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No concern here. tp_ns is not changed (the capture is not mutable), the lambda is not a coro, and get_allocation_domain() is a pure function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No concern here. tp_ns is not changed (the capture is not mutable), the lambda is not a coro, and get_allocation_domain() is a pure function.

awesome. thanks for the explanation

@dotnwat
Copy link
Member

dotnwat commented Nov 22, 2022

PR looks good--some minor feedback but no blockers from me.

@dlex dlex requested a review from ztlpn November 22, 2022 16:30
@dlex dlex merged commit efa17cc into redpanda-data:dev Nov 22, 2022
@dlex
Copy link
Contributor Author

dlex commented Nov 22, 2022

/backport v22.3.x

@andrewhsu
Copy link
Member

fyi i updated the release notes section with text from what @dlex suggested

@dlex dlex deleted the 7343_fix-use-after-move-in-delete-topic-cmd branch November 24, 2022 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants