-
Notifications
You must be signed in to change notification settings - Fork 677
[fromtree] bluetooth: fix bug when destroying tx buffers on disconnected #3026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[fromtree] bluetooth: fix bug when destroying tx buffers on disconnected #3026
Conversation
…nd_buf" This reverts commit 827eb44. Signed-off-by: Aleksander Wasaznik <aleksander.wasaznik@nordicsemi.no>
…p_send_pdu" This reverts commit 685c53a. Signed-off-by: Aleksander Wasaznik <aleksander.wasaznik@nordicsemi.no>
… disconnect Channel tx_queue purging on disconnect was inconsistently handled by the different channels: iso, l2cap, l2cap_br. iso channels handled purging in the tx_data_pull hook. l2cap and l2cap_br did the purging in channel delete functions and did not expect tx_data_pull to be called for a disconnected channel. Their data_pull functions could return a ptr to a net_buf that was still on the tx_queue, which is problematic when the conn tx_processor unrefs the returned buffer resulting in multiple calls to the buf destroy function. To make things consistent and correct, remove the code that tries to purge tx_queues in the tx_processor and only do purging in the channels themselves when they are deleted/disconnected. Also refactor and clarify referencing of the net_buf returned by tx_data_pull. It was confusing who had a reference and when, which could vary depending on the length of the original buffer. There are three cases: the buffer length is less than the tx.mps, greater the mps but less than the mtu so requiring segementation but not fragmentation, or greater than both mps and mtu so requiring both segmentation and fragmentation. The conn layer would increase the refcnt if the length was greater than the mtu, but not have any awareness of whether the net_buf was still on the tx_queue or not. Now it is the tx_data_pull callbacks responsibitity to increment the reference count if it is returning a pointer to a net_buf that it is still keeping on the tx_queue for segmentation purposes. The conn layer will now always transfer that reference into a fragment view and not conditional it on the length relative to the mtu, and always decrement the reference to the parent when the fragment is destroyed. So there is no risk of decrementing a reference to a net buf that might still be on a tx_queue, which simplifies error handling in particular. Also add error handling paths for when asserts are not enabled. Signed-off-by: Mike J. Chen <mjchen@google.com> (cherry picked from commit a392c33c60286574b74fe708e1a182dd99666c8d)
Allow for an additional buffer reference if callback is provided. This can be used to extend lifetime of the net buffer until the data transmission is confirmed by ACK of the remote. Signed-off-by: Marek Pieta <Marek.Pieta@nordicsemi.no> (cherry picked from commit d6539f4) (cherry picked from commit 685c53a)
If ATT sent callback is delayed until data transmission is done by BLE controller, the transmitted buffer may have an additional reference. The reference is used to extend lifetime of the net buffer until the data transmission is confirmed by ACK of the remote. send_buf function can be called multiple times, if buffer has to be fragmented over HCI. In that case, the callback is provided as an argument only for the last transmitted fragment. The `buf->ref == 1` check is skipped because it's impossible to properly validate number of references for the sent fragments if buffers may have the additional reference. Jira: NCSDK-28624 Signed-off-by: Marek Pieta <Marek.Pieta@nordicsemi.no> (cherry picked from commit e589307) (cherry picked from commit 827eb44)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay. I didn't do a detailed review of the fromtree commit itself, seeing as this has already been through a thorough review upstream. Handling of the noups looks like it's doing the correct thing as far as I can tell.
subsys/bluetooth/host/l2cap.c
Outdated
@@ -729,13 +729,17 @@ int bt_l2cap_send_pdu(struct bt_l2cap_le_chan *le_chan, struct net_buf *pdu, | |||
return -ENOTCONN; | |||
} | |||
|
|||
if (pdu->ref != 1) { | |||
/* Allow for an additional buffer reference if callback is provided. This can be used to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have misunderstood your description, but could you elaborate a bit why make this change in this PR and what the change is fixing? While the change itself (increasing the reference count possible if there is a callback) does seem sensible, I am not finding why this is added here in particular, and also whether we have some test in place ensuring this doesn't cause resource issues in some other end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi
The noups are related to NCS-specific (not accepted in upstream Zephyr) workaround for HID report flow over Bluetooth LE. This PR does not show the complete use-case for the noups. For more details, refer to:
The related Jira bug ticket: https://nordicsemi.atlassian.net/browse/NCSDK-27422
PRs introducing fixes to the NCS: nrfconnect/sdk-nrf#16027 and #1819
* the only reference (e.g. buf was removed from the conn | ||
* tx_queue). It would be 2 if the tx_data_pull kept it on | ||
* the tx_queue for segmentation. | ||
/* If ATT sent callback is delayed until data transmission |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: "letter count" (forgot proper word) in one line had been increased. So better to go with the new one. Probably you can just have two long lines and run clang-format.
@alwa-nordic Have you looked more closely at the test you mention in the description? Because if it is relatively simple to cherry-pick, I'd say we should (we may have nooup changes in some other corner affecting the fix, or will have in the future, so better to have the test also downstream than not imo). Ofc, if the test needs to be amended or drags in lots of other changes, we can decide to rely on it being upstream only, guess we also want to cherry-pick this fix quick. |
Add early_disconnect test cases for iso/cis and l2cap/stress to test the bluetooth stack handling of disconnects while transmit is still in progress. Signed-off-by: Mike J. Chen <mjchen@google.com> (cherry picked from commit 93997dab8eaa4777351d6d00df49695a35d41c76)
I cherry-picked the test. |
if (buf->ref > 2 + (cb ? 1 : 0)) { | ||
__ASSERT_NO_MSG(false); | ||
} | ||
__ASSERT_NO_MSG(IS_ENABLED(CONFIG_BT_ATT_SENT_CB_AFTER_TX) || (buf->ref == 1) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we squash the 2 noups? This one reverts part of the first noup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline with @alwa-nordic . Will not squash in scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks good. Tested locally, no regression found
Cherry-pick commit from upstream #90975
Review guide:
The only purpose of this PR is to cherry-pick commit from upstream #90975.
So why are there 5 commits in this PR?
The commit in the middle is the cherry-pick we want. The NCS CI required that the diff of all 'fromtree' is the same as the upstream commit. This means it's not possible to alter it, and also precludes a manual conflict resolution when applying that commit.
The upstream commit did not cherry-pick without conflict. It conflicts with some noup commits we have downstream. To resolve this will move the noup changes to after the cherry-pick, by reverting them and reapplying them, doing a manual conflict resolution in the noup changes.
So what do I need to review?
Guide to the noup changes
[nrf noup] bluetooth: conn: Allow for an extra ref in bt_l2cap_send_pdu
The diff-diff is a bit unwieldly because of the comments in the code. I have cheated and removed all diff-diff to do with code comments, and get the below:
What you see there is that the previous version of this noup changes an assertion to an if-statement, adding
+ (cb ? 1 : 0)
to what's already there. The new version of the noup does the same, but it converts the new|| (buf->ref == 2)
to just bumping up the number to2
. This should be equivalent.[nrf noup] bluetooth: conn: Skip buffer ref count check in send_buf
Again, with the comment-stuff removed:
This noup is removing the if-condition and replacing it with an assertion again, so we convert the
> 2
back to a|| (buf->ref == 2)
in the assertion.The validity of the cherry-pick itself
You can look at the upstream PR to understand the change and note any comments that could indicate potential problems.
It's worth noting that the upstream PR has a test that has not been cherry-picked in this PR. I did not put much though into this other than wanting to keep the cherry-pick small. Less is more. But maybe cherry-picking this test will increase our confidence for this fix?