Skip to content

nrf_rpc: implement timeout for response #1764

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Damian-Nordic
Copy link
Contributor

@Damian-Nordic Damian-Nordic commented Jun 6, 2025

Introduce necessary changes to support timing out waiting for a command response:

  1. Handle nrf_rpc_os_msg_get() returning null data pointer, indicating that the response has not arrived within the required maximum time.
  2. Introduce context mutex to protect all context members. This is necessary because now the response may arrive after the initiating thread has already released the context.

Note that 2 may not sufficient solution to address another scenario where the context is freed by the initiator but is then reused before a delayed response arrives - to solve this, we likely need to introduce the concept of session ID or conversation ID to detect outdated packets.

manifest-pr-skip

@Damian-Nordic
Copy link
Contributor Author

@doki-nordic @grochu Friendly reminder to take a look.

Copy link
Contributor

@grochu grochu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not having reviewed in shorter time, I hoped someone more experienced with nRF RPC could do it. Anyway the changes technically are OK to me, I just have some minor comments from the documentation perspective.

@doki-nordic
Copy link
Contributor

I started reviewing this PR, but since I have limited time to review it, I did not finalized it. I have concerns about mutex locking logic. Do we really need to lock it in context alloc function? When locking and unlocking logic is spread across multiple functions and multiple files, there is higher risk of edge cases or time races that may lead to dead lock. I am concern about context reusing functionality. Can you do simple context reusing test on this PR? See docs for some details about it: https://docs.nordicsemi.com/bundle/ncs-latest/page/nrfxlib/nrf_rpc/doc/architecture.html#id5 . This will happen, for example, when the local side sends a command and waits for a response, next the remote send another command back to local within the command handler context. In this case, context on local side should be reused.

@Damian-Nordic
Copy link
Contributor Author

Damian-Nordic commented Jun 24, 2025

I started reviewing this PR, but since I have limited time to review it, I did not finalized it. I have concerns about mutex locking logic. Do we really need to lock it in context alloc function? When locking and unlocking logic is spread across multiple functions and multiple files, there is higher risk of edge cases or time races that may lead to dead lock. I am concern about context reusing functionality. Can you do simple context reusing test on this PR? See docs for some details about it: https://docs.nordicsemi.com/bundle/ncs-latest/page/nrfxlib/nrf_rpc/doc/architecture.html#id5 . This will happen, for example, when the local side sends a command and waits for a response, next the remote send another command back to local within the command handler context. In this case, context on local side should be reused.

cmd_ctx_alloc may seem counterintuitive but I didn't find a better place to do the locking:

  1. cmd_ctx_alloc already modifies the context to initialize it, so it must lock the mutex in order to prevent another thread from using a partially initialized context when e.g. a delayed response arrives. If we wanted to do lock/unlock every time in a single function, we would need to do this in many functions, such as cmd_ctx_alloc, cmd_ctx_reserve, nrf_rpc_cmd_common. I thought this would be less efficient.
  2. cmd_ctx_alloc is called only when a new context is allocated - in the case of context reusing, this won't be called again. If we moved locking to e.g. nrf_rpc_cmd_common, we might be going to lock an already locked mutex, so we would need to assume the mutex is reentrant or add an extra check.

I'm happy to document it better, or rework this if you prefer locking the mutex in individual functions, but I thought doing this the current way actually simplifies the design - you always run nRF RPC code with the context mutex held and unlock it only when waiting for the peer or when releasing the context.

And yeah, I can run a test like that :)

@Damian-Nordic
Copy link
Contributor Author

Damian-Nordic commented Jul 7, 2025

@grochu I addressed your comments, please take another look.
@doki-nordic I tested this code with the context reusing (implemented UDP callback which retrieves extra information about the received UDP packet from the remote device) and it works as expected.
Let me know if you have time for a call so I can provide the rationale for the current design.

Copy link
Contributor

@grochu grochu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock development. Let's discuss the context locking with mutex later in another context.

@github-actions github-actions bot added the doc-required PR must not be merged without tech writer approval. label Jul 9, 2025
@Damian-Nordic Damian-Nordic requested a review from annwoj July 9, 2025 13:15
Introduce necessary changes to support timing out waiting
for a command response:
1. Handle nrf_rpc_os_msg_get() returning null data pointer,
   indicating that the response has not arrived within
   the required maximum time.
2. Introduce context mutex to protect all context members.
   This is necessary because now the response may arrive
   after the initiating thread has already released the
   context.

Note that 2 may not sufficient solution to address another
scenario where the context is freed by the initiator but
is then reused before a delayed response arrives - to solve
this, we likely need to introduce the concept of session
ID or conversation ID to detect outdated packets.

Signed-off-by: Damian Krolik <damian.krolik@nordicsemi.no>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-required PR must not be merged without tech writer approval.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants