Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libteam]: Reimplement Warm-Reboot procedure #3016

Merged

Conversation

pavel-shirshov
Copy link
Contributor

New implementation of teamd support of Warm-Reboot procedure

Port of #2999

During the manual testing of the previous Warm-Reboot procedure implementation for teamd we found, that teamd restores state incorrectly, if one of the ports was put in OPER DOWN state during the procedure.
To fix that I redesigned the procedure completely:

  1. When we prepare the system for the WR procedure, we save LACP PDU for every LAG member port (as before), and also additional information about current LAG:
  • number of LAG members
  • interface names of the LAG members
  • operational states of the LAG members
  1. When we start the system in WR state, we read the saved LAG information, and this information allows us to restore the state correctly.
  • if LAG state before the reboot was down, we disable the WR mode immediately, we have to start in the normal mode
  • if LAG state before the reboot was up, we start with enabled LAG interface, to don't disrupt the dataplane and start restoring state of LAG members
  • when SONiC adds LAG interfaces, teamd runs lacp_update_carrier() function. This function is used to calculate LAG interface operational state. We keep the state up, while we're in WR mode. Every time when lacp_update_carrier() is executed, we check operational state of LAG member interfaces. If it's up, we read the LACP PDU from files, if it's not up we wait.
  • After we execute lacp_update_carrier() once, we start timer. If we weren't able to restore the state for more than 3 seconds, we stop WR mode and start working as usual.
  • As soon as we read all LAG member interfaces state, we can disable WR-mode and run teamd as usual

The WR start logic was completely moved to lacp_update_carrier().
I've added a lot of debug messages for WR mode, which will allow us to find issues easily.

I've rearranged libteam patches in the series, to make WR patch last. It will allow us to change WR behaviour more easily

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants