Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple restart of swss during config load fails to start swss #3244

Closed
antony-rheneus opened this issue Jul 31, 2019 · 13 comments
Closed

Multiple restart of swss during config load fails to start swss #3244

antony-rheneus opened this issue Jul 31, 2019 · 13 comments

Comments

@antony-rheneus
Copy link
Contributor

https://github.com/Azure/sonic-buildimage/blob/67463f18b2ea396c1b3bab87575f803376a8046e/files/build_templates/swss.service.j2#L13

@jleveque, Interval has been increased to from default 10sec to 1200sec, but burst has been decreased from default 5 to 3.
Since burst is too low in longer interval timespan, swss was not started by systemd.
Can we revert the burst/increase it?

@jleveque
Copy link
Contributor

jleveque commented Jul 31, 2019

@antony-rheneus: Could you please explain your issue more? The new settings (StartLimitInterval = 1200sec, StartLimitBurst = 3) now mean that if systemd has already restarted swss 3 times in the past 20 minutes, it should stop trying to restart the service and mark it as failed. Under what circumstances are you encountering this? It should never occur under normal operation.

@nikos-github
Copy link
Collaborator

nikos-github commented Jul 31, 2019

@jleveque We are having the same issue. Configuration that isn't incremental yet can't be pushed within the specified number of time limit, is causing swss to not start again.

@jleveque
Copy link
Contributor

@nikos-github: To clarify, you have a need to perform > 3 configuration pushes (and thus >3 SwSS restarts, due to lack of incremental config) within 20 minutes? Is this correct?

@antony-rheneus antony-rheneus changed the title Multiple restart of swss during config load fails to stats swss Multiple restart of swss during config load fails to start swss Aug 1, 2019
@antony-rheneus
Copy link
Contributor Author

@jleveque , If you run test suite, this fails as test suite does change configs multiple times and test 1by1.
Excerpt " if systemd has already restarted swss 3 times in the past 20 minutes, it should stop trying to restart the service and mark it as failed."
How can you be sure swss restart more than thrice is not vaild scenario, I personally feel this is a valid scenario?

Continuous restart is InValid Only if there is issue in application which exited or core dumped, and then endless loop in the same state has to be avoided. For this we cannot add a prevention in generic infra which affects intentional restarts.

@antony-rheneus
Copy link
Contributor Author

@jleveque Would you give some pointers why this was changed from default system service values? If you provide me some insights it would be helpful for me to understand the reason for the change, as you would have analysed to come up with these new values

@nikos-github
Copy link
Collaborator

nikos-github commented Aug 1, 2019

@nikos-github: To clarify, you have a need to perform > 3 configuration pushes (and thus >3 SwSS restarts, due to lack of incremental config) within 20 minutes? Is this correct?

@jleveque That is correct. Currently not all configuration pertaining to sonic can be applied incrementally or without restarting swss. Keep in mind that users may also push configuration through our software at different times which when applied will force a swss restart. I don't think there is a deterministic way to predict how many times swss should be allowed to restart and in what interval.

@jleveque
Copy link
Contributor

jleveque commented Aug 2, 2019

@antony-rheneus: This was changed from the default values once we added the 'auto-restart-upon-critical-process-crash' feature (#2845). This is to prevent SONiC from indefinitely restarting the service if there is something causing one of the critical processes to crash consistently.

@antony-rheneus
Copy link
Contributor Author

can we add "systemctl reset-failed swss" to reset the restart counter in the sonic-utilities/ where config load/load_minigraph is being called?
This is to ensure the test suite doesn't report failure for valid multiple restarts

@avi-milner
Copy link

can we add "systemctl reset-failed swss" to reset the restart counter in the sonic-utilities/ where config load/load_minigraph is being called?
This is to ensure the test suite doesn't report failure for valid multiple restarts

also for config reload operation will need to clear counter , this is not only for tests suites, but also for allowing pusing/changing new configurations as much as desired without restrictions to it

@jleveque
Copy link
Contributor

can we add "systemctl reset-failed swss" to reset the restart counter in the sonic-utilities/ where config load/load_minigraph is being called?
This is to ensure the test suite doesn't report failure for valid multiple restarts

We can. This is something I was already considering adding. I'll look into creating a PR.

@jleveque
Copy link
Contributor

@antony-rheneus, @avi-milner: PR here: sonic-net/sonic-utilities#607. Please review.

@jleveque
Copy link
Contributor

Should be addressed by sonic-net/sonic-utilities#607

@avi-milner
Copy link

avi-milner commented Aug 25, 2019

hi @jleveque ,
it seems that the fix is not working as we expected,
when we call to config_reload / config_load/ config_load_minigraph
from this code you only reset failed counter for services that are already in failed state, this still causes the config load scenarios to fail after running them within the time window of 20 minutes, 3 times

can you please fix to always reset failed counter for config load commands ?

i have opened #sonic-net/sonic-utilities#616
for this

mssonicbld added a commit that referenced this issue Apr 2, 2024
…atically (#18524)

#### Why I did it
src/sonic-utilities
```
* bd86d33b - (HEAD -> master, origin/master, origin/HEAD) [generate_dump] call hw-management-generate-dump.sh in collect_cisco_8000 (#2809) (2 hours ago) [Geert Vlaemynck]
* 52e9117c - [dualtor_neighbor_check] Fix the script not exists issue (#3244) (24 hours ago) [Longxiang Lyu]
```
#### How I did it
#### How to verify it
#### Description for the changelog
mssonicbld added a commit that referenced this issue Apr 2, 2024
…atically (#18522)

#### Why I did it
src/sonic-utilities
```
* d6eec0f4 - (HEAD -> 202305, origin/202305) [dualtor_neighbor_check] Fix the script not exists issue (#3244) (16 hours ago) [Longxiang Lyu]
```
#### How I did it
#### How to verify it
#### Description for the changelog
mssonicbld added a commit that referenced this issue Apr 2, 2024
…atically (#18521)

#### Why I did it
src/sonic-utilities
```
* a056e9d5 - (HEAD -> 202311, origin/202311) [dualtor_neighbor_check] Fix the script not exists issue (#3244) (16 hours ago) [Longxiang Lyu]
```
#### How I did it
#### How to verify it
#### Description for the changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants