Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fault Management (Analysis and Handling) #1520

Open
shyam77git opened this issue Nov 20, 2023 · 5 comments
Open

Fault Management (Analysis and Handling) #1520

shyam77git opened this issue Nov 20, 2023 · 5 comments
Assignees

Comments

@shyam77git
Copy link
Contributor

shyam77git commented Nov 20, 2023

Basic Information (context)
Any failure (or an error) impacting a system/chassis or a sub-system is regarded as a fault.
Broadly classified into SW (Software) and HW (Hardware) faults:

  • SW faults are the ones that can occur during SW processing of a workflow at process/sub-system or a system level
  • HW faults are those that can occur during SW or HW processing of a workflow at HW (board) level - e.g. HW component/device etc.
    They may occur at any of the following stages of system's functioning:
  • system configuration, bring-up
  • feature enablement/configuration
  • during steady state
  • feature disablement/unconfiguration
  • while going-down (config reload, reboot etc.)

Present State
In SONiC, Fault is represented via an Event or an Alarm.
SONiC has Event Framework HLD which can help event-detector to publish its event to the eventD redisDB.
However, there is no Fault Manager/Handler which can take the needed/ platform-specified action(s) to recover the system from the generated fault.

Need for this feature
This feature aims at adding a generic FM (Fault Management) Infrastructure which can do the following:

  • Abstract the platform/HWSKU nuances from an open source NOS (i.e. SONiC) by publishing platform-specific 'Fault-Action Policy table'
  • Fetch these events (alarms/faults) from the eventD (based on published YANG/schema)
  • Analyze them (in a generic way) against the above-mentioned Policy Table
  • Take action based on the lookup/match in Policy Table
    Action could either be generic or platform specfic

Benefits
Platform supplied 'Fault-Action Policy table' has a holistic/system-level view of the platform (chassis/board/HWSKU) and can gauge the right action required to recover from the fault. It can either go with the recommended action (provided by the fault source/detector) or override it with the system-level one.

@shyam77git
Copy link
Contributor Author

HLD (md) PR: #1527

@zhangyanzhao
Copy link
Collaborator

Dell registered as the reviewer.

@zhangyanzhao
Copy link
Collaborator

@zhangyanzhao
Copy link
Collaborator

@venkatmahalingam can you please let me know the github id of other reviewers from Dell? Thanks.

@venkatmahalingam
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants