-
Notifications
You must be signed in to change notification settings - Fork 647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLI support for SmartSwitch PMON #3271
base: master
Are you sure you want to change the base?
Conversation
Can you please add UT for the new functions? |
addressed. 2. The DPU reboot-cause data is fetched directly fromn the chassis_state_db now
temporarily bypassing the check
is differed for now.
@rameshraghupathy Could you please add sample output for the DPUs/Smartswitch case? |
@rameshraghupathy can you update the CLI in this doc https://github.com/sonic-net/sonic-utilities/blob/master/doc/Command-Reference.md . There is a "reboot-cause" section |
show/system_health.py
Outdated
|
||
''' | ||
# | ||
# TBD: Uncomment this code in phase:2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy can you remove these comment code? You can raise a PR in your personal PR for future reference.
tests/system_health_test.py
Outdated
|
||
''' | ||
# | ||
# TBD: Uncomment this code in phase:2 when system-health is supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy can you please remove these commented test code. you can raise a PR in your personal repo for future reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy can you please remove these commented test code. you can raise a PR in your personal repo for future reference
Done
These CLIs were executed on the Switch: root@MtFuji:/opt/cisco/bin# show reboot-cause SWITCH 2024_07_24_20_43_22 Power Loss N/A N/A 2024_07_24_20_43_22 Power Loss N/A N/A Unknown DPU7 DPU7 Unknown N/A N/A N/A SWITCH 2024_07_24_20_43_22 Power Loss N/A N/A Unknown root@MtFuji:/opt/cisco/bin# show reboot-cause history SWITCH SWITCH 2024_07_24_20_43_22 Power Loss N/A N/A Unknown |
Done. BTW, the platform work is in progress to update the N/A fields in the output |
Done |
PR will be raised eventually for phase:2
f818c3c
to
d229307
Compare
|
||
- Example: | ||
``` | ||
root@MtFuji:~$ show reboot-cause all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy This will list the DPUs that are in admin shut?
SWITCH 2024_07_24_20_43_22 Power Loss N/A N/A | ||
DPU2 2024_07_24_20_43_22 Software causes (Reboot) N/A N/A | ||
DPU1 2024_07_24_20_43_22 Software causes (Reboot) N/A N/A | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy What will be the output for DPU that is in admin shut?
@@ -221,12 +223,16 @@ def port_optics_get(db, intf_name, type): | |||
Get optic type info for port | |||
""" | |||
full_table_id = PORT_TRANSCEIVER_TABLE_PREFIX + intf_name | |||
port_id = PORT_TABLE_PREFIX + intf_name | |||
port_role = db.get(db.CONFIG_DB, port_id, PORT_ROLE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy I am not sure why we need the PORT_ROLE in config DB. Can you point me to the HLD section?
state_db = SonicV2Connector(host="127.0.0.1", port="6379") | ||
state_db.connect(state_db.STATE_DB) | ||
|
||
key_pattern = '*' | ||
key_pattern = CHASSIS_MODULE_INFO_TABLE + '|*' | ||
if chassis_module_name: | ||
key_pattern = '|' + chassis_module_name | ||
key_pattern = CHASSIS_MODULE_INFO_TABLE + '|' + chassis_module_name | ||
|
||
keys = state_db.keys(state_db.STATE_DB, CHASSIS_MODULE_INFO_TABLE + key_pattern) | ||
keys = state_db.keys(state_db.STATE_DB, key_pattern) | ||
if not keys: | ||
print('Key {} not found in {} table'.format(key_pattern, CHASSIS_MODULE_INFO_TABLE)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy Have you tested this CLI change in T2 chassis?
@reboot_cause.command() | ||
@click.argument('module_name', required=False) | ||
def history(module_name): | ||
"""Show history of reboot-cause""" | ||
reboot_cause_history = fetch_reboot_cause_history_from_db(module_name) | ||
if module_name is not None: | ||
header = ['Device', 'Name', 'Cause', 'Time', 'User', 'Comment'] | ||
click.echo(tabulate(reboot_cause_history, header, numalign="left")) | ||
else: | ||
click.echo("Reboot-cause history is not yet available in StateDB") | ||
sys.exit(1) | ||
header = ['Name', 'Cause', 'Time', 'User', 'Comment'] | ||
click.echo(tabulate(reboot_cause_history, header, numalign="left")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy please test this CLI on non-smartswitch
row[6] = value | ||
|
||
|
||
@system_health.command() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy Seems this command is not mentioned in command-reference.md?
elif up_cnt == 3: | ||
oper_status = "Online" | ||
else: | ||
oper_status = "Partial Online" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy What is partial online? Is this defined in HLD?
for key, value in state_info.items(): | ||
if key.endswith('_state'): | ||
if value.lower() == 'up': | ||
up_cnt = up_cnt + 1 | ||
if 'midplane' in key and value.lower() == 'down': | ||
midplanedown = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy can you explain whats going on here? Either the midplane is down or up when the user executed the CLI. ?
@@ -24,6 +23,100 @@ def read_reboot_cause_file(): | |||
return reboot_cause_dict | |||
|
|||
|
|||
# Function to fetch reboot cause data from database | |||
def fetch_data_from_db(module_name, fetch_history=False, use_chassis_db=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy how are these functions prevented from executing on non-smartswitch platforms?
entry = rdb.get_all(rdb.CHASSIS_STATE_DB, tk) | ||
else: | ||
entry = rdb.get_all(rdb.STATE_DB, tk) | ||
|
||
if module_name is not None: | ||
if 'device' in entry: | ||
if module_name != entry['device'] and module_name != "all": | ||
continue | ||
if entry['device'] in d and not history: | ||
append = False | ||
continue | ||
elif not entry['device'] in d or entry['device'] in d and history: | ||
if not entry['device'] in d: | ||
d.append(entry['device']) | ||
append = True | ||
r.append(entry['device'] if 'device' in entry else "SWITCH") | ||
|
||
name = tk.replace(prefix, "") | ||
if "|" in name: | ||
name = name[:name.rindex('|')] + '' | ||
r.append(name) | ||
r.append(entry['cause'] if 'cause' in entry else "") | ||
r.append(entry['time'] if 'time' in entry else "") | ||
r.append(entry['user'] if 'user' in entry else "") | ||
if append and not fetch_history: | ||
table.append(r) | ||
elif fetch_history: | ||
r.append(entry['comment'] if 'comment' in entry else "") | ||
if module_name is None or module_name == 'all' or module_name.startswith('SWITCH') or \ | ||
'device' in entry and module_name == entry['device']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy can you test if these CLIs crash if the DB is empty? are we handling those corner case gracefully?
def show_module_state(module_name): | ||
chassis_state_db = SonicV2Connector(host=CHASSIS_SERVER, port=CHASSIS_SERVER_PORT) | ||
chassis_state_db.connect(chassis_state_db.CHASSIS_STATE_DB) | ||
key = 'DPU_STATE|' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy I don't see any check here that prevents this function being called on non-smartswitch platform
What I did
Enhanced the following CLIs to support SmartSwitch PMON as described in the PMON HLD documentation "https://github.com/sonic-net/SONiC/blob/d19d8933a43d0a31a4f3b2310f4336f289bca340/doc/smart-switch/pmon/smartswitch-pmon.md"
CLIs:
Added new module "DPUX" support for 1 and 2 below
1. "config chassis module startup DPUX" , where X could be 0, to the maximum number of DPUs-1 in the SmartSwitch chassis
2. "config chassis module shutdown DPUX"
Extended the following CLIs to support the new module "DPUX" and also proved a "all" option to display the "SWITCH" and all "DPUX" modules
1. "show reboot-cause" will remain the same and added "show reboot-cause all"
2. "show reboot-cause history" will remain the same and added "show reboot-cause history ", where module name could be DPUX, SWITCH and all.
Extended the following CLIs to support the new module "DPUX" and also proved a "all" option to display the "SWITCH" and all "DPUX" modules
1. "show system-health summary" will remain the same and added sub-command "show system-health summary ", where module name could be DPUX, SWITCH and all.
2. "show system-health monitor-list" will remain the same and added sub-command "show system-health monitor-list ", where module name could be DPUX, SWITCH and all.
3. "show system-health summary" will remain the same and added sub-command "show system-health summary ", where module name could be DPUX, SWITCH and all.etail" will remain the same and added sub-command "show system-health detail ", where module name could be DPUX, SWITCH and all.
4. Added a new sub command "show system-health dpu ", where module name could be DPUX, and all. This new subcommand will provide additional DPU state details as mentioned in the HLD
How I did it
How to verify it
Require files:
- This PR including reboot_cause.py, chassis_modules.py, system_health.py)
- The other PR including module_base.py, chassis_base.py, docker-pmon.supervisord.conf.j2, chassisd, mock_module_base.py, and the appropriate database_config.json
- Platform "platform-cisco-8000" supporting PMON (module.py, chassis.py, inventory.py, pmon_daemon_control.json, and the required grpc and DB changes)
Previous command output (if the output of a command-line utility has changed)
root@sonic:~# show reboot-cause
Unknown
root@sonic:~# show reboot-cause history
Name Cause Time User Comment
2023_06_19_11_00_24 Power Loss N/A N/A Unknown (First boot of SONiC version 202311.10869-dirty-2024044)
New command output (if the output of a command-line utility has changed)
root@sonic:~# show reboot-cause history all
Device Name Cause Time User Comment
SWITCH 2023_06_19_11_00_24 Power Loss N/A N/A Unknown (First boot of SONiC version 202311.10869-dirty-2024044)
root@sonic:~# show reboot-cause history SWITCH
Device Name Cause Time User Comment
SWITCH 2023_06_19_11_00_24 Power Loss N/A N/A Unknown (First boot of SONiC version 202311.10869-dirty-2024044)
root@sonic:~# show reboot-cause history DPU0
Device Name Cause Time User Comment