Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI support for SmartSwitch PMON #3271

Open
wants to merge 59 commits into
base: master
Choose a base branch
from

Conversation

rameshraghupathy
Copy link

@rameshraghupathy rameshraghupathy commented Apr 14, 2024

What I did

Enhanced the following CLIs to support SmartSwitch PMON as described in the PMON HLD documentation "https://github.com/sonic-net/SONiC/blob/d19d8933a43d0a31a4f3b2310f4336f289bca340/doc/smart-switch/pmon/smartswitch-pmon.md"

CLIs:
Added new module "DPUX" support for 1 and 2 below
1. "config chassis module startup DPUX" , where X could be 0, to the maximum number of DPUs-1 in the SmartSwitch chassis
2. "config chassis module shutdown DPUX"

Extended the following CLIs to support the new module "DPUX" and also proved a "all" option to display the "SWITCH" and all "DPUX" modules
1. "show reboot-cause" will remain the same and added "show reboot-cause all"
2. "show reboot-cause history" will remain the same and added "show reboot-cause history ", where module name could be DPUX, SWITCH and all.

Extended the following CLIs to support the new module "DPUX" and also proved a "all" option to display the "SWITCH" and all "DPUX" modules
1. "show system-health summary" will remain the same and added sub-command "show system-health summary ", where module name could be DPUX, SWITCH and all.
2. "show system-health monitor-list" will remain the same and added sub-command "show system-health monitor-list ", where module name could be DPUX, SWITCH and all.
3. "show system-health summary" will remain the same and added sub-command "show system-health summary ", where module name could be DPUX, SWITCH and all.etail" will remain the same and added sub-command "show system-health detail ", where module name could be DPUX, SWITCH and all.
4. Added a new sub command "show system-health dpu ", where module name could be DPUX, and all. This new subcommand will provide additional DPU state details as mentioned in the HLD

How I did it

  1. Kept the original CLI output unaltered
  2. Added sub command to support SmartSwitch "DPUs"
  3. Added additional code in chassisd, and in platform modules.py, chassis.py to support it
  4. Updated the DB tables as mentioned in the PMON HLD

How to verify it

  1. Build an image with the required files (refer to the other upstream PRs and the platform PRs)
    Require files:
    - This PR including reboot_cause.py, chassis_modules.py, system_health.py)
    - The other PR including module_base.py, chassis_base.py, docker-pmon.supervisord.conf.j2, chassisd, mock_module_base.py, and the appropriate database_config.json
    - Platform "platform-cisco-8000" supporting PMON (module.py, chassis.py, inventory.py, pmon_daemon_control.json, and the required grpc and DB changes)
  2. Run the CLIs and see the new output

Previous command output (if the output of a command-line utility has changed)

root@sonic:~# show reboot-cause
Unknown

root@sonic:~# show reboot-cause history
Name Cause Time User Comment


2023_06_19_11_00_24 Power Loss N/A N/A Unknown (First boot of SONiC version 202311.10869-dirty-2024044)

New command output (if the output of a command-line utility has changed)

root@sonic:~# show reboot-cause history all
Device Name Cause Time User Comment


SWITCH 2023_06_19_11_00_24 Power Loss N/A N/A Unknown (First boot of SONiC version 202311.10869-dirty-2024044)

root@sonic:~# show reboot-cause history SWITCH
Device Name Cause Time User Comment


SWITCH 2023_06_19_11_00_24 Power Loss N/A N/A Unknown (First boot of SONiC version 202311.10869-dirty-2024044)

root@sonic:~# show reboot-cause history DPU0
Device Name Cause Time User Comment


@oleksandrivantsiv
Copy link
Collaborator

Can you please add UT for the new functions?

show/chassis_modules.py Outdated Show resolved Hide resolved
@prgeor
Copy link
Contributor

prgeor commented Sep 16, 2024

@rameshraghupathy Could you please add sample output for the DPUs/Smartswitch case?

@prgeor
Copy link
Contributor

prgeor commented Sep 16, 2024

@rameshraghupathy can you update the CLI in this doc https://github.com/sonic-net/sonic-utilities/blob/master/doc/Command-Reference.md . There is a "reboot-cause" section


'''
#
# TBD: Uncomment this code in phase:2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy can you remove these comment code? You can raise a PR in your personal PR for future reference.


'''
#
# TBD: Uncomment this code in phase:2 when system-health is supported
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy can you please remove these commented test code. you can raise a PR in your personal repo for future reference

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy can you please remove these commented test code. you can raise a PR in your personal repo for future reference

Done

@prgeor

@rameshraghupathy
Copy link
Author

rameshraghupathy commented Sep 24, 2024

@rameshraghupathy Could you please add sample output for the DPUs/Smartswitch case?

These CLIs were executed on the Switch:

root@MtFuji:/opt/cisco/bin# show reboot-cause
Power Loss
root@MtFuji:/opt/cisco/bin# show reboot-cause all
Device Name Cause Time User


SWITCH 2024_07_24_20_43_22 Power Loss N/A N/A
DPU7 DPU7 Unknown N/A N/A
DPU3 DPU3 Unknown N/A N/A
DPU2 DPU2 Unknown N/A N/A
DPU1 DPU1 Unknown N/A N/A
root@MtFuji:/opt/cisco/bin# show reboot-cause history
Name Cause Time User Comment


2024_07_24_20_43_22 Power Loss N/A N/A Unknown
2024_07_24_06_21_32 Power Loss N/A N/A Unknown
2024_07_23_23_06_57 Kernel Panic Tue Jul 23 11:02:27 PM UTC 2024 N/A N/A
2024_07_23_12_07_57 Power Loss N/A N/A Unknown
2024_07_23_11_21_32 Power Loss N/A N/A Unknown
2024_07_23_09_51_35 Power Loss N/A N/A Unknown (First boot of SONiC version azure_cisco_master.15585-dirty-20240907.225359)
root@MtFuji:/opt/cisco/bin#
root@MtFuji:/opt/cisco/bin# show reboot-cause history DPU7
Device Name Cause Time User Comment


DPU7 DPU7 Unknown N/A N/A N/A
DPU7 DPU7 Unknown N/A N/A N/A
DPU7 DPU7 Software causes (Hardware watchdog reset) N/A N/A N/A
root@MtFuji:/opt/cisco/bin# show reboot-cause history all
Device Name Cause Time User Comment


SWITCH 2024_07_24_20_43_22 Power Loss N/A N/A Unknown
SWITCH 2024_07_24_06_21_32 Power Loss N/A N/A Unknown
SWITCH 2024_07_23_23_06_57 Kernel Panic Tue Jul 23 11:02:27 PM UTC 2024 N/A N/A
SWITCH 2024_07_23_12_07_57 Power Loss N/A N/A Unknown
SWITCH 2024_07_23_11_21_32 Power Loss N/A N/A Unknown
SWITCH 2024_07_23_09_51_35 Power Loss N/A N/A Unknown (First boot of SONiC version azure_cisco_master.15585-dirty-20240907.225359)
DPU7 DPU7 Unknown N/A N/A N/A
DPU7 DPU7 Unknown N/A N/A N/A
DPU7 DPU7 Software causes (Hardware watchdog reset) N/A N/A N/A
DPU3 DPU3 Unknown N/A N/A N/A
DPU3 DPU3 Unknown N/A N/A N/A
DPU2 DPU2 Unknown N/A N/A N/A
DPU2 DPU2 Software causes (Reboot) N/A N/A N/A
DPU2 DPU2 Unknown N/A N/A N/A
DPU2 DPU2 Unknown N/A N/A N/A
DPU1 DPU1 Unknown N/A N/A N/A
DPU1 DPU1 Software causes (Reboot) N/A N/A Unknown

root@MtFuji:/opt/cisco/bin# show reboot-cause history SWITCH
Device Name Cause Time User Comment


SWITCH 2024_07_24_20_43_22 Power Loss N/A N/A Unknown
SWITCH 2024_07_24_06_21_32 Power Loss N/A N/A Unknown
SWITCH 2024_07_23_23_06_57 Kernel Panic Tue Jul 23 11:02:27 PM UTC 2024 N/A N/A
SWITCH 2024_07_23_12_07_57 Power Loss N/A N/A Unknown
SWITCH 2024_07_23_11_21_32 Power Loss N/A N/A Unknown
SWITCH 2024_07_23_09_51_35 Power Loss N/A N/A Unknown (First boot of SONiC version azure_cisco_master.15585-dirty-20240907.225359)

@prgeor

@rameshraghupathy
Copy link
Author

@rameshraghupathy Could you please add sample output for the DPUs/Smartswitch case?

Done. BTW, the platform work is in progress to update the N/A fields in the output

@rameshraghupathy
Copy link
Author

@rameshraghupathy can you update the CLI in this doc https://github.com/sonic-net/sonic-utilities/blob/master/doc/Command-Reference.md . There is a "reboot-cause" section

Done
@prgeor


- Example:
```
root@MtFuji:~$ show reboot-cause all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy This will list the DPUs that are in admin shut?

Comment on lines +730 to +733
SWITCH 2024_07_24_20_43_22 Power Loss N/A N/A
DPU2 2024_07_24_20_43_22 Software causes (Reboot) N/A N/A
DPU1 2024_07_24_20_43_22 Software causes (Reboot) N/A N/A
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy What will be the output for DPU that is in admin shut?

@@ -221,12 +223,16 @@ def port_optics_get(db, intf_name, type):
Get optic type info for port
"""
full_table_id = PORT_TRANSCEIVER_TABLE_PREFIX + intf_name
port_id = PORT_TABLE_PREFIX + intf_name
port_role = db.get(db.CONFIG_DB, port_id, PORT_ROLE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy I am not sure why we need the PORT_ROLE in config DB. Can you point me to the HLD section?

Comment on lines +40 to 49
state_db = SonicV2Connector(host="127.0.0.1", port="6379")
state_db.connect(state_db.STATE_DB)

key_pattern = '*'
key_pattern = CHASSIS_MODULE_INFO_TABLE + '|*'
if chassis_module_name:
key_pattern = '|' + chassis_module_name
key_pattern = CHASSIS_MODULE_INFO_TABLE + '|' + chassis_module_name

keys = state_db.keys(state_db.STATE_DB, CHASSIS_MODULE_INFO_TABLE + key_pattern)
keys = state_db.keys(state_db.STATE_DB, key_pattern)
if not keys:
print('Key {} not found in {} table'.format(key_pattern, CHASSIS_MODULE_INFO_TABLE))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy Have you tested this CLI change in T2 chassis?

Comment on lines +168 to +178
@reboot_cause.command()
@click.argument('module_name', required=False)
def history(module_name):
"""Show history of reboot-cause"""
reboot_cause_history = fetch_reboot_cause_history_from_db(module_name)
if module_name is not None:
header = ['Device', 'Name', 'Cause', 'Time', 'User', 'Comment']
click.echo(tabulate(reboot_cause_history, header, numalign="left"))
else:
click.echo("Reboot-cause history is not yet available in StateDB")
sys.exit(1)
header = ['Name', 'Cause', 'Time', 'User', 'Comment']
click.echo(tabulate(reboot_cause_history, header, numalign="left"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy please test this CLI on non-smartswitch

row[6] = value


@system_health.command()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy Seems this command is not mentioned in command-reference.md?

elif up_cnt == 3:
oper_status = "Online"
else:
oper_status = "Partial Online"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy What is partial online? Is this defined in HLD?

Comment on lines +192 to +197
for key, value in state_info.items():
if key.endswith('_state'):
if value.lower() == 'up':
up_cnt = up_cnt + 1
if 'midplane' in key and value.lower() == 'down':
midplanedown = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy can you explain whats going on here? Either the midplane is down or up when the user executed the CLI. ?

@@ -24,6 +23,100 @@ def read_reboot_cause_file():
return reboot_cause_dict


# Function to fetch reboot cause data from database
def fetch_data_from_db(module_name, fetch_history=False, use_chassis_db=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy how are these functions prevented from executing on non-smartswitch platforms?

Comment on lines +50 to +79
entry = rdb.get_all(rdb.CHASSIS_STATE_DB, tk)
else:
entry = rdb.get_all(rdb.STATE_DB, tk)

if module_name is not None:
if 'device' in entry:
if module_name != entry['device'] and module_name != "all":
continue
if entry['device'] in d and not history:
append = False
continue
elif not entry['device'] in d or entry['device'] in d and history:
if not entry['device'] in d:
d.append(entry['device'])
append = True
r.append(entry['device'] if 'device' in entry else "SWITCH")

name = tk.replace(prefix, "")
if "|" in name:
name = name[:name.rindex('|')] + ''
r.append(name)
r.append(entry['cause'] if 'cause' in entry else "")
r.append(entry['time'] if 'time' in entry else "")
r.append(entry['user'] if 'user' in entry else "")
if append and not fetch_history:
table.append(r)
elif fetch_history:
r.append(entry['comment'] if 'comment' in entry else "")
if module_name is None or module_name == 'all' or module_name.startswith('SWITCH') or \
'device' in entry and module_name == entry['device']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy can you test if these CLIs crash if the DB is empty? are we handling those corner case gracefully?

def show_module_state(module_name):
chassis_state_db = SonicV2Connector(host=CHASSIS_SERVER, port=CHASSIS_SERVER_PORT)
chassis_state_db.connect(chassis_state_db.CHASSIS_STATE_DB)
key = 'DPU_STATE|'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy I don't see any check here that prevents this function being called on non-smartswitch platform

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants