[mirroring] Missing per-vendor validation of mirror session queue parameter #8189

raphaelt-nvidia · 2021-07-15T15:41:27Z

Description

It is possible to configure a value for queue in a mirror session that is outside the range supported by the switch vendor. The flow is that MirrorOrch::createEntry calls activateSession which calls status = sai_mirror_api->create_mirror_session. If it fails, e.g. due to invalid queue, activateSession calls handleSaiCreateStatus, which in any failure case initiates exiting orchagent.

I think the solution involves orchagent comparing the user-supplied value of queue with SAI_SWITCH_ATTR_QOS_MAX_NUMBER_OF_TRAFFIC_CLASSES as part of its earlier validations, and not attempting to create the session if this validation fails.

Steps to reproduce the issue:

config mirror_session erspan add ms6 40.0.0.1 40.0.0.2 63 250 "" 15 Ethernet136 tx

Describe the results you received:

ERR syncd#SDK: [SPAN.ERR] SWITCH_PRIO 15 is outside valid range 0-14.
ERR syncd#SDK: [SPAN.ERR] __span_add failed, err: Parameter Error.
ERR syncd#SDK: [SAI_MIRROR.ERR] mlnx_sai_mirror.c[2647]- mlnx_create_mirror_session: Error creating mirror session
ERR syncd#SDK: :- sendApiResponse: api SAI_COMMON_API_CREATE failed in syncd mode: SAI_STATUS_INVALID_PARAMETER
ERR syncd#SDK: :- processQuadEvent: attr: SAI_MIRROR_SESSION_ATTR_TC: 15
ERR swss#orchagent: :- create: create status: SAI_STATUS_INVALID_PARAMETER
ERR swss#orchagent: :- activateSession: Failed to activate mirroring session ms6
ERR swss#orchagent: :- handleSaiCreateStatus: Encountered failure in create operation, exiting orchagent, SAI API: SAI_API_MIRROR, status: SAI_STATUS_INVALID_PARAMETER

Describe the results you expected:

A single error in log from orchagent only, no exiting the process.

Output of `show version`:

SONiC Software Version: SONiC.202106.0-ebc962c22
Distribution: Debian 10.10
Kernel: 4.19.0-12-2-amd64
Build commit: ebc962c
Build date: Thu Jun 24 17:59:46 UTC 2021
Built by: raphaelt@r-build-sonic05

Platform: x86_64-mlnx_msn2410-r0
HwSKU: ACS-MSN2410
ASIC: mellanox
ASIC Count: 1
Serial Number: MT1921X01546
Uptime: 13:55:39 up 2:44, 2 users, load average: 1.58, 1.65, 1.65

Docker images:
REPOSITORY TAG IMAGE ID SIZE
docker-syncd-mlnx 202106.0-ebc962c22 d00568dad081 961MB
docker-syncd-mlnx latest d00568dad081 961MB
docker-snmp 202106.0-ebc962c22 69263d7ff0ca 454MB
docker-snmp latest 69263d7ff0ca 454MB
docker-dhcp-relay 202106.0-ebc962c22 12bd15bd2785 420MB
docker-dhcp-relay latest 12bd15bd2785 420MB
docker-teamd 202106.0-ebc962c22 92f76cb0c626 424MB
docker-teamd latest 92f76cb0c626 424MB
docker-nat 202106.0-ebc962c22 36858e717963 427MB
docker-nat latest 36858e717963 427MB
docker-router-advertiser 202106.0-ebc962c22 34fc1983f5d3 413MB
docker-router-advertiser latest 34fc1983f5d3 413MB
docker-platform-monitor 202106.0-ebc962c22 74dd0063f256 738MB
docker-platform-monitor latest 74dd0063f256 738MB
docker-lldp 202106.0-ebc962c22 ce9a9e42fecd 453MB
docker-lldp latest ce9a9e42fecd 453MB
docker-macsec 202106.0-ebc962c22 89d2ebe2ddaa 427MB
docker-macsec latest 89d2ebe2ddaa 427MB
docker-database 202106.0-ebc962c22 32ce5d38e191 413MB
docker-database latest 32ce5d38e191 413MB
docker-orchagent 202106.0-ebc962c22 c6254d27f85d 442MB
docker-orchagent latest c6254d27f85d 442MB
docker-sonic-telemetry 202106.0-ebc962c22 bd04168564d8 501MB
docker-sonic-telemetry latest bd04168564d8 501MB
docker-sonic-mgmt-framework 202106.0-ebc962c22 a5ec5adee9cd 570MB
docker-sonic-mgmt-framework latest a5ec5adee9cd 570MB
docker-fpm-frr 202106.0-ebc962c22 5436d8f7e983 442MB
docker-fpm-frr latest 5436d8f7e983 442MB
docker-sflow 202106.0-ebc962c22 0d353410dcb2 425MB
docker-sflow latest 0d353410dcb2 425MB

Output of `show techsupport`:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

The text was updated successfully, but these errors were encountered:

zhangyanzhao · 2021-07-21T15:22:09Z

orchagent crashed for this issue. @shi-su please work with @raphaelt-nvidia on this issue.

raphaelt-nvidia · 2021-08-05T06:36:12Z

Here is suggested code to add to MirrorOrch::createEntry that would address this:

        else if (fvField(i) == MIRROR_SESSION_QUEUE)
        {
            sai_status_t status;
            sai_attribute_t attr;
            entry.queue = to_uint<uint8_t>(fvValue(i));
            attr.id = SAI_SWITCH_ATTR_QOS_MAX_NUMBER_OF_TRAFFIC_CLASSES;
            status = sai_switch_api->get_switch_attribute(gSwitchId, 1, &attr);
            if (status == SAI_STATUS_SUCCESS)
            {
                if (entry.queue > attr.value.u8)
                {
                    SWSS_LOG_ERROR("Failed to get valid queue %s", fvValue(i).c_str());
                    return task_process_status::task_invalid_entry;
                }
            }
        }

My doubt is about the line

                if (entry.queue > attr.value.u8)

Should it be '>' or '>='? I chose '>' because of the attribute's description:

/**
 * @brief Maximum traffic classes limit
 *
 * @type sai_uint8_t
 * @flags READ_ONLY
 */
SAI_SWITCH_ATTR_QOS_MAX_NUMBER_OF_TRAFFIC_CLASSES,

For example, if the valid values for queue are 0-14, I would expect get_switch_attribute to return 14. Or would you expect 15?

shi-su · 2021-08-05T17:29:32Z

Here is suggested code to add to MirrorOrch::createEntry that would address this:

        else if (fvField(i) == MIRROR_SESSION_QUEUE)
        {
            sai_status_t status;
            sai_attribute_t attr;
            entry.queue = to_uint<uint8_t>(fvValue(i));
            attr.id = SAI_SWITCH_ATTR_QOS_MAX_NUMBER_OF_TRAFFIC_CLASSES;
            status = sai_switch_api->get_switch_attribute(gSwitchId, 1, &attr);
            if (status == SAI_STATUS_SUCCESS)
            {
                if (entry.queue > attr.value.u8)
                {
                    SWSS_LOG_ERROR("Failed to get valid queue %s", fvValue(i).c_str());
                    return task_process_status::task_invalid_entry;
                }
            }
        }

My doubt is about the line

                if (entry.queue > attr.value.u8)

Should it be '>' or '>='? I chose '>' because of the attribute's description:

/**
 * @brief Maximum traffic classes limit
 *
 * @type sai_uint8_t
 * @flags READ_ONLY
 */
SAI_SWITCH_ATTR_QOS_MAX_NUMBER_OF_TRAFFIC_CLASSES,

For example, if the valid values for queue are 0-14, I would expect get_switch_attribute to return 14. Or would you expect 15?

I am not sure about this, but I felt it should be '>=' since it seems to be defined as the maximum number and 0-14 includes 15 values. In an extreme case that it supports nothing, the return value should be 0, otherwise, it could not cover it. This is just my feeling, we need to check.

Another concern about this change is that it seems to check for SAI_SWITCH_ATTR_QOS_MAX_NUMBER_OF_TRAFFIC_CLASSES every time before creating an entry. It seems to be a bit inefficient. Maybe we can query that value during initialization and save it for future use?

raphaelt-nvidia · 2021-08-05T17:46:38Z

I see that many instances of calling sai_switch_api->get_switch_attribute occur in initialization routines, so I agree with your second comment. If you wish me to supply the code, I would like to wait until there is a definite ruling on my original question, so that we can test it with a conforming SAI implementation.

shi-su · 2021-08-05T22:33:40Z

I see that many instances of calling sai_switch_api->get_switch_attribute occur in initialization routines, so I agree with your second comment. If you wish me to supply the code, I would like to wait until there is a definite ruling on my original question, so that we can test it with a conforming SAI implementation.

I tried your proposed fix. Interestingly, for the scenario that valid values for queue are 0-14, the SAI_SWITCH_ATTR_QOS_MAX_NUMBER_OF_TRAFFIC_CLASSES value is neither 14 nor 15. I got 16 for this attribute. Not sure what went wrong.

raphaelt-nvidia · 2021-08-06T10:27:48Z

The 16 is a bug in our SAI implementation that has been changed to 15 - not sure when that goes upstream. My question here is whether it should actually be changed to 14.

shi-su · 2021-08-06T16:46:31Z

The 16 is a bug in our SAI implementation that has been changed to 15 - not sure when that goes upstream. My question here is whether it should actually be changed to 14.

Per my understanding, 15 makes better sense. Yet this question should better be answered by someone who is familiar with the SAI definition.

raphaelt-nvidia · 2021-08-08T06:46:08Z

How do we identify and get the attention of the people who should decide? It seems to me that if the decision is 15, then the comment "Maximum traffic classes limit" should be changed and clarified. Also say something about the lower bound. Is the assumption that 0 is the lowest valid value true for all platforms?

shi-su · 2021-08-09T06:25:20Z

@zhangyanzhao Could you please help get attention from someone who has expertise in the SAI definition? This seems to go beyond my knowledge set.

raphaelt-nvidia · 2021-08-12T11:50:45Z

Ping

zhangyanzhao · 2021-08-12T18:13:57Z

Ack and let me see how can help on the SAI part.

prsunny · 2021-09-11T00:07:50Z

Agree with @shi-su on the suggestion. It should be >= . This is what is being done for cases like MAX ECMP groups (ref). Also as suggested above, please get the value only once.

zhangyanzhao added the Triaged this issue has been triaged label Jul 21, 2021

liat-grozovik added the port mirroring label Aug 24, 2021

zhangyanzhao added the Issue for 202106 label Sep 13, 2021

rupesh-k mentioned this issue Sep 30, 2021

SONiC Yang model support for Mirror #7877

Merged

4 tasks

raphaelt-nvidia mentioned this issue Oct 5, 2021

Orchagent validates mirror session queue parameter against maximum va… raphaelt-nvidia/sonic-swss#2

Closed

raphaelt-nvidia mentioned this issue Oct 13, 2021

Orchagent validates mirror session queue parameter against maximum value from SAI sonic-net/sonic-swss#1957

Merged

prsunny closed this as completed in sonic-net/sonic-swss#1957 Oct 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mirroring] Missing per-vendor validation of mirror session queue parameter #8189

[mirroring] Missing per-vendor validation of mirror session queue parameter #8189

raphaelt-nvidia commented Jul 15, 2021

zhangyanzhao commented Jul 21, 2021

raphaelt-nvidia commented Aug 5, 2021

shi-su commented Aug 5, 2021

raphaelt-nvidia commented Aug 5, 2021

shi-su commented Aug 5, 2021 •

edited

Loading

raphaelt-nvidia commented Aug 6, 2021

shi-su commented Aug 6, 2021

raphaelt-nvidia commented Aug 8, 2021

shi-su commented Aug 9, 2021

raphaelt-nvidia commented Aug 12, 2021

zhangyanzhao commented Aug 12, 2021

prsunny commented Sep 11, 2021

[mirroring] Missing per-vendor validation of mirror session queue parameter #8189

[mirroring] Missing per-vendor validation of mirror session queue parameter #8189

Comments

raphaelt-nvidia commented Jul 15, 2021

Description

Steps to reproduce the issue:

Describe the results you received:

Describe the results you expected:

Output of show version:

Output of show techsupport:

Additional information you deem important (e.g. issue happens only occasionally):

zhangyanzhao commented Jul 21, 2021

raphaelt-nvidia commented Aug 5, 2021

shi-su commented Aug 5, 2021

raphaelt-nvidia commented Aug 5, 2021

shi-su commented Aug 5, 2021 • edited Loading

raphaelt-nvidia commented Aug 6, 2021

shi-su commented Aug 6, 2021

raphaelt-nvidia commented Aug 8, 2021

shi-su commented Aug 9, 2021

raphaelt-nvidia commented Aug 12, 2021

zhangyanzhao commented Aug 12, 2021

prsunny commented Sep 11, 2021

Output of `show version`:

Output of `show techsupport`:

shi-su commented Aug 5, 2021 •

edited

Loading