Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[warmboot] Recovery path failed when syncd hits "Runtime error: :- translateVidToRid" #8722

Closed
vaibhavhd opened this issue Sep 9, 2021 · 9 comments · Fixed by #8732
Closed

Comments

@vaibhavhd
Copy link
Contributor

Description

Syncd sends switch shutdown request to OA after "Runtime error: :- translateVidToRid: unable to get RID for VID".

Orchagent, cannot process this request gracefully, while warm recovery is in progress, and leads to SIGSEGV error with core.

Steps to reproduce the issue:

  1. Install latest master image
  2. Issue "warm-reboot" command (no test/traffic needed)
  3. There will be errors in syslog, and core generated in the warmboot recovery path.

Describe the results you received:

Sep  9 18:10:59.846684 str2-7050cx3-acs-02 NOTICE admin: Rebooting with /sbin/kexec -e to SONiC-OS-master.34161-7c9be314e ...

Sep  9 18:11:49.725054 str2-7050cx3-acs-02 NOTICE syncd#syncd: :- performWarmRestartSingleSwitch: Warm boot: create switch VID: oid:0x21000000000000 took 8.884678 sec
Sep  9 18:11:49.794082 str2-7050cx3-acs-02 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_BUFFER_POOL: 5
Sep  9 18:11:49.810178 str2-7050cx3-acs-02 NOTICE syncd#syncd: :- checkWarmBootDiscoveredRids: all discovered RIDs are present in current RID2VID map for switch VID oid:0x21000000000000

Sep  9 18:11:55.550007 str2-7050cx3-acs-02 ERR syncd#syncd: :- run: Runtime error: :- translateVidToRid: unable to get RID for VID oid:0x18000000000759
Sep  9 18:11:55.550101 str2-7050cx3-acs-02 NOTICE syncd#syncd: :- sendShutdownRequest: sending switch_shutdown_request notification to OA for switch: oid:0x21000000000000
Sep  9 18:11:55.550260 str2-7050cx3-acs-02 NOTICE syncd#syncd: :- sendShutdownRequestAfterException: notification send successfully

Sep  9 18:12:55.589358 str2-7050cx3-acs-02 ERR swss#orchagent: :- wait: SELECT operation result: TIMEOUT on getresponse
Sep  9 18:12:55.589358 str2-7050cx3-acs-02 ERR swss#orchagent: :- wait: failed to get response for getresponse
Sep  9 18:12:57.493945 str2-7050cx3-acs-02 NOTICE swss#neighsyncd: :- setWarmStartState: neighsyncd warm start state changed to reconciled
Sep  9 18:13:39.012938 str2-7050cx3-acs-02 NOTICE bgp#fpmsyncd: :- main: Warm-Restart timer expired.
Sep  9 18:13:39.012938 str2-7050cx3-acs-02 NOTICE bgp#fpmsyncd: :- reconcile: Warm-Restart: Initiating reconciliation process for bgp application.
Sep  9 18:13:39.209177 str2-7050cx3-acs-02 NOTICE bgp#fpmsyncd: :- setWarmStartState: bgp warm start state changed to reconciled
Sep  9 18:13:39.209177 str2-7050cx3-acs-02 NOTICE bgp#fpmsyncd: :- reconcile: Warm-Restart: Concluded reconciliation process for bgp application.
Sep  9 18:13:39.209177 str2-7050cx3-acs-02 NOTICE bgp#fpmsyncd: :- main: Warm-Restart reconciliation processed.
Sep  9 18:13:55.593396 str2-7050cx3-acs-02 ERR swss#orchagent: :- wait: SELECT operation result: TIMEOUT on getresponse
Sep  9 18:13:55.593396 str2-7050cx3-acs-02 ERR swss#orchagent: :- wait: failed to get response for getresponse

Sep  9 18:14:50.951101 str2-7050cx3-acs-02 NOTICE swss#fdbsyncd: :- setWarmStartState: fdbsyncd warm start state changed to reconciled
Sep  9 18:14:50.951101 str2-7050cx3-acs-02 NOTICE swss#fdbsyncd: :- main: VXLAN FDB VNI Reconcillation Complete
Sep  9 18:14:55.654262 str2-7050cx3-acs-02 ERR swss#orchagent: :- wait: SELECT operation result: TIMEOUT on getresponse
Sep  9 18:14:55.654262 str2-7050cx3-acs-02 ERR swss#orchagent: :- wait: failed to get response for getresponse
Sep  9 18:15:55.715170 str2-7050cx3-acs-02 ERR swss#orchagent: :- wait: SELECT operation result: TIMEOUT on getresponse
Sep  9 18:15:55.715170 str2-7050cx3-acs-02 ERR swss#orchagent: :- wait: failed to get response for getresponse

Sep  9 18:16:42.011817 str2-7050cx3-acs-02 NOTICE root: WARMBOOT_FINALIZER : Finalizing warmboot...
Sep  9 18:16:42.796416 str2-7050cx3-acs-02 INFO systemd[1]: warmboot-finalizer.service: Succeeded.

Sep  9 18:16:55.773487 str2-7050cx3-acs-02 ERR swss#orchagent: :- wait: SELECT operation result: TIMEOUT on getresponse
Sep  9 18:16:55.773487 str2-7050cx3-acs-02 ERR swss#orchagent: :- wait: failed to get response for getresponse
Sep  9 18:16:54.714433 str2-7050cx3-acs-02 INFO ntpd[5100]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Sep  9 18:17:55.833319 str2-7050cx3-acs-02 ERR swss#orchagent: :- wait: SELECT operation result: TIMEOUT on getresponse
Sep  9 18:17:55.833319 str2-7050cx3-acs-02 ERR swss#orchagent: :- wait: failed to get response for getresponse
Sep  9 18:17:55.833549 str2-7050cx3-acs-02 ERR swss#orchagent: :- on_switch_shutdown_request: Syncd stopped
Sep  9 18:17:55.837050 str2-7050cx3-acs-02 INFO kernel: [  413.483865] orchagent[2293]: segfault at 18 ip 000055a16607601d sp 00007ffd2b970430 error 4 in orchagent[55a165f67000+21e000]
Sep  9 18:17:55.837103 str2-7050cx3-acs-02 INFO kernel: [  413.483884] Code: 00 48 8d 75 20 48 89 df e8 20 11 ef ff 85 c0 0f 88 57 09 00 00 48 8d 43 10 48 8b 6d 40 49 39 c4 74 08 4c 89 e7 e8 03 16 ef ff <4c> 8b 65 18 48 8d 45 08 48 8d 4c 24 70 41 bd 01 00 00 00 48 89 44
Sep  9 18:17:56.940899 str2-7050cx3-acs-02 INFO swss#supervisord 2021-09-09 18:17:56,939 INFO exited: orchagent (terminated by SIGSEGV (core dumped); not expected)

SAI redis:

2021-09-09.18:11:39.400830|a|INIT_VIEW
2021-09-09.18:11:50.203023|A|SAI_STATUS_SUCCESS

2021-09-09.18:11:53.250357|c|SAI_OBJECT_TYPE_BUFFER_POOL:oid:0x18000000000759|SAI_BUFFER_POOL_ATTR_THRESHOLD_MODE=SAI_BUFFER_POOL_THRESHOLD_MODE_STATIC|SAI_BUFFER_POOL_ATTR_SIZE=32599040|SAI_BUFFER_POOL_ATTR_TYPE=SAI_BUFFER_POOL_TYPE_EGRESS

2021-09-09.18:11:55.550253|n|switch_shutdown_request|{"switch_id":"oid:0x21000000000000"}|

Describe the results you expected:

Output of show version:

SONiC Software Version: SONiC.master.34161-7c9be314e
Distribution: Debian 10.10
Kernel: 4.19.0-12-2-amd64
Build commit: 7c9be314e
Build date: Tue Sep  7 13:36:19 UTC 2021
Built by: AzDevOps@sonic-build-workers-000O4E

Platform: x86_64-arista_7050cx3_32s
HwSKU: Arista-7050CX3-32S-C32
ASIC: broadcom
ASIC Count: 1

SAI version:

Package: libsaibcm
Status: install ok installed
Priority: extra
Section: libs
Installed-Size: 245525
Maintainer: Guohan Lu <gulv@microsoft.com>
Architecture: amd64
Source: saibcm
Version: 5.0.0.6-1
Provides: libsai

Output of show techsupport:

Additional information you deem important (e.g. issue happens only occasionally):

@lguohan
Copy link
Collaborator

lguohan commented Sep 10, 2021

the earliest issue i see is this one.

fixture_generator = <function verify_dut_health at 0x7fd1bdc39550>
args = (<SubRequest 'verify_dut_health' for <Function test_warm_reboot>>, <tests.common.devices.duthosts.DutHosts object at 0..., {'auto_recover': 'False', 'comment': 'Tests virtual switch vm', 'conf-name': 'vms-kvm-t0', 'duts': ['vlab-01'], ...})
kargs = {}, it = <generator object verify_dut_health at 0x7fd1bc057aa0>
res = None

    @decorator.decorator
    def _fixture_generator_decorator(fixture_generator, *args, **kargs):
        # setup, to the first yield in fixture
        logging.info("-" * 20 + (" fixture %s setup starts " % fixture_generator.__name__) + "-" * 20)
        it = fixture_generator(*args, **kargs)
        try:
            res = next(it)
            logging.info("-" * 20 + (" fixture %s setup ends " % fixture_generator.__name__) + "-" * 20)
            yield res
        except Exception as detail:
            logging.exception("\n%r", detail)
            logging.info("-" * 20 + (" fixture %s setup ends " % fixture_generator.__name__) + "-" * 20)
            raise
    
        # teardown, fixture will raise StopIteration
        logging.info("-" * 20 + (" fixture %s teardown starts " % fixture_generator.__name__) + "-" * 20)
        try:
>           next(it)

args       = (<SubRequest 'verify_dut_health' for <Function test_warm_reboot>>, <tests.common.devices.duthosts.DutHosts object at 0..., {'auto_recover': 'False', 'comment': 'Tests virtual switch vm', 'conf-name': 'vms-kvm-t0', 'duts': ['vlab-01'], ...})
fixture_generator = <function verify_dut_health at 0x7fd1bdc39550>
it         = <generator object verify_dut_health at 0x7fd1bc057aa0>
kargs      = {}
res        = None

common/plugins/log_section_start/__init__.py:84: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

request = <SubRequest 'verify_dut_health' for <Function test_warm_reboot>>
duthosts = <tests.common.devices.duthosts.DutHosts object at 0x7fd1bdb72050>
rand_one_dut_hostname = 'vlab-01'
tbinfo = {'auto_recover': 'False', 'comment': 'Tests virtual switch vm', 'conf-name': 'vms-kvm-t0', 'duts': ['vlab-01'], ...}

    @pytest.fixture
    def verify_dut_health(request, duthosts, rand_one_dut_hostname, tbinfo):
        global test_report
        test_report = {}
        duthost = duthosts[rand_one_dut_hostname]
        check_services(duthost)
        check_interfaces_and_transceivers(duthost, request)
        check_neighbors(duthost, tbinfo)
        if "20191130" in duthost.os_version:
            pre_existing_cores = duthost.shell('ls /var/core/ | grep -v python | wc -l')['stdout']
        else:
            pre_existing_cores = duthost.shell('ls /var/core/ | wc -l')['stdout']
        pytest_assert(all(list(test_report.values())), "DUT not ready for test. Health check failed before reboot: {}"
            .format(test_report))
    
        yield
    
        test_report = {}
        check_services(duthost)
        check_interfaces_and_transceivers(duthost, request)
        check_neighbors(duthost, tbinfo)
        verify_no_coredumps(duthost, pre_existing_cores)
        check_all = all([check == True for check in list(test_report.values())])
        pytest_assert(check_all, "Health check failed after reboot: {}"
>           .format(test_report))
E       Failed: Health check failed after reboot: {'check_interfaces_and_transceivers': True, 'check_services': True, 'verify_no_coredumps': 'Core dumps found. Expected: 0 Found: 1', 'check_neighbors': True}

check      = True
check_all  = False
duthost    = <MultiAsicSonicHost> vlab-01
duthosts   = <tests.common.devices.duthosts.DutHosts object at 0x7fd1bdb72050>
pre_existing_cores = u'0'
rand_one_dut_hostname = 'vlab-01'
request    = <SubRequest 'verify_dut_health' for <Function test_warm_reboot>>
tbinfo     = {'auto_recover': 'False', 'comment': 'Tests virtual switch vm', 'conf-name': 'vms-kvm-t0', 'duts': ['vlab-01'], ...}

platform_tests/verify_dut_health.py:156: Failed

@lguohan
Copy link
Collaborator

lguohan commented Sep 10, 2021

this test.
https://dev.azure.com/mssonic/build/_build/results?buildId=34620&view=results

the image is

SONiC Software Version: SONiC.master.34588-f4dea87cf
Distribution: Debian 10.10
Kernel: 4.19.0-12-2-amd64
Build commit: f4dea87cf
Build date: Wed Sep  8 16:56:45 UTC 2021
Built by: AzDevOps@sonic-build-workers-000O7P

lguohan added a commit that referenced this issue Sep 11, 2021
Fix #8722

retreat two commits which cause warm reboot regression

* eb79ca4 2021-09-01 | [pbh]: Add PBH OA (#1782) [Nazarii Hnydyn]
* 3d6b1f0 2021-08-31 | [buffer orch] Bugfix: Don't query counter SAI_BUFFER_POOL_STAT_XOFF_ROOM_WATERMARK_BYTES on a pool where it is not supported (#1857) [Stephen Sun]

Signed-off-by: Guohan Lu <lguohan@gmail.com>
@lguohan lguohan reopened this Sep 11, 2021
@nazariig
Copy link
Collaborator

@stephenxs FYI

@kcudnik
Copy link
Contributor

kcudnik commented Sep 14, 2021

Sep  9 18:11:49.725054 str2-7050cx3-acs-02 NOTICE syncd#syncd: :- performWarmRestartSingleSwitch: Warm boot: create switch VID: oid:0x21000000000000 took 8.884678 sec
Sep  9 18:11:49.794082 str2-7050cx3-acs-02 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_BUFFER_POOL: 5
Sep  9 18:11:49.810178 str2-7050cx3-acs-02 NOTICE syncd#syncd: :- checkWarmBootDiscoveredRids: all discovered RIDs are present in current RID2VID map for switch VID oid:0x21000000000000

is this whole log? uncutted? this seems terribly wrong if only buffer pools are discovered, there should be whole list of other objects as well

@kcudnik
Copy link
Contributor

kcudnik commented Sep 14, 2021

Sep  9 18:11:55.550007 str2-7050cx3-acs-02 ERR syncd#syncd: :- run: Runtime error: :- translateVidToRid: unable to get RID for VID oid:0x18000000000759

this is caused by clear_stats on buffer profile during init view mode

@yxieca
Copy link
Contributor

yxieca commented Sep 14, 2021

This PR: sonic-net/sonic-swss#1857 introduced the call Kamil mentioned above.

@stephenxs
Copy link
Collaborator

stephenxs commented Sep 15, 2021

This PR: Azure/sonic-swss#1857 introduced the call Kamil mentioned above.

Hi @yxieca
Originally, there was a clear called in buffer orchagent during its initialization, which causes the issue sporadically and there was a bug for that sairedis issue 862: Warm reboot: How the FlexCounter was handled after init view?
In the PR sonic-net/sonic-swss#1857, the clear remains untouched and a get operation has been added. I don't know whether that will make the issue be reproduced at a higher probability. But I think we can fix the original bug sonic-net/sonic-sairedis/issues/862 first

@kcudnik
Copy link
Contributor

kcudnik commented Sep 15, 2021

I added PR that should put nice message and fail in OA on those apis: sonic-net/sonic-sairedis#930

vaibhavhd added a commit to sonic-net/sonic-swss that referenced this issue Oct 6, 2021
…T_XOFF_ROOM_WATERMARK_BYTES on a pool where it is not supported (#1857)" (#1945)

This reverts commit 3d6b1f0.

Fix sonic-net/sonic-buildimage#8893

What I did
This commit had earlier caused issue on master image warmboot - sonic-net/sonic-buildimage#8722

To fix this issue, this PR was created to retreat sonic-swss head on buildimage - sonic-net/sonic-buildimage#8732

Now, this commit was again pulled into sonic-buildimage as part of sonic-swss submodule advance:
sonic-net/sonic-buildimage#8839

And, warm-reboot again broke for the same reason.

This change is so that any other swss submodule update on buildimage will not fail warmboot again.
judyjoseph pushed a commit to sonic-net/sonic-swss that referenced this issue Oct 7, 2021
…T_XOFF_ROOM_WATERMARK_BYTES on a pool where it is not supported (#1857)" (#1945)

This reverts commit 3d6b1f0.

Fix sonic-net/sonic-buildimage#8893

What I did
This commit had earlier caused issue on master image warmboot - sonic-net/sonic-buildimage#8722

To fix this issue, this PR was created to retreat sonic-swss head on buildimage - sonic-net/sonic-buildimage#8732

Now, this commit was again pulled into sonic-buildimage as part of sonic-swss submodule advance:
sonic-net/sonic-buildimage#8839

And, warm-reboot again broke for the same reason.

This change is so that any other swss submodule update on buildimage will not fail warmboot again.
liat-grozovik pushed a commit to sonic-net/sonic-swss that referenced this issue Nov 24, 2021
)

- What I did
Don't handle buffer pool watermark during warm reboot reconciling

- Why I did it
This is to fix the community issue sonic-net/sonic-sairedis#862 and sonic-net/sonic-buildimage#8722

- How I verified it
Perform a warm reboot. Check whether

buffer pool watermark handling is skipped during reconciling and handled after it.
other watermark handling is handled during reconciling as it was before.
Details if related
The warm reboot flow is like this:

System starts. Orchagent fetches the items from database stored before warm reboot and pushes them into m_toSync of all orchagents. This is done by bake, which can be overridden by sub orchagent.
All sub orchagents handle the items in m_toSync. At this point, any notification from redis-db is blocked.
Warm reboot converges.
Orchagent starts to handle notifications from redis-db.
The fix is like this: in FlexCounterOrch::bake. the buffer pool watermark handling is skipped.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
judyjoseph pushed a commit to sonic-net/sonic-swss that referenced this issue Dec 1, 2021
)

- What I did
Don't handle buffer pool watermark during warm reboot reconciling

- Why I did it
This is to fix the community issue sonic-net/sonic-sairedis#862 and sonic-net/sonic-buildimage#8722

- How I verified it
Perform a warm reboot. Check whether

buffer pool watermark handling is skipped during reconciling and handled after it.
other watermark handling is handled during reconciling as it was before.
Details if related
The warm reboot flow is like this:

System starts. Orchagent fetches the items from database stored before warm reboot and pushes them into m_toSync of all orchagents. This is done by bake, which can be overridden by sub orchagent.
All sub orchagents handle the items in m_toSync. At this point, any notification from redis-db is blocked.
Warm reboot converges.
Orchagent starts to handle notifications from redis-db.
The fix is like this: in FlexCounterOrch::bake. the buffer pool watermark handling is skipped.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
@stephenxs
Copy link
Collaborator

Hi
can we close this bug as it has been resolved by sonic-net/sonic-swss#1987?

qiluo-msft pushed a commit to sonic-net/sonic-swss that referenced this issue Jan 10, 2022
)

- What I did
Don't handle buffer pool watermark during warm reboot reconciling

- Why I did it
This is to fix the community issue sonic-net/sonic-sairedis#862 and sonic-net/sonic-buildimage#8722

- How I verified it
Perform a warm reboot. Check whether

buffer pool watermark handling is skipped during reconciling and handled after it.
other watermark handling is handled during reconciling as it was before.
Details if related
The warm reboot flow is like this:

System starts. Orchagent fetches the items from database stored before warm reboot and pushes them into m_toSync of all orchagents. This is done by bake, which can be overridden by sub orchagent.
All sub orchagents handle the items in m_toSync. At this point, any notification from redis-db is blocked.
Warm reboot converges.
Orchagent starts to handle notifications from redis-db.
The fix is like this: in FlexCounterOrch::bake. the buffer pool watermark handling is skipped.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants