Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Telemetry] After ONIE install, the telemetry process inside telemetry container exits but docker stays up #16533

Closed
dgsudharsan opened this issue Sep 13, 2023 · 5 comments
Assignees
Labels
Issue for 202305 MSFT Triaged this issue has been triaged

Comments

@dgsudharsan
Copy link
Collaborator

dgsudharsan commented Sep 13, 2023

Description

After installing through onie, the telemetry process inside the telemetry container exits and sometimes its FATAL.

root@r-bulldog-03:~# docker exec -it telemetry bash
root@r-bulldog-03:/# supervisorctl status
containercfgd                    RUNNING   pid 16, uptime 0:10:16
dependent-startup                EXITED    Sep 13 02:33 AM
dialout                          RUNNING   pid 22, uptime 0:10:13
rsyslogd                         RUNNING   pid 11, uptime 0:10:18
start                            EXITED    Sep 13 02:33 AM
supervisor-proc-exit-listener    RUNNING   pid 8, uptime 0:10:19
telemetry                        EXITED    Sep 13 02:33 AM
root@r-anaconda-51:/home/admin# docker exec telemetry supervisorctl status
containercfgd                    RUNNING   pid 16, uptime 0:04:43
dependent-startup                RUNNING   pid 7, uptime 0:04:46
dialout                          STOPPED   Not started
rsyslogd                         RUNNING   pid 11, uptime 0:04:45
start                            EXITED    Sep 07 03:27 PM
supervisor-proc-exit-listener    RUNNING   pid 8, uptime 0:04:46
telemetry                        FATAL     Exited too quickly (process log may have details)

Sep 7 18:27:45.653772 r-anaconda-51 INFO telemetry#supervisord 2023-09-07 15:27:45,652 INFO exited: telemetry (exit status 0; not expected)

CONTAINER ID   IMAGE                                COMMAND                  CREATED         STATUS         PORTS     NAMES
ef27c6674bd1   da0d5011c828                         "/usr/local/bin/supe…"   5 minutes ago   Up 5 minutes             what-just-happened
88975307ca20   docker-sonic-telemetry:latest        "/usr/local/bin/supe…"   5 minutes ago   Up 5 minutes             telemetry
406b86ebed27   docker-snmp:latest                   "/usr/local/bin/supe…"   5 minutes ago   Up 5 minutes             snmp
d8928328285d   docker-sonic-mgmt-framework:latest   "/usr/local/bin/supe…"   5 minutes ago   Up 5 minutes             mgmt-framework
eb47aa6353e9   docker-lldp:latest                   "/usr/bin/docker-lld…"   5 minutes ago   Up 5 minutes             lldp
b07e18a80aa7   17676f080268                         "/usr/local/bin/supe…"   5 minutes ago   Up 5 minutes             doai
9ec1e4932275   1c536017f212                         "/usr/bin/docker_ini…"   5 minutes ago   Up 5 minutes             dhcp_relay
3442ac024c5c   docker-router-advertiser:latest      "/usr/bin/docker-ini…"   8 minutes ago   Up 6 minutes             radv
a5fd6055df9f   docker-platform-monitor:latest       "/usr/bin/docker_ini…"   8 minutes ago   Up 6 minutes             pmon
43d29c7f45a6   docker-syncd-mlnx:latest             "/usr/local/bin/supe…"   8 minutes ago   Up 6 minutes             syncd
efb481585b52   docker-fpm-frr:latest                "/usr/bin/docker_ini…"   8 minutes ago   Up 7 minutes             bgp
a2a815188981   docker-teamd:latest                  "/usr/local/bin/supe…"   9 minutes ago   Up 7 minutes             teamd
c3d0ab212a1b   docker-orchagent:latest              "/usr/bin/docker-ini…"   9 minutes ago   Up 7 minutes             swss
5c61e39a9fd9   docker-eventd:latest                 "/usr/local/bin/supe…"   9 minutes ago   Up 7 minutes             eventd
fce8c3979214   docker-database:latest               "/usr/local/bin/dock…"   9 minutes ago   Up 7 minutes             database
root@r-bulldog-03:~# sonic-cfggen -d -v TELEMETRY

root@r-bulldog-03:~#

Steps to reproduce the issue:

  1. Perform onie install
  2. Check telemetry status.

Describe the results you received:

Telemetry process exits. However docker stays up even though its a critical process.

Describe the results you expected:

Telemetry main process should not exit. If it exits the docker should exit as well

Output of show version:

show version

SONiC Software Version: SONiC.202305_RC.4-a4fbef8bc_Internal
SONiC OS Version: 11
Distribution: Debian 11.7
Kernel: 5.10.0-18-2-amd64
Build commit: a4fbef8bc
Build date: Tue Sep 12 16:31:36 UTC 2023
Built by: sw-r2d2-bot@r-build-sonic-ci03-243

Platform: x86_64-mlnx_msn2100-r0
HwSKU: ACS-MSN2100
ASIC: mellanox
ASIC Count: 1
Serial Number: MT1752X06330
Model Number: MSN2100-CB2F
Hardware Revision: A1
Uptime: 03:17:46 up 47 min,  1 user,  load average: 0.57, 0.53, 0.65
Date: Wed 13 Sep 2023 03:17:46

Docker images:
REPOSITORY                                         TAG                              IMAGE ID       SIZE
docker-orchagent                                   202305_RC.4-a4fbef8bc_Internal   7acdadbb064c   328MB
docker-orchagent                                   latest                           7acdadbb064c   328MB
docker-fpm-frr                                     202305_RC.4-a4fbef8bc_Internal   26f22a12fc79   348MB
docker-fpm-frr                                     latest                           26f22a12fc79   348MB
docker-nat                                         202305_RC.4-a4fbef8bc_Internal   fa385a23398e   319MB
docker-nat                                         latest                           fa385a23398e   319MB
docker-sflow                                       202305_RC.4-a4fbef8bc_Internal   2ff9bf1e70a9   318MB
docker-sflow                                       latest                           2ff9bf1e70a9   318MB
docker-teamd                                       202305_RC.4-a4fbef8bc_Internal   5d9f9ae038aa   317MB
docker-teamd                                       latest                           5d9f9ae038aa   317MB
docker-macsec                                      latest                           55e56b22516d   319MB
docker-syncd-mlnx                                  202305_RC.4-a4fbef8bc_Internal   616ccd12a441   823MB
docker-syncd-mlnx                                  latest                           616ccd12a441   823MB
docker-dhcp-relay                                  latest                           343d390dae33   306MB
docker-eventd                                      202305_RC.4-a4fbef8bc_Internal   2b7aec4ae7a0   299MB
docker-eventd                                      latest                           2b7aec4ae7a0   299MB
docker-platform-monitor                            202305_RC.4-a4fbef8bc_Internal   3ba68825f54c   815MB
docker-platform-monitor                            latest                           3ba68825f54c   815MB
docker-snmp                                        202305_RC.4-a4fbef8bc_Internal   81ccd0cf706e   338MB
docker-snmp                                        latest                           81ccd0cf706e   338MB
docker-sonic-telemetry                             202305_RC.4-a4fbef8bc_Internal   3fa87969b07c   599MB
docker-sonic-telemetry                             latest                           3fa87969b07c   599MB
docker-lldp                                        202305_RC.4-a4fbef8bc_Internal   9412a37cc891   341MB
docker-lldp                                        latest                           9412a37cc891   341MB
docker-mux                                         202305_RC.4-a4fbef8bc_Internal   49c55dacb4fb   348MB
docker-mux                                         latest                           49c55dacb4fb   348MB
docker-database                                    202305_RC.4-a4fbef8bc_Internal   448e222d1079   299MB
docker-database                                    latest                           448e222d1079   299MB
docker-router-advertiser                           202305_RC.4-a4fbef8bc_Internal   62a30b600998   299MB
docker-router-advertiser                           latest                           62a30b600998   299MB
docker-sonic-mgmt-framework                        202305_RC.4-a4fbef8bc_Internal   2b97e20df004   415MB
docker-sonic-mgmt-framework                        latest                           2b97e20df004   415MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/sonic-wjh   1.0.0-202305-2                   07eeec349434   432MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/doai        1.0.0-202305-1                   17676f080268   277MB

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

sonic_dump_r-bulldog-03_20230913_023753.tar.gz
sonic_dump_r-anaconda-51_20230907_183233.tar.gz

@dgsudharsan dgsudharsan changed the title [Telemetry] After ONIE install, the telemetry process inside telemetry container exits [Telemetry] After ONIE install, the telemetry process inside telemetry container exits but docker stays up Sep 13, 2023
@prgeor prgeor added Triaged this issue has been triaged MSFT labels Sep 13, 2023
@prgeor
Copy link
Contributor

prgeor commented Sep 13, 2023

@dgsudharsan could you please capture the difference in behavior across the two sonic version.

@dgsudharsan
Copy link
Collaborator Author

In 202211 when installing from ONIE, the telemetry process exits. However along with it the telemetry docker exits too since the telemetry process is defined as a critical process. In 202305 the telemetry docker however doesn't exit.

root@r-anaconda-51:/home/admin# docker exec telemetry bash -c '[ -f /etc/supervisor/critical_processes ] && cat /etc/supervisor/critical_processes'
program:telemetry

@FengPan-Frank
Copy link
Contributor

Reproduce the issue locally on 20230531.03 version.

After ONIE installation, telemetry process is exited indeed.

admin@sonic:/var/log$ docker exec telemetry supervisorctl status
containercfgd RUNNING pid 16, uptime 0:34:55
dependent-startup EXITED Sep 20 07:45 AM
dialout RUNNING pid 22, uptime 0:34:50
rsyslogd RUNNING pid 11, uptime 0:34:58
start EXITED Sep 20 07:45 AM
supervisor-proc-exit-listener RUNNING pid 8, uptime 0:35:03
telemetry EXITED Sep 20 07:46 AM

Snippet telemetry.log:
Sep 20 07:45:57.973354 sonic INFO telemetry#supervisord: telemetry Traceback (most recent call last):
Sep 20 07:45:57.974320 sonic INFO telemetry#supervisord: telemetry File "/usr/local/bin/sonic-cfggen", line 452, in
Sep 20 07:45:57.975525 sonic INFO telemetry#supervisord: telemetry main()
Sep 20 07:45:57.976782 sonic INFO telemetry#supervisord: telemetry File "/usr/local/bin/sonic-cfggen", line 416, in main
Sep 20 07:45:57.977365 sonic INFO telemetry#supervisord: telemetry template_data = template.render(data)
Sep 20 07:45:57.977883 sonic INFO telemetry#supervisord: telemetry File "/usr/local/lib/python3.9/dist-packages/jinja2/environment.py", line 1301, in render
Sep 20 07:45:57.980311 sonic INFO telemetry#supervisord: telemetry self.environment.handle_exception()
Sep 20 07:45:57.981591 sonic INFO telemetry#supervisord: telemetry File "/usr/local/lib/python3.9/dist-packages/jinja2/environment.py", line 936, in handle_exception
Sep 20 07:45:57.982543 sonic INFO telemetry#supervisord: telemetry raise rewrite_traceback_stack(source=source)
share/sonic/templates/telemetry_vars.j2", line 2, in top-level template code
Sep 20 07:45:57.986648 sonic INFO telemetry#supervisord: telemetry "certs": {% if "certs" in TELEMETRY.keys() %}{{ TELEMETRY["certs"] }}{% else %}""{% endif %},
Sep 20 07:45:57.987416 sonic INFO telemetry#supervisord: telemetry File "/usr/local/lib/python3.9/dist-packages/jinja2/environment.py", line 485, in getattr
Sep 20 07:45:57.988104 sonic INFO telemetry#supervisord: telemetry return getattr(obj, attribute)
Sep 20 07:45:57.988721 sonic INFO telemetry#supervisord: telemetry jinja2.exceptions.UndefinedError: 'TELEMETRY' is undefined
Sep 20 07:46:00.291148 sonic INFO telemetry#supervisord: telemetry Incorrect threshold value, expecting positive integers

investigating on a proper fix.

@qiluo-msft qiluo-msft removed their assignment Sep 20, 2023
@qnos
Copy link
Contributor

qnos commented Sep 22, 2023

This is because telemetry service introduce the cert authentication but no telemetry config in Config DB.

127.0.0.1:6379[4]> keys TELEMETRY*
(empty array)
127.0.0.1:6379[4]>

Therefore, we need to manually load the TELEMETRY config into config DB:

telemetry.json

  1. no client auth
{
    "TELEMETRY": {
        "gnmi": {
            "client_auth": "false",
            "port": "50051",
            "log_level": "2"
        }
    }
}
  1. With client_auth and specify the cert path, this requires to generate CA and cert key first.
{
    "TELEMETRY": {
        "certs": {
            "server_crt": "/etc/sonic/telemetry/streamingtelemetryserver.cer",
            "server_key": "/etc/sonic/telemetry/streamingtelemetryserver.key",
            "ca_crt": "/etc/sonic/telemetry/dsmsroot.cer"
        },
        "gnmi": {
            "client_auth": "true",
            "port": "50051",
            "log_level": "2"
        }
    }
}

Load telemetry config into CONFIG DB:

sudo config load telemetry.json -y

Then, start telemetry process

docker exec telemetry supervisorctl start telemetry

After that, the above telemetry issue will be resolved. It requires a mechanism to generate a default TELEMETRY config into config db.

@qnos
Copy link
Contributor

qnos commented Sep 25, 2023

It still suggests to load customized TELEMETRY configs, if no TELEMETRY configuration in redis DB, after the fix, it will uses the default TELEMETRY configurations.

qiluo-msft pushed a commit that referenced this issue Sep 28, 2023
#### Why I did it
Fix issue #16533 , telemetry service exit in master and 202305 branches due to no telemetry configs in redis DB.

#### How I did it
Enable default config if no TELEMETRY configs from redis DB.

#### How to verify it
After the fix, telemetry service would work with the following two scenarios:
1. With TELEMETRY config in redis DB, load service configs from DB.
2. No TELEMETRY config in redis DB, use default service configs.
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this issue Oct 17, 2023
…-net#16683)

#### Why I did it
Fix issue sonic-net#16533 , telemetry service exit in master and 202305 branches due to no telemetry configs in redis DB.

#### How I did it
Enable default config if no TELEMETRY configs from redis DB.

#### How to verify it
After the fix, telemetry service would work with the following two scenarios:
1. With TELEMETRY config in redis DB, load service configs from DB.
2. No TELEMETRY config in redis DB, use default service configs.
mssonicbld pushed a commit that referenced this issue Oct 21, 2023
#### Why I did it
Fix issue #16533 , telemetry service exit in master and 202305 branches due to no telemetry configs in redis DB.

#### How I did it
Enable default config if no TELEMETRY configs from redis DB.

#### How to verify it
After the fix, telemetry service would work with the following two scenarios:
1. With TELEMETRY config in redis DB, load service configs from DB.
2. No TELEMETRY config in redis DB, use default service configs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue for 202305 MSFT Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

5 participants