Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plugin: ipmi_sensor: fails to detect psu failure #5755

Closed
vvershkov opened this issue Apr 23, 2019 · 4 comments · Fixed by #5816
Closed

Plugin: ipmi_sensor: fails to detect psu failure #5755

vvershkov opened this issue Apr 23, 2019 · 4 comments · Fixed by #5816
Assignees
Labels
bug unexpected problem or unintended behavior
Milestone

Comments

@vvershkov
Copy link

Relevant telegraf.conf:

interval = "60s"
metric_version = 1
timeout = "10s"

System info:

Any OS or telegraf version: this bug is caused due to ipmi itself

Steps to reproduce:

We have a server with failed PSU: we know it because we saw it: it has amber light instead of green and sound alarm too. But ipmi and telegraf detect it status as OK:

...
PS1 Status       | 0x03              | ok
PS2 Status       | 0x01              | ok

However 0x03 flag is "failure":
https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-sg8039en_us&docLocale=en_US

Expected behavior:

PS1 Status is bad (0)

Actual behavior:

PS1 Status is OK (1)

Additional info:

AFAIK there is no readings with 0x03 and "OK" status. for old motherboards - 0x03 for CPU for example means overheating. But maybe flag check needed only for PSU.

@danielnelson
Copy link
Contributor

It looks like we don't handle hexadecimal values currently. Could you also run this command and add the output for these sensors so I can be sure to fix it for metric_version = 2 as well:

ipmitool sdr elist

@danielnelson danielnelson added the bug unexpected problem or unintended behavior label Apr 23, 2019
@vvershkov
Copy link
Author

Wow, that is a worst timing ever :) I just replaced it with new ones.
But I can try to reproduce this bug with dead PSU and some server in office tomorrow.

@vvershkov
Copy link
Author

Ok, ipmitool sdr and sdr elist outputs looks like this:
Old version of ipmi failed PSU:

PS Status        | 0x02              | ok
PS Status        | 55h | ok  | 10.1 | Failure detected

(both PSU in one status, this is horrible)

after bmc update, turning off one power line:

PS1 Status       | 0x0b              | ok
PS2 Status       | 0x01              | ok
PS1 Status       | C8h | ok  | 10.1 | Presence detected, Failure detected, Power Supply AC lost
PS2 Status       | C9h | ok  | 10.2 | Presence detected

switching to failed psu:

PS1 Status       | 0x03              | ok
PS2 Status       | 0x01              | ok
PS1 Status       | C8h | ok  | 10.1 | Presence detected, Failure detected
PS2 Status       | C9h | ok  | 10.2 | Presence detected

PS funny thing:

PS1 Status       | 0x00              | ok
PS2 Status       | 0x01              | ok
PS1 Status       | C8h | ok  | 10.1 | 
PS2 Status       | C9h | ok  | 10.2 | Presence detected

PS1 is missing but its OK :)

@danielnelson
Copy link
Contributor

@vvershkov We are finalizing a fix for the value, but you should know it will still report these examples as status=1. This is because we don't want to attempt to do any special interpretation, I don't think this would be feasible for all producets, and will just pass the info along.

You will need to setup your alerts to handle this specific issue either with the value in metric_version = 1 or with the status_desc with metric_version = 2.

@danielnelson danielnelson added this to the 1.10.4 milestone May 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants