Skip to content

Commit

Permalink
fix: add monitoring dashboard
Browse files Browse the repository at this point in the history
Signed-off-by: Ilya Kheifets <ikheifets@splunk.com>
  • Loading branch information
ikheifets-splunk committed Sep 3, 2024
1 parent dc5f24d commit f6210dc
Show file tree
Hide file tree
Showing 8 changed files with 290 additions and 0 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Changelog

## Unreleased
- add metrics dashboard

### Changed

Expand Down
227 changes: 227 additions & 0 deletions dashboard/dashboard.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
<form version="1.1" theme="dark">
<label>sc4snmp</label>
<fieldset submitButton="false" autoRun="true"></fieldset>
<row>
<panel>
<title>SNMP polling status.</title>
<input type="dropdown" token="poll_status_host" searchWhenChanged="true">
<label>SNMP device</label>
<choice value="*">all</choice>
<default>*</default>
<initialValue>*</initialValue>
<fieldForLabel>ip</fieldForLabel>
<fieldForValue>ip</fieldForValue>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" "Scheduler: Sending due task sc4snmp;*;*;poll" | rex field=_raw "Sending due task sc4snmp;(?&lt;ip&gt;.+);(?&lt;num&gt;\d+);poll" | stats count by ip</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
</search>
</input>
<chart>
<title>In case of unsuccessful polling status, please copy spl query from this chart and find failed tasks. Explanation of error log messages you can find at the https://splunk.github.io/splunk-connect-for-snmp/main/bestpractices/</title>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" splunk_connect_for_snmp.snmp.tasks.poll $poll_status_host$ | rex field=_raw "Task splunk_connect_for_snmp.*\[*\] (?&lt;status&gt;\w+)" | where status != "received" | timechart count by status</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
</search>
<option name="charting.axisTitleX.visibility">visible</option>
<option name="charting.axisTitleY.visibility">visible</option>
<option name="charting.axisTitleY2.visibility">visible</option>
<option name="charting.chart">line</option>
<option name="charting.chart.nullValueMode">connect</option>
<option name="charting.drilldown">all</option>
<option name="charting.legend.placement">right</option>
<option name="height">331</option>
<option name="refresh.display">progressbar</option>
<option name="trellis.enabled">0</option>
<drilldown>
<link target="_blank">search?q=index%3D*%20sourcetype%3D%22*%3Acontainer%3Asplunk-connect-for-snmp-*%22%20splunk_connect_for_snmp.snmp.tasks.poll%20$poll_status_host$%20%7C%20rex%20field%3D_raw%20%22Task%20splunk_connect_for_snmp.*%5C%5B*%5C%5D%20(%3F%3Cstatus%3E%5Cw%2B)%22%20%7C%20where%20status%20!%3D%20%22received%22%20%7C%20timechart%20count%20by%20status&amp;earliest=-24h@h&amp;latest=now</link>
</drilldown>
</chart>
</panel>
<panel>
<title>SNMP schedule of polling tasks.</title>
<input type="dropdown" token="poll_host" searchWhenChanged="true">
<label>SNMP device</label>
<choice value="*">all</choice>
<default>*</default>
<initialValue>*</initialValue>
<fieldForLabel>ip</fieldForLabel>
<fieldForValue>ip</fieldForValue>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" "Scheduler: Sending due task sc4snmp;*;*;poll" | rex field=_raw "Sending due task sc4snmp;(?&lt;ip&gt;.+);(?&lt;num&gt;\d+);poll" | stats count by ip</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
</search>
</input>
<chart>
<title>If count will be zero for selected device then the polling was not scheduled for that device.</title>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" Scheduler: Sending due task sc4snmp;$poll_host$;*poll | timechart count</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
</search>
<option name="charting.chart">line</option>
<option name="charting.drilldown">all</option>
<option name="height">331</option>
<option name="refresh.display">progressbar</option>
<drilldown>
<link target="_blank">search?q=index%3D*%20sourcetype%3D%22*%3Acontainer%3Asplunk-connect-for-snmp-*%22%20Scheduler%3A%20Sending%20due%20task%20sc4snmp%3B$poll_host$%3B*poll%20%7C%20timechart%20count&amp;earliest=-24h@h&amp;latest=now</link>
</drilldown>
</chart>
</panel>
</row>
<row>
<panel>
<title>SNMP walk status.</title>
<input type="dropdown" token="walk_status_host">
<label>SNMP device</label>
<choice value="*">all</choice>
<default>*</default>
<initialValue>*</initialValue>
<fieldForLabel>ip</fieldForLabel>
<fieldForValue>ip</fieldForValue>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" "Scheduler: Sending due task sc4snmp;*;walk" | rex field=_raw "Sending due task sc4snmp;(?&lt;ip&gt;.+);walk" | stats count by ip</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
</search>
</input>
<chart>
<title>In case of unsuccessful walk status, please copy spl query from this chart and find failed tasks. Explanation of error log messages you can find at the https://splunk.github.io/splunk-connect-for-snmp/main/bestpractices/</title>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" splunk_connect_for_snmp.snmp.tasks.walk $walk_status_host$ | rex field=_raw "Task splunk_connect_for_snmp.*\[*\] (?&lt;status&gt;\w+)" | where status != "received" | timechart count by status</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
</search>
<option name="charting.chart">line</option>
<option name="charting.drilldown">all</option>
<option name="height">327</option>
<option name="refresh.display">progressbar</option>
<drilldown>
<link target="_blank">search?q=index%3D*%20sourcetype%3D%22kube%3Acontainer%3Asplunk-connect-for-snmp-*%22%20splunk_connect_for_snmp.snmp.tasks.walk%20$walk_status_host$%20%7C%20rex%20field%3D_raw%20%22Task%20splunk_connect_for_snmp.*%5C%5B*%5C%5D%20(%3F%3Cstatus%3E%5Cw%2B)%22%20%7C%20where%20status%20!%3D%20%22received%22%20%7C%20timechart%20count%20by%20status&amp;earliest=-24h@h&amp;latest=now</link>
</drilldown>
</chart>
</panel>
<panel>
<title>SNMP schedule for walk tasks.</title>
<input type="dropdown" token="walk_host">
<label>SNMP device</label>
<choice value="*">all</choice>
<default>*</default>
<initialValue>*</initialValue>
<fieldForLabel>ip</fieldForLabel>
<fieldForValue>ip</fieldForValue>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" "Scheduler: Sending due task sc4snmp;*;walk" | rex field=_raw "Sending due task sc4snmp;(?&lt;ip&gt;.+);walk" | stats count by ip</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
</search>
</input>
<chart>
<title>If count will be zero for selected device then the walk was not scheduled for that device.</title>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" Scheduler: Sending due task sc4snmp;$walk_host$;walk | timechart count</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
</search>
<option name="charting.chart">line</option>
<option name="charting.drilldown">all</option>
<option name="height">324</option>
<option name="refresh.display">progressbar</option>
<drilldown>
<link target="_blank">search?q=index%3D*%20sourcetype%3D%22*%3Acontainer%3Asplunk-connect-for-snmp-*%22%20Scheduler%3A%20Sending%20due%20task%20sc4snmp%3B$walk_host$%3Bwalk%20%7C%20timechart%20count&amp;earliest=-24h@h&amp;latest=now</link>
</drilldown>
</chart>
</panel>
</row>
<row>
<panel>
<title>SNMP trap status.</title>
<chart>
<title>In case of unsuccessful trap status, please copy spl query from this chart and find failed tasks. Explanationof error log messages you can find at the https://splunk.github.io/splunk-connect-for-snmp/main/bestpractices/</title>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" splunk_connect_for_snmp.snmp.tasks.trap | rex field=_raw "Task splunk_connect_for_snmp.*\[*\] (?&lt;status&gt;\w+)" | where status != "received" | timechart count by status</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
</search>
<option name="charting.chart">line</option>
<option name="charting.drilldown">none</option>
<option name="height">332</option>
<option name="refresh.display">progressbar</option>
</chart>
</panel>
<panel>
<title>SNMP trap authorisation.</title>
<chart>
<title>It has not succeeded because of SNMP authorisation problem.</title>
<search>
<query>index=* "ERROR Security Model failure for device" OR "splunk_connect_for_snmp.snmp.tasks.trap\[*\] succeeded" | eval status=if(searchmatch("succeeded"), "succeeded", "failed") | timechart count by status</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
</search>
<option name="charting.chart">line</option>
<option name="charting.drilldown">none</option>
<option name="height">329</option>
<option name="refresh.display">progressbar</option>
</chart>
</panel>
</row>
<row>
<panel>
<title>SNMP send to Splunk status.</title>
<chart>
<title>In case of unsuccessful send status, please copy spl query from this chart and find failed tasks. Explanationof error log messages you can find at the https://splunk.github.io/splunk-connect-for-snmp/main/bestpractices/</title>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" splunk_connect_for_snmp.splunk.tasks.send | rex field=_raw "Task splunk_connect_for_snmp.*\[*\] (?&lt;status&gt;\w+)" | where status != "received" | timechart count by status</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
</search>
<option name="charting.chart">line</option>
<option name="charting.drilldown">none</option>
<option name="refresh.display">progressbar</option>
</chart>
</panel>
<panel>
<title>SNMP enrich task status.</title>
<chart>
<title>In case of unsuccessful enrich status, please copy spl query from this chart and find failed tasks. Explanationof error log messages you can find at the https://splunk.github.io/splunk-connect-for-snmp/main/bestpractices/</title>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" splunk_connect_for_snmp.enrich.tasks.enrich | rex field=_raw "Task splunk_connect_for_snmp.*\[*\] (?&lt;status&gt;\w+)" | where status != "received" | timechart count by status</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
</search>
<option name="charting.chart">line</option>
<option name="charting.drilldown">none</option>
<option name="refresh.display">progressbar</option>
</chart>
</panel>
<panel>
<title>SNMP prepare task status.</title>
<chart>
<title>In case of unsuccessful enrich status, please copy spl query from this chart and find failed tasks. Explanationof error log messages you can find at the https://splunk.github.io/splunk-connect-for-snmp/main/bestpractices/</title>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" splunk_connect_for_snmp.splunk.tasks.prepare | rex field=_raw "Task splunk_connect_for_snmp.*\[*\] (?&lt;status&gt;\w+)" | where status != "received" | timechart count by status</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
</search>
<option name="charting.chart">line</option>
<option name="charting.drilldown">none</option>
<option name="refresh.display">progressbar</option>
</chart>
</panel>
<panel>
<title>SNMP inventory poller task status.</title>
<chart>
<title>In case of unsuccessful enrich status, please copy spl query from this chart and find failed tasks. Explanationof error log messages you can find at the https://splunk.github.io/splunk-connect-for-snmp/main/bestpractices/</title>
<search>
<query>index=* sourcetype="*:container:splunk-connect-for-snmp-*" splunk_connect_for_snmp.inventory.tasks.inventory_setup_poller | rex field=_raw "Task splunk_connect_for_snmp.*\[*\] (?&lt;status&gt;\w+)" | where status != "received" | timechart count by status</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
</search>
<option name="charting.chart">line</option>
<option name="charting.drilldown">none</option>
</chart>
</panel>
</row>
</form>
61 changes: 61 additions & 0 deletions docs/dashboard.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Dashboard

Using dashboard you can monitor SC4SNMP and be sure that is healthy and working correctly.

## Presetting

1. [Create metrics indexes](gettingstarted/splunk-requirements.md#requirements-for-splunk-enterprise-or-enterprise-cloud) in Splunk.
2. Enable metrics logging for your runtime:
* For K8S install [Splunk OpenTelemetry Collector for K8S](gettingstarted/sck-installation.md)
* For docker-compose use [Splunk logging driver for docker](dockercompose/9-splunk-logging.md)

## Install dashboard

1. In Splunk platform open **Search -> Dashboards**.
2. Click on **Create New Dashboard** and make an empty dashboard. Be sure to choose Classic Dashboards.
3. In the **Edit Dashboard** view, go to Source and replace the initial xml with the contents of [dashboard/dashboard.xml](https://github.com/splunk/splunk-connect-for-snmp/blob/main/dashboard/dashboard.xml) published in the SC4SNMP repository.
4. Save your changes. Your dashboard is ready to use.


## Metrics explanation

### Polling dashboards

To check that polling on your device is working correctly first of all check **SNMP schedule of polling tasks** dashboard.
Using this chart you can understand when SC4SNMP scheduled polling for your SNMP device last time. The process works if it runs regularly.

After double-checking that SC4SNMP scheduled polling tasks for your SNMP device we need to be sure that polling is working
For that look at another dashboard **SNMP polling status** and if everything is okay you will see only **succeeded** status of polling.
If something is going wrong you will see also another status (like on screenshot), then use [troubleshooting docs for that](bestpractices.md)

*Note: if you set very big polling period like 2 hours, it's okay that you haven't found during this 2 hours new polling tasks*

![Polling dashboards](images/dashboard/polling_dashboard.png)

### Walk dashboards

To check that polling on your device is working correctly first of all check **SNMP schedule of walk tasks** dashboard.
Using this chart you can understand when SC4SNMP scheduled walk for your SNMP device last time. The process works if it runs regularly.

After double-checking that SC4SNMP scheduled walk tasks for your SNMP device we need to be sure polling is working.
For that look at another dashboard **SNMP walk status** and if everything is okay you will see only **succeeded** status of walk.
If something is going wrong you will see also another status (like on screenshot), then use [troubleshooting docs for that](bestpractices.md)

*Note: if you set very big walk period like 2 hours, it's okay that you haven't found during this 2 hours new walk tasks*

![Walk dashboards](images/dashboard/walk_dashboard.png)

### Trap dashboards

First of all check **SNMP traps authorisation** dashboard, if you see only **succeeded** status it means that authorisation is configured correctly, otherwise please use [troubleshooting docs for that](bestpractices.md#identifying-traps-issues).

After checking that we have not any authorisation traps issues we can check that trap tasks are working correctly. For that we need to go **SNMP trap status**
dashboard, if we have only **succeeded** status it means that everything is working,, otherwise we will see information with another status.

![Trap dashboards](images/dashboard/trap_dashboard.png)

### Other dashboards

We also have tasks that will be a callbcak for walk and poll. For example **send** will publish result in Splunk. We need to be sure that after successfull walk & poll this callbacks finished successful. Please check that we have only succefull status for this tasks.

![Other dashboards](images/dashboard/other_dashboard.png)
Binary file added docs/images/dashboard/other_dashboard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/dashboard/polling_dashboard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/dashboard/trap_dashboard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/dashboard/walk_dashboard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -87,3 +87,4 @@ nav:
- Releases: "releases.md"
- High Availability: ha.md
- Improved polling performance: "improved-polling.md"
- Monitoring dashboard: "dashboard.md"

0 comments on commit f6210dc

Please sign in to comment.