Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Added logs-elastic_agent* read privileges to kibana_system #91701

Merged
merged 6 commits into from
Nov 23, 2022

Conversation

juliaElastic
Copy link
Contributor

@juliaElastic juliaElastic commented Nov 18, 2022

Required change for Fleet telemetry to read from elastic agent log indices.
elastic/kibana#146107

@juliaElastic juliaElastic self-assigned this Nov 18, 2022
@elasticsearchmachine elasticsearchmachine added the external-contributor Pull request authored by a developer outside the Elasticsearch team label Nov 18, 2022
@juliaElastic juliaElastic added the :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC label Nov 18, 2022
@elasticsearchmachine elasticsearchmachine added the Team:Security Meta label for security team label Nov 18, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

@elasticsearchmachine
Copy link
Collaborator

Hi @juliaElastic, I've created a changelog YAML for you.

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Nov 18, 2022

I see some ml tests failing, which is unrelated to my changes. Any idea how to fix this? I just did a merge from main as well.

15:53:26 * What went wrong:
15:53:26 Execution failed for task ':x-pack:plugin:ml:test'.
15:53:26 > There were failing tests. See the report at: file:///dev/shm/elastic+elasticsearch+pull-request+part-3-fips/x-pack/plugin/ml/build/reports/tests/test/index.html
15:53:26

15:47:59 org.elasticsearch.xpack.ml.utils.NativeMemoryCalculatorTests > testActualNodeSizeCalculationConsistency FAILED
15:47:59     java.lang.AssertionError: native memory [113246208] smaller than original native memory [113464638]
15:47:59     Expected: a value equal to or greater than <113464637L>
15:47:59          but: <113246208L> was less than <113464637L>
15:47:59         at __randomizedtesting.SeedInfo.seed([34D70F53042A2AED:11DE06502B080F4A]:0)
15:47:59         at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
15:47:59         at org.junit.Assert.assertThat(Assert.java:956)
15:47:59         at org.elasticsearch.xpack.ml.utils.NativeMemoryCalculatorTests.lambda$testActualNodeSizeCalculationConsistency$0(NativeMemoryCalculatorTests.java:170)
15:47:59         at org.elasticsearch.xpack.ml.utils.NativeMemoryCalculatorTests.testActualNodeSizeCalculationConsistency(NativeMemoryCalculatorTests.java:207)
15:47:59 

@droberts195 do you have any insight here? I see the test file changed a few hours ago here: #91694

@juliaElastic
Copy link
Contributor Author

Jenkins, test this please

@droberts195
Copy link
Contributor

Jenkins run elasticsearch-ci/part-3-fips

@ywangd
Copy link
Member

ywangd commented Nov 21, 2022

Ping @elastic/kibana-security

@juliaElastic
Copy link
Contributor Author

@ywangd are you ready to approve?

@@ -719,6 +719,8 @@ public static RoleDescriptor kibanaSystemRoleDescriptor(String name) {
// Fleet Server indices. Kibana create this indice before Fleet Server use them.
// Fleet Server indices. Kibana read and write to this indice to manage Elastic Agents
RoleDescriptor.IndicesPrivileges.builder().indices(".fleet*").allowRestrictedIndices(true).privileges("all").build(),
// Fleet telemetry queries Agent Logs indices in kibana task runner
RoleDescriptor.IndicesPrivileges.builder().indices("logs-elastic_agent*").privileges("read").build(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I see @joshdover hasn't replied on this yet in https://github.com/elastic/ingest-dev/issues/1261#issuecomment-1322223139, did you folks sync over Slack/Zoom? I just want to be sure we're all aligned, if so - the change looks good from the Kibana Security perspective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, I'll wait for his feedback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for confirming

Copy link
Contributor Author

@juliaElastic juliaElastic Nov 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything else you need from me to approve?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, everything looks good from Kibana perspective, feel free to merge as soon as ES team approves the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elastic/elasticsearch-team Hey, could someone review and approve?

Copy link
Contributor

@slobodanadamovic slobodanadamovic Nov 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliaElastic
We try to advise people not to use @elastic/elasticsearch-team handle, since it pings the whole Elasticsearch team (116 people).

I see that you already notified the correct area team (Team:Security) and that Yang (who is on PTO) did the initial review. Everyone from Security team receives notifications for every comment, so you can either simply comment on this PR or ping @elastic/es-security if there is an urgent need for someone to have a look.

From my perspective, the changes look good! 🚀 If you need this to be backported to 8.6 branch you can simply apply the auto-backport-and-merge label (before merging this PR) and it will automatically open a new PR and merge it (otherwise you would have to do it manually).

Another small suggestion would be to update the description of this PR (as it serves for future references) and remove contributor template bullet points and the link to a private GitHub repo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestions! I wasn't sure which team to ping, as there was no team assigned as pr reviewer. I'll keep this in mind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made the same mistakes myself. It's not something you could have known, so no worries. :)

juliaElastic added a commit to elastic/kibana that referenced this pull request Nov 23, 2022
## Summary

Closes elastic/ingest-dev#1261

Added a snippet to the telemetry that I added for each requirement.
Please review and let me know if any changes are needed.
Also asked a few questions below. @jlind23 @kpollich 

6. is blocked by [elasticsearch
change](elastic/elasticsearch#91701) to give
kibana_system the missing privilege to read logs-elastic_agent* indices.

Took inspiration for task versioning from
https://github.com/elastic/kibana/pull/144494/files#diff-0c7c49bf5c55c45c19e9c42d5428e99e52c3a39dd6703633f427724d36108186

- [x] 1. Elastic Agent versions
Versions of all the Elastic Agent running: `agent.version` field on
`.fleet-agents` documents

```
"agent_versions": [
    "8.6.0"
  ],
```

- [x] 2. Fleet server configuration
Think we can query for `.fleet-policies` where some `input` has `type:
'fleet-server'` for this, as well as use the `Fleet Server Hosts`
settings that we define via saved objects in Fleet


```
  "fleet_server_config": {
    "policies": [
      {
        "input_config": {
          "server": {
            "limits.max_agents": 10000
          },
          "server.runtime": "gc_percent:20"
        }
      }
    ]
  }
```

- [x] 3. Number of policies
Count of `.fleet-policies` index 

To confirm, did we mean agent policies here?

```
 "agent_policies": {
    "count": 7,
```

- [x] 4. Output type contained in those policies
Collecting this from ts logic, querying from `.fleet-policies` index.
The alternative would be to write a painless script (because the
`outputs` are an object with dynamic keys, we can't do an aggregation
directly).

```
"agent_policies": {
    "output_types": [
      "elasticsearch"
    ]
  }
```

Did we mean to just collect the types here, or any other info? e.g.
output urls

- [x] 5. Average number of checkin failures
We only have the most recent checkin status and timestamp on
`.fleet-agents`.

Do we mean here to publish the total last checkin failure count? E.g. 3
if 3 agents are in failure checkin status currently.
Or do we mean to publish specific info for all agents
(`last_checkin_status`, `last_checkin` time, `last_checkin_message`)?
Are the only statuses `error` and `degraded` that we want to send?

```
  "agent_last_checkin_status": {
    "error": 0,
    "degraded": 0
  },
```

- [ ] 6. Top 3 most common errors in the Elastic Agent logs

Do we mean here elastic-agent logs only, or fleet-server logs as well
(maybe separately)?

I found an alternative way to query the message field using sampler and
categorize text aggregation:
```
GET logs-elastic_agent*/_search
{
    "size": 0,
    "query": {
        "bool": {
            "must": [
                {
                    "term": {
                        "log.level": "error"
                    }
                },
                {
                    "range": {
                        "@timestamp": {
                            "gte": "now-1h"
                        }
                    }
                }
            ]
        }
    },
    "aggregations": {
        "message_sample": {
            "sampler": {
                "shard_size": 200
            },
            "aggs": {
                "categories": {
                    "categorize_text": {
                        "field": "message",
                        "size": 10
                    }
                }
            }
        }
    }
}
```
Example response:
```
"aggregations": {
    "message_sample": {
      "doc_count": 112,
      "categories": {
        "buckets": [
          {
            "doc_count": 73,
            "key": "failed to unenroll offline agents",
            "regex": ".*?failed.+?to.+?unenroll.+?offline.+?agents.*?",
            "max_matching_length": 36
          },
          {
            "doc_count": 7,
            "key": """stderr panic close of closed channel n ngoroutine running Stop ngithub.51.al/elastic/beats/v7/libbeat/cmd/instance Beat launch.func5 \n\t/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go n
```


- [x] 7.  Number of checkin failure over the past period of time

I think this is almost the same as #5. The difference would be to report
new failures happened only in the last hour, or report all agents in
failure state. (which would be an increasing number if the agent stays
in failed state).
Do we want these 2 separate telemetry fields?

EDIT: removed the last1hr query, instead added a new field to report
agents enrolled per policy (top 10). See comments below.

```
  "agent_checkin_status": {
    "error": 3,
    "degraded": 0
  },
  "agents_per_policy": [2, 1000],
```

- [x] 8. Number of Elastic Agent and number of fleet server

This is already there in the existing telemetry:
```
  "agents": {
    "total_enrolled": 0,
    "healthy": 0,
    "unhealthy": 0,
    "offline": 0,
    "total_all_statuses": 1,
    "updating": 0
  },
  "fleet_server": {
    "total_enrolled": 0,
    "healthy": 0,
    "unhealthy": 0,
    "offline": 0,
    "updating": 0,
    "total_all_statuses": 0,
    "num_host_urls": 1
  },
```




### Checklist

- [ ] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Nov 23, 2022
## Summary

Closes elastic/ingest-dev#1261

Added a snippet to the telemetry that I added for each requirement.
Please review and let me know if any changes are needed.
Also asked a few questions below. @jlind23 @kpollich

6. is blocked by [elasticsearch
change](elastic/elasticsearch#91701) to give
kibana_system the missing privilege to read logs-elastic_agent* indices.

Took inspiration for task versioning from
https://github.com/elastic/kibana/pull/144494/files#diff-0c7c49bf5c55c45c19e9c42d5428e99e52c3a39dd6703633f427724d36108186

- [x] 1. Elastic Agent versions
Versions of all the Elastic Agent running: `agent.version` field on
`.fleet-agents` documents

```
"agent_versions": [
    "8.6.0"
  ],
```

- [x] 2. Fleet server configuration
Think we can query for `.fleet-policies` where some `input` has `type:
'fleet-server'` for this, as well as use the `Fleet Server Hosts`
settings that we define via saved objects in Fleet

```
  "fleet_server_config": {
    "policies": [
      {
        "input_config": {
          "server": {
            "limits.max_agents": 10000
          },
          "server.runtime": "gc_percent:20"
        }
      }
    ]
  }
```

- [x] 3. Number of policies
Count of `.fleet-policies` index

To confirm, did we mean agent policies here?

```
 "agent_policies": {
    "count": 7,
```

- [x] 4. Output type contained in those policies
Collecting this from ts logic, querying from `.fleet-policies` index.
The alternative would be to write a painless script (because the
`outputs` are an object with dynamic keys, we can't do an aggregation
directly).

```
"agent_policies": {
    "output_types": [
      "elasticsearch"
    ]
  }
```

Did we mean to just collect the types here, or any other info? e.g.
output urls

- [x] 5. Average number of checkin failures
We only have the most recent checkin status and timestamp on
`.fleet-agents`.

Do we mean here to publish the total last checkin failure count? E.g. 3
if 3 agents are in failure checkin status currently.
Or do we mean to publish specific info for all agents
(`last_checkin_status`, `last_checkin` time, `last_checkin_message`)?
Are the only statuses `error` and `degraded` that we want to send?

```
  "agent_last_checkin_status": {
    "error": 0,
    "degraded": 0
  },
```

- [ ] 6. Top 3 most common errors in the Elastic Agent logs

Do we mean here elastic-agent logs only, or fleet-server logs as well
(maybe separately)?

I found an alternative way to query the message field using sampler and
categorize text aggregation:
```
GET logs-elastic_agent*/_search
{
    "size": 0,
    "query": {
        "bool": {
            "must": [
                {
                    "term": {
                        "log.level": "error"
                    }
                },
                {
                    "range": {
                        "@timestamp": {
                            "gte": "now-1h"
                        }
                    }
                }
            ]
        }
    },
    "aggregations": {
        "message_sample": {
            "sampler": {
                "shard_size": 200
            },
            "aggs": {
                "categories": {
                    "categorize_text": {
                        "field": "message",
                        "size": 10
                    }
                }
            }
        }
    }
}
```
Example response:
```
"aggregations": {
    "message_sample": {
      "doc_count": 112,
      "categories": {
        "buckets": [
          {
            "doc_count": 73,
            "key": "failed to unenroll offline agents",
            "regex": ".*?failed.+?to.+?unenroll.+?offline.+?agents.*?",
            "max_matching_length": 36
          },
          {
            "doc_count": 7,
            "key": """stderr panic close of closed channel n ngoroutine running Stop ngithub.51.al/elastic/beats/v7/libbeat/cmd/instance Beat launch.func5 \n\t/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go n
```

- [x] 7.  Number of checkin failure over the past period of time

I think this is almost the same as elastic#5. The difference would be to report
new failures happened only in the last hour, or report all agents in
failure state. (which would be an increasing number if the agent stays
in failed state).
Do we want these 2 separate telemetry fields?

EDIT: removed the last1hr query, instead added a new field to report
agents enrolled per policy (top 10). See comments below.

```
  "agent_checkin_status": {
    "error": 3,
    "degraded": 0
  },
  "agents_per_policy": [2, 1000],
```

- [x] 8. Number of Elastic Agent and number of fleet server

This is already there in the existing telemetry:
```
  "agents": {
    "total_enrolled": 0,
    "healthy": 0,
    "unhealthy": 0,
    "offline": 0,
    "total_all_statuses": 1,
    "updating": 0
  },
  "fleet_server": {
    "total_enrolled": 0,
    "healthy": 0,
    "unhealthy": 0,
    "offline": 0,
    "updating": 0,
    "total_all_statuses": 0,
    "num_host_urls": 1
  },
```

### Checklist

- [ ] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
(cherry picked from commit e00e26e)
kibanamachine added a commit to elastic/kibana that referenced this pull request Nov 23, 2022
# Backport

This will backport the following commits from `main` to `8.6`:
- [Fleet Usage telemetry extension
(#145353)](#145353)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Julia
Bardi","email":"90178898+juliaElastic@users.noreply.github.com"},"sourceCommit":{"committedDate":"2022-11-23T09:22:20Z","message":"Fleet
Usage telemetry extension (#145353)\n\n## Summary\r\n\r\nCloses
https://github.com/elastic/ingest-dev/issues/1261\r\n\r\nAdded a snippet
to the telemetry that I added for each requirement.\r\nPlease review and
let me know if any changes are needed.\r\nAlso asked a few questions
below. @jlind23 @kpollich \r\n\r\n6. is blocked by
[elasticsearch\r\nchange](elastic/elasticsearch#91701)
to give\r\nkibana_system the missing privilege to read
logs-elastic_agent* indices.\r\n\r\nTook inspiration for task versioning
from\r\nhttps://github.com//pull/144494/files#diff-0c7c49bf5c55c45c19e9c42d5428e99e52c3a39dd6703633f427724d36108186\r\n\r\n-
[x] 1. Elastic Agent versions\r\nVersions of all the Elastic Agent
running: `agent.version` field on\r\n`.fleet-agents`
documents\r\n\r\n```\r\n\"agent_versions\": [\r\n \"8.6.0\"\r\n
],\r\n```\r\n\r\n- [x] 2. Fleet server configuration\r\nThink we can
query for `.fleet-policies` where some `input` has
`type:\r\n'fleet-server'` for this, as well as use the `Fleet Server
Hosts`\r\nsettings that we define via saved objects in
Fleet\r\n\r\n\r\n```\r\n \"fleet_server_config\": {\r\n \"policies\":
[\r\n {\r\n \"input_config\": {\r\n \"server\": {\r\n
\"limits.max_agents\": 10000\r\n },\r\n \"server.runtime\":
\"gc_percent:20\"\r\n }\r\n }\r\n ]\r\n }\r\n```\r\n\r\n- [x] 3. Number
of policies\r\nCount of `.fleet-policies` index \r\n\r\nTo confirm, did
we mean agent policies here?\r\n\r\n```\r\n \"agent_policies\": {\r\n
\"count\": 7,\r\n```\r\n\r\n- [x] 4. Output type contained in those
policies\r\nCollecting this from ts logic, querying from
`.fleet-policies` index.\r\nThe alternative would be to write a painless
script (because the\r\n`outputs` are an object with dynamic keys, we
can't do an aggregation\r\ndirectly).\r\n\r\n```\r\n\"agent_policies\":
{\r\n \"output_types\": [\r\n \"elasticsearch\"\r\n ]\r\n
}\r\n```\r\n\r\nDid we mean to just collect the types here, or any other
info? e.g.\r\noutput urls\r\n\r\n- [x] 5. Average number of checkin
failures\r\nWe only have the most recent checkin status and timestamp
on\r\n`.fleet-agents`.\r\n\r\nDo we mean here to publish the total last
checkin failure count? E.g. 3\r\nif 3 agents are in failure checkin
status currently.\r\nOr do we mean to publish specific info for all
agents\r\n(`last_checkin_status`, `last_checkin` time,
`last_checkin_message`)?\r\nAre the only statuses `error` and `degraded`
that we want to send?\r\n\r\n```\r\n \"agent_last_checkin_status\":
{\r\n \"error\": 0,\r\n \"degraded\": 0\r\n },\r\n```\r\n\r\n- [ ] 6.
Top 3 most common errors in the Elastic Agent logs\r\n\r\nDo we mean
here elastic-agent logs only, or fleet-server logs as well\r\n(maybe
separately)?\r\n\r\nI found an alternative way to query the message
field using sampler and\r\ncategorize text aggregation:\r\n```\r\nGET
logs-elastic_agent*/_search\r\n{\r\n \"size\": 0,\r\n \"query\": {\r\n
\"bool\": {\r\n \"must\": [\r\n {\r\n \"term\": {\r\n \"log.level\":
\"error\"\r\n }\r\n },\r\n {\r\n \"range\": {\r\n \"@timestamp\": {\r\n
\"gte\": \"now-1h\"\r\n }\r\n }\r\n }\r\n ]\r\n }\r\n },\r\n
\"aggregations\": {\r\n \"message_sample\": {\r\n \"sampler\": {\r\n
\"shard_size\": 200\r\n },\r\n \"aggs\": {\r\n \"categories\": {\r\n
\"categorize_text\": {\r\n \"field\": \"message\",\r\n \"size\": 10\r\n
}\r\n }\r\n }\r\n }\r\n }\r\n}\r\n```\r\nExample
response:\r\n```\r\n\"aggregations\": {\r\n \"message_sample\": {\r\n
\"doc_count\": 112,\r\n \"categories\": {\r\n \"buckets\": [\r\n {\r\n
\"doc_count\": 73,\r\n \"key\": \"failed to unenroll offline
agents\",\r\n \"regex\":
\".*?failed.+?to.+?unenroll.+?offline.+?agents.*?\",\r\n
\"max_matching_length\": 36\r\n },\r\n {\r\n \"doc_count\": 7,\r\n
\"key\": \"\"\"stderr panic close of closed channel n ngoroutine running
Stop ngithub.51.al/elastic/beats/v7/libbeat/cmd/instance Beat launch.func5
\\n\\t/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go
n\r\n```\r\n\r\n\r\n- [x] 7. Number of checkin failure over the past
period of time\r\n\r\nI think this is almost the same as #5. The
difference would be to report\r\nnew failures happened only in the last
hour, or report all agents in\r\nfailure state. (which would be an
increasing number if the agent stays\r\nin failed state).\r\nDo we want
these 2 separate telemetry fields?\r\n\r\nEDIT: removed the last1hr
query, instead added a new field to report\r\nagents enrolled per policy
(top 10). See comments below.\r\n\r\n```\r\n \"agent_checkin_status\":
{\r\n \"error\": 3,\r\n \"degraded\": 0\r\n },\r\n
\"agents_per_policy\": [2, 1000],\r\n```\r\n\r\n- [x] 8. Number of
Elastic Agent and number of fleet server\r\n\r\nThis is already there in
the existing telemetry:\r\n```\r\n \"agents\": {\r\n \"total_enrolled\":
0,\r\n \"healthy\": 0,\r\n \"unhealthy\": 0,\r\n \"offline\": 0,\r\n
\"total_all_statuses\": 1,\r\n \"updating\": 0\r\n },\r\n
\"fleet_server\": {\r\n \"total_enrolled\": 0,\r\n \"healthy\": 0,\r\n
\"unhealthy\": 0,\r\n \"offline\": 0,\r\n \"updating\": 0,\r\n
\"total_all_statuses\": 0,\r\n \"num_host_urls\": 1\r\n
},\r\n```\r\n\r\n\r\n\r\n\r\n### Checklist\r\n\r\n- [ ] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>","sha":"e00e26e86854bdbde7c14f88453b717505fed4d9","branchLabelMapping":{"^v8.7.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:Fleet","v8.6.0","v8.7.0"],"number":145353,"url":"https://github.com/elastic/kibana/pull/145353","mergeCommit":{"message":"Fleet
Usage telemetry extension (#145353)\n\n## Summary\r\n\r\nCloses
https://github.com/elastic/ingest-dev/issues/1261\r\n\r\nAdded a snippet
to the telemetry that I added for each requirement.\r\nPlease review and
let me know if any changes are needed.\r\nAlso asked a few questions
below. @jlind23 @kpollich \r\n\r\n6. is blocked by
[elasticsearch\r\nchange](elastic/elasticsearch#91701)
to give\r\nkibana_system the missing privilege to read
logs-elastic_agent* indices.\r\n\r\nTook inspiration for task versioning
from\r\nhttps://github.com//pull/144494/files#diff-0c7c49bf5c55c45c19e9c42d5428e99e52c3a39dd6703633f427724d36108186\r\n\r\n-
[x] 1. Elastic Agent versions\r\nVersions of all the Elastic Agent
running: `agent.version` field on\r\n`.fleet-agents`
documents\r\n\r\n```\r\n\"agent_versions\": [\r\n \"8.6.0\"\r\n
],\r\n```\r\n\r\n- [x] 2. Fleet server configuration\r\nThink we can
query for `.fleet-policies` where some `input` has
`type:\r\n'fleet-server'` for this, as well as use the `Fleet Server
Hosts`\r\nsettings that we define via saved objects in
Fleet\r\n\r\n\r\n```\r\n \"fleet_server_config\": {\r\n \"policies\":
[\r\n {\r\n \"input_config\": {\r\n \"server\": {\r\n
\"limits.max_agents\": 10000\r\n },\r\n \"server.runtime\":
\"gc_percent:20\"\r\n }\r\n }\r\n ]\r\n }\r\n```\r\n\r\n- [x] 3. Number
of policies\r\nCount of `.fleet-policies` index \r\n\r\nTo confirm, did
we mean agent policies here?\r\n\r\n```\r\n \"agent_policies\": {\r\n
\"count\": 7,\r\n```\r\n\r\n- [x] 4. Output type contained in those
policies\r\nCollecting this from ts logic, querying from
`.fleet-policies` index.\r\nThe alternative would be to write a painless
script (because the\r\n`outputs` are an object with dynamic keys, we
can't do an aggregation\r\ndirectly).\r\n\r\n```\r\n\"agent_policies\":
{\r\n \"output_types\": [\r\n \"elasticsearch\"\r\n ]\r\n
}\r\n```\r\n\r\nDid we mean to just collect the types here, or any other
info? e.g.\r\noutput urls\r\n\r\n- [x] 5. Average number of checkin
failures\r\nWe only have the most recent checkin status and timestamp
on\r\n`.fleet-agents`.\r\n\r\nDo we mean here to publish the total last
checkin failure count? E.g. 3\r\nif 3 agents are in failure checkin
status currently.\r\nOr do we mean to publish specific info for all
agents\r\n(`last_checkin_status`, `last_checkin` time,
`last_checkin_message`)?\r\nAre the only statuses `error` and `degraded`
that we want to send?\r\n\r\n```\r\n \"agent_last_checkin_status\":
{\r\n \"error\": 0,\r\n \"degraded\": 0\r\n },\r\n```\r\n\r\n- [ ] 6.
Top 3 most common errors in the Elastic Agent logs\r\n\r\nDo we mean
here elastic-agent logs only, or fleet-server logs as well\r\n(maybe
separately)?\r\n\r\nI found an alternative way to query the message
field using sampler and\r\ncategorize text aggregation:\r\n```\r\nGET
logs-elastic_agent*/_search\r\n{\r\n \"size\": 0,\r\n \"query\": {\r\n
\"bool\": {\r\n \"must\": [\r\n {\r\n \"term\": {\r\n \"log.level\":
\"error\"\r\n }\r\n },\r\n {\r\n \"range\": {\r\n \"@timestamp\": {\r\n
\"gte\": \"now-1h\"\r\n }\r\n }\r\n }\r\n ]\r\n }\r\n },\r\n
\"aggregations\": {\r\n \"message_sample\": {\r\n \"sampler\": {\r\n
\"shard_size\": 200\r\n },\r\n \"aggs\": {\r\n \"categories\": {\r\n
\"categorize_text\": {\r\n \"field\": \"message\",\r\n \"size\": 10\r\n
}\r\n }\r\n }\r\n }\r\n }\r\n}\r\n```\r\nExample
response:\r\n```\r\n\"aggregations\": {\r\n \"message_sample\": {\r\n
\"doc_count\": 112,\r\n \"categories\": {\r\n \"buckets\": [\r\n {\r\n
\"doc_count\": 73,\r\n \"key\": \"failed to unenroll offline
agents\",\r\n \"regex\":
\".*?failed.+?to.+?unenroll.+?offline.+?agents.*?\",\r\n
\"max_matching_length\": 36\r\n },\r\n {\r\n \"doc_count\": 7,\r\n
\"key\": \"\"\"stderr panic close of closed channel n ngoroutine running
Stop ngithub.51.al/elastic/beats/v7/libbeat/cmd/instance Beat launch.func5
\\n\\t/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go
n\r\n```\r\n\r\n\r\n- [x] 7. Number of checkin failure over the past
period of time\r\n\r\nI think this is almost the same as #5. The
difference would be to report\r\nnew failures happened only in the last
hour, or report all agents in\r\nfailure state. (which would be an
increasing number if the agent stays\r\nin failed state).\r\nDo we want
these 2 separate telemetry fields?\r\n\r\nEDIT: removed the last1hr
query, instead added a new field to report\r\nagents enrolled per policy
(top 10). See comments below.\r\n\r\n```\r\n \"agent_checkin_status\":
{\r\n \"error\": 3,\r\n \"degraded\": 0\r\n },\r\n
\"agents_per_policy\": [2, 1000],\r\n```\r\n\r\n- [x] 8. Number of
Elastic Agent and number of fleet server\r\n\r\nThis is already there in
the existing telemetry:\r\n```\r\n \"agents\": {\r\n \"total_enrolled\":
0,\r\n \"healthy\": 0,\r\n \"unhealthy\": 0,\r\n \"offline\": 0,\r\n
\"total_all_statuses\": 1,\r\n \"updating\": 0\r\n },\r\n
\"fleet_server\": {\r\n \"total_enrolled\": 0,\r\n \"healthy\": 0,\r\n
\"unhealthy\": 0,\r\n \"offline\": 0,\r\n \"updating\": 0,\r\n
\"total_all_statuses\": 0,\r\n \"num_host_urls\": 1\r\n
},\r\n```\r\n\r\n\r\n\r\n\r\n### Checklist\r\n\r\n- [ ] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>","sha":"e00e26e86854bdbde7c14f88453b717505fed4d9"}},"sourceBranch":"main","suggestedTargetBranches":["8.6"],"targetPullRequestStates":[{"branch":"8.6","label":"v8.6.0","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.7.0","labelRegex":"^v8.7.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/145353","number":145353,"mergeCommit":{"message":"Fleet
Usage telemetry extension (#145353)\n\n## Summary\r\n\r\nCloses
https://github.com/elastic/ingest-dev/issues/1261\r\n\r\nAdded a snippet
to the telemetry that I added for each requirement.\r\nPlease review and
let me know if any changes are needed.\r\nAlso asked a few questions
below. @jlind23 @kpollich \r\n\r\n6. is blocked by
[elasticsearch\r\nchange](elastic/elasticsearch#91701)
to give\r\nkibana_system the missing privilege to read
logs-elastic_agent* indices.\r\n\r\nTook inspiration for task versioning
from\r\nhttps://github.com//pull/144494/files#diff-0c7c49bf5c55c45c19e9c42d5428e99e52c3a39dd6703633f427724d36108186\r\n\r\n-
[x] 1. Elastic Agent versions\r\nVersions of all the Elastic Agent
running: `agent.version` field on\r\n`.fleet-agents`
documents\r\n\r\n```\r\n\"agent_versions\": [\r\n \"8.6.0\"\r\n
],\r\n```\r\n\r\n- [x] 2. Fleet server configuration\r\nThink we can
query for `.fleet-policies` where some `input` has
`type:\r\n'fleet-server'` for this, as well as use the `Fleet Server
Hosts`\r\nsettings that we define via saved objects in
Fleet\r\n\r\n\r\n```\r\n \"fleet_server_config\": {\r\n \"policies\":
[\r\n {\r\n \"input_config\": {\r\n \"server\": {\r\n
\"limits.max_agents\": 10000\r\n },\r\n \"server.runtime\":
\"gc_percent:20\"\r\n }\r\n }\r\n ]\r\n }\r\n```\r\n\r\n- [x] 3. Number
of policies\r\nCount of `.fleet-policies` index \r\n\r\nTo confirm, did
we mean agent policies here?\r\n\r\n```\r\n \"agent_policies\": {\r\n
\"count\": 7,\r\n```\r\n\r\n- [x] 4. Output type contained in those
policies\r\nCollecting this from ts logic, querying from
`.fleet-policies` index.\r\nThe alternative would be to write a painless
script (because the\r\n`outputs` are an object with dynamic keys, we
can't do an aggregation\r\ndirectly).\r\n\r\n```\r\n\"agent_policies\":
{\r\n \"output_types\": [\r\n \"elasticsearch\"\r\n ]\r\n
}\r\n```\r\n\r\nDid we mean to just collect the types here, or any other
info? e.g.\r\noutput urls\r\n\r\n- [x] 5. Average number of checkin
failures\r\nWe only have the most recent checkin status and timestamp
on\r\n`.fleet-agents`.\r\n\r\nDo we mean here to publish the total last
checkin failure count? E.g. 3\r\nif 3 agents are in failure checkin
status currently.\r\nOr do we mean to publish specific info for all
agents\r\n(`last_checkin_status`, `last_checkin` time,
`last_checkin_message`)?\r\nAre the only statuses `error` and `degraded`
that we want to send?\r\n\r\n```\r\n \"agent_last_checkin_status\":
{\r\n \"error\": 0,\r\n \"degraded\": 0\r\n },\r\n```\r\n\r\n- [ ] 6.
Top 3 most common errors in the Elastic Agent logs\r\n\r\nDo we mean
here elastic-agent logs only, or fleet-server logs as well\r\n(maybe
separately)?\r\n\r\nI found an alternative way to query the message
field using sampler and\r\ncategorize text aggregation:\r\n```\r\nGET
logs-elastic_agent*/_search\r\n{\r\n \"size\": 0,\r\n \"query\": {\r\n
\"bool\": {\r\n \"must\": [\r\n {\r\n \"term\": {\r\n \"log.level\":
\"error\"\r\n }\r\n },\r\n {\r\n \"range\": {\r\n \"@timestamp\": {\r\n
\"gte\": \"now-1h\"\r\n }\r\n }\r\n }\r\n ]\r\n }\r\n },\r\n
\"aggregations\": {\r\n \"message_sample\": {\r\n \"sampler\": {\r\n
\"shard_size\": 200\r\n },\r\n \"aggs\": {\r\n \"categories\": {\r\n
\"categorize_text\": {\r\n \"field\": \"message\",\r\n \"size\": 10\r\n
}\r\n }\r\n }\r\n }\r\n }\r\n}\r\n```\r\nExample
response:\r\n```\r\n\"aggregations\": {\r\n \"message_sample\": {\r\n
\"doc_count\": 112,\r\n \"categories\": {\r\n \"buckets\": [\r\n {\r\n
\"doc_count\": 73,\r\n \"key\": \"failed to unenroll offline
agents\",\r\n \"regex\":
\".*?failed.+?to.+?unenroll.+?offline.+?agents.*?\",\r\n
\"max_matching_length\": 36\r\n },\r\n {\r\n \"doc_count\": 7,\r\n
\"key\": \"\"\"stderr panic close of closed channel n ngoroutine running
Stop ngithub.51.al/elastic/beats/v7/libbeat/cmd/instance Beat launch.func5
\\n\\t/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go
n\r\n```\r\n\r\n\r\n- [x] 7. Number of checkin failure over the past
period of time\r\n\r\nI think this is almost the same as #5. The
difference would be to report\r\nnew failures happened only in the last
hour, or report all agents in\r\nfailure state. (which would be an
increasing number if the agent stays\r\nin failed state).\r\nDo we want
these 2 separate telemetry fields?\r\n\r\nEDIT: removed the last1hr
query, instead added a new field to report\r\nagents enrolled per policy
(top 10). See comments below.\r\n\r\n```\r\n \"agent_checkin_status\":
{\r\n \"error\": 3,\r\n \"degraded\": 0\r\n },\r\n
\"agents_per_policy\": [2, 1000],\r\n```\r\n\r\n- [x] 8. Number of
Elastic Agent and number of fleet server\r\n\r\nThis is already there in
the existing telemetry:\r\n```\r\n \"agents\": {\r\n \"total_enrolled\":
0,\r\n \"healthy\": 0,\r\n \"unhealthy\": 0,\r\n \"offline\": 0,\r\n
\"total_all_statuses\": 1,\r\n \"updating\": 0\r\n },\r\n
\"fleet_server\": {\r\n \"total_enrolled\": 0,\r\n \"healthy\": 0,\r\n
\"unhealthy\": 0,\r\n \"offline\": 0,\r\n \"updating\": 0,\r\n
\"total_all_statuses\": 0,\r\n \"num_host_urls\": 1\r\n
},\r\n```\r\n\r\n\r\n\r\n\r\n### Checklist\r\n\r\n- [ ] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>","sha":"e00e26e86854bdbde7c14f88453b717505fed4d9"}}]}]
BACKPORT-->

Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>
@juliaElastic juliaElastic merged commit 8b34388 into elastic:main Nov 23, 2022
juliaElastic added a commit to juliaElastic/elasticsearch that referenced this pull request Nov 23, 2022
…lastic#91701)

* Added logs-elastic_agent* read privileges to kibana_system

* Update docs/changelog/91701.yaml

* added unit test

* Fixed formatting

* removed read cross cluster role
elasticsearchmachine pushed a commit that referenced this pull request Nov 23, 2022
…91701) (#91842)

* Added logs-elastic_agent* read privileges to kibana_system

* Update docs/changelog/91701.yaml

* added unit test

* Fixed formatting

* removed read cross cluster role
juliaElastic added a commit to elastic/kibana that referenced this pull request Nov 29, 2022
## Summary

Closes elastic/ingest-dev#1261

Merged: [elasticsearch
change](elastic/elasticsearch#91701) to give
kibana_system the missing privilege to read logs-elastic_agent* indices.

## Top 3 most common errors in the Elastic Agent logs

Added most common elastic-agent and fleet-server logs to telemetry.

Using a query of message field using sampler and categorize text
aggregation. This is a workaround as we can't directly do aggregation on
`message` field.
```
GET logs-elastic_agent*/_search
{
    "size": 0,
    "query": {
        "bool": {
            "must": [
                {
                    "term": {
                        "log.level": "error"
                    }
                },
                {
                    "range": {
                        "@timestamp": {
                            "gte": "now-1h"
                        }
                    }
                }
            ]
        }
    },
    "aggregations": {
        "message_sample": {
            "sampler": {
                "shard_size": 200
            },
            "aggs": {
                "categories": {
                    "categorize_text": {
                        "field": "message",
                        "size": 10
                    }
                }
            }
        }
    }
}
```

Tested with latest Elasticsearch snapshot, and verified that the logs
are added to telemetry:
```
   {
      "agent_logs_top_errors": [
         "failed to dispatch actions error failed reloading q q q nil nil config failed reloading artifact config for composed snapshot.downloader failed to generate snapshot config failed to detect remote snapshot repo proceeding with configured not an agent uri",
         "fleet-server stderr level info time message No applicable limit for agents using default \\n level info time message No applicable limit for agents using default \\n",
         "stderr panic close of closed channel n ngoroutine running Stop"
      ],
      "fleet_server_logs_top_errors": [
         "Dispatch abort response",
         "error while closing",
         "failed to take ownership"
      ]
   }
```

Did some measurements locally, and the query took a few ms only. I'll
try to check with larger datasets in elastic agent logs too.


### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
juliaElastic added a commit to juliaElastic/kibana that referenced this pull request Nov 29, 2022
## Summary

Closes elastic/ingest-dev#1261

Merged: [elasticsearch
change](elastic/elasticsearch#91701) to give
kibana_system the missing privilege to read logs-elastic_agent* indices.

## Top 3 most common errors in the Elastic Agent logs

Added most common elastic-agent and fleet-server logs to telemetry.

Using a query of message field using sampler and categorize text
aggregation. This is a workaround as we can't directly do aggregation on
`message` field.
```
GET logs-elastic_agent*/_search
{
    "size": 0,
    "query": {
        "bool": {
            "must": [
                {
                    "term": {
                        "log.level": "error"
                    }
                },
                {
                    "range": {
                        "@timestamp": {
                            "gte": "now-1h"
                        }
                    }
                }
            ]
        }
    },
    "aggregations": {
        "message_sample": {
            "sampler": {
                "shard_size": 200
            },
            "aggs": {
                "categories": {
                    "categorize_text": {
                        "field": "message",
                        "size": 10
                    }
                }
            }
        }
    }
}
```

Tested with latest Elasticsearch snapshot, and verified that the logs
are added to telemetry:
```
   {
      "agent_logs_top_errors": [
         "failed to dispatch actions error failed reloading q q q nil nil config failed reloading artifact config for composed snapshot.downloader failed to generate snapshot config failed to detect remote snapshot repo proceeding with configured not an agent uri",
         "fleet-server stderr level info time message No applicable limit for agents using default \\n level info time message No applicable limit for agents using default \\n",
         "stderr panic close of closed channel n ngoroutine running Stop"
      ],
      "fleet_server_logs_top_errors": [
         "Dispatch abort response",
         "error while closing",
         "failed to take ownership"
      ]
   }
```

Did some measurements locally, and the query took a few ms only. I'll
try to check with larger datasets in elastic agent logs too.


### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC Team:Fleet Team:Security Meta label for security team v8.6.1 v8.7.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants