Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ECS] [TSDB] Centralisation of Dimension Fields #5193

Open
agithomas opened this issue Feb 7, 2023 · 40 comments
Open

[ECS] [TSDB] Centralisation of Dimension Fields #5193

agithomas opened this issue Feb 7, 2023 · 40 comments
Assignees
Labels

Comments

@agithomas
Copy link
Contributor

agithomas commented Feb 7, 2023

Scope

  • Identification of common fields in package that can be moved to ecs as dimensions.
  • Aggregation of fields that related to service integration and cloud native , if needed.
  • Identification of ecs repo changes and ecs testing.
  • Creation of PR
@agithomas agithomas self-assigned this Feb 7, 2023
@agithomas agithomas changed the title [Draft] [Research / Scoping] [ECS] Centralisation of Dimension Fields [Draft] [ECS] [TSDB] Centralisation of Dimension Fields Feb 7, 2023
@agithomas
Copy link
Contributor Author

agithomas commented Feb 8, 2023

From the context of packages having ownership of service integration team, there exist certain fields that are part of every package and they are potential candidates of becoming dimension fields.

  • service.address
  • host.ip or host.name (Expect duplicates if the host.ip is not public and is a part of a subnet)
  • agent.id

@agithomas
Copy link
Contributor Author

agithomas commented Feb 8, 2023

When metrics are collected from a resource running in cloud or in a container, below mentioned fields are potential candidates of becoming dimension fields

  • cloud.instance.id or cloud.instance.name
  • cloud.provider : This is considering multi-cloud / hybrid infra deployment
  • cloud.project.id : One organisation can have multiple projects. This is to support multi-regional deployment
  • container.id : container.name may not be apt here. There may exist pods having same container names running in same host.
  • host.name: Expect duplicates if the host.ip is not public and is a part of a subnet
  • agent.id

Should subnet / network name be include ? TBD

@ruflin
Copy link
Member

ruflin commented Feb 13, 2023

  • For apps running in k8s, we do wee something like k8s.cluster.name or similar?
  • Lets assume we specify all the fields above in ECS as dimension fields. In the k8s case, the cloud provider fields would likely not be used. Do this also count then to the max 16 fields even if there are no values? And how does this look, if the fields are set in dynamic templates? (@P1llus )

@lalit-satapathy
Copy link
Collaborator

CC @tommyers-elastic @gizas @felixbarny for any suggestions/comments on the TSDB ECS dimension fields. Once we close on these ECS fields, we can raise a PR for the same.

@felixbarny
Copy link
Member

@kruskall has tried to define the dimensions for APM (elastic/apm-server#9730) but quickly hit the dimension limit (elastic/elasticsearch#93564).

@ruflin
Copy link
Member

ruflin commented Feb 14, 2023

I like the direction elastic/elasticsearch#93564 is taking. An initial shortcut might be to just increase the limit to 32 which could already help.

Looking at elastic/apm-server#9730, it seems there is some overlap with the dimensions proposed here but there are also quite a few dimensions which I would argue are potentially unique to apm data. It would be nice if all the default dimensions can be set in ECS (or a common base) but are only used / applied when there is actual data. This goes back to my question: Is the limit reached when a field for the dimension is there or if it is in the mapping itself. If mapping is already enough, will it help to have it in the dynamic template?

For each default dimension define, I would like to see us have a note on why it is a dimension. There went a lot of thought into which dimensions to pick and it should be persisted and shared.

@agithomas
Copy link
Contributor Author

Can we consider the above mentioned list for Service Integration and for Cloud (#5193 (comment)). We may have to expect new common fields added to the above list as we test new scenario.

We know, in such cases, explicit dimension field mapping can be done to avoid being blocked.

Is the above list good enough to work towards preparing RFC in ECS ?

@agithomas
Copy link
Contributor Author

agithomas commented Feb 14, 2023

  • For apps running in k8s, we do wee something like k8s.cluster.name or similar?

I think, having host.ip and container.id will cover the criteria of unique identification of document. Don't you think so?

@ruflin
Copy link
Member

ruflin commented Feb 15, 2023

Is the above list good enough to work towards preparing RFC in ECS ?

I assume so, you know best :-)

I think, having host.ip and container.id will cover the criteria of unique identification of document. Don't you think so?

Is a container.id globally unique? I assume the chances are low enough for conflicts to have this work.

@agithomas
Copy link
Contributor Author

Is a container.id globally unique? I assume the chances are low enough for conflicts to have this work.

There are no references i could find that says a container.id repeats in a cluster.

I understand the scenario you are referring to - the scenario where multiple k8s cluster are provisioned. Cluster name is a good logical segregation in such cases.

Reference : https://cloud.google.com/stackdriver/docs/solutions/gke/observing

It may then be checked

  • Are the ways to get the cluster name in GKE / EKS / Self managed installation different or same?
  • Are these information currently captured ? If yes, which ecs field is currently used ?

These may be the questions that can be asked to the owners of GKE / EKS integration team.

@agithomas
Copy link
Contributor Author

It may then be checked

  • Are the ways to get the cluster name in GKE / EKS / Self managed installation different or same?
  • Are these information currently captured ? If yes, which ecs field is currently used ?

These may be the questions that can be asked to the owners of GKE / EKS integration team.

My preference would be not to modify the ecs dimension fields based on an application or cluster technology / deployment architecture. Every technology such a aws.kinesis, gke , eks may have its own identifier a unique resource it is monitoring. In akw.kinesis it is aws.dimensions.streamName.

As part of integration enhacement to use TSDB, it is expected these unique fields that represent a resource is identified carefully as dimension field in the integration.

@agithomas agithomas changed the title [Draft] [ECS] [TSDB] Centralisation of Dimension Fields [ECS] [TSDB] Centralisation of Dimension Fields Feb 15, 2023
@agithomas
Copy link
Contributor Author

data_stream.namespace is an important one too. Adding it to the confirmed list.

@ruflin
Copy link
Member

ruflin commented Feb 20, 2023

Can you share some details why data_stream.namespace is an important dimension? If this value changes, the data goes into a different index.

@agithomas
Copy link
Contributor Author

Can you share some details why data_stream.namespace is an important dimension? If this value changes, the data goes into a different index.

As you rightly said, this is unnecessary.

@lalit-satapathy
Copy link
Collaborator

@agithomas,

Can we summarise the final list of ECS fields which are dimensions below and one-line description for each, providing the rationale?

@felixbarny
Copy link
Member

When the dimension limit is removed (elastic/elasticsearch#93564), can we just make any non-metric (keyword?) field a dimension by default. I don't see the value of spending our time on finding out what good dimension fields are. Other TSDBs only support two types of fields: metrics and dimensions. Can we just operate under the same mental model?

@lalit-satapathy
Copy link
Collaborator

@felixbarny,

Its a good point, in particular, it's not very clear what is the difference between not-dimension meta fields which are keywords vs. dimension fields. TSDB documents will primarily contain a combination of metric fields, dimension fields and meta fields. From a document query point of view, assuming dimension fields and meta fields behave the same.

Currently missing any specific details, which links number of dimension fields vs. TSDB size/performance. I am assuming there is a relationship. Hoping someone from ES team, can provide more details on this.

In the mean time, we can just continue to annotate dimensions, as is this is the ask for TSDB enablement.

@felixbarny
Copy link
Member

@martijnvg could you give us some guidance on the impact of having a lot of dimensions, assuming the _tsid is a hash and there are no size restrictions. See also elastic/elasticsearch#93564 (comment).

What if any negative consequences do we need to expect if we declare too many dimensions or when making all non-metric fields a dimension by default?

Note that this is the default in other TSDBs so if there are negative consequences in ES when treating non-metric fields as dimensions by default, I'd be curious to have your thoughts on whether they're tolerable, and if not, what we could do to minimize the impact so that we can work with ES like with any other TSDB.

@martijnvg
Copy link
Member

@felixbarny I need to think more about this.

We might end up with a default dynamic mapping in where every keyword field or every non-metric field (everything except for counter, gauge, or histogram) in order to support dynamic user-defined metrics. (from elastic/elasticsearch#93564 (comment).)

How are keyword labels modelled in this model?

@felixbarny
Copy link
Member

Keyword labels would be mapped as a dimension. By default, everything except actual metrics would be a dimension.

@agithomas
Copy link
Contributor Author

While we discuss the limitations and the possible future enhancements, i would like to freeze the ecs fields which must be marked as dimension fields.

  1. host.ip
  2. service.address
  3. agent.id
  4. cloud.project.id
  5. cloud.instance.id
  6. cloud.provider
  7. container.id

@ruflin
Copy link
Member

ruflin commented Feb 22, 2023

Lets separate immediate changes from future plans. TSDB is to be released soonish and we want to adopt it in integrations to also make sure it all works as expected. This is where we need the list from @agithomas . These are all ECS fields and if we add it to ECS, all integrations will have these dimensions by default as soon as ECS is updated. Everything using ECS will have a field annotated as dimension from there on, but as long as TSDB is not enabled, it wont have any effect. @agithomas List LGTM

Then there is the mid term and long term and I agree with @felixbarny , ideally we should not have to think about dimensions at all but this will likely not happen immediately. I suggest to keep the "no dimension" discussion in the Elasticsearch issue.

@agithomas
Copy link
Contributor Author

  • host.ip
  • service.address
  • agent.id
  • cloud.project.id
  • cloud.instance.id
  • cloud.provider
  • container.id

Based on the recommendations, host.ip will be replaced by host.name.

The new list will be

  • host.name
  • service.address
  • agent.id
  • cloud.project.id
  • cloud.instance.id
  • cloud.provider
  • container.id

@agithomas
Copy link
Contributor Author

@lalit-satapathy ,can you please help by approving , if there are no further queries?

@lalit-satapathy
Copy link
Collaborator

@agithomas,

Lets update the TSDB migration document to change from host.ip to host.name

@tetianakravchenko
Copy link
Contributor

as discussed with @agithomas:

  1. It is recommended to include all those fields below to be on the safe side.
    But package developers can choose to pick a subset out of the recommended list after analyzing possible impacts.
  2. For now this list is not enforced and not set as a default dimensions list. In the future it might be changed on the ECS side.

For packages that can be deployed on cloud/on-prem/k8s (examples - MongoDB, Nginx)

Field name Explanation/reasoning
host.name It is a host where the agent is running. Mainly used for cases when integration is installed on-prem. Note: we are not using host.id since it might be not unique
service.address This field is present in case we provide a concrete target for scraping, like IP:PORT of some service (like mongodb), in some cases - it might be not needed to use this field
container.id For now it is mainly used for: docker/containerd/kubernetes packages, this field is empty for other integrations. Container.id in this case is an id of the monitored container
cloud.account.id (new) id used to identify different entities in a multi-tenant environment.
cloud.provider To avoid minimal chance that account.id might be the same for different providers
cloud.region (new) For services that are region specific
cloud.availability_zone (new) For services that are zone specific
cloud.instance.id host.name (can be defined manually by customer) is not unique enough. for Azure - instance.id is globally unique, AWS - region, GCP - availability zone. Technically for azure for example it would be enough to define cloud.instance.id only, but since it should be unified we include all fields: cloud.region/zone
agent.id For cases when 2 data shipers are monitoring the same resource
cloud.project.id (deleted)  

For Cloud-only Integration Packages / Managed Services Packages ( examples - AWS S3 )

Field name    
cloud.account.id required cloud id used to identify different entities in a multi-tenant environment
cloud.region required For managed services that are region specific
cloud.availability_zone required For managed services that are zone specific
cloud.provided Not needed Because package specific fields already covers it
agent.id Required  For cases when 2 data shipers are monitoring the same resource

@agithomas
Copy link
Contributor Author

The above recommended list is based on based on the understanding we presently have on

  • dimensions in managed services running in cloud
  • dimensions a service / product need when running in on-prem, public-cloud in monolithic ,microservice manner.
  • based on the available fields in ECS.

The above list may be needed when more fields are added to the ECS & used. For example - details of the subnet (for on-prem infrastructure).

The above list will be used to prepare the RFC-1 of RFC-0

@ruflin , @felixbarny , @martijnvg

Kindly help by reviewing the new list mentioned here

@ruflin
Copy link
Member

ruflin commented May 8, 2023

I stumbled over the following line.

agent.id: For cases when 2 data shipers are monitoring the same resource

If 2 agents are monitoring the same resource, shouldn't it be the same time serie? Can you provide an example on where this happens, this likely clarifies things.

@agithomas
Copy link
Contributor Author

If 2 agents are monitoring the same resource, shouldn't it be the same time serie?

We can have one policy deployed on any number of agents. This permits two agents monitoring same resource. This may be done intentionally or accidentally by the customer.

Case 1: If intentionally, it is important that agent.id should be part of a dimension field so that data can be recorded as separate timeseries. A valid usecase i can think here is - a standalone elastic-agent may be running on single node monitoring several infra assets. The admin on understanding a problem related to disk or over-utilisation choose to migrate to a different system. As part of cut-over, during maintenance window, it is important that the user verifies data received from new agent is consistent . Without including agent.id, the data in ES from new agent will be recorded in staggered manner.

Case 2: If agent policy is installed accidentally on more than on agents, is elasticsearch expected to do the de-duplication making use of dimension field constraint (not a feature) of timeseries database ? We think, It may be best that a datastore is a true representation of data received from the upstream system, in this case integration packages.

@agithomas
Copy link
Contributor Author

@ruflin , I have mentioned here , the reason why the agent.id must be included.

Do you think these usecases and scenario are valid to include agent.id? Or, should we consider it as exceptions and save a few bytes of _tsid field by removing agent.id ?

@ruflin
Copy link
Member

ruflin commented May 9, 2023

At the moment, I would rather opt for too many then too few dimensions so I'm good with the approach.

tetianakravchenko added a commit to tetianakravchenko/integrations that referenced this issue Jun 7, 2023
…ment)

Signed-off-by: Tetiana Kravchenko <tetiana.kravchenko@elastic.co>
@gpop63 gpop63 mentioned this issue Nov 28, 2023
4 tasks
@botelastic
Copy link

botelastic bot commented May 8, 2024

Hi! We just realized that we haven't looked into this issue in a while. We're sorry! We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

@botelastic botelastic bot added the Stalled label May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants