Configurable _source.enabled Elastic mapping property #1629

lnorregaard · 2019-10-03T17:08:26Z

From Elastic's documentation:
https://www.elastic.co/guide/en/elasticsearch/reference/7.3/mapping-source-field.html

The metrics use case is distinct from other time-based or logging use cases in that there are many small documents which consist only of numbers, dates, or keywords. There are no updates, no highlighting requests, and the data ages quickly so there is no need to reindex. Search requests typically use simple queries to filter the dataset by date or tags, and the results are returned as aggregations.
In this case, disabling the _source field will save space and reduce I/O.

This means that it is not possible to see the fields in the index and Kibana view will not show the metrics.

I would make a pull request to change to enabled: true for Elasticsearch 7.3 and up

robachmann · 2019-10-06T12:53:53Z

Thank you, Lars, for offering to make a PR.
This Stack Overflow answer also concludes that setting enabled: true would solve the issue of not seeing metrics in Kibana. I think this would be a quick-fix until there is native support for Elastic APM as requested in #793

…ometer-metrics#1629)

shakuzen · 2019-12-11T16:49:55Z

The quoted section of the documentation explains why disabling the _source field is a good idea for metrics, but you're proposing enabling it.

Just so we're on the same page, the fields are there (you can check them in the Kibana index pattern, for instance) and you can query/aggregate them, but they don't show the raw values in the Kibana Discovery page for each document. I can see how this makes ad-hoc querying the metrics difficult unless you know the metric names/tags already. It would be nice if Kibana could still auto-complete the field names/values for this purpose.

The proposal to enable the _source field would remedy this pain, but is it worth everyone paying the cost of including the _source field for metrics data? I'm not sure how much that cost is, to be honest. I wonder if there is some other way we can enable easier querying of the data without the overhead of enabling the _source field.
@xeraa sorry to ping you suddenly, but I'd love to get any thoughts from you on the above, if you have time to help.

Other Elasticsearch/Kibana users also feel free to give your input on this, as the proposed change would affect the default behavior for all users of the Micrometer Elasticsearch registry.

The above being said, there is the workaround currently of specifying your own template that uses "_source": { "enabled": true }. Even if we decide keeping the current default is best, another option would be to make a configuration property for whether to enable the _source field, so users don't have to copy the template just to modify that.

Also, I want to point out that recently, the following dashboards/visualizations of Micrometer metrics in Kibana have been published by some users.
https://github.com/acroquest/micrometer-kibana-dashboard

xeraa · 2019-12-11T20:55:26Z

What's the impact?

To make the impact of this explicit. It might be obvious already, but just to put everyone on the same page.

If you have the following mapping and 3 sample docs:

PUT metrics
{
  "mappings": {
    "_source": {
      "enabled": false
    }
  }
}

POST metrics/_doc
{
  "some-long": 200,
  "some-float": 0.9,
  "timestamp": "2019-12-11T12:10:10"
}
POST metrics/_doc
{
  "some-long": 400,
  "some-float": 1.2,
  "timestamp": "2019-12-11T12:10:20"
}
POST metrics/_doc
{
  "some-long": 240,
  "some-float": 1.0,
  "timestamp": "2019-12-11T12:10:30"
}

A search like GET metrics/_search will only give you an output like this:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "metrics",
        "_type" : "_doc",
        "_id" : "LA8N9m4BscjDut7uOBBz",
        "_score" : 1.0
      },
      ...

And Discover will show the same:

But Visualizations (and thus Dashboards) will work just like normal:

Is this the right tradeoff? I don't know :)
It might depend on the use-case, so making it configurable could make sense (potentially pointing to the right index template in the docs)? My personal feeling is that you want to keep this set to true by default and more experienced users with larger datasets can change it to false if that is what they want.

How much disk will this save?

As always — it depends. Number of fields, their mapping, how well the data can be compressed,... will all play a role.

Do you have a representative dataset? Then we could easily try it out.

Which Elasticsearch versions support this?

I would make a pull request to change to enabled: true for Elasticsearch 7.3 and up

This feature has been around for a long time. I'm not sure what version range of Elasticsearch you support, but you could probably enable this for all versions (for example see the docs for 5.0).

PS: Rollups

Maybe the solution could also include rollups? Basically it takes the raw data and pre-aggregates it. So you could have 10s intervals for today, but after 48h you only keep 5min intervals around; IMO that would save a lot more disk space (but add different tradeoffs).

PPS: Compression

As mentioned in the docs, you could also look into compression if you're not doing that already.

spencergibb · 2019-12-12T04:19:28Z

another option would be to make a configuration property for whether to enable the _source field,

I like options

xeraa · 2019-12-12T21:01:07Z

BTW some other ideas for further reducing storage requirements:

Pick the right numeric datatype, for example integer or short instead of the default long; or instead of double either float or even scaled_float.
Set "index": false on numeric values, so you can still aggregate on a value, but not search or filter. This is also doable through an index template
Strings should probably only be keyword if you're not already doing that.

dome313 · 2019-12-17T23:44:09Z

The description states that this is the behaviour in Elasticsearch 7.3 and up, but if you check the docs this is the behavior in any Elasticsearch version. Don't see what is introduced in terms of _source in Elasticsearch 7.3?

izeye · 2020-11-27T08:32:45Z

I created #2363 to try to resolve this.

See micrometer-metrics#1629 (comment)

izeye · 2020-11-27T08:50:34Z

I also created #2364 to incorporate the suggestion from @xeraa in #1629 (comment).

See #1629 (comment)

Closes micrometer-metricsgh-1629

shakuzen · 2022-03-25T14:21:04Z

We discussed this internally today. My thought is still that metrics data should not have _source enabled by default. I'm even a bit weary about making it configurable, because for all but small scale usage, I suspect it is a bad idea to have it enabled in production due to the added cost that increases with more data while the benefit does not. You can query the metrics, even if the contents of the documents do not get returned in a search.

Despite that, I can see the usefulness of it being easier to explore the contents of the metrics. As a compromise, we're thinking to make this configurable, but to print a warning log discouraging its use in production. See #2363. We can consider feedback on the idea still.

After our internal discussion, I now see in the latest documentation, it is worded more to encourage leaving _source enabled.

Do you have a representative dataset? Then we could easily try it out.

@xeraa I like the idea of a data-driven approach. Things can vary a lot, but a good baseline for a lot of users is probably the metrics that are auto-configurred in a basic Spring Boot application. This can be reproduced by generating a project from https://start.spring.io with the Actuator and Web dependencies. Add micrometer-registry-elastic as a dependency, start a local Elasticsearch instance, and start the Spring Boot app. By default, it will publish registered metrics to Elasticsearch once per minute at localhost:9200. If you have some controller endpoints, and you make requests to them, this will increase the number of metrics published. In a real environment, you'll have this amount of data multiplied by each application instance you have running. And most applications will probably have more metrics than such a bare-bones Spring Boot app.

We could run such an app for 10 minutes or so to extrapolate how much data might be generated in an hour and multiply it by a few hundred or thousand to estimate a production environment of many apps with many instances.
Then we can repeat the scenario with _source enabled and see the difference in storage space required.

I don't know if this is a good strategy for estimating the difference or if there will be some fixed costs that at the relatively low amount of data from 10 minutes with one instance, we won't be able to extrapolate to a longer run with many more instances.

Separate from this, we really probably should be moving in the direction of writing metric data as a data stream, I think. I don't know if that obviates this _source discussion or not.

But also warn about the costs associated as a mitigation to this mistakenly being enabled in a production environment. Closes gh-1629

shakuzen · 2023-06-07T04:07:06Z

Re-opening this as it looks like #2363 got inadvertently dropped from 1.10.0 when reverting 2.0-specific changes. See #2363 (comment)

denistsyplakov · 2024-09-18T18:57:11Z

Looks like the easiest way to fix this is clone library, patch and build it myself. God bless opensource.
This bug is very painful and I struggling with it past 3-4 years.

jvmlet · 2024-09-18T20:13:10Z

@denistsyplakov, at the moment you do it, you start the adventure of syncing your fork and maintaining the CI/CD by yoursel..... Not sure which one is a bigger pain.

xeraa · 2024-09-19T00:33:04Z

Two sidenotes from the Elasticsearch side:

We now have synthetic source, which basically gets you disk space savings while still keeping source available
There's now a much more efficient index type for metrics called TSDS. It uses synthetic source but also other tricks to make it much more efficient. If you have a larger metrics use-case this will make a major difference.

shakuzen · 2024-09-20T07:27:40Z

@xeraa thanks for always keeping us aware of the latest features that may be relevant. It's much appreciated. On TSDS, where things left off was #3763 (comment). As far as I'm aware, we can write metrics using Micrometer to a TSDS already using the same functionality to write to a traditional index (per #2997), but we don't handle creating the index template (or equivalent) for it now. Maybe our next step should be to do what was mentioned in
#3763 (comment) and default to that type if an index doesn't already exist. I assume all supported versions of Elasticsearch support TSDS by now.

@denistsyplakov Micrometer by default creates an index template if one doesn't already exist. You can create your own index template in your Elasticsearch instance and have Micrometer write to it. You could also modify the index template that Micrometer had previously created. So there should have been nothing blocking users this whole time from enabling source if they wanted - it merely wasn't a configuration option on the index template Micrometer creates if none exists. That said, the feature is now (re-)merged for inclusion in 1.14.0-RC1 and will be available in 1.14.0-SNAPSHOT versions momentarily.

shakuzen added the registry: elastic An ElasticSearch Registry related issue label Oct 4, 2019

lnorregaard added a commit to lnorregaard/micrometer that referenced this issue Oct 6, 2019

Changed Elastic Template source to true for Version 7 and above (micr…

0818706

…ometer-metrics#1629)

micheljung mentioned this issue Jul 20, 2020

ElasticMeterRegistry: template index pattern is hardcoded to metrics* #2196

Closed

izeye mentioned this issue Nov 27, 2020

Make _source.enabled configurable for ElasticMeterRegistry #2363

Merged

izeye mentioned this issue Nov 27, 2020

Configurable _source.enabled Elastic mapping property #2357

Closed

izeye added a commit to izeye/micrometer that referenced this issue Nov 27, 2020

Set index to false for metric values in ElasticMeterRegistry

1fcbe2d

See micrometer-metrics#1629 (comment)

izeye mentioned this issue Nov 27, 2020

Set index to false for metric values in ElasticMeterRegistry #2364

Merged

shakuzen pushed a commit that referenced this issue Nov 30, 2020

Set index to false for metric values in ElasticMeterRegistry (#2364)

9a9fb18

See #1629 (comment)

shakuzen changed the title ~~Elastic metrics: The 7.3 have changed the meaning for _source enabled false~~ Configurable _source.enabled Elastic mapping property Mar 25, 2022

izeye added a commit to izeye/micrometer that referenced this issue Mar 25, 2022

Make _source.enabled configurable for ElasticMeterRegistry

7309efc

Closes micrometer-metricsgh-1629

shakuzen closed this as completed in #2363 Mar 28, 2022

shakuzen pushed a commit that referenced this issue Mar 28, 2022

Make _source.enabled configurable for ElasticMeterRegistry (#2363)

8bf66dc

But also warn about the costs associated as a mitigation to this mistakenly being enabled in a production environment. Closes gh-1629

shakuzen reopened this Jun 7, 2023

shakuzen added this to the 1.12 backlog milestone Jun 7, 2023

shakuzen added the enhancement A general enhancement label Jun 7, 2023

shakuzen modified the milestones: 1.12 backlog, 1.13 backlog Nov 14, 2023

shakuzen closed this as completed in 99078a2 Sep 20, 2024

shakuzen modified the milestones: 1.13.x, 1.14.0-RC1 Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable _source.enabled Elastic mapping property #1629

Configurable _source.enabled Elastic mapping property #1629

lnorregaard commented Oct 3, 2019

robachmann commented Oct 6, 2019

shakuzen commented Dec 11, 2019

xeraa commented Dec 11, 2019

spencergibb commented Dec 12, 2019

xeraa commented Dec 12, 2019

dome313 commented Dec 17, 2019

izeye commented Nov 27, 2020

izeye commented Nov 27, 2020

shakuzen commented Mar 25, 2022

shakuzen commented Jun 7, 2023

denistsyplakov commented Sep 18, 2024

jvmlet commented Sep 18, 2024

xeraa commented Sep 19, 2024

shakuzen commented Sep 20, 2024

Configurable _source.enabled Elastic mapping property #1629

Configurable _source.enabled Elastic mapping property #1629

Comments

lnorregaard commented Oct 3, 2019

robachmann commented Oct 6, 2019

shakuzen commented Dec 11, 2019

xeraa commented Dec 11, 2019

What's the impact?

How much disk will this save?

Which Elasticsearch versions support this?

PS: Rollups

PPS: Compression

spencergibb commented Dec 12, 2019

xeraa commented Dec 12, 2019

dome313 commented Dec 17, 2019

izeye commented Nov 27, 2020

izeye commented Nov 27, 2020

shakuzen commented Mar 25, 2022

shakuzen commented Jun 7, 2023

denistsyplakov commented Sep 18, 2024

jvmlet commented Sep 18, 2024

xeraa commented Sep 19, 2024

shakuzen commented Sep 20, 2024