-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emit DataDog statsd metrics with metadata tags #28961
Conversation
@potiuk @hussein-awala @uranusjr I've started working with backend folks to add the new metric tags to the backend to be able to read the soon to be published metrics... and I was reminded that cardinality of the metrics is an issue when it comes to the storage space and the retention period of the tags. I'm not sure of the other infrastructures, but for us, the cardinality of a metric is measured as:
Introducing tags to existing metric names that already have these values concatenated into the metric names doesn't actually increase the cardinality by a lot (it only doubles from duplication of metrics on same events). But as a rule of thumb I think we might benefit from carefully analyzing the potential for cardinality explosion from each new tags. As an example, my only concern with this PR is the new tag attribute 'run_id' which is unique for every single dag_run, and hence increases the cardinality by the number of unique scheduled dag_runs during a retention period. This means that for an Airflow instance with 1000 daily jobs, with a metric retention period of 10 days, we are increasing the cardinality of our metrics by 10,000 on just one single metric just by adding this tag alone. If we add this tag to a few other metrics, that could easily result in an explosion of metric cardinality, and storage requirements for a metrics backend. As a benchmark, our allocated quota for metric cardinality is 100,000 per tenancy, and I'm wondering if other tag users may face similar storage-based concerns as well. Could I get your thoughts on this? Is there room to discuss and potentially backtrack the addition of run_id as a metric tag in the upcoming release? |
On that note, I'm wondering if we should review this metric as well: If you think this warrants its own Issue to facilitate more discussion before opening a PR, I'm happy to open one as well. |
I'll test it and check with the folks at datadog if the new added tag might cause a problem, then we can decide if we remove it completely or we add a parameter to enable/disable it. |
Thank you @hussein-awala - appreciate it! |
I think the concern for 'High-Cardinality' metrics is pretty universal at a quick glance: DataDog: https://arapulido.github.io/blog/2021/11/15/understanding-dd-tag-cardinality-in-kubernetes/ And it looks like metric cardinality would directly affect the pricing plan for custom metrics as well. |
Yes. I think if we see high-cardinality metrics we could add features to disable them indeed - not sure though if it should be done a single "disable-high-cardinality" metrics or list of metrics to disable. Both have advantages and disadvantages. I think the single flag is more opinionated what is high-cardinality, but it has also the potential on being used in OTEL implementaiton (cc: @feruzzi). It looks like the discussion on what is/should be cardinality explanation and making the documentation and explanation of it part of the OTEL specification open-telemetry/opentelemetry-specification#2996 so once we get into the OTEL implementation we could also think about it and take part in the discussion. |
Thank you for that reference @potiuk - I think we can lean in on the fact that OTEL is also trying better document the problems high cardinality metrics pose to the users and justify implementing a solution of our own in Statsd metrics in the interim. I think this discussion has grown sufficiently to warrant a Issue of its own for us to agree on a solution. Will open one up |
^ Add meaningful description above
In this PR, I'm adding a new config
metrics.statsd_datadog_metrics_tags
, when it'sTrue
, Airflow will emit some of the metrics (the counters) with tags to add some details about the metric source.This can help the users to create custom dashboards, aggregate and filter the metrics and detect the problems more easily. But activating this feature can increase the datadog cost
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.