Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable tagged metric names for existing Statsd metric publishing events | influxdb-statsd support #29093

Merged
merged 26 commits into from
Feb 14, 2023

Conversation

sungwy
Copy link
Contributor

@sungwy sungwy commented Jan 21, 2023

This PR attempts to close a long open issue (closes: #11463 ) related to allowing Airflow to publish tags in a backward compatible manner.

The support for tags has been long-desired, and the recent PR #28961 introduced the tags parameter to the StatsLogger Protocol. However, this tag only allows us to introduce tags in a forward looking way: useful existing metrics that take form of concatenated tag-like attributes cannot meaningfully be changed to be efficiently aggregated. The following are some examples of metrics that would greatly benefit from having an updated naming/tagging convention as they hold an important significance at the cluster level:

  • dagrun.schedule_delay.<dag_id>
  • dagrun.<dag_id>.first_task_scheduling_delay

I am introducing two additional parameters, along with one configuration property in order to enable a backward-compatible approach to introducing tags for tag-supported plugins (Datadog, Influxdb etc)

Parameters:

  • name_tags: OrderedDict[str, str]
  • stat_suffix: str

name_tags are similar to the existing tags, except that they will be included into the end of the stat_name if statsd_aggregation_optimized_naming_enabled is set to False. This means that adding or changing name_tags will break backward compatibility. In contrast, the tags property will remain a non-breaking parameter that will allow us to introduce a breadth of tags to airflow.

stat_suffix are added to the stat_name after the name_tags have been concatenated. We would prefer to not have this attribute as it seems extraneous, but it is necessary to maintain backward compatibility with some metrics that have included tag-like attributes in the middle of the metric name. (e.g. dagrun.<dag_id>.first_task_scheduling_delay).

Configuration Property

  • statsd_aggregation_optimized_naming_enabled: boolean = False

When set to False (default), the current naming format is preserved. But when set to True, name_tags will be excluded from the stat_name for better aggregation. In our example, if dag_id is passed into the name_tags parameter as OrderedDict({"dag_id": dag.dag_id}), it will be excluded from the stat_name and instead be published as one of the tags to the metric. (The resulting stat_name will be dagrun.first_task_scheduling_delay, which allows us to aggregate the average of scheduling delays better).


EDIT:

This PR now publishes two separate stats metrics for some important existing metrics. A single scheduling delay event will now trigger the following two Stats.incr calls. The rationale for this change is discussed in the comments.

Stats.timing(f"dagrun.schedule_delay.{dag.dag_id}", schedule_delay)
Stats.timing(f"dagrun.schedule_delay", schedule_delay, tags={"dag": dag.dag_id})

Additional Change: InfluxDB support:

InfluxDB-StatsD support is entirely opt-in, and is driven by a configuration property: statsd_influxdb_enabled which is set to False by default. When enabled, the tags will be added into the stats_name according to the influxdb-statsd convention in order to allow the publication of tagged metrics.

I have currently made the change to dagrun.schedule_delay.<dag_id> or dagrun.<dag_id>.first_task_scheduling_delay to show how name_tags could be used to update the existing metrics in a backward-compatible way. If this sounds like a good idea, I could keep working on this PR to update other existing tags as well.


Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Jan 21, 2023
@sungwy
Copy link
Contributor Author

sungwy commented Jan 21, 2023

@potiuk @hussein-awala @uranusjr would love to get your thoughts on this PR

@sungwy sungwy changed the title Enable backward compatible tags for statsd [WIP] Enable backward compatible tags for statsd Jan 21, 2023
@hussein-awala
Copy link
Member

Thank your for this PR, I was working on a PR to support the choice between adding the variables (job_name, task_id, dag_id, run_id, ...) as a part of the metric name or as tags with same metric name for DataDog metrics.

I will check your PR by tomorrow.

@sungwy sungwy changed the title [WIP] Enable backward compatible tags for statsd [WIP] Enable tags for statsd metrics in a backward compatible way Jan 25, 2023
@sungwy
Copy link
Contributor Author

sungwy commented Jan 30, 2023

Another alternative to this PR, is just to simply publish duplicate metrics: one that has the current naming strategy, and another one that has an aggregation friendly name and its tags.

Stats.timing(f"dagrun.schedule_delay.{dag.dag_id}", schedule_delay)
Stats.timing(f"dagrun.schedule_delay", schedule_delay, tags={"dag": dag.dag_id})

@uranusjr
Copy link
Member

Another alternative to this PR, is just to simply publish duplicate metrics

I was thinking exactly this when I opened this PR. Is there a good way to deprecate stat names? Keeping the old names forever is probably not a very viable solution.

@sungwy
Copy link
Contributor Author

sungwy commented Jan 31, 2023

Another alternative to this PR, is just to simply publish duplicate metrics

I was thinking exactly this when I opened this PR. Is there a good way to deprecate stat names? Keeping the old names forever is probably not a very viable solution.

First off: thank you for the review! Very much appreciate you taking a look. I agree with you @uranusjr, that tagless metrics are very limiting for obvious reasons. But as I understand, there are a couple of factors that act against us completely deprecating the Statsd old stat names...

Firstly, it looks like the long-term vision is to deprecate Statsd support overall and move towards adopting OpenTelemetry. And a part of that proposal is to deprecate Statsd metrics / the existing stat names and introducing tagged metrics. But as I have pointed out in this issue #11463, adopting OpenTelemetry will be an organizational effort rather than an individual one for users in large companies. On top of that, I believe that that AIP is still a long way from being implemented.

Unfortunately with Statsd, although it is widely used, we have to deal with the unfortunate lack of standardization for tag-supported plugins. As an Influxdb-Statsd plugin user, I would love to just simplify the existing stat naming conventions to be more aggregation friendly and just introduce meaningful tags. But completely removing variable (dag, pool, etc) information from the existing stats metrics would be a pretty destructive change to current classic statsd users that may rely on that information.

I think if we want to reserve that destructive change until OpenTelemetry is introduced, then preserving backward compatibility as proposed in this PR might be a gentle way to bridge the gap.

@uranusjr
Copy link
Member

uranusjr commented Feb 1, 2023

If the goal is to remove statsd altogether eventually, emitting an event under multiple keys feels like the easiest solution to me.

@sungwy sungwy changed the title [WIP] Enable tags for statsd metrics in a backward compatible way Enable tagged metric names for existing Statsd metric publishing events | influxdb-statsd support Feb 1, 2023
@sungwy
Copy link
Contributor Author

sungwy commented Feb 1, 2023

If the goal is to remove statsd altogether eventually, emitting an event under multiple keys feels like the easiest solution to me.

Sounds good to me @uranusjr . Made the changes to some important metrics. May I ask you for another review?

Copy link
Member

@uranusjr uranusjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits. General logic lgtm.

airflow/config_templates/config.yml Outdated Show resolved Hide resolved
airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved
airflow/models/dagrun.py Outdated Show resolved Hide resolved
airflow/stats.py Outdated Show resolved Hide resolved
airflow/stats.py Outdated Show resolved Hide resolved
airflow/stats.py Outdated Show resolved Hide resolved
airflow/stats.py Outdated Show resolved Hide resolved
@sungwy
Copy link
Contributor Author

sungwy commented Feb 2, 2023

Some nits. General logic lgtm.

Thank you! I've made the suggested changes.

@sungwy
Copy link
Contributor Author

sungwy commented Feb 3, 2023

@potiuk could I ask for your review on this PR?

@potiuk potiuk merged commit 289ae47 into apache:main Feb 14, 2023
@potiuk
Copy link
Member

potiuk commented Feb 14, 2023

Woooho !

@sungwy
Copy link
Contributor Author

sungwy commented Feb 14, 2023

Thank you both!! @potiuk @uranusjr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler type:new-feature Changelog: New Features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tags rather than names in variable parts of the metrics
5 participants