Disallow any dag tags longer than 100 char #25196

TommyDew42 · 2022-07-20T19:33:47Z

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

boring-cyborg · 2022-07-20T19:33:50Z

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
Be sure to read the Airflow Coding style.
Apache Airflow is a community-driven project and together we are making it better 🚀.
In case of doubts contact the developers at:
Mailing List: dev@airflow.apache.org
Slack: https://s.apache.org/airflow-slack

potiuk · 2022-07-21T17:36:20Z

airflow/models/dag.py

@@ -347,6 +347,9 @@ def __init__(
    ):
        from airflow.utils.task_group import TaskGroup

+        if tags and any(len(tag) > 100 for tag in tags):


I just wonder - why 100 ? Do we know why the 100 causes the DAGs to disappear (in the original issue? ) I think having such a magic constant here is not good - it needs to come from somewhere ?

eladkal · 2022-07-21T17:58:36Z

I don't think this is right (or at least not valid to all backends)
I was able to create a DAG with tag longer than 100 chars with Sqlite (didn't test the others)

potiuk

Yep. We need to confirm where the 100 limit comes from (and whether it's real) before we merge this one.

anselboero · 2022-07-22T14:31:08Z

I'd like to try to give an explanation :)

The DagTag class requires the maximum length of the Tag to be 100 chars.
Because of that, the error occurs during the loading of the Tag into the Db.

I believe SqlLite is not affected by this bug as, according to the SqlLite documentation (3.1.1),

numeric arguments in parentheses that following the type name (ex: "VARCHAR(255)") are ignored by SQLite - SQLite does not impose any length restrictions [...] on the length of strings

It follows that the tag is uploaded into the db despite the restriction.

potiuk · 2022-07-22T14:45:59Z

I'd like to try to give an explanation :)

The DagTag class requires the maximum length of the Tag to be 100 chars. Because of that, the error occurs during the loading of the Tag into the Db.

I believe SqlLite is not affected by this bug as, according to the SqlLite documentation (3.1.1),

numeric arguments in parentheses that following the type name (ex: "VARCHAR(255)") are ignored by SQLite - SQLite does not impose any length restrictions [...] on the length of strings

It follows that the tag is uploaded into the db despite the restriction.

Cool. Lets then extract the 100 from there to a constant and use it in Both places.

…g-tag-longer-than-100-char

TommyDew42 · 2022-07-22T15:38:38Z

Thanks everyone's help! Updated made!

TommyDew42 · 2022-07-22T16:02:58Z

I am a bit confused about the test failure tho. Is the test flaky or we did break sth?

potiuk · 2022-07-22T16:37:01Z

You need to rebase. It was a flaky test that I fixed in #25231

The new markdown-lint is a bit more picky about headeers in markdown

In case of "merge-run" (i.e. when maintainer merges somethin to main or v2-*-test branch is pushed, the ARM instance does not need to be built because in such case cache is built and pushed for ARM image.

@task

* Bump typing-extensions and mypy for ParamSpec I want to use them in some @task signature improvements. Mypy added this in 0.950, but let's just bump to latest since why not. Changelog of typing-extensions is spotty before 4.0, but ParamSpec was introduced some time before that (likely some time in 2021), and it seems to be a reasonble minimum to bump to. For more about ParamSpec, read PEP 612: https://peps.python.org/pep-0612/

Original implementation has used the API that exists only on AWS, so it will be failing on Azure & GCP Databricks. This PR uses another API that is available on all Databricks cloud platforms.

Tells you, for a given dag run, the dataset events that are "part of" the dag run. I.e. they were part of the collection of dataset events that contributed to the triggering of the dag run. In practice we just query the events that occurred since the prev dag run. We may make the relationship tighter, see #24969.

related: #22447, #22430

…s any exception (#24943)

Convert dataframe object columns to str, to avoid errors when converting from df to parquet. Renamed methods to remove old name: _fix_int_dtypes -> _fix_dtypes test_fix_int_dtypes -> test_fix_dtypes Co-authored-by: Paul Stanton <paul.stanton@selectquote.com>

I missed the hive provider from the commit in #25214

Assets compilation when CI and PROD images are prepared should also be run in non-main branch.

)

…ation_for_ids` docs (#25197)

* dataset details view * Fix sorting and naming * update table tests * update comments

* Convert RDS Export Sample DAG to System Test * PR Fixes

…ssorManager queue (#25147)

The 22.2 version of `pip` has **just** been released. Looks like it does not break neither constraints nor "eager upgrade" builds when we build images locally so it's safe to update to it in main.

* Improve ElasticsearchTaskHandler: - use builtin logging.makeLogRecord instead of strange _ESJsonLogFmt - do not re-sort already sorted logs - apply ISO 8601 datetime format - fixed several found bugs

In Postgres especially (but in generaly in all databases, if there is no order specified in select query, the rows might come in random order. It depends on many factors. The test query here retrieved the dags without any order but expected the list to be in specific order. This PR adds ordering - it also removes side-effects of the test by using fixture that clears the datasets before and after the tests that rely/modify datasets - because othrwise failure of one of the tests can create side effects that fail the other tests (this is what happened in this case)

…(AIP-47) (#25134) * Sagemaker System Tests - Part 3 of 3 - example_sagemaker_endpoint.py * PR Fixes * Fix failing static checks - unused import

* Add batching to SQL Check Operators Commit adds a WHERE clause to the sql statement that allows for arbitrary batching in a given table. * Fix bug with multiple table checks When multiple table checks are given to the SQLTableCheckOperator and at least one is not a fully aggregate statement, a GROUP BY clause was previously needed. This commit updates the operator to use the get_pandas_df() method instead of _get_first() to return a pandas dataframe object that contains the check names and check results from the new style of query. The new style of query uses UNION ALL to run each test as its own SELECT statement, bypassing the need to do a GROUP BY. * Update test failure logic Changed name of method from _get_failed_tests to _get_failed_checks to better match naming, and updated logic of the method to include an optional column param. The query in the column check operator is removed from the failed test exception message, as it was only ever showing the last query, instead of the relevant one(s). This is replaced by the column, which will be more useful in debugging. * Add table alias to SQLTableCheckOperator query Without a table alias, the query does not run on Postgres and other databases. The alias is arbitrary and used only for proper query execution. * Fix formatting error in operator * Add batching to SQL Check Operators Commit adds a WHERE clause to the sql statement that allows for arbitrary batching in a given table. * Fix bug with multiple table checks When multiple table checks are given to the SQLTableCheckOperator and at least one is not a fully aggregate statement, a GROUP BY clause was previously needed. This commit updates the operator to use the get_pandas_df() method instead of _get_first() to return a pandas dataframe object that contains the check names and check results from the new style of query. The new style of query uses UNION ALL to run each test as its own SELECT statement, bypassing the need to do a GROUP BY. * Update test failure logic Changed name of method from _get_failed_tests to _get_failed_checks to better match naming, and updated logic of the method to include an optional column param. The query in the column check operator is removed from the failed test exception message, as it was only ever showing the last query, instead of the relevant one(s). This is replaced by the column, which will be more useful in debugging. * Add table alias to SQLTableCheckOperator query Without a table alias, the query does not run on Postgres and other databases. The alias is arbitrary and used only for proper query execution. * Fix formatting error in operator * Move alias to proper query build statement The table alias should be in the self.sql query build statement as that is where the table it needs to alias is defined. * Add batching to SQL Check Operators Commit adds a WHERE clause to the sql statement that allows for arbitrary batching in a given table. * Fix bug with multiple table checks When multiple table checks are given to the SQLTableCheckOperator and at least one is not a fully aggregate statement, a GROUP BY clause was previously needed. This commit updates the operator to use the get_pandas_df() method instead of _get_first() to return a pandas dataframe object that contains the check names and check results from the new style of query. The new style of query uses UNION ALL to run each test as its own SELECT statement, bypassing the need to do a GROUP BY. * Update test failure logic Changed name of method from _get_failed_tests to _get_failed_checks to better match naming, and updated logic of the method to include an optional column param. The query in the column check operator is removed from the failed test exception message, as it was only ever showing the last query, instead of the relevant one(s). This is replaced by the column, which will be more useful in debugging. * Add table alias to SQLTableCheckOperator query Without a table alias, the query does not run on Postgres and other databases. The alias is arbitrary and used only for proper query execution. * Fix formatting error in operator * Bug fixes and updates to test and operator Fixed bug in test where the dataframe column names did not match the operator's expected dataframe column names. Added more info to the SQLColumnCheckOperator's batch arg. Fixed the location of table aliasing in SQLTableCheckOperator. * Remove merge conflict lines * Rename parameter batch to partition_clause Gives a clearer name to the parameter and adds templating to the SQLTableCheckOperator. * Fix typo in docstring * Reformat operator file

…mmyDew42/airflow into disallow-dag-tag-longer-than-100-char

TommyDew42 · 2022-07-22T16:44:07Z

@potiuk Thanks. I actually never try rebase but always use merge LOL. Hope what I did fixed the tests.

TommyDew42 · 2022-07-22T16:45:00Z

Oh wait. So I think the test was fixed when I merged the master into my branch. 😂 no need to do rebase.

potiuk · 2022-07-22T16:46:44Z

Well. you now have 58 commits in your change.. Rebase is REALLY better :)

potiuk · 2022-07-22T16:47:24Z

Highly recommend learning to rebase: https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#id14

anselboero · 2022-07-22T17:13:04Z

airflow/models/dag.py

@@ -374,6 +375,9 @@ def __init__(
    ):
        from airflow.utils.task_group import TaskGroup

+        if tags and any(len(tag) > TAG_MAX_LEN for tag in tags):
+            raise AirflowException("tag cannot be longer than 100 characters")


Sorry, short question,
should we map the Exception text to the TAG_MAX_LEN variable, too?

potiuk · 2022-07-25T13:46:18Z

Merging. Random issue with MsSQL

boring-cyborg · 2022-07-25T13:46:30Z

Awesome work, congrats on your first merged pull request!

TommyDew42 added 2 commits July 21, 2022 03:31

raise exception when any tag length > 100

a545b90

an unit test for that

2abb48a

TommyDew42 requested review from kaxil, XD-DENG and ashb as code owners July 20, 2022 19:33

potiuk reviewed Jul 21, 2022

View reviewed changes

potiuk requested changes Jul 22, 2022

View reviewed changes

TommyDew42 added 3 commits July 22, 2022 23:31

improve test

30c8573

make the max length for tag 100 as a constant

47856aa

Merge branch 'main' of github.com:TommyDew42/airflow into disallow-da…

557d575

…g-tag-longer-than-100-char

TommyDew42 requested a review from potiuk July 22, 2022 15:38

jwitz and others added 12 commits July 23, 2022 00:40

Update tutorial docs to include a definition of operators (#25012)

92918fa

Fix markdown error in Airflow Readme (#25120)

9fd7467

The new markdown-lint is a bit more picky about headeers in markdown

Avoid running build ARM image twice (#24858)

bd04ba0

In case of "merge-run" (i.e. when maintainer merges somethin to main or v2-*-test branch is pushed, the ARM instance does not need to be built because in such case cache is built and pushed for ARM image.

Databricks: fix test_connection implementation (#25114)

a5a5ce1

Original implementation has used the API that exists only on AWS, so it will be failing on Azure & GCP Databricks. This PR uses another API that is available on all Databricks cloud platforms.

Migrate Google example gcs_to_sftp to new design AIP-47 (#25107)

5b9f0ae

related: #22447, #22430

Convert the batch sample dag to system tests (AIP-47) (#24448)

60de340

Added exception catching to send default email if template file raise…

7f424c2

…s any exception (#24943)

Convert Glue Sample DAG to System Test (#25136)

6a58614

AIP-47 - Migrate Slack DAG to new design (#25137)

8e6fcc1

bbovenzi and others added 15 commits July 23, 2022 00:40

call updateNodeLabels after expandGroup (#25217)

4d9b5ef

make MsSQL tests runnable on Python 3.8 (2nd) (#25216)

b69b273

I missed the hive provider from the commit in #25214

Also compile assets in non-main (#25220)

5b36b03

Assets compilation when CI and PROD images are prepared should also be run in non-main branch.

SQSPublishOperator should allow sending messages to a FIFO Queue (#25171

d547b76

)

Show dataset readiness for the next run (#25141)

bd1a50f

Include missing mention of external_executor_id in `sql_engine_coll…

c6702d9

…ation_for_ids` docs (#25197)

Dataset details view (#25208)

a3ec165

* dataset details view * Fix sorting and naming * update table tests * update comments

Convert RDS Export Sample DAG to System Test (AIP-47) (#25205)

8b75564

* Convert RDS Export Sample DAG to System Test * PR Fixes

Stop SLA callbacks gazumping other callbacks and DOS'ing the DagProce…

c998394

…ssorManager queue (#25147)

Upgrade to Pip 22.2 (#25218)

5bed461

The 22.2 version of `pip` has **just** been released. Looks like it does not break neither constraints nor "eager upgrade" builds when we build images locally so it's safe to update to it in main.

Improve ElasticsearchTaskHandler (#21942)

5dfb6bf

* Improve ElasticsearchTaskHandler: - use builtin logging.makeLogRecord instead of strange _ESJsonLogFmt - do not re-sort already sorted logs - apply ISO 8601 datetime format - fixed several found bugs

Sagemaker System Tests - Part 3 of 3 - example_sagemaker_endpoint.py …

f2d15b4

…(AIP-47) (#25134) * Sagemaker System Tests - Part 3 of 3 - example_sagemaker_endpoint.py * PR Fixes * Fix failing static checks - unused import

Merge branch 'disallow-dag-tag-longer-than-100-char' of github.com:To…

1081080

…mmyDew42/airflow into disallow-dag-tag-longer-than-100-char

potiuk approved these changes Jul 22, 2022

View reviewed changes

anselboero reviewed Jul 22, 2022

View reviewed changes

TommyDew42 added 2 commits July 23, 2022 01:31

use TAG_MAX_LEN in exception message

6ac7da6

Merge branch 'main' into disallow-dag-tag-longer-than-100-char

738d439

potiuk merged commit 4b28635 into apache:main Jul 25, 2022

ephraimbuddy added the type:improvement Changelog: Improvements label Aug 15, 2022

ephraimbuddy added this to the Airflow 2.4.0 milestone Sep 14, 2022

ephraimbuddy added type:bug-fix Changelog: Bug Fixes and removed type:improvement Changelog: Improvements labels Sep 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disallow any dag tags longer than 100 char #25196

Disallow any dag tags longer than 100 char #25196

TommyDew42 commented Jul 20, 2022

boring-cyborg bot commented Jul 20, 2022

potiuk Jul 21, 2022

eladkal commented Jul 21, 2022

potiuk left a comment

anselboero commented Jul 22, 2022 •

edited

Loading

potiuk commented Jul 22, 2022 •

edited

Loading

TommyDew42 commented Jul 22, 2022

TommyDew42 commented Jul 22, 2022

potiuk commented Jul 22, 2022

TommyDew42 commented Jul 22, 2022

TommyDew42 commented Jul 22, 2022

potiuk commented Jul 22, 2022

potiuk commented Jul 22, 2022

anselboero Jul 22, 2022

potiuk Jul 22, 2022

potiuk commented Jul 25, 2022

boring-cyborg bot commented Jul 25, 2022

Disallow any dag tags longer than 100 char #25196

Disallow any dag tags longer than 100 char #25196

Conversation

TommyDew42 commented Jul 20, 2022

boring-cyborg bot commented Jul 20, 2022

potiuk Jul 21, 2022

Choose a reason for hiding this comment

eladkal commented Jul 21, 2022

potiuk left a comment

Choose a reason for hiding this comment

anselboero commented Jul 22, 2022 • edited Loading

potiuk commented Jul 22, 2022 • edited Loading

TommyDew42 commented Jul 22, 2022

TommyDew42 commented Jul 22, 2022

potiuk commented Jul 22, 2022

TommyDew42 commented Jul 22, 2022

TommyDew42 commented Jul 22, 2022

potiuk commented Jul 22, 2022

potiuk commented Jul 22, 2022

anselboero Jul 22, 2022

Choose a reason for hiding this comment

potiuk Jul 22, 2022

Choose a reason for hiding this comment

potiuk commented Jul 25, 2022

boring-cyborg bot commented Jul 25, 2022

anselboero commented Jul 22, 2022 •

edited

Loading

potiuk commented Jul 22, 2022 •

edited

Loading