Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disallow any dag tags longer than 100 char #25196

Merged
merged 60 commits into from
Jul 25, 2022
Merged

Disallow any dag tags longer than 100 char #25196

merged 60 commits into from
Jul 25, 2022

Conversation

TommyDew42
Copy link
Contributor


closes #24820.

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg
Copy link

boring-cyborg bot commented Jul 20, 2022

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@@ -347,6 +347,9 @@ def __init__(
):
from airflow.utils.task_group import TaskGroup

if tags and any(len(tag) > 100 for tag in tags):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wonder - why 100 ? Do we know why the 100 causes the DAGs to disappear (in the original issue? ) I think having such a magic constant here is not good - it needs to come from somewhere ?

@eladkal
Copy link
Contributor

eladkal commented Jul 21, 2022

I don't think this is right (or at least not valid to all backends)
I was able to create a DAG with tag longer than 100 chars with Sqlite (didn't test the others)
Screen Shot 2022-07-21 at 20 56 44

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. We need to confirm where the 100 limit comes from (and whether it's real) before we merge this one.

@anselboero
Copy link

anselboero commented Jul 22, 2022

I'd like to try to give an explanation :)

The DagTag class requires the maximum length of the Tag to be 100 chars.
Because of that, the error occurs during the loading of the Tag into the Db.

I believe SqlLite is not affected by this bug as, according to the SqlLite documentation (3.1.1),

numeric arguments in parentheses that following the type name (ex: "VARCHAR(255)") are ignored by SQLite - SQLite does not impose any length restrictions [...] on the length of strings

It follows that the tag is uploaded into the db despite the restriction.

@potiuk
Copy link
Member

potiuk commented Jul 22, 2022

I'd like to try to give an explanation :)

The DagTag class requires the maximum length of the Tag to be 100 chars. Because of that, the error occurs during the loading of the Tag into the Db.

I believe SqlLite is not affected by this bug as, according to the SqlLite documentation (3.1.1),

numeric arguments in parentheses that following the type name (ex: "VARCHAR(255)") are ignored by SQLite - SQLite does not impose any length restrictions [...] on the length of strings

It follows that the tag is uploaded into the db despite the restriction.

Cool. Lets then extract the 100 from there to a constant and use it in Both places.

@TommyDew42
Copy link
Contributor Author

Thanks everyone's help! Updated made!

@TommyDew42 TommyDew42 requested a review from potiuk July 22, 2022 15:38
@TommyDew42
Copy link
Contributor Author

I am a bit confused about the test failure tho. Is the test flaky or we did break sth?

@potiuk
Copy link
Member

potiuk commented Jul 22, 2022

You need to rebase. It was a flaky test that I fixed in #25231

jwitz and others added 12 commits July 23, 2022 00:40
The new markdown-lint is a bit more picky about headeers in markdown
In case of "merge-run" (i.e. when maintainer merges somethin
to main or v2-*-test branch is pushed, the ARM instance does not
need to be built because in such case cache is built and pushed
for ARM image.
* Bump typing-extensions and mypy for ParamSpec

I want to use them in some @task signature improvements. Mypy added this
in 0.950, but let's just bump to latest since why not.

Changelog of typing-extensions is spotty before 4.0, but ParamSpec was
introduced some time before that (likely some time in 2021), and it
seems to be a reasonble minimum to bump to.

For more about ParamSpec, read PEP 612: https://peps.python.org/pep-0612/
Original implementation has used the API that exists only on AWS, so it will be failing on
Azure & GCP Databricks.  This PR uses another API that is available on all Databricks
cloud platforms.
Tells you, for a given dag run, the dataset events that are "part of" the dag run.  I.e. they were part of the collection of dataset events that contributed to the triggering of the dag run.  In practice we just query the events that occurred since the prev dag run.  We may make the relationship tighter, see #24969.
Convert dataframe object columns to str, to avoid errors when converting from df to parquet.

Renamed methods to remove old name:
_fix_int_dtypes -> _fix_dtypes
test_fix_int_dtypes -> test_fix_dtypes

Co-authored-by: Paul Stanton <paul.stanton@selectquote.com>
bbovenzi and others added 15 commits July 23, 2022 00:40
I missed the hive provider from the commit in #25214
Assets compilation when CI and PROD images are prepared should
also be run in non-main branch.
* dataset details view

* Fix sorting and naming

* update table tests

* update comments
* Convert RDS Export Sample DAG to System Test

* PR Fixes
The 22.2 version of `pip` has **just** been released. Looks like
it does not break neither constraints nor "eager upgrade" builds
when we build images locally so it's safe to update to it in main.
* Improve ElasticsearchTaskHandler:
- use builtin logging.makeLogRecord instead of strange _ESJsonLogFmt
- do not re-sort already sorted logs
- apply ISO 8601 datetime format
- fixed several found bugs
In Postgres especially (but in generaly in all databases, if
there is no order specified in select query, the rows might
come in random order. It depends on many factors.

The test query here retrieved the dags without any order but
expected the list to be in specific order.

This PR adds ordering - it also removes side-effects of the
test by using fixture that clears the datasets before and after
the tests that rely/modify datasets - because othrwise failure of
one of the tests can create side effects that fail the other
tests (this is what happened in this case)
…(AIP-47) (#25134)

* Sagemaker System Tests - Part 3 of 3 - example_sagemaker_endpoint.py

* PR Fixes

* Fix failing static checks - unused import
* Add batching to SQL Check Operators

Commit adds a WHERE clause to the sql statement that allows for
arbitrary batching in a given table.

* Fix bug with multiple table checks

When multiple table checks are given to the SQLTableCheckOperator
and at least one is not a fully aggregate statement, a GROUP BY
clause was previously needed. This commit updates the operator to
use the get_pandas_df() method instead of _get_first() to return a
pandas dataframe object that contains the check names and check
results from the new style of query. The new style of query uses
UNION ALL to run each test as its own SELECT statement, bypassing
the need to do a GROUP BY.

* Update test failure logic

Changed name of method from _get_failed_tests to _get_failed_checks
to better match naming, and updated logic of the method to include
an optional column param. The query in the column check operator
is removed from the failed test exception message, as it was only
ever showing the last query, instead of the relevant one(s). This is
replaced by the column, which will be more useful in debugging.

* Add table alias to SQLTableCheckOperator query

Without a table alias, the query does not run on Postgres and
other databases. The alias is arbitrary and used only for
proper query execution.

* Fix formatting error in operator

* Add batching to SQL Check Operators

Commit adds a WHERE clause to the sql statement that allows for
arbitrary batching in a given table.

* Fix bug with multiple table checks

When multiple table checks are given to the SQLTableCheckOperator
and at least one is not a fully aggregate statement, a GROUP BY
clause was previously needed. This commit updates the operator to
use the get_pandas_df() method instead of _get_first() to return a
pandas dataframe object that contains the check names and check
results from the new style of query. The new style of query uses
UNION ALL to run each test as its own SELECT statement, bypassing
the need to do a GROUP BY.

* Update test failure logic

Changed name of method from _get_failed_tests to _get_failed_checks
to better match naming, and updated logic of the method to include
an optional column param. The query in the column check operator
is removed from the failed test exception message, as it was only
ever showing the last query, instead of the relevant one(s). This is
replaced by the column, which will be more useful in debugging.

* Add table alias to SQLTableCheckOperator query

Without a table alias, the query does not run on Postgres and
other databases. The alias is arbitrary and used only for
proper query execution.

* Fix formatting error in operator

* Move alias to proper query build statement

The table alias should be in the self.sql query build statement
as that is where the table it needs to alias is defined.

* Add batching to SQL Check Operators

Commit adds a WHERE clause to the sql statement that allows for
arbitrary batching in a given table.

* Fix bug with multiple table checks

When multiple table checks are given to the SQLTableCheckOperator
and at least one is not a fully aggregate statement, a GROUP BY
clause was previously needed. This commit updates the operator to
use the get_pandas_df() method instead of _get_first() to return a
pandas dataframe object that contains the check names and check
results from the new style of query. The new style of query uses
UNION ALL to run each test as its own SELECT statement, bypassing
the need to do a GROUP BY.

* Update test failure logic

Changed name of method from _get_failed_tests to _get_failed_checks
to better match naming, and updated logic of the method to include
an optional column param. The query in the column check operator
is removed from the failed test exception message, as it was only
ever showing the last query, instead of the relevant one(s). This is
replaced by the column, which will be more useful in debugging.

* Add table alias to SQLTableCheckOperator query

Without a table alias, the query does not run on Postgres and
other databases. The alias is arbitrary and used only for
proper query execution.

* Fix formatting error in operator

* Bug fixes and updates to test and operator

Fixed bug in test where the dataframe column names did not match
the operator's expected dataframe column names. Added more info
to the SQLColumnCheckOperator's batch arg. Fixed the location of
table aliasing in SQLTableCheckOperator.

* Remove merge conflict lines

* Rename parameter batch to partition_clause

Gives a clearer name to the parameter and adds templating to
the SQLTableCheckOperator.

* Fix typo in docstring

* Reformat operator file
…mmyDew42/airflow into disallow-dag-tag-longer-than-100-char
@TommyDew42
Copy link
Contributor Author

@potiuk Thanks. I actually never try rebase but always use merge LOL. Hope what I did fixed the tests.

@TommyDew42
Copy link
Contributor Author

Oh wait. So I think the test was fixed when I merged the master into my branch. 😂 no need to do rebase.

@potiuk
Copy link
Member

potiuk commented Jul 22, 2022

Well. you now have 58 commits in your change.. Rebase is REALLY better :)

@potiuk
Copy link
Member

potiuk commented Jul 22, 2022

Highly recommend learning to rebase: https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#id14

@@ -374,6 +375,9 @@ def __init__(
):
from airflow.utils.task_group import TaskGroup

if tags and any(len(tag) > TAG_MAX_LEN for tag in tags):
raise AirflowException("tag cannot be longer than 100 characters")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, short question,
should we map the Exception text to the TAG_MAX_LEN variable, too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls

@potiuk
Copy link
Member

potiuk commented Jul 25, 2022

Merging. Random issue with MsSQL

@potiuk potiuk merged commit 4b28635 into apache:main Jul 25, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Jul 25, 2022

Awesome work, congrats on your first merged pull request!

@ephraimbuddy ephraimbuddy added the type:improvement Changelog: Improvements label Aug 15, 2022
@ephraimbuddy ephraimbuddy added this to the Airflow 2.4.0 milestone Sep 14, 2022
@ephraimbuddy ephraimbuddy added type:bug-fix Changelog: Bug Fixes and removed type:improvement Changelog: Improvements labels Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dag disappears when DAG tag is longer than 100 char limit