-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disallow any dag tags longer than 100 char #25196
Disallow any dag tags longer than 100 char #25196
Conversation
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
|
airflow/models/dag.py
Outdated
@@ -347,6 +347,9 @@ def __init__( | |||
): | |||
from airflow.utils.task_group import TaskGroup | |||
|
|||
if tags and any(len(tag) > 100 for tag in tags): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wonder - why 100 ? Do we know why the 100 causes the DAGs to disappear (in the original issue? ) I think having such a magic constant here is not good - it needs to come from somewhere ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. We need to confirm where the 100 limit comes from (and whether it's real) before we merge this one.
I'd like to try to give an explanation :) The DagTag class requires the maximum length of the Tag to be 100 chars. I believe SqlLite is not affected by this bug as, according to the SqlLite documentation (3.1.1),
It follows that the tag is uploaded into the db despite the restriction. |
Cool. Lets then extract the 100 from there to a constant and use it in Both places. |
Thanks everyone's help! Updated made! |
I am a bit confused about the test failure tho. Is the test flaky or we did break sth? |
You need to rebase. It was a flaky test that I fixed in #25231 |
The new markdown-lint is a bit more picky about headeers in markdown
In case of "merge-run" (i.e. when maintainer merges somethin to main or v2-*-test branch is pushed, the ARM instance does not need to be built because in such case cache is built and pushed for ARM image.
* Bump typing-extensions and mypy for ParamSpec I want to use them in some @task signature improvements. Mypy added this in 0.950, but let's just bump to latest since why not. Changelog of typing-extensions is spotty before 4.0, but ParamSpec was introduced some time before that (likely some time in 2021), and it seems to be a reasonble minimum to bump to. For more about ParamSpec, read PEP 612: https://peps.python.org/pep-0612/
Original implementation has used the API that exists only on AWS, so it will be failing on Azure & GCP Databricks. This PR uses another API that is available on all Databricks cloud platforms.
Tells you, for a given dag run, the dataset events that are "part of" the dag run. I.e. they were part of the collection of dataset events that contributed to the triggering of the dag run. In practice we just query the events that occurred since the prev dag run. We may make the relationship tighter, see #24969.
…s any exception (#24943)
Convert dataframe object columns to str, to avoid errors when converting from df to parquet. Renamed methods to remove old name: _fix_int_dtypes -> _fix_dtypes test_fix_int_dtypes -> test_fix_dtypes Co-authored-by: Paul Stanton <paul.stanton@selectquote.com>
I missed the hive provider from the commit in #25214
Assets compilation when CI and PROD images are prepared should also be run in non-main branch.
…ation_for_ids` docs (#25197)
* dataset details view * Fix sorting and naming * update table tests * update comments
* Convert RDS Export Sample DAG to System Test * PR Fixes
…ssorManager queue (#25147)
The 22.2 version of `pip` has **just** been released. Looks like it does not break neither constraints nor "eager upgrade" builds when we build images locally so it's safe to update to it in main.
* Improve ElasticsearchTaskHandler: - use builtin logging.makeLogRecord instead of strange _ESJsonLogFmt - do not re-sort already sorted logs - apply ISO 8601 datetime format - fixed several found bugs
In Postgres especially (but in generaly in all databases, if there is no order specified in select query, the rows might come in random order. It depends on many factors. The test query here retrieved the dags without any order but expected the list to be in specific order. This PR adds ordering - it also removes side-effects of the test by using fixture that clears the datasets before and after the tests that rely/modify datasets - because othrwise failure of one of the tests can create side effects that fail the other tests (this is what happened in this case)
…(AIP-47) (#25134) * Sagemaker System Tests - Part 3 of 3 - example_sagemaker_endpoint.py * PR Fixes * Fix failing static checks - unused import
* Add batching to SQL Check Operators Commit adds a WHERE clause to the sql statement that allows for arbitrary batching in a given table. * Fix bug with multiple table checks When multiple table checks are given to the SQLTableCheckOperator and at least one is not a fully aggregate statement, a GROUP BY clause was previously needed. This commit updates the operator to use the get_pandas_df() method instead of _get_first() to return a pandas dataframe object that contains the check names and check results from the new style of query. The new style of query uses UNION ALL to run each test as its own SELECT statement, bypassing the need to do a GROUP BY. * Update test failure logic Changed name of method from _get_failed_tests to _get_failed_checks to better match naming, and updated logic of the method to include an optional column param. The query in the column check operator is removed from the failed test exception message, as it was only ever showing the last query, instead of the relevant one(s). This is replaced by the column, which will be more useful in debugging. * Add table alias to SQLTableCheckOperator query Without a table alias, the query does not run on Postgres and other databases. The alias is arbitrary and used only for proper query execution. * Fix formatting error in operator * Add batching to SQL Check Operators Commit adds a WHERE clause to the sql statement that allows for arbitrary batching in a given table. * Fix bug with multiple table checks When multiple table checks are given to the SQLTableCheckOperator and at least one is not a fully aggregate statement, a GROUP BY clause was previously needed. This commit updates the operator to use the get_pandas_df() method instead of _get_first() to return a pandas dataframe object that contains the check names and check results from the new style of query. The new style of query uses UNION ALL to run each test as its own SELECT statement, bypassing the need to do a GROUP BY. * Update test failure logic Changed name of method from _get_failed_tests to _get_failed_checks to better match naming, and updated logic of the method to include an optional column param. The query in the column check operator is removed from the failed test exception message, as it was only ever showing the last query, instead of the relevant one(s). This is replaced by the column, which will be more useful in debugging. * Add table alias to SQLTableCheckOperator query Without a table alias, the query does not run on Postgres and other databases. The alias is arbitrary and used only for proper query execution. * Fix formatting error in operator * Move alias to proper query build statement The table alias should be in the self.sql query build statement as that is where the table it needs to alias is defined. * Add batching to SQL Check Operators Commit adds a WHERE clause to the sql statement that allows for arbitrary batching in a given table. * Fix bug with multiple table checks When multiple table checks are given to the SQLTableCheckOperator and at least one is not a fully aggregate statement, a GROUP BY clause was previously needed. This commit updates the operator to use the get_pandas_df() method instead of _get_first() to return a pandas dataframe object that contains the check names and check results from the new style of query. The new style of query uses UNION ALL to run each test as its own SELECT statement, bypassing the need to do a GROUP BY. * Update test failure logic Changed name of method from _get_failed_tests to _get_failed_checks to better match naming, and updated logic of the method to include an optional column param. The query in the column check operator is removed from the failed test exception message, as it was only ever showing the last query, instead of the relevant one(s). This is replaced by the column, which will be more useful in debugging. * Add table alias to SQLTableCheckOperator query Without a table alias, the query does not run on Postgres and other databases. The alias is arbitrary and used only for proper query execution. * Fix formatting error in operator * Bug fixes and updates to test and operator Fixed bug in test where the dataframe column names did not match the operator's expected dataframe column names. Added more info to the SQLColumnCheckOperator's batch arg. Fixed the location of table aliasing in SQLTableCheckOperator. * Remove merge conflict lines * Rename parameter batch to partition_clause Gives a clearer name to the parameter and adds templating to the SQLTableCheckOperator. * Fix typo in docstring * Reformat operator file
…mmyDew42/airflow into disallow-dag-tag-longer-than-100-char
@potiuk Thanks. I actually never try |
Oh wait. So I think the test was fixed when I merged the master into my branch. 😂 no need to do rebase. |
Well. you now have 58 commits in your change.. Rebase is REALLY better :) |
Highly recommend learning to rebase: https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#id14 |
airflow/models/dag.py
Outdated
@@ -374,6 +375,9 @@ def __init__( | |||
): | |||
from airflow.utils.task_group import TaskGroup | |||
|
|||
if tags and any(len(tag) > TAG_MAX_LEN for tag in tags): | |||
raise AirflowException("tag cannot be longer than 100 characters") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, short question,
should we map the Exception text to the TAG_MAX_LEN variable, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls
Merging. Random issue with MsSQL |
Awesome work, congrats on your first merged pull request! |
closes #24820.
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.