Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExternalTaskSensor respects soft_fail if the external task enters a failed_state #23647

Merged
merged 4 commits into from
Jun 24, 2022
Merged

ExternalTaskSensor respects soft_fail if the external task enters a failed_state #23647

merged 4 commits into from
Jun 24, 2022

Conversation

argibbs
Copy link
Contributor

@argibbs argibbs commented May 11, 2022

Problem

You have an external task which may "skip". When that happens, you want the external task sensor to also skip. This PR enables that behaviour.

Solution

The "right" way to do this is:

  1. Set failed_states to [State.SKIPPED] (by default it is None)
  2. Set soft_fail=True

With the above settings, when the target task enters the SKIPPED state, the ExternalTaskSensor will see this as a failed state and immediately react to the failure. Because of the soft_fail flag, it should not mark as failed, but instead as skipped.

Effect of change

Prior to this change, the ExternalTaskSensor ignored the soft_fail setting, and always marked it as an error. With this change in place, the ExternalTaskSensor will instead mark the task as skipped, as desired.

Other comments

soft_fail would already kick in if the ExternalTaskSensors timed out (i.e. if the external task did not enter an allowed_state or failed_state in the time. That means a possible workaround for the above is to set a timeout on the ExternalTaskSensor which you believe is long enough that the external task should have completed.

However (in my experience at least) ExternalTaskSensors are often run without timeouts. This is because if the target task hangs on upstream dependencies, having the external sensor timeout means cascading failures throughout the system which normally means more cleanup. It's usually easier to have the ExternalTaskSensors not timeout and thus only fail if there's an actual problem.

Additionally, soft_fail + execution timeouts is not perfect. If the task skips you may want the skip to propagate immediately; under the workaround, the skip would only kick in after the timeout. And of course, there are obvious goldilocks issues with finessing the timeout to be not too short nor too long.

Related Issues

closes: #19754

@boring-cyborg boring-cyborg bot added the area:core-operators Operators, Sensors and hooks within Core Airflow label May 11, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented May 11, 2022

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@ferruzzi
Copy link
Contributor

Looks reasonable to me.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks cool, however this is a change in behaviour which should be documented. It's not a breaking change I think, I'd rather lean towards a bugfix (It should have behaved like that already). I think though this one is enough of a change to warrant a newsfragment (of bugfix type) https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#step-4-prepare-pr

Can you please add it (and also add information about this behaviour in the docstring of the External Task Sensor @argibbs ?

@jedcunningham @ephraimbuddy - this is not a "critical enough" to make into 2.3.1 so I'd leave it up to you whether to cherry-pick / add it to 2.3.1 (though it should be small and localized enought to not be a problem to cherry-pick if we alll agree this is a "bugfix".

@argibbs
Copy link
Contributor Author

argibbs commented May 21, 2022

Sorry, missed the comment, will crack on with the requested changes ASAP.

@argibbs
Copy link
Contributor Author

argibbs commented Jun 7, 2022

Alright, sorry for the delay, took me a while to get a working test env; breeze + docker worked well once I actually followed the instructions!

@potiuk those requested changes are all done.

  • Fixed tests (I hadn't realised the taskinstance would swallow the skip exception ... they now pass locally in breeze)
  • Added one-line bugfix news fragment
  • Updated sensor docstring
  • Ran black on both the sensor and the test file

Additionally, I've rebased onto latest main...

@argibbs argibbs requested a review from potiuk June 7, 2022 15:08
Copy link
Contributor

@josh-fell josh-fell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool! I love all the extra documentation in the operator too; very helpful. Some nits and suggestions so other static checks are happier before releasing all of the other ones in CI. And it's a good idea to run the static checks locally too if you can.

airflow/sensors/external_task.py Outdated Show resolved Hide resolved
newsfragments/23647.bugfix.rst Outdated Show resolved Hide resolved
tests/sensors/test_external_task_sensor.py Outdated Show resolved Hide resolved
tests/sensors/test_external_task_sensor.py Outdated Show resolved Hide resolved
tests/sensors/test_external_task_sensor.py Outdated Show resolved Hide resolved
@argibbs
Copy link
Contributor Author

argibbs commented Jun 9, 2022

Thanks for the feedback; sorry should have used breeze to run the static checks locally; should have RTM properly.

@argibbs
Copy link
Contributor Author

argibbs commented Jun 9, 2022

Ok, I've run black again, weirdly black via breeze seems to do something different to black run externally 🤷 .

Anyhoo, breeze static-checks -t black passes locally for me.

@potiuk
Copy link
Member

potiuk commented Jun 9, 2022

Likely whatever you used to run black does not take pyproject.toml into account:

[tool.black]
line-length = 110
target-version = ['py37', 'py38', 'py39', 'py310']
skip-string-normalization = true

That's why breeze main power is - to make sure that there is no 'works for me` syndrome - everyone has the same environment and tool setup.

@argibbs
Copy link
Contributor Author

argibbs commented Jun 9, 2022

Yeah, I ran it from the root of the checkout... I dunno what I did wrong. I'm really hoping this is it now though. 😄

It's been a learning curve but I'll hopefully walk into fewer rakes if (when?) I pick up another PR.

@potiuk
Copy link
Member

potiuk commented Jun 9, 2022

Yeah, I ran it from the root of the checkout... I dunno what I did wrong. I'm really hoping this is it now though. 😄

Could also be a different black version :). Also I really recommend "pre-commit install" and then you would not even have to think about running ANYTHING - all that is needed will be run automatically when you do "git commit".

@potiuk
Copy link
Member

potiuk commented Jun 9, 2022

Tests are failing :)

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Jun 9, 2022
@github-actions
Copy link

github-actions bot commented Jun 9, 2022

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@josh-fell
Copy link
Contributor

Tests are failing :)

Yeah, that's my fault. I provided an incorrect suggestion. @argibbs the state values should be State.FAILED and State.SUCCESS in your test.

@argibbs
Copy link
Contributor Author

argibbs commented Jun 9, 2022

OK, <counts on fingers> 4th time's the charm

@argibbs
Copy link
Contributor Author

argibbs commented Jun 10, 2022

Just checking but I assume the helm tests timing out are not a blocker? Do they need to be rerun, or do we just not mind given the unrelated nature of the change, or <insert option C here>?

@argibbs
Copy link
Contributor Author

argibbs commented Jun 13, 2022

@potiuk @josh-fell just checking I don't have to do anything here...

I see there are some fixes for flakey tests hitting master; do you need me to rebase?

@potiuk
Copy link
Member

potiuk commented Jun 13, 2022

@potiuk @josh-fell just checking I don't have to do anything here...

I see there are some fixes for flakey tests hitting master; do you need me to rebase?

It's always good idea no matter what/

@argibbs
Copy link
Contributor Author

argibbs commented Jun 14, 2022

It's always good idea no matter what

Done. I'm not at my laptop, so have used github's web ui to rebase to main ... I think it's done the right thing, o frabjous day.

Copy link
Contributor

@josh-fell josh-fell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending "fun with tests". Great addition!

@argibbs
Copy link
Contributor Author

argibbs commented Jun 14, 2022

Ok so I'm at a loss now; the failing tests are Helm k8s stuffs, unrelated to my changes.

  >           raise ApiException(http_resp=r)
  E           kubernetes.client.exceptions.ApiException: (404)
  E           Reason: Not Found
  E           HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '17d9895e-1b8d-4e42-b78e-95252c495cc0', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'a1289028-3422-4c01-84db-a75375dda550', 'Date': 'Tue, 14 Jun 2022 15:38:05 GMT', 'Content-Length': '242'})
  E           HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"test-cd53faaae56f4d1abb2766e7d1f75797\" not found","reason":"NotFound","details":{"name":"test-cd53faaae56f4d1abb2766e7d1f75797","kind":"pods"},"code":404}
  
  .build/.kubernetes_venv/airflow-python-3.7-v1.21.12_host_python_3.7_KubernetesExecutor/lib/python3.7/site-packages/kubernetes/client/rest.py:234: ApiException

I've rebased to latest master. Do it get merged anyway, or does this PR just sit in limbo? Is there anything I can do to get it over the line?

@josh-fell
Copy link
Contributor

Ok so I'm at a loss now; the failing tests are Helm k8s stuffs, unrelated to my changes.

  >           raise ApiException(http_resp=r)
  E           kubernetes.client.exceptions.ApiException: (404)
  E           Reason: Not Found
  E           HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '17d9895e-1b8d-4e42-b78e-95252c495cc0', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'a1289028-3422-4c01-84db-a75375dda550', 'Date': 'Tue, 14 Jun 2022 15:38:05 GMT', 'Content-Length': '242'})
  E           HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"test-cd53faaae56f4d1abb2766e7d1f75797\" not found","reason":"NotFound","details":{"name":"test-cd53faaae56f4d1abb2766e7d1f75797","kind":"pods"},"code":404}
  
  .build/.kubernetes_venv/airflow-python-3.7-v1.21.12_host_python_3.7_KubernetesExecutor/lib/python3.7/site-packages/kubernetes/client/rest.py:234: ApiException

I've rebased to latest master. Do it get merged anyway, or does this PR just sit in limbo? Is there anything I can do to get it over the line?

Still in a holding pattern. There is work being done to investigate these test failures and resolve. Don't worry, it's not you 🙂

@argibbs
Copy link
Contributor Author

argibbs commented Jun 16, 2022

Fair beans, ty for the info. What's the best place to track the progress of the fixes? Should I just lurk on the devel slack for a Eureka post, or is there an issue or something I can watch? (Don't want to be rebasing and asking for reruns unnecessarily)

@potiuk
Copy link
Member

potiuk commented Jun 19, 2022

Fair beans, ty for the info. What's the best place to track the progress of the fixes? Should I just lurk on the devel slack for a Eureka post, or is there an issue or something I can watch? (Don't want to be rebasing and asking for reruns unnecessarily)

Just rebase often and early. No problem with that. You have ONE PR to take care about the maintainers about 30-50 things to look at every day (see my talk here https://www.youtube.com/watch?v=G6VjYvKr2wQ&t=1s - to get the right perspective.

Be persistent. rebase when you feel like it and see your PR is green. It's you who want to merge it, and just make sure things are gree. I rebase it now - but I leave it up to you if it is fixed and green (you can continue rebasing when you see it's not}.

Ping us please if you see it's green and ready for review.

@argibbs
Copy link
Contributor Author

argibbs commented Jun 20, 2022

Be persistent.

Yeah, ok. I've no problem rebasing daily, but unless something's changed, as a first time committer I don't have the ability to run the workflows. I was trying to avoid being "that guy" who pops up every day asking about a low prio MR, precisely because you guys have so much to deal with. 😄

Will see how it goes.

Ty!

@potiuk
Copy link
Member

potiuk commented Jun 20, 2022

The thing is that if nothing changes, we are not notified about your new push with fixes. Probably every day is too often (You need to consider people\s time availabiliyt etc. ). Apache Software Foundation has this very good rule of giving people minimum 72 Hours for important stuff - like voting and participating in discussions, this is the right baseline. And of course everyone's mileage vary.

The important thing is - that if you are fresh committer to GitHub and the project - any push of yours will require from us to approve your workflow (this is security protection against bots mining bitcoins by pull requests) so until you get your first PR merged, you need to ping in the PR to approve the workflow.

@argibbs
Copy link
Contributor Author

argibbs commented Jun 20, 2022

Yup, all makes sense (esp the backdoor bitcoin mining thing sadly; this is why we can't have nice things). Love the app, trying to do my part and not get in the way 😄

I'll check in every +/-3 days (unless this latest attempt works!)

Thanks for the help!

argibbs and others added 2 commits June 23, 2022 18:49
…e in the failed state (#19754)

- Changed behaviour of sensor to as above to respect soft_fail
- Added tests of new soft_fail behaviour (#19754)
- Added newsfragment and improved sensor docstring
- Removing some black-induced-trailing-commas
- Using explicit enum values rather than strings for state
- Joining two strings black pulled up onto the same line

Co-authored-by: Josh Fell <48934154+josh-fell@users.noreply.github.com>
@argibbs
Copy link
Contributor Author

argibbs commented Jun 23, 2022

@potiuk @josh-fell I've rebased onto master, and resolved a minor conflict against main arising from tweaks to the file imports.

If you could approve the workflow, I'd like to roll the dice again 😄

@argibbs
Copy link
Contributor Author

argibbs commented Jun 24, 2022

Static checks passing locally now.

@argibbs
Copy link
Contributor Author

argibbs commented Jun 24, 2022

@potiuk @josh-fell w00t, it's passed. Can either of you merge it for me? I (understandably) don't have write access.

@potiuk potiuk merged commit 1b34598 into apache:main Jun 24, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Jun 24, 2022

Awesome work, congrats on your first merged pull request!

@potiuk
Copy link
Member

potiuk commented Jun 24, 2022

🎉 🎉 🎉 🎉 🎉

@potiuk
Copy link
Member

potiuk commented Jun 24, 2022

Too bad you became 2098th contributor... it was close though ....

@argibbs
Copy link
Contributor Author

argibbs commented Jun 25, 2022

Fantastic, thanks for all the help!

@ephraimbuddy ephraimbuddy added this to the Airflow 2.3.3 milestone Jul 5, 2022
@ephraimbuddy ephraimbuddy added the type:bug-fix Changelog: Bug Fixes label Jul 5, 2022
ephraimbuddy pushed a commit that referenced this pull request Jul 5, 2022
…ailed_state (#23647)

* Respecting soft_fail in ExternalTaskSensor when the upstream tasks are in the failed state (#19754)

- Changed behaviour of sensor to as above to respect soft_fail
- Added tests of new soft_fail behaviour (#19754)
- Added newsfragment and improved sensor docstring

(cherry picked from commit 1b34598)
ephraimbuddy pushed a commit that referenced this pull request Jul 5, 2022
…ailed_state (#23647)

* Respecting soft_fail in ExternalTaskSensor when the upstream tasks are in the failed state (#19754)

- Changed behaviour of sensor to as above to respect soft_fail
- Added tests of new soft_fail behaviour (#19754)
- Added newsfragment and improved sensor docstring

(cherry picked from commit 1b34598)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core-operators Operators, Sensors and hooks within Core Airflow full tests needed We need to run full set of tests for this PR to merge type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ExternalTaskSensor should skip if soft_fail=True and external task in one of the failed_states
5 participants