-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermitent log on deferrable operator #28647
Comments
you maybe have some logs on the trigger since once the task is deferred the execution happens on the trigger. |
Hi @pankajastro We use SLURM to distribute tasks in different nodes of our cluster. The Airflow operator we use submits a job to SLURM. The workflow is as follows: Operator: log a few things & submit job & defer itself -> Trigger: check the SLURM log. When a few lines have been added, yield a TriggerEvent -> Operator: receive the new SLURM log lines in the TriggerEvent and log them in Airflow & defer itself again Therefore, all the logging happens inside the Operator, and not in the trigger. It looks like a frontend bug: when the operator is deferred (the trigger "runs point"), the website loses count of what try is happening and defaults to the last one. On Monday I'll try to write a simpler operator+triggerer (same structure but without using SLURM) to see if this keeps happening. |
Looks like duplicate of #27955 |
Hello @tanelk It looks similar, but I've reproduced the same issue with the "old standalone" log page as well. Here are some logs, that correspond to the retry=2
|
cc: @andrewgodwin maybe that rings a bell for you ? |
The logging web frontend definitely does have some weird behaviour depending on if the task is in RUNNING state or not, mostly as a result of the code that tries to go get live logs from the node directly if it's in RUNNING. If you've written a custom log backend, maybe that bit isn't working well? I've worked on a couple of logging things around this for our hosted solution, and we had to do especially fun things to turn that mode off and make Airflow just read from the same place for logs all the time. |
Hi, We have not written any custom log backend: we use In the following weeks we'll upgrade to the latest version of Airflow, to check whether it is still happening. |
Today I updated to Airflow 2.5.3. The new log view works as expected, but the old log web page still shows and hides the latest log. |
I think that changes in 2.6.0 (or maybe we discussed that we should remove the old log page @dstandish ?) |
Great! I'll wait for 2.6 then. Thanks. |
Since we upgraded to Airflow 2.6.2, this issue does not appear anymore. Closing it. |
We continue to have this issue. 2.6.2 on Kubernetes. In the naked log view I get this:
When I change the URL of that view to have
Please reopen. Thanks. |
Can you please open a new issue @DFINITYManu -> add your logs and circumstances. While behaviour might be similar, logs, circumstances and more details - incluiding description of your deployment might be needed to help you to diagnose your issue. We generally very rarely reopen issues because often what looks similar might be caused by different reasons, Having more detailed information and circumstances described in detail help to help the users of Airflow (and save time of volunteers who look at the issues when they have time and help people who get the software for free and use it for free in their time. Helping them to diagnose your issues by providing more detailed report is a great way to give back to the community. Thanks in advance for helping volunteers to diagnose the issues. |
I'm wary of opening another issue, but I would like to inform you that the problem happens exclusively when accessing the logs of deferred tasks. Logs of running tasks are fetched fine. |
@DFINITYManu what you describe is exactly what used to happen to us. We do heavy use of deferred tasks and since we updated to 2.6.2 (and rewrote the trigger to log from there), this issue has not happened anymore. |
Would you mind sharing how you rewrote your trigger? We have a deferred trigger and we log from it, but nothing appears on the log -- I get the error. |
Sure, this is the minimum possible example. We do many more things, but with this it should work. I also see you're using kubernetes. We installed it using pip.
|
|
@ecodina We have the same code in principle -- we use defer and run, but we have the bug on every deferred task. Whenever the task is deferred, I get the error shown in the screenshot above. Note this is a problem in the grid logs page. |
@DFINITYManu please create a new issue, and add all relevant details about your setup. it appears that your webserver cannot access that address. that would be a good place to start with your troubleshooting. is the hostname / IP correct? can you actually access it from the webserver? |
I'll try, but in the meantime, can the software actually dump a proper exception log complete with the hostname it is putatively trying to access? Thanks. |
Depends on the proper definition. You can definitely forward all logging and exceptions to remote logger. Elasticsearch, Cloudwatch eetc. That would be definitely proper, and yes you can do it today. You just need a proper system that you collect anylyse your logs configured. |
I can't, because the code that fetches the logs simply swallows the exception traceback. Doesn't matter where I log or how sophisticated the logging stack is - the code is not logging anything useful. |
I think you are wrong @DFINITYManu. If you are using 2.6 (latest) version of Airlfow, There was a change implemented that logs from Triggerer #27758 that enabled:
If neither of that does not work for you, then likely you have something misconfigured, or maybe you are not using Airflow 2.6. Note, that you might underestimate complexity of the problem BTW. Logs raised in the trigger cannot be JUST written to a log file, because the trigger code is executed in a tight asyncio loop that should be executed as soon as possible and it cannot do any synchronous operations (including writing to a file or sending it over TCP). People who implement the solution I described went to a great extent to have a compleley custom implementation of storing the logs in memory and having completeley separate threads that are processing the logs and making them available either to the webserver or to remote logging handler that forward to to GCS/S3/ whatever can "properly" collect your logs. If you do not understand the complexity involved, and summarise it with one "properly" world, then I would like to ask you for a litle bit of empathy and appreciation of the work of people who do it in their free time to build the software that you can use for free, (and without any guarantees of any sort)., Because - in case you have not noticed - this is how Airflow is developed and you paid exactly 0 for the sofware you are using. A little more appreciation would go a long way. And if you want to be a good member of the community - if you still have problems after following the helpful advices (including using latest airflow and using remote logging), the Thank you for your understanding and I hope you will enjoy many years of using open source software for free (in the past and in the future). People who spend they nights and weekends on spending their personal time are counting on that. |
I may be wrong, but sadly and more urgently I'm quite swamped with work at the moment. It is certainly my intention to open a proper discussion at some point in the near future -- all the more since Airflow may end up being a production platform for us (I am a big advocate for it). You got my empathy. I come from 20 years of OSS development (even though this Github account doesn't show it), and I rely on user contributions too. Have a great day. |
Sure. It can wait, just wanted you to know that sometimes things aren't as straightforward and simple "proper exceptin" might be much more difficult than it seems, and actually creating an issue and describing your problem in the way that it makes it easy to help you is better way than commenting on a closed issue in a way that imposes that what happens is somewhat "improper". And I also sympathise with being swamped, unfortunately it does not mean that there is anything we can help with "fixing" the "improper" behaviours. |
I just upgraded to 2.7.0 to see if the issue was addressed, and I'm still experiencing the same issue of not being able to fetch logs from deferred tasks: For the record I'm using the CeleryExecutor in two workers on Kubernetes. Pertinent traceback from webserver pod / container:
It would be nice to see the traceback mention exactly which URL was being attempted to fetch. FYI: I can absolutely confirm the logs in question are being written to the worker. |
The problem is ("Name or service not known") is that your worker's address uses the local IP address that is not reachable from the webserver. The address it tries to connect to is printed as the first line in the log This is part of your deployment configuration - depending on your networking configuration, there are various ways your workers will expose the addresses that your pods inside K8S can use to communicate. The 192.168.* address is the loopback address that is only reachable inside the container, so your
Great idea. I think it's a very nice first contributio you could do. Also maybe a good idea after you got the explanation about the hostname callable being part of the problem. you coudl help us and contribute it in the way that people like you, searching for help would be able to find easily in the docs? You probably looke for some help there so you likely know what's best. I tink it would be great to write some FAQ/troubleshooting for example here - https://airflow.apache.org/docs/helm-chart/stable/manage-logs.html from the user's perspective (especially when you solve the problem after getting those advices and directions here) where you probably are the best person to find the right words and provide examples. And possibly it would be greaat to link to such a chapter from the log message if you would like to make a PR to add the logging you mentioned. This would be a agreat help for users like you, and fantastic contribution back to the free sofware you are using - similarly as some 2600 people (most of them like you had some problems, got some advices, solved them and contributed back some documentation or code changes so that others will have less problems in the future). I hope it was helpful and that it wlll help you to solve the problem and that you will contribute it back. |
I don't think that's it. Logs work perfectly when tasks are running or finished. They only fail when they are deferred. I also tried different variants of the |
Apache Airflow version
Other Airflow 2 version (please specify below)
What happened
On Airflow 2.4.3
When using a deferrable operator the log disappears from the webpage while the operator is deferred and appears when it is running/finished. This doesn't happen consistently over all DAG runs.
The error has been reproduced in the new log panel on the grid page & the "standalone" log page.
As an example, I've tried running a simple DAG (just executes
sleep 10
). The logs are for the 3rd retry and I am using the "Auto-refresh" button on the grid page.There are no visible errors on the web console when this happens, but the web server logs look like Airflow doesn't know it is the 3rd try while the operator is deferred and gets the 2nd try.
What you think should happen instead
No response
How to reproduce
Use a deferrable operator. We use this custom one: https://gist.github.com/ecodina/157b5dc44b79b13fe296b1275b4f0967
Trigger it from the webpage and see how the log appears intermittently.
Operating System
CentOS Linux 8
Versions of Apache Airflow Providers
apache-airflow-providers-cncf-kubernetes | 4.4.0 | Kubernetes
apache-airflow-providers-common-sql | 1.2.0 | Common SQL Provider
apache-airflow-providers-ftp | 3.1.0 | File Transfer Protocol (FTP)
apache-airflow-providers-http | 4.0.0 | Hypertext Transfer Protocol (HTTP)
apache-airflow-providers-imap | 3.0.0 | Internet Message Access Protocol (IMAP)
apache-airflow-providers-postgres | 5.2.2 | PostgreSQL
apache-airflow-providers-sqlite | 3.2.1 | SQLite
apache-airflow-providers-ssh | 3.2.0 | Secure Shell (SSH)
Deployment
Other
Deployment details
Installed using PIP in a conda environment (as if it was a virtualenv).
Using postgresql .
Anything else
Relevant lines of the webserver's log:
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: