Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermitent log on deferrable operator #28647

Closed
1 of 2 tasks
ecodina opened this issue Dec 30, 2022 · 29 comments
Closed
1 of 2 tasks

Intermitent log on deferrable operator #28647

ecodina opened this issue Dec 30, 2022 · 29 comments
Labels
affected_version:2.4 Issues Reported for 2.4 area:async-operators AIP-40: Deferrable ("Async") Operators area:core kind:bug This is a clearly a bug

Comments

@ecodina
Copy link
Contributor

ecodina commented Dec 30, 2022

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

On Airflow 2.4.3

When using a deferrable operator the log disappears from the webpage while the operator is deferred and appears when it is running/finished. This doesn't happen consistently over all DAG runs.

The error has been reproduced in the new log panel on the grid page & the "standalone" log page.

As an example, I've tried running a simple DAG (just executes sleep 10). The logs are for the 3rd retry and I am using the "Auto-refresh" button on the grid page.

There are no visible errors on the web console when this happens, but the web server logs look like Airflow doesn't know it is the 3rd try while the operator is deferred and gets the 2nd try.

What you think should happen instead

No response

How to reproduce

Use a deferrable operator. We use this custom one: https://gist.github.com/ecodina/157b5dc44b79b13fe296b1275b4f0967

Trigger it from the webpage and see how the log appears intermittently.

Operating System

CentOS Linux 8

Versions of Apache Airflow Providers

apache-airflow-providers-cncf-kubernetes | 4.4.0 | Kubernetes
apache-airflow-providers-common-sql | 1.2.0 | Common SQL Provider
apache-airflow-providers-ftp | 3.1.0 | File Transfer Protocol (FTP)
apache-airflow-providers-http | 4.0.0 | Hypertext Transfer Protocol (HTTP)
apache-airflow-providers-imap | 3.0.0 | Internet Message Access Protocol (IMAP)
apache-airflow-providers-postgres | 5.2.2 | PostgreSQL
apache-airflow-providers-sqlite | 3.2.1 | SQLite
apache-airflow-providers-ssh | 3.2.0 | Secure Shell (SSH)

Deployment

Other

Deployment details

Installed using PIP in a conda environment (as if it was a virtualenv).

Using postgresql .

Anything else

Relevant lines of the webserver's log:

--> the log doesn't appear
Dec 30 08:53:52 bright01 webserver.sh[1915713]: 192.168.253.76 - - [30/Dec/2022:08:53:52 +0000] "GET /object/grid_data?dag_id=dag.testing&num_runs=25 HTTP/1.1" 200 1800 "http://airflow.local/dags/dag.testing/grid?root=&dag_run_id=scheduled__2022-12-30T07%3A00%3A00%2B00%3A00&task_id=prova" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Dec 30 08:53:52 bright01 webserver.sh[1915713]: 192.168.253.76 - - [30/Dec/2022:08:53:52 +0000] "GET /api/v1/dags/dag.testing/dagRuns/scheduled__2022-12-30T07:00:00+00:00/taskInstances/prova/logs/2?full_content=false HTTP/1.1" 200 14349 "http://airflow.local/dags/dag.testing/grid?root=&dag_run_id=scheduled__2022-12-30T07%3A00%3A00%2B00%3A00&task_id=prova" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Dec 30 08:53:55 bright01 webserver.sh[1915713]: 192.168.253.76 - - [30/Dec/2022:08:53:55 +0000] "GET /object/grid_data?dag_id=dag.testing&num_runs=25 HTTP/1.1" 200 1800 "http://airflow.local/dags/dag.testing/grid?root=&dag_run_id=scheduled__2022-12-30T07%3A00%3A00%2B00%3A00&task_id=prova" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Dec 30 08:53:55 bright01 webserver.sh[1915713]: 192.168.253.76 - - [30/Dec/2022:08:53:55 +0000] "GET /api/v1/dags/dag.testing/dagRuns/scheduled__2022-12-30T07:00:00+00:00/taskInstances/prova/logs/2?full_content=false HTTP/1.1" 200 14349 "http://airflow.local/dags/dag.testing/grid?root=&dag_run_id=scheduled__2022-12-30T07%3A00%3A00%2B00%3A00&task_id=prova" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Dec 30 08:53:58 bright01 webserver.sh[1915713]: 192.168.253.76 - - [30/Dec/2022:08:53:58 +0000] "GET /object/grid_data?dag_id=dag.testing&num_runs=25 HTTP/1.1" 200 1801 "http://airflow.local/dags/dag.testing/grid?root=&dag_run_id=scheduled__2022-12-30T07%3A00%3A00%2B00%3A00&task_id=prova" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Dec 30 08:53:58 bright01 webserver.sh[1915713]: 192.168.253.76 - - [30/Dec/2022:08:53:58 +0000] "GET /api/v1/dags/dag.testing/dagRuns/scheduled__2022-12-30T07:00:00+00:00/taskInstances/prova/logs/2?full_content=false HTTP/1.1" 200 14349 "http://airflow.local/dags/dag.testing/grid?root=&dag_run_id=scheduled__2022-12-30T07%3A00%3A00%2B00%3A00&task_id=prova" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Dec 30 08:53:58 bright01 webserver.sh[1915713]: 192.168.253.76 - - [30/Dec/2022:08:53:58 +0000] "GET /api/v1/dags/dag.testing/dagRuns/scheduled__2022-12-30T07:00:00+00:00/taskInstances/prova/logs/3?full_content=false HTTP/1.1" 200 7804 "http://airflow.local/dags/dag.testing/grid?root=&dag_run_id=scheduled__2022-12-30T07%3A00%3A00%2B00%3A00&task_id=prova" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"

-> the log starts appearing
Dec 30 08:54:01 bright01 webserver.sh[1915713]: 192.168.253.76 - - [30/Dec/2022:08:54:01 +0000] "GET /object/grid_data?dag_id=dag.testing&num_runs=25 HTTP/1.1" 200 1802 "http://airflow.local/dags/dag.testing/grid?root=&dag_run_id=scheduled__2022-12-30T07%3A00%3A00%2B00%3A00&task_id=prova" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Dec 30 08:54:02 bright01 webserver.sh[1915713]: 192.168.253.76 - - [30/Dec/2022:08:54:02 +0000] "GET /api/v1/dags/dag.testing/dagRuns/scheduled__2022-12-30T07:00:00+00:00/taskInstances/prova/logs/3?full_content=false HTTP/1.1" 200 8198 "http://airflow.local/dags/dag.testing/grid?root=&dag_run_id=scheduled__2022-12-30T07%3A00%3A00%2B00%3A00&task_id=prova" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@ecodina ecodina added area:core kind:bug This is a clearly a bug labels Dec 30, 2022
@pankajastro
Copy link
Member

you maybe have some logs on the trigger since once the task is deferred the execution happens on the trigger.

@ecodina
Copy link
Contributor Author

ecodina commented Dec 30, 2022

Hi @pankajastro

We use SLURM to distribute tasks in different nodes of our cluster. The Airflow operator we use submits a job to SLURM. The workflow is as follows:

Operator: log a few things & submit job & defer itself -> Trigger: check the SLURM log. When a few lines have been added, yield a TriggerEvent -> Operator: receive the new SLURM log lines in the TriggerEvent and log them in Airflow & defer itself again

Therefore, all the logging happens inside the Operator, and not in the trigger.

It looks like a frontend bug: when the operator is deferred (the trigger "runs point"), the website loses count of what try is happening and defaults to the last one.

On Monday I'll try to write a simpler operator+triggerer (same structure but without using SLURM) to see if this keeps happening.

@tanelk
Copy link
Contributor

tanelk commented Dec 30, 2022

Looks like duplicate of #27955

@ecodina
Copy link
Contributor Author

ecodina commented Jan 2, 2023

Hello @tanelk

It looks similar, but I've reproduced the same issue with the "old standalone" log page as well. Here are some logs, that correspond to the retry=2

--> the task is in the deferred state. I refresh the log page & the log doesn't appear
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1 HTTP/1.1" 200 44841 "http://airflow.local/dags/dag.testing/grid?dag_run_id=scheduled__2023-01-02T05%3A00%3A00%2B00%3A00&task_id=prova" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/appbuilder/css/bootstrap.min.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/appbuilder/select2/select2.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/appbuilder/css/font-awesome.min.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/appbuilder/datepicker/bootstrap-datepicker.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/appbuilder/css/flags/flags16.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/appbuilder/css/ab.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/dist/airflowDefaultTheme.731e57571b52cca4350d.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/dist/materialIcons.f9559e4953177b8b9a4a.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/dist/main.a53ccbaa4ca384c91837.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/dist/loadingDots.5da42d00b5455806e709.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/dist/bootstrap-datetimepicker.min.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/dist/switch.0e45299d29df3c0c8282.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/appbuilder/js/jquery-latest.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/appbuilder/js/ab_filters.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/appbuilder/js/ab_actions.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/dist/flash.2b1a873e0aabc828a165.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/appbuilder/js/bootstrap.min.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/appbuilder/datepicker/bootstrap-datepicker.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/appbuilder/select2/select2.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/appbuilder/js/ab.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/dist/moment.197a6f3cab42e240f8bd.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/dist/main.a53ccbaa4ca384c91837.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/dist/bootstrap-datetimepicker.min.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/dist/bootstrap3-typeahead.min.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/dist/dag.a9efcc271e92a9bdc654.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/dist/tiLog.7864c61b0b31ef37f084.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /static/loading.gif HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:20 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:20 +0000] "GET /object/next_run_datasets/dag.testing HTTP/1.1" 200 2 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:21 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:21 +0000] "GET /get_logs_with_metadata?dag_id=dag.testing&task_id=prova&map_index=-1&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&try_number=1&metadata=null HTTP/1.1" 200 16370 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"


--> the task is in the running state. I refresh the log page & the log appears
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1 HTTP/1.1" 200 45311 "http://airflow.local/dags/dag.testing/grid?dag_run_id=scheduled__2023-01-02T05%3A00%3A00%2B00%3A00&task_id=prova" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/appbuilder/select2/select2.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/appbuilder/css/font-awesome.min.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/appbuilder/datepicker/bootstrap-datepicker.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/appbuilder/css/bootstrap.min.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/appbuilder/css/flags/flags16.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/appbuilder/css/ab.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/dist/airflowDefaultTheme.731e57571b52cca4350d.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/dist/materialIcons.f9559e4953177b8b9a4a.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/dist/main.a53ccbaa4ca384c91837.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/dist/bootstrap-datetimepicker.min.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/dist/loadingDots.5da42d00b5455806e709.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/dist/switch.0e45299d29df3c0c8282.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/appbuilder/js/jquery-latest.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/appbuilder/js/ab_filters.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/appbuilder/js/ab_actions.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/dist/flash.2b1a873e0aabc828a165.css HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/appbuilder/datepicker/bootstrap-datepicker.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/appbuilder/js/bootstrap.min.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/appbuilder/select2/select2.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/appbuilder/js/ab.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/dist/moment.197a6f3cab42e240f8bd.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/dist/main.a53ccbaa4ca384c91837.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/dist/bootstrap-datetimepicker.min.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/dist/bootstrap3-typeahead.min.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/dist/dag.a9efcc271e92a9bdc654.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/dist/tiLog.7864c61b0b31ef37f084.js HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /static/loading.gif HTTP/1.1" 304 0 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /object/next_run_datasets/dag.testing HTTP/1.1" 200 2 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /get_logs_with_metadata?dag_id=dag.testing&task_id=prova&map_index=-1&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&try_number=1&metadata=null HTTP/1.1" 200 16370 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"
Jan 02 06:17:25 bright01 webserver.sh[1915713]: 192.168.253.76 - - [02/Jan/2023:06:17:25 +0000] "GET /get_logs_with_metadata?dag_id=dag.testing&task_id=prova&map_index=-1&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&try_number=2&metadata=null HTTP/1.1" 200 7468 "http://airflow.local/log?dag_id=dag.testing&task_id=prova&execution_date=2023-01-02T05%3A00%3A00%2B00%3A00&map_index=-1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0"

@uranusjr uranusjr added the area:async-operators AIP-40: Deferrable ("Async") Operators label Jan 2, 2023
@eladkal eladkal added the affected_version:2.4 Issues Reported for 2.4 label Jan 7, 2023
@potiuk
Copy link
Member

potiuk commented Feb 20, 2023

cc: @andrewgodwin maybe that rings a bell for you ?

@andrewgodwin
Copy link
Contributor

The logging web frontend definitely does have some weird behaviour depending on if the task is in RUNNING state or not, mostly as a result of the code that tries to go get live logs from the node directly if it's in RUNNING. If you've written a custom log backend, maybe that bit isn't working well?

I've worked on a couple of logging things around this for our hosted solution, and we had to do especially fun things to turn that mode off and make Airflow just read from the same place for logs all the time.

@ecodina
Copy link
Contributor Author

ecodina commented Feb 22, 2023

Hi,

We have not written any custom log backend: we use self.log from within the custom operator and the only related configuration we've changed from airflow.cfg is base_log_folder.

In the following weeks we'll upgrade to the latest version of Airflow, to check whether it is still happening.

@ecodina
Copy link
Contributor Author

ecodina commented Apr 3, 2023

Today I updated to Airflow 2.5.3. The new log view works as expected, but the old log web page still shows and hides the latest log.

@potiuk
Copy link
Member

potiuk commented Apr 3, 2023

Today I updated to Airflow 2.5.3. The new log view works as expected, but the old log web page still shows and hides the latest log.

I think that changes in 2.6.0 (or maybe we discussed that we should remove the old log page @dstandish ?)

@ecodina
Copy link
Contributor Author

ecodina commented Apr 3, 2023

Great! I'll wait for 2.6 then. Thanks.

@ecodina
Copy link
Contributor Author

ecodina commented Jun 25, 2023

Since we upgraded to Airflow 2.6.2, this issue does not appear anymore. Closing it.

@ecodina ecodina closed this as completed Jun 25, 2023
@DFINITYManu
Copy link

DFINITYManu commented Jun 27, 2023

We continue to have this issue. 2.6.2 on Kubernetes.

image

In the naked log view I get this:

{"message": [["default_host", "Error fetching the logs. Try number 0 is invalid."]], "metadata": {"end_of_log": true}}

When I change the URL of that view to have try_number=1 instead of try_number=0, I get this:

{"message": [["192.168.251.22", "*** Could not read served logs: [Errno -2] Name or service not known\n"]], "metadata": {"end_of_log": false, "log_pos": 0}}

Please reopen. Thanks.

@potiuk
Copy link
Member

potiuk commented Jun 27, 2023

Can you please open a new issue @DFINITYManu -> add your logs and circumstances. While behaviour might be similar, logs, circumstances and more details - incluiding description of your deployment might be needed to help you to diagnose your issue. We generally very rarely reopen issues because often what looks similar might be caused by different reasons, Having more detailed information and circumstances described in detail help to help the users of Airflow (and save time of volunteers who look at the issues when they have time and help people who get the software for free and use it for free in their time. Helping them to diagnose your issues by providing more detailed report is a great way to give back to the community.

Thanks in advance for helping volunteers to diagnose the issues.

@DFINITYManu
Copy link

I'm wary of opening another issue, but I would like to inform you that the problem happens exclusively when accessing the logs of deferred tasks. Logs of running tasks are fetched fine.

@ecodina
Copy link
Contributor Author

ecodina commented Jul 25, 2023

@DFINITYManu what you describe is exactly what used to happen to us. We do heavy use of deferred tasks and since we updated to 2.6.2 (and rewrote the trigger to log from there), this issue has not happened anymore.

@DFINITYManu
Copy link

Would you mind sharing how you rewrote your trigger? We have a deferred trigger and we log from it, but nothing appears on the log -- I get the error.

@ecodina
Copy link
Contributor Author

ecodina commented Jul 25, 2023

Sure, this is the minimum possible example. We do many more things, but with this it should work. I also see you're using kubernetes. We installed it using pip.

class MyOperator(BaseOperator):
    # ...

    def execute(self, context):
        # Do a few things
        self.defer(trigger=MyTrigger(), method_name="trigger_finished")

    def trigger_finished(self, context, event=None):
        # Do something


class MyTrigger(BaseTrigger):
    # ...

    async def run(self):
        # We monitor a logfile (external to Airflow), reading it with aiofiles
        # new_lines is the variable containing the new lines from this logfile, which will be printed on Ariflow's log

        for l in new_lines:
            self.log.info(l)

@dstandish
Copy link
Contributor

I think that changes in 2.6.0 (or maybe we discussed that we should remove the old log page @dstandish ?)

Probably a good idea @potiuk. created issue here #32833

@DFINITYManu
Copy link

@ecodina We have the same code in principle -- we use defer and run, but we have the bug on every deferred task. Whenever the task is deferred, I get the error shown in the screenshot above.

Note this is a problem in the grid logs page.

@dstandish
Copy link
Contributor

@DFINITYManu please create a new issue, and add all relevant details about your setup. it appears that your webserver cannot access that address. that would be a good place to start with your troubleshooting. is the hostname / IP correct? can you actually access it from the webserver?

@DFINITYManu
Copy link

I'll try, but in the meantime, can the software actually dump a proper exception log complete with the hostname it is putatively trying to access? Thanks.

@potiuk
Copy link
Member

potiuk commented Aug 6, 2023

I'll try, but in the meantime, can the software actually dump a proper exception log complete with the hostname it is putatively trying to access? Thanks.

Depends on the proper definition. You can definitely forward all logging and exceptions to remote logger. Elasticsearch, Cloudwatch eetc. That would be definitely proper, and yes you can do it today. You just need a proper system that you collect anylyse your logs configured.

@DFINITYManu
Copy link

I can't, because the code that fetches the logs simply swallows the exception traceback. Doesn't matter where I log or how sophisticated the logging stack is - the code is not logging anything useful.

@potiuk
Copy link
Member

potiuk commented Aug 7, 2023

I think you are wrong @DFINITYManu.

If you are using 2.6 (latest) version of Airlfow, There was a change implemented that logs from Triggerer #27758 that enabled:

  1. seeing triggerer logs in webserver
  2. forwarding them to remote logging if you have remote logging configured

If neither of that does not work for you, then likely you have something misconfigured, or maybe you are not using Airflow 2.6.

Note, that you might underestimate complexity of the problem BTW. Logs raised in the trigger cannot be JUST written to a log file, because the trigger code is executed in a tight asyncio loop that should be executed as soon as possible and it cannot do any synchronous operations (including writing to a file or sending it over TCP). People who implement the solution I described went to a great extent to have a compleley custom implementation of storing the logs in memory and having completeley separate threads that are processing the logs and making them available either to the webserver or to remote logging handler that forward to to GCS/S3/ whatever can "properly" collect your logs.

If you do not understand the complexity involved, and summarise it with one "properly" world, then I would like to ask you for a litle bit of empathy and appreciation of the work of people who do it in their free time to build the software that you can use for free, (and without any guarantees of any sort)., Because - in case you have not noticed - this is how Airflow is developed and you paid exactly 0 for the sofware you are using. A little more appreciation would go a long way.

And if you want to be a good member of the community - if you still have problems after following the helpful advices (including using latest airflow and using remote logging), the proper thing to do will be to open an issue or discussion and describe your problem in detail.This is the best way you can help community to build a better software.

Thank you for your understanding and I hope you will enjoy many years of using open source software for free (in the past and in the future). People who spend they nights and weekends on spending their personal time are counting on that.

@DFINITYManu
Copy link

DFINITYManu commented Aug 14, 2023

I may be wrong, but sadly and more urgently I'm quite swamped with work at the moment. It is certainly my intention to open a proper discussion at some point in the near future -- all the more since Airflow may end up being a production platform for us (I am a big advocate for it).

You got my empathy. I come from 20 years of OSS development (even though this Github account doesn't show it), and I rely on user contributions too. Have a great day.

@potiuk
Copy link
Member

potiuk commented Aug 17, 2023

Sure. It can wait, just wanted you to know that sometimes things aren't as straightforward and simple "proper exceptin" might be much more difficult than it seems, and actually creating an issue and describing your problem in the way that it makes it easy to help you is better way than commenting on a closed issue in a way that imposes that what happens is somewhat "improper". And I also sympathise with being swamped, unfortunately it does not mean that there is anything we can help with "fixing" the "improper" behaviours.

@DFINITYManu
Copy link

DFINITYManu commented Sep 7, 2023

I just upgraded to 2.7.0 to see if the issue was addressed, and I'm still experiencing the same issue of not being able to fetch logs from deferred tasks:

image

For the record I'm using the CeleryExecutor in two workers on Kubernetes.

Pertinent traceback from webserver pod / container:

airflow-webserver-6f7696bd7d-26fm4: 10.13.9.190 - - [07/Sep/2023:12:54:57 +0000] "GET /api/v1/dags/rollout_ic_os_to_mainnet_subnets/dagRuns/manual__2023-09-07T12:37:20+00:00/taskInstances/batch_2.wait_until_no_alerts/1 HTTP/1.1" 200 1505 "https://airflow.ch1-obsdev1.dfinity.network/dags/rollout_ic_os_to_mainnet_subnets/grid?dag_run_id=manual__2023-09-07T12%3A37%3A20%2B00%3A00&task_id=batch_2.wait_until_no_alerts&tab=mapped_tasks&map_index=1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
airflow-webserver-6f7696bd7d-26fm4: [2023-09-07T12:54:59.951+0000] {file_task_handler.py:525} ERROR - Could not read served logs
airflow-webserver-6f7696bd7d-26fm4: Traceback (most recent call last):
airflow-webserver-6f7696bd7d-26fm4:   File "/home/airflow/.local/lib/python3.10/site-packages/httpcore/_exceptions.py", line 10, in map_exceptions
airflow-webserver-6f7696bd7d-26fm4:     yield
airflow-webserver-6f7696bd7d-26fm4:   File "/home/airflow/.local/lib/python3.10/site-packages/httpcore/backends/sync.py", line 94, in connect_tcp
airflow-webserver-6f7696bd7d-26fm4:     sock = socket.create_connection(
airflow-webserver-6f7696bd7d-26fm4:   File "/usr/local/lib/python3.10/socket.py", line 824, in create_connection
airflow-webserver-6f7696bd7d-26fm4:     for res in getaddrinfo(host, port, 0, SOCK_STREAM):
airflow-webserver-6f7696bd7d-26fm4:   File "/usr/local/lib/python3.10/socket.py", line 955, in getaddrinfo
airflow-webserver-6f7696bd7d-26fm4:     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
airflow-webserver-6f7696bd7d-26fm4: socket.gaierror: [Errno -2] Name or service not known
airflow-webserver-6f7696bd7d-26fm4: 
airflow-webserver-6f7696bd7d-26fm4: During handling of the above exception, another exception occurred:
airflow-webserver-6f7696bd7d-26fm4: 
airflow-webserver-6f7696bd7d-26fm4: Traceback (most recent call last):
airflow-webserver-6f7696bd7d-26fm4:   File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_transports/default.py", line 60, in map_httpcore_exceptions
airflow-webserver-6f7696bd7d-26fm4:     yield
airflow-webserver-6f7696bd7d-26fm4:   File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_transports/default.py", line 218, in handle_request
airflow-webserver-6f7696bd7d-26fm4:     resp = self._pool.handle_request(req)
airflow-webserver-6f7696bd7d-26fm4:   File "/home/airflow/.local/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 253, in handle_request
airflow-webserver-6f7696bd7d-26fm4:     raise exc
airflow-webserver-6f7696bd7d-26fm4:   File "/home/airflow/.local/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 237, in handle_request
airflow-webserver-6f7696bd7d-26fm4:     response = connection.handle_request(request)
airflow-webserver-6f7696bd7d-26fm4:   File "/home/airflow/.local/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 86, in handle_request
airflow-webserver-6f7696bd7d-26fm4:     raise exc
airflow-webserver-6f7696bd7d-26fm4:   File "/home/airflow/.local/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 63, in handle_request
airflow-webserver-6f7696bd7d-26fm4:     stream = self._connect(request)
airflow-webserver-6f7696bd7d-26fm4:   File "/home/airflow/.local/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 111, in _connect
airflow-webserver-6f7696bd7d-26fm4:     stream = self._network_backend.connect_tcp(**kwargs)
airflow-webserver-6f7696bd7d-26fm4:   File "/home/airflow/.local/lib/python3.10/site-packages/httpcore/backends/sync.py", line 93, in connect_tcp
airflow-webserver-6f7696bd7d-26fm4:     with map_exceptions(exc_map):
airflow-webserver-6f7696bd7d-26fm4:   File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
airflow-webserver-6f7696bd7d-26fm4:     self.gen.throw(typ, value, traceback)
airflow-webserver-6f7696bd7d-26fm4:   File "/home/airflow/.local/lib/python3.10/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
airflow-webserver-6f7696bd7d-26fm4:     raise to_exc(exc)
airflow-webserver-6f7696bd7d-26fm4: httpcore.ConnectError: [Errno -2] Name or service not known
airflow-webserver-6f7696bd7d-26fm4: 
airflow-webserver-6f7696bd7d-26fm4: The above exception was the direct cause of the following exception:
airflow-webserver-6f7696bd7d-26fm4: 
airflow-webserver-6f7696bd7d-26fm4: Traceback (most recent call last):
airflow-webserver-6f7696bd7d-26fm4:   File "/home/airflow/.local/lib/python3.10/site-packages/airflow/utils/log/file_task_handler.py", line 508, in _read_from_logs_server
airflow-webserver-6f7696bd7d-26fm4:     response = _fetch_logs_from_service(url, rel_path)
airflow-webserver-6f7696bd7d-26fm4:   File "/home/airflow/.local/lib/python3.10/site-packages/airflow/utils/log/file_task_handler.py", line 90, in _fetch_logs_from_service
airflow-webserver-6f7696bd7d-26fm4:     response = httpx.get(
airflow-webserver-6f7696bd7d-26fm4:   File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_api.py", line 189, in get
airflow-webserver-6f7696bd7d-26fm4:     return request(
airflow-webserver-6f7696bd7d-26fm4:   File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_api.py", line 100, in request
airflow-webserver-6f7696bd7d-26fm4:     return client.request(
airflow-webserver-6f7696bd7d-26fm4:   File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_client.py", line 821, in request
airflow-webserver-6f7696bd7d-26fm4:     return self.send(request, auth=auth, follow_redirects=follow_redirects)
airflow-webserver-6f7696bd7d-26fm4:   File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_client.py", line 908, in send
airflow-webserver-6f7696bd7d-26fm4:     response = self._send_handling_auth(
airflow-webserver-6f7696bd7d-26fm4:   File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_client.py", line 936, in _send_handling_auth
airflow-webserver-6f7696bd7d-26fm4:     response = self._send_handling_redirects(
airflow-webserver-6f7696bd7d-26fm4:   File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_client.py", line 973, in _send_handling_redirects
airflow-webserver-6f7696bd7d-26fm4:     response = self._send_single_request(request)
airflow-webserver-6f7696bd7d-26fm4:   File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_client.py", line 1009, in _send_single_request
airflow-webserver-6f7696bd7d-26fm4:     response = transport.handle_request(request)
airflow-webserver-6f7696bd7d-26fm4:   File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_transports/default.py", line 217, in handle_request
airflow-webserver-6f7696bd7d-26fm4:     with map_httpcore_exceptions():
airflow-webserver-6f7696bd7d-26fm4:   File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
airflow-webserver-6f7696bd7d-26fm4:     self.gen.throw(typ, value, traceback)
airflow-webserver-6f7696bd7d-26fm4:   File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
airflow-webserver-6f7696bd7d-26fm4:     raise mapped_exc(message) from exc
airflow-webserver-6f7696bd7d-26fm4: httpx.ConnectError: [Errno -2] Name or service not known

It would be nice to see the traceback mention exactly which URL was being attempted to fetch.

FYI: I can absolutely confirm the logs in question are being written to the worker.

@potiuk
Copy link
Member

potiuk commented Sep 7, 2023

The problem is ("Name or service not known") is that your worker's address uses the local IP address that is not reachable from the webserver.

The address it tries to connect to is printed as the first line in the log 192.168.... Liikely you need to configure hostname_callable: https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#hostname-callable so that the IP address that your workers do not use the IP address that is that of the internal networking interface, but that they are providing the IP address that webserver can communicate with to retrieve the logs from the workers.

This is part of your deployment configuration - depending on your networking configuration, there are various ways your workers will expose the addresses that your pods inside K8S can use to communicate. The 192.168.* address is the loopback address that is only reachable inside the container, so your hostname_callable should be modified to retrieve the address that is exposed in your networing for intra-pod communication. The documentation shows an example how it can look like, but since networking configuration looks very different in various implementations of Kubernetes and deployments, the way how to obtain the right address might be different. Kubernetes is quite leaky abstraction and you need to sometimes goe down to the lower layers (such as DNS and networking) to configure certain aspects of K8S apps - this is also why we exposed hostname_callable as a code that you can add to retrieve the right address.

It would be nice to see the traceback mention exactly which URL was being attempted to fetch.

Great idea. I think it's a very nice first contributio you could do.

Also maybe a good idea after you got the explanation about the hostname callable being part of the problem. you coudl help us and contribute it in the way that people like you, searching for help would be able to find easily in the docs? You probably looke for some help there so you likely know what's best. I tink it would be great to write some FAQ/troubleshooting for example here - https://airflow.apache.org/docs/helm-chart/stable/manage-logs.html from the user's perspective (especially when you solve the problem after getting those advices and directions here) where you probably are the best person to find the right words and provide examples. And possibly it would be greaat to link to such a chapter from the log message if you would like to make a PR to add the logging you mentioned.

This would be a agreat help for users like you, and fantastic contribution back to the free sofware you are using - similarly as some 2600 people (most of them like you had some problems, got some advices, solved them and contributed back some documentation or code changes so that others will have less problems in the future).

I hope it was helpful and that it wlll help you to solve the problem and that you will contribute it back.

@DFINITYManu
Copy link

The address it tries to connect to is printed as the first line in the log 192.168.... Liikely you need to configure hostname_callable: https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#hostname-callable so that the IP address that your workers do not use the IP address that is that of the internal networking interface, but that they are providing the IP address that webserver can communicate with to retrieve the logs from the workers.

I don't think that's it. Logs work perfectly when tasks are running or finished. They only fail when they are deferred. I also tried different variants of the hostname_callable setting a few months ago, and all I managed to do is break logging completely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.4 Issues Reported for 2.4 area:async-operators AIP-40: Deferrable ("Async") Operators area:core kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

9 participants