-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long-running task blocks other tasks with LocalExecutor #11331
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
Have you tried increasing On the other note, have you considered using CeleryExecutor? Then you can have separate queues for long and short running tasks. |
I've just tried setting |
How many DAGs/tasks do you have? Would you be able to provide a replicable setup using |
We have 13 DAGs, each has ~5 tasks (so, ~70 tasks total). Most of those DAGs run hourly (with some time offset). In most cases DAG just checks for new files to grab and does nothing if no new files found. When DAG discovers new unprocessed file, it grabs it, parses it, load parsed data into database and call process function there. In that case it could take some time (up to few hours). I've prepared reproduce code with 2 DAGs: one long-running DAG and second is short-running DAG. When long-running DAG is running no new short-running DAG runs are scheduled and started (and UI reports that the scheduler doesn't appear to be running). blocking_reproduce_dag.pyfrom airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import BranchPythonOperator
from airflow.utils.dates import days_ago
from airflow.utils.trigger_rule import TriggerRule
long_dag = DAG(
dag_id='long_dag',
schedule_interval='@hourly',
start_date=days_ago(1),
catchup=False
)
check_files_task = BranchPythonOperator(
task_id='check_files', dag=long_dag,
python_callable=lambda: 'parse_files'
)
parse_files_task = BashOperator(
task_id='parse_files', dag=long_dag,
bash_command='sleep 20m'
)
process_files_task = BashOperator(
task_id='process_files', dag=long_dag,
bash_command='sleep 15m'
)
slack_report_task = DummyOperator(
task_id='slack_report', dag=long_dag,
trigger_rule=TriggerRule.NONE_FAILED_OR_SKIPPED
)
check_files_task >> parse_files_task >> process_files_task >> slack_report_task
check_files_task >> slack_report_task
short_dag = DAG(
dag_id='short_dag',
schedule_interval='*/5 * * * *',
start_date=days_ago(1),
catchup=False,
max_active_runs=1
)
query_service_task = BashOperator(
task_id='query_service', dag=short_dag,
bash_command='sleep 30s'
)
do_something_task = DummyOperator(
task_id='do_something', dag=short_dag
)
query_service_task >> do_something_task |
It seems to be I figured out what the problem is. We have Airflow scheduler installed as a daemon service on Ubuntu machine. Service command is following:
SCHEDULER_RUNS is set to 5 in environment variables. So, scheduler starts, makes 5 loops of scanning DAG files and then stops. Linux daemon restart policy automatically restart it again. However, if long-running task is still running after service restart (it's not stopped) scheduler doesn't grab new tasks until it finishes. Setting SCHEDULER_RUNS to -1 solved the issue. |
Apache Airflow version: 1.10.12
Environment:
uname -a
): Linux ip-XX-XX-XX-XX.ec2.internal 5.4.0-1025-aws Tutorial improvements. #25~18.04.1-Ubuntu SMP Fri Sep 11 12:03:04 UTC 2020 x86_64 x86_64 x86_64 GNU/LinuxWhat happened:
We have 13 DAGs in our Airflow. Some of them in some circumstances process a large amount of data. Usually it's a parsing some large file, transform parsed data and load it into database. Also there are database processing tasks which involve long-running queries. So, some tasks could be running for several hours sometimes. The problem is those long-running tasks block all other tasks from being started. Tasks which are scheduled to run hourly are not started until long-running task is completed. Also we see an yellow bar in our Airflow Web UI:
We examined Airflow scheduler logs and figured out that scheduler just doesn't try to grab new tasks while long-running task is running. When there is no long-running task running we see that scheduler tries to check whether any task could run and check parallelism/concurrency limitation for them. But with long-running task there are no log messages like this.
Manual triggering also doesn't help - triggered tasks are not started until long-running task is finished.
What you expected to happen:
We expect all other DAGs to start according to their schedule when long-running task is running. This how LocalExecutor should work according to documentation.
We also checked server resources for those cases - but there are a lot of free RAM and CPU in that time, so it shoudn't be the cause.
Anything else we need to know:
This problem occurs every time.
The text was updated successfully, but these errors were encountered: