Re-serialize all DAGs on 'airflow db upgrade' #24518

uranusjr · 2022-06-17T09:44:41Z

Spawned out of conversations in #23860. I also added some session=session arguments that seem to make sense.

github-actions · 2022-06-17T16:15:39Z

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

kaxil · 2022-06-22T16:13:54Z

airflow/utils/db.py

+@provide_session
+def reserialize_dags(*, session: Session = NEW_SESSION) -> None:
+    session.query(SerializedDagModel).delete(synchronize_session=False)
+    dagbag = DagBag()
+    dagbag.collect_dags(only_if_updated=False, safe_mode=False)
+    dagbag.sync_to_db(session=session)
+


At the very least I would really like to have a flag to not do that.

I would like to control when DAGs get reserialized. For customers with bigger DAGs (more than >100 tasks) and larger number of DAGs (>100-500)< this is good to add a quite bit of time. Some of our migrations already take a good chunk of time, I would really like to avoid that here. " Time to Parse the DAGs + Time to serialize it + Send the entire blob over the network).

I am not sure, but i think we should know if we are talking about 5% of time increase (for example when we add simple new field to a table with task instance) or 50% or 200%. Or put it another way - are we talking about 5 minutes or 1 hour of overhead.

I think we should get some data to base our decisions on. I believe (this is rather my intuition than hard data) that with even a big installation re-serialization of DAGs will not take a long time - minutes at most.

Back-of-the envelope calcilations. Cardinality of SerializedDAG table is (until we add versioning at least) the smallest when it comes to "dag-related" tables (and this is the most important factor IMHO). Number of rows in the table is equal to number of DAGs. So it's not like we multiple 500 DAGs by 100 tasks in the case above. For comparision - Task Instance is by far the worst, because we have DAG num * per-dag task num * number of runs. This can grow fast, quickly, but not SerializedDAG table.

Yeah. You have to parse all those files. But even eve if you take 0.5s to parse each DAG, with 500 DAGs you get 6 minutes. And 0.5s per DAG indicates that the user has far bigger problems already. Sending 500 blobs (this is what reserializing does) and deleting/updating 500 rows in a table is unlikely to be slow. Sending biggish blobs one per transactions is about as efficient as it gets when it comes to overhead of communication with the DB - roughly equivalent to sending the raw data (overhead for transaction and row insert/delete is not neglectible but it is order of magnitude less than sending the data). Assuming we have 1MB Blob to send per DAG on average (which is outrageous) we are talking about sending roughly 500 MB to the DB. Assuming remote databse and 100 Mb connection (extremely low for a prod db) it is less than 2 minutes (assuming 50% saturation of the network)

So back-of -the envelope calculation tells me that even with such huge installation, we are talking about at most few minutes overhead.

My intuition tells me that if this is 5% or 5 minutes, this is entirely justifiable to add delay. Conversely - It will take hours of downtime and investigation and loosing time of multiple people to deal with serialization issue when it happens, so avoiding "sure" one-time 5-minutes delay once-a-few-months in favour of possible hours of investigations is simply premature optimization. There are always cost coming with optimizations - complexity, quite likely risk of failure in this case.

On the other hand if for some customers it will add hours of delays, then yeah. we might think about optimizing it and avoiding the reserialization.

Do we actually have any number on how much does it take for a customer that hass 500 dags and 100 tasks?

Re-serialize all DAGs on 'airflow db upgrade'

6790bda

boring-cyborg bot added the area:CLI label Jun 17, 2022

uranusjr mentioned this pull request Jun 17, 2022

Remove special serde logic for mapped op_kwargs #23860

Merged

dstandish approved these changes Jun 17, 2022

View reviewed changes

github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Jun 17, 2022

potiuk approved these changes Jun 17, 2022

View reviewed changes

potiuk merged commit a10965f into apache:main Jun 19, 2022

uranusjr deleted the reserialize-on-upgrade branch June 21, 2022 02:16

kaxil reviewed Jun 22, 2022

View reviewed changes

ephraimbuddy added the type:improvement Changelog: Improvements label Aug 12, 2022

troyharvey mentioned this pull request Oct 3, 2022

airflow db upgrade Failed to write serialized DAG #26834

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-serialize all DAGs on 'airflow db upgrade' #24518

Re-serialize all DAGs on 'airflow db upgrade' #24518

uranusjr commented Jun 17, 2022 •

edited

Loading

github-actions bot commented Jun 17, 2022

kaxil Jun 22, 2022

potiuk Jun 29, 2022 •

edited

Loading

Re-serialize all DAGs on 'airflow db upgrade' #24518

Re-serialize all DAGs on 'airflow db upgrade' #24518

Conversation

uranusjr commented Jun 17, 2022 • edited Loading

github-actions bot commented Jun 17, 2022

kaxil Jun 22, 2022

Choose a reason for hiding this comment

potiuk Jun 29, 2022 • edited Loading

Choose a reason for hiding this comment

uranusjr commented Jun 17, 2022 •

edited

Loading

potiuk Jun 29, 2022 •

edited

Loading