Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-serialize all DAGs on 'airflow db upgrade' #24518

Merged
merged 1 commit into from
Jun 19, 2022

Conversation

uranusjr
Copy link
Member

@uranusjr uranusjr commented Jun 17, 2022

Spawned out of conversations in #23860. I also added some session=session arguments that seem to make sense.

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Jun 17, 2022
@github-actions
Copy link

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@potiuk potiuk merged commit a10965f into apache:main Jun 19, 2022
@uranusjr uranusjr deleted the reserialize-on-upgrade branch June 21, 2022 02:16
Comment on lines +791 to +797
@provide_session
def reserialize_dags(*, session: Session = NEW_SESSION) -> None:
session.query(SerializedDagModel).delete(synchronize_session=False)
dagbag = DagBag()
dagbag.collect_dags(only_if_updated=False, safe_mode=False)
dagbag.sync_to_db(session=session)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least I would really like to have a flag to not do that.

I would like to control when DAGs get reserialized. For customers with bigger DAGs (more than >100 tasks) and larger number of DAGs (>100-500)< this is good to add a quite bit of time. Some of our migrations already take a good chunk of time, I would really like to avoid that here. " Time to Parse the DAGs + Time to serialize it + Send the entire blob over the network).

Copy link
Member

@potiuk potiuk Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but i think we should know if we are talking about 5% of time increase (for example when we add simple new field to a table with task instance) or 50% or 200%. Or put it another way - are we talking about 5 minutes or 1 hour of overhead.

I think we should get some data to base our decisions on. I believe (this is rather my intuition than hard data) that with even a big installation re-serialization of DAGs will not take a long time - minutes at most.

Back-of-the envelope calcilations. Cardinality of SerializedDAG table is (until we add versioning at least) the smallest when it comes to "dag-related" tables (and this is the most important factor IMHO). Number of rows in the table is equal to number of DAGs. So it's not like we multiple 500 DAGs by 100 tasks in the case above. For comparision - Task Instance is by far the worst, because we have DAG num * per-dag task num * number of runs. This can grow fast, quickly, but not SerializedDAG table.

Yeah. You have to parse all those files. But even eve if you take 0.5s to parse each DAG, with 500 DAGs you get 6 minutes. And 0.5s per DAG indicates that the user has far bigger problems already. Sending 500 blobs (this is what reserializing does) and deleting/updating 500 rows in a table is unlikely to be slow. Sending biggish blobs one per transactions is about as efficient as it gets when it comes to overhead of communication with the DB - roughly equivalent to sending the raw data (overhead for transaction and row insert/delete is not neglectible but it is order of magnitude less than sending the data). Assuming we have 1MB Blob to send per DAG on average (which is outrageous) we are talking about sending roughly 500 MB to the DB. Assuming remote databse and 100 Mb connection (extremely low for a prod db) it is less than 2 minutes (assuming 50% saturation of the network)

So back-of -the envelope calculation tells me that even with such huge installation, we are talking about at most few minutes overhead.

My intuition tells me that if this is 5% or 5 minutes, this is entirely justifiable to add delay. Conversely - It will take hours of downtime and investigation and loosing time of multiple people to deal with serialization issue when it happens, so avoiding "sure" one-time 5-minutes delay once-a-few-months in favour of possible hours of investigations is simply premature optimization. There are always cost coming with optimizations - complexity, quite likely risk of failure in this case.

On the other hand if for some customers it will add hours of delays, then yeah. we might think about optimizing it and avoiding the reserialization.

Do we actually have any number on how much does it take for a customer that hass 500 dags and 100 tasks?

@ephraimbuddy ephraimbuddy added the type:improvement Changelog: Improvements label Aug 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:CLI full tests needed We need to run full set of tests for this PR to merge type:improvement Changelog: Improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants