Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect data_interval_start due to scheduling time change #21715

Closed
wants to merge 0 commits into from

Conversation

wanlce
Copy link
Contributor

@wanlce wanlce commented Feb 21, 2022

This PR fixes the incorrect first run of data_interval_start after
changing the scheduling time.

New version of the timetables issue

eg:
At first I had a scheduling time of 30 * * * * *
After running it twice, I changed it to 32 * * * *
The actual first run time after the change is fine, but because the data interval starts from the end of the previous interval, it has already been determined and not refreshed, so the displayed execution_date is incorrect. This can cause problems with Cross-dag dependencies.

2022-02-15_23-34 2022-02-15_23-36

Correct:

2022-02-15_23-26 2022-02-15_23-27

My current solution is to add start = self._get_prev(end) below end = self._get_next(start) to reset data_interval_start.
I'm not sure if this is appropriate 😃


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@uranusjr
Copy link
Member

Should we call _align on start before line 94 instead? start is also used to check against restriction (which is constructed from start_date etc.), and fixing start at the end of the function may be too late.

@wanlce wanlce force-pushed the fix-timetables-interval-start branch from 195d738 to abac9fe Compare February 23, 2022 04:41
@wanlce
Copy link
Contributor Author

wanlce commented Feb 23, 2022

Thank you for your review, I made changes. Looking forward to your new comment 😄

Should we call _align on start before line 94 instead? start is also used to check against restriction (which is constructed from start_date etc.), and fixing start at the end of the function may be too late.

Comment on lines 220 to 229
def _prev_align(self, current: DateTime) -> DateTime:
"""Get the prev scheduled time.

This is ``current - interval``, unless ``current`` falls right on the
interval boundary, when ``current`` is returned.
"""
prev_time = self._get_prev(current)
if self._get_next(prev_time) != current:
return prev_time
return current
Copy link
Member

@uranusjr uranusjr Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be always the same as self._get_prev(self._get_next(current))?

Say a CRON schedule runs one every midnight. For this implementation:

  • If current is at midnight, the return value is at midnight
  • If current is on 1am, the return value is at midnight

For self._get_prev(self._get_next(current)):

  • If current is at midnight, _get_next returns the next midnight, and _get_prev returns this midnight
  • If current is at 1am, _get_next returns the next midnight, and _get_prev returns this midnight

And for the timedelta interval, self._get_prev(self._get_next(current)) is always current.

So if I’m not mistaken, this function is not needed.

Copy link
Contributor Author

@wanlce wanlce Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are different ha

For self._prev_align()

  • If current is at 1am, the return value is at midnight

For self._align()

  • If current is at 1am, the return value is next midnight.

It is correct that data_schedule_start is equal to the data_schedule_end of the previous one when no scheduling time changes. However, when the scheduling time changes, The first data_schedule_start for a new schedule may not be the same as the data_schedule_end before the schedule was changed
2022-02-23_16-55

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the breakdown. From this analysis it seems like a _prev_align is unavoidable. However, perhaps we could rename the functions to descibe things better. Maybe we could rename _align to _align_to_next, and name this new function _align_to_prev?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we need to document pretty thoroughly why we need both functions.

@uranusjr
Copy link
Member

Hi, do you plan to work on this PR?

@wanlce
Copy link
Contributor Author

wanlce commented Mar 31, 2022

Hi, do you plan to work on this PR?

I'm very sorry, I forgot about it.
Seems to have accidentally messed up 😢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants