Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error handling in Lancium adapter #281

Merged
merged 2 commits into from
Feb 16, 2023

Conversation

giffels
Copy link
Member

@giffels giffels commented Feb 2, 2023

During testing the integration of Lancium Compute into the infrastructure, TARDIS sometimes crashes due to two reasons, which can be internally handled by TARDIS.

  1. Since we are updating the resource status only once a minute the following situation can occur. A drone previously in state queued or running can be changed meantime to error or finished. If TARDIS tries to terminate this drone, the Lancium API returns HTTP Status Code 419 ("Unable to terminate a job that is not queued or running"). In that case TARDIS should handle it as TardisResourceStatusUpdateFailed exception. So, after the next resource status update, the life cycle management can take care of it.
  2. It can rarely happen that a drone crashes and is not anymore in the system, so that the Lancium API returns a HTTP Status Code 404 ("Not found"). In that case TARDISshould handle it as TardisDroneCrashed exception. Afterwards the life cycle management takes care of it.

FYI the handle_exceptions context manager is involved in here.

@giffels giffels added the Improvement Code Improvements label Feb 2, 2023
@@ -277,3 +278,11 @@
with self.assertRaises(TardisError):
with self.adapter.handle_exceptions():
raise AuthError("test", "test")

with self.assertRaises(TardisResourceStatusUpdateFailed):

Check warning

Code scanning / CodeQL

Unreachable code

This statement is unreachable.
@giffels giffels requested review from a team, eileen-kuehn and RHofsaess and removed request for a team February 3, 2023 08:43
@giffels giffels marked this pull request as ready for review February 3, 2023 08:43
RHofsaess
RHofsaess previously approved these changes Feb 3, 2023
Copy link

@RHofsaess RHofsaess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks fine 👍

@eileen-kuehn
Copy link
Member

2. It can rarely happen that a drone crashes and is not anymore in the system, so that the Lancium API returns a HTTP Status Code 404 ("Not found"). In that case TARDISshould handle it as TardisDroneCrashed exception. Afterwards the life cycle management takes care of it.

I have a quick question on this. Does the handling of a drone as crashed have any further meaning? It can (or should) happen that a HTTP Status Code 404 is given even when the drone has not crashed but was correctly terminated by TARDIS (or something else), e.g. the confirmation might just have gone lost.

In case it does not make any difference, I will approve :)

@giffels
Copy link
Member Author

giffels commented Feb 16, 2023

  1. It can rarely happen that a drone crashes and is not anymore in the system, so that the Lancium API returns a HTTP Status Code 404 ("Not found"). In that case TARDISshould handle it as TardisDroneCrashed exception. Afterwards the life cycle management takes care of it.

I have a quick question on this. Does the handling of a drone as crashed have any further meaning?

The drone is not handled anymore by TARDIS, since it is already gone.

It can (or should) happen that a HTTP Status Code 404 is given even when the drone has not crashed but was correctly terminated by TARDIS (or something else), e.g. the confirmation might just have gone lost.

Yes, that can happen, however the result is the same. The drone is gone and it is not handled anymore by TARDIS.

In case it does not make any difference, I will approve :)

Yes, does not make any difference! :-)

Copy link

@RHofsaess RHofsaess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still fine for me :D

@giffels giffels added this pull request to the merge queue Feb 16, 2023
Merged via the queue into MatterMiners:master with commit 64fd917 Feb 16, 2023
@giffels giffels deleted the handle-crashed-drones-lancium branch February 16, 2023 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Improvement Code Improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants