Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auditor plugin crashes when drone does not reach AvailableState #305

Closed
QuantumDancer opened this issue Aug 2, 2023 · 0 comments · Fixed by #306
Closed

Auditor plugin crashes when drone does not reach AvailableState #305

QuantumDancer opened this issue Aug 2, 2023 · 0 comments · Fixed by #306

Comments

@QuantumDancer
Copy link
Contributor

In some cases, it can happen that a drone is started but never reaches the AvailableState. However, during clean-up, it reaches the DownState. This case is currently not handled correctly in the auditor plugin, see

if isinstance(state, AvailableState):
record = self.construct_record(resource_attributes)
await self._client.add(record)
elif isinstance(state, DownState):
record = self.construct_record(resource_attributes)
record.with_stop_time(
resource_attributes["updated"]
.replace(tzinfo=self._local_timezone)
.astimezone(pytz.utc)
)
await self._client.update(record)

A new record is created when the drone reaches AvailableState. The record is then updated with the stop-time once the drone reaches DownState. If the drone never reached the AvailableState, there is also no record to be updated. In this case, AUDITOR returns a HTTP error `400 BAD REQUEST (The server cannot or will not process the request due to something that is perceived to be a client error).

Jul 13 12:20:02 monopol.bfg.privat docker-COBalD-Tardis-atlhei[3808458]: cobald.runtime.tardis.plugins.auditor: 2023-07-13 10:20:02 Drone: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16996522, 'c
reated': datetime.datetime(2023, 7, 12, 20, 4, 53, 130170), 'updated': datetime.datetime(2023, 7, 13, 10, 20, 2, 340694), 'drone_uuid': 'nemo-f25919f1d0', 'resource_status': <ResourceStatus.Deleted: 4>} has changed state to DownState
Jul 13 12:20:02 monopol.bfg.privat docker-COBalD-Tardis-atlhei[3808458]: cobald.runtime.runner.asyncio: 2023-07-13 10:20:02 runner aborted: <cobald.daemon.runners.asyncio_runner.AsyncioRunner object at 0x7f17fab74040>
Jul 13 12:20:02 monopol.bfg.privat docker-COBalD-Tardis-atlhei[3808458]: Traceback (most recent call last):
Jul 13 12:20:02 monopol.bfg.privat docker-COBalD-Tardis-atlhei[3808458]:   File "/usr/local/lib/python3.8/site-packages/cobald/daemon/runners/base_runner.py", line 68, in run
Jul 13 12:20:02 monopol.bfg.privat docker-COBalD-Tardis-atlhei[3808458]:     await self.manage_payloads()
Jul 13 12:20:02 monopol.bfg.privat docker-COBalD-Tardis-atlhei[3808458]:   File "/usr/local/lib/python3.8/site-packages/cobald/daemon/runners/asyncio_runner.py", line 54, in manage_payloads
Jul 13 12:20:02 monopol.bfg.privat docker-COBalD-Tardis-atlhei[3808458]:     await self._payload_failure
Jul 13 12:20:02 monopol.bfg.privat docker-COBalD-Tardis-atlhei[3808458]:   File "/usr/local/lib/python3.8/site-packages/cobald/daemon/runners/asyncio_runner.py", line 40, in _monitor_payload
Jul 13 12:20:02 monopol.bfg.privat docker-COBalD-Tardis-atlhei[3808458]:     result = await payload()
Jul 13 12:20:02 monopol.bfg.privat docker-COBalD-Tardis-atlhei[3808458]:   File "/usr/local/lib/python3.8/site-packages/tardis/resources/drone.py", line 123, in run
Jul 13 12:20:02 monopol.bfg.privat docker-COBalD-Tardis-atlhei[3808458]:     await current_state.run(self)
Jul 13 12:20:02 monopol.bfg.privat docker-COBalD-Tardis-atlhei[3808458]:   File "/usr/local/lib/python3.8/site-packages/tardis/resources/dronestates.py", line 288, in run
Jul 13 12:20:02 monopol.bfg.privat docker-COBalD-Tardis-atlhei[3808458]:     await drone.set_state(await cls.run_processing_pipeline(drone))
Jul 13 12:20:02 monopol.bfg.privat docker-COBalD-Tardis-atlhei[3808458]:   File "/usr/local/lib/python3.8/site-packages/tardis/resources/drone.py", line 143, in set_state
Jul 13 12:20:02 monopol.bfg.privat docker-COBalD-Tardis-atlhei[3808458]:     await self.notify_plugins()
Jul 13 12:20:02 monopol.bfg.privat docker-COBalD-Tardis-atlhei[3808458]:   File "/usr/local/lib/python3.8/site-packages/tardis/resources/drone.py", line 153, in notify_plugins
Jul 13 12:20:02 monopol.bfg.privat docker-COBalD-Tardis-atlhei[3808458]:     await plugin.notify(self.state, self.resource_attributes)
Jul 13 12:20:02 monopol.bfg.privat docker-COBalD-Tardis-atlhei[3808458]:   File "/usr/local/lib/python3.8/site-packages/tardis/plugins/auditor.py", line 88, in notify
Jul 13 12:20:02 monopol.bfg.privat docker-COBalD-Tardis-atlhei[3808458]:     await self._client.update(record)
Jul 13 12:20:02 monopol.bfg.privat docker-COBalD-Tardis-atlhei[3808458]: RuntimeError: Reqwest Error: HTTP status client error (400 Bad Request) for url (http://10.18.0.12:8001/update)

I think this is the correct behaviour from auditor, as updating a non-existing record does not make sense.

So we should handle this exception somehow or try to find another solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant