Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TARDIS crashing due to Drones in ShutdownState having remote_resource_uuid=None #248

Closed
giffels opened this issue May 25, 2022 · 4 comments · Fixed by #249
Closed

TARDIS crashing due to Drones in ShutdownState having remote_resource_uuid=None #248

giffels opened this issue May 25, 2022 · 4 comments · Fixed by #249
Assignees
Labels
bug Something isn't working

Comments

@giffels
Copy link
Member

giffels commented May 25, 2022

I currently have spotted two cases on the Compute4PUNCH infrastructure, where drones in ShutdownState do have a remote_resource_uuid=None. Which leads to continously crashing TARDIS. At the moment it is unclear, how this happened.

cobald.runtime.tardis.resources.dronestates: 2022-05-25 10:40:03 Drone {'site_name': 'GridKa', 'machine_type': 'eightcore', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1024, 'Disk': 1048576}, 'remote_resource_uui
d': None, 'created': datetime.datetime(2022, 5, 21, 15, 32, 27, 240837), 'updated': datetime.datetime(2022, 5, 21, 15, 55, 31, 606878), 'drone_uuid': 'gridka-d61765a93c'} in ShutDownState
....
Traceback (most recent call last):
  File "/home/compute4punch/venv/lib64/python3.6/site-packages/tardis/adapters/sites/htcondor.py", line 326, in handle_exceptions
    yield
  File "/home/compute4punch/venv/lib64/python3.6/site-packages/tardis/agents/siteagent.py", line 54, in resource_status
    return await self._site_adapter.resource_status(resource_attributes)
  File "/home/compute4punch/venv/lib64/python3.6/site-packages/tardis/adapters/sites/htcondor.py", line 285, in resource_status
    resource_uuid = _job_id(resource_attributes.remote_resource_uuid)
  File "/home/compute4punch/venv/lib64/python3.6/site-packages/tardis/adapters/sites/htcondor.py", line 33, in _job_id
    return resource_uuid if "." in resource_uuid else f"{resource_uuid}.0"
TypeError: argument of type 'NoneType' is not iterable
@giffels giffels added the bug Something isn't working label May 25, 2022
@giffels
Copy link
Member Author

giffels commented May 25, 2022

It happened again, however TARDIS is still running, but the remote_resource_uuid is missing in the DB.

id|remote_resource_uuid|drone_uuid|state_id|site_id|machine_type_id|created|updated
4||gridka-ac71906584|5|1|1|2022-05-25 12:01:36.163755|2022-05-25 12:06:38.426646

but it still known to TARDIS.

cobald.runtime.tardis.resources.dronestates: 2022-05-25 12:14:39 Resource attributes: {'site_name': 'GridKa', 'machine_type': 'eightcore', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1024, 'Disk': 1048576}, 'remote_resource_uuid': '8729870.0', 'created': datetime.datetime(2022, 5, 25, 12, 1, 38, 65858), 'updated': datetime.datetime(2022, 5, 25, 12, 6, 38, 426646), 'drone_uuid': 'gridka-ac71906584', 'resource_status': <ResourceStatus.Running: 2>}

So it seems that the DB is not updated properly.

@giffels
Copy link
Member Author

giffels commented May 25, 2022

The problems was potentially introduced by #247. Here is the first update of the DB of that corresponding drone.

cobald.runtime.tardis.plugins.sqliteregistry: 2022-05-25 12:01:36 Drone: {'site_name': 'GridKa', 'machine_type': 'eightcore', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1024, 'Disk': 1048576}, 'remote_resource_uuid': None, 'created': datetime.datetime(2022, 5, 25, 12, 1, 36, 163755), 'updated': datetime.datetime(2022, 5, 25, 12, 1, 36, 978763), 'drone_uuid': 'gridka-ac71906584'} has changed state to RequestState
cobald.runtime.tardis.plugins.sqliteregistry: 2022-05-25 12:01:36
        INSERT OR ROLLBACK INTO
        Resources(remote_resource_uuid, drone_uuid, state_id, site_id, machine_type_id,
        created, updated)
        SELECT :remote_resource_uuid, :drone_uuid, RS.state_id, S.site_id,
        MT.machine_type_id, :created, :updated
        FROM ResourceStates RS
        JOIN Sites S ON S.site_name = :site_name
        JOIN MachineTypes MT ON MT.machine_type = :machine_type AND MT.site_id =
        S.site_id
        WHERE RS.state = :state,{'state': 'RequestState', 'site_name': 'GridKa', 'machine_type': 'eightcore', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1024, 'Disk': 1048576}, 'remote_resource_uuid': None, 'created': datetime.datetime(2022, 5, 25, 12, 1, 36, 163755), 'updated': datetime.datetime(2022, 5, 25, 12, 1, 36, 978763), 'drone_uuid': 'gridka-ac71906584'} executed

@giffels
Copy link
Member Author

giffels commented May 25, 2022

However on the second update, the remote_resource_uuid is correctly written to the DB.

cobald.runtime.tardis.plugins.sqliteregistry: 2022-05-25 12:01:38 Drone: {'site_name': 'GridKa', 'machine_type': 'eightcore', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1024, 'Disk': 1048576}, 'remote_resource_uuid': '8729870.0', 'created': datetime.datetime(2022, 5, 25, 12, 1, 38, 65858), 'updated': datetime.datetime(2022, 5, 25, 12, 1, 38, 66210), 'drone_uuid': 'gridka-ac71906584'} has changed state to BootingState
cobald.runtime.tardis.plugins.sqliteregistry: 2022-05-25 12:01:38 UPDATE Resources SET updated = :updated,
        state_id = (SELECT state_id FROM ResourceStates WHERE state = :state)
        WHERE drone_uuid = :drone_uuid
        AND site_id = (SELECT site_id FROM Sites WHERE site_name = :site_name),{'state': 'BootingState', 'site_name': 'GridKa', 'machine_type': 'eightcore', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1024, 'Disk': 1048576}, 'remote_resource_uuid': '8729870.0', 'created': datetime.datetime(2022, 5, 25, 12, 1, 38, 65858), 'updated': datetime.datetime(2022, 5, 25, 12, 1, 38, 66210), 'drone_uuid': 'gridka-ac71906584'} executed

@giffels
Copy link
Member Author

giffels commented May 25, 2022

The update call, just updates the state_id and the updated timestamp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant