Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no ResourceState from SQLiteRegistry db after service crash/restart #291

Closed
dirksammel opened this issue Mar 28, 2023 · 25 comments · Fixed by #292 or #293
Closed

no ResourceState from SQLiteRegistry db after service crash/restart #291

dirksammel opened this issue Mar 28, 2023 · 25 comments · Fixed by #292 or #293
Labels
bug Something isn't working

Comments

@dirksammel
Copy link
Contributor

Hey,

We seem to have some problems getting the ResourceState from the drone db after a crash/restart of the service.
As far as I know, this problem is not related to any recent update, but was always there (@stefan-k can maybe comment).
This is from the config:

SqliteRegistry:
    db_file: "/db/drone_registry_atlhei.db"

We use docker, so the actual db is at /home/tardis/db/:

-> docker::run { 'COBalD/Tardis-atlhei':
    image   => 'matterminers/cobald-tardis:latest',
    net => 'host',
    volumes => [
      '/etc/cobaldtardis/cobald_atlhei.yml:/srv/cobald.yml',
      '/etc/cobaldtardis/tardis_atlhei.yml:/srv/tardis.yml',
      '/home/tardis/.ssh/id_slurm:/keys/id_slurm',
      '/home/tardis/.ssh/id_nemo:/keys/id_nemo',
      '/home/tardis/known_hosts:/keys/known_hosts',
      '/home/tardis/db/:/db/',
    ],
    ports => [
      '8127:8127',
    ]
  }

Some example that just happened:
The db looks like this:

sqlite> .dump Resources
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE Resources (id INTEGER PRIMARY KEY AUTOINCREMENT,remote_resource_uuid VARCHAR(255), drone_uuid VARCHAR(255) UNIQUE, state_id INTEGER, site_id INTEGER, machine_type_id INTEGER, created TIMESTAMP, updated TIMESTAMP, FOREIGN KEY(state_id) REFERENCES ResourceState(state_id), FOREIGN KEY(site_id) REFERENCES Sites(site_id), FOREIGN KEY(machine_type_id) REFERENCES MachineTypes(machine_type_id), CONSTRAINT unique_remote_resource_uuid_per_site UNIQUE (site_id, remote_resource_uuid));
INSERT INTO Resources VALUES(1,NULL,'nemo-8455542ca3',1,1,1,'2023-03-27 12:19:55.080348','2023-03-27 12:19:55.658881');
INSERT INTO Resources VALUES(2,NULL,'nemo-db29f70c92',1,1,1,'2023-03-27 13:01:40.251877','2023-03-27 13:01:40.792939');
INSERT INTO Resources VALUES(3,NULL,'nemo-5706f85d59',1,1,1,'2023-03-27 13:17:10.740430','2023-03-27 13:17:10.950521');
INSERT INTO Resources VALUES(4,NULL,'nemo-961bd21e33',1,1,1,'2023-03-27 13:52:34.910558','2023-03-27 13:52:34.958508');
INSERT INTO Resources VALUES(5,NULL,'nemo-26b643134b',1,1,1,'2023-03-27 13:52:34.910680','2023-03-27 13:52:34.959096');
INSERT INTO Resources VALUES(6,NULL,'nemo-61774b02db',1,1,1,'2023-03-27 13:59:05.725789','2023-03-27 13:59:05.759855');
INSERT INTO Resources VALUES(7,NULL,'nemo-9f6f374ca0',1,1,1,'2023-03-27 14:01:05.732834','2023-03-27 14:01:06.733262');
INSERT INTO Resources VALUES(8,NULL,'nemo-bb361407b9',1,1,1,'2023-03-27 14:01:05.732716','2023-03-27 14:01:06.737537');
INSERT INTO Resources VALUES(9,NULL,'nemo-1ca4256a87',1,1,1,'2023-03-27 15:19:26.731433','2023-03-27 15:19:26.930026');
INSERT INTO Resources VALUES(10,NULL,'nemo-667fa5f690',1,1,1,'2023-03-27 15:21:26.737791','2023-03-27 15:21:26.821617');
INSERT INTO Resources VALUES(11,NULL,'nemo-1fb0b384ae',1,1,1,'2023-03-27 15:21:26.737700','2023-03-27 15:21:26.825348');
INSERT INTO Resources VALUES(12,NULL,'nemo-f7c82faef6',1,1,1,'2023-03-27 15:21:26.737750','2023-03-27 15:21:26.825736');
INSERT INTO Resources VALUES(13,NULL,'nemo-2025eabfe8',1,1,1,'2023-03-27 15:21:26.737546','2023-03-27 15:21:26.826064');

(Maybe already strange that there are no values for ResourceState etc?)

The log showed these booting drones:

Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:27:51 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16521168, 'created': datetime.datetime(2023, 3, 28, 6, 9, 9, 2201), 'updated': datetime.datetime(2023, 3, 28, 13, 27, 51, 571601), 'drone_uuid': 'nemo-16521168', 'resource_status': <ResourceStatus.Booting: 1>}
Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:27:51 NEMO has status {'JobID': '16521169', 'State': 'Idle'}.
Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:27:51 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16521169, 'created': datetime.datetime(2023, 3, 28, 6, 9, 9, 8072), 'updated': datetime.datetime(2023, 3, 28, 13, 27, 51, 572560), 'drone_uuid': 'nemo-16521169', 'resource_status': <ResourceStatus.Booting: 1>}
Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:27:51 NEMO has status {'JobID': '16521170', 'State': 'Idle'}.
Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:27:51 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16521170, 'created': datetime.datetime(2023, 3, 28, 6, 9, 9, 34917), 'updated': datetime.datetime(2023, 3, 28, 13, 27, 51, 573467), 'drone_uuid': 'nemo-16521170', 'resource_status': <ResourceStatus.Booting: 1>}
Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:27:51 NEMO has status {'JobID': '16521171', 'State': 'Idle'}.
Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:27:51 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16521171, 'created': datetime.datetime(2023, 3, 28, 6, 9, 9, 63762), 'updated': datetime.datetime(2023, 3, 28, 13, 27, 51, 574335), 'drone_uuid': 'nemo-16521171', 'resource_status': <ResourceStatus.Booting: 1>}
Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:27:51 NEMO has status {'JobID': '16521172', 'State': 'Idle'}.
Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:27:51 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16521172, 'created': datetime.datetime(2023, 3, 28, 6, 9, 9, 65612), 'updated': datetime.datetime(2023, 3, 28, 13, 27, 51, 575221), 'drone_uuid': 'nemo-16521172', 'resource_status': <ResourceStatus.Booting: 1>}
Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:27:51 NEMO has status {'JobID': '16521173', 'State': 'Idle'}.
Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:27:51 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16521173, 'created': datetime.datetime(2023, 3, 28, 6, 9, 9, 68858), 'updated': datetime.datetime(2023, 3, 28, 13, 27, 51, 576265), 'drone_uuid': 'nemo-16521173', 'resource_status': <ResourceStatus.Booting: 1>}
Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:27:51 NEMO has status {'JobID': '16521174', 'State': 'Idle'}.
Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:27:51 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16521174, 'created': datetime.datetime(2023, 3, 28, 6, 9, 9, 77660), 'updated': datetime.datetime(2023, 3, 28, 13, 27, 51, 577232), 'drone_uuid': 'nemo-16521174', 'resource_status': <ResourceStatus.Booting: 1>}
Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:27:51 NEMO has status {'JobID': '16521175', 'State': 'Idle'}.
Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:27:51 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16521175, 'created': datetime.datetime(2023, 3, 28, 6, 9, 9, 79127), 'updated': datetime.datetime(2023, 3, 28, 13, 27, 51, 578138), 'drone_uuid': 'nemo-16521175', 'resource_status': <ResourceStatus.Booting: 1>}
Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:27:51 NEMO has status {'JobID': '16521176', 'State': 'Idle'}.
Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:27:51 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16521176, 'created': datetime.datetime(2023, 3, 28, 6, 9, 9, 82262), 'updated': datetime.datetime(2023, 3, 28, 13, 27, 51, 579000), 'drone_uuid': 'nemo-16521176', 'resource_status': <ResourceStatus.Booting: 1>}
Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:27:51 NEMO has status {'JobID': '16521177', 'State': 'Idle'}.
Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:27:51 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16521177, 'created': datetime.datetime(2023, 3, 28, 6, 9, 9, 83663), 'updated': datetime.datetime(2023, 3, 28, 13, 27, 51, 579918), 'drone_uuid': 'nemo-16521177', 'resource_status': <ResourceStatus.Booting: 1>}
Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:27:51 NEMO has status {'JobID': '16521181', 'State': 'Idle'}.
Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:27:51 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16521181, 'created': datetime.datetime(2023, 3, 28, 6, 9, 9, 872632), 'updated': datetime.datetime(2023, 3, 28, 13, 27, 51, 580808), 'drone_uuid': 'nemo-16521181', 'resource_status': <ResourceStatus.Booting: 1>}
Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:27:51 NEMO has status {'JobID': '16521182', 'State': 'Idle'}.
Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:27:51 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16521182, 'created': datetime.datetime(2023, 3, 28, 6, 9, 9, 882702), 'updated': datetime.datetime(2023, 3, 28, 13, 27, 51, 581702), 'drone_uuid': 'nemo-16521182', 'resource_status': <ResourceStatus.Booting: 1>}
Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:27:51 NEMO has status {'JobID': '16521180', 'State': 'Idle'}.
Mar 28 15:27:51 auditor.novalocal docker-COBalD-Tardis-atlhei[730491]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:27:51 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16521180, 'created': datetime.datetime(2023, 3, 28, 6, 9, 9, 861091), 'updated': datetime.datetime(2023, 3, 28, 13, 27, 51, 582568), 'drone_uuid': 'nemo-16521180', 'resource_status': <ResourceStatus.Booting: 1>}

After a restart of the service, the drones from the db can be seen in the log:

Mar 28 15:37:04 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:37:04 Drone {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': None, 'created': datetime.datetime(2023, 3, 27, 14, 1, 5, 732834), 'updated': datetime.datetime(2023, 3, 27, 14, 1, 6, 733262), 'drone_uuid': 'nemo-9f6f374ca0'} in RequestState
Mar 28 15:37:04 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:37:04 Drone {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': None, 'created': datetime.datetime(2023, 3, 27, 12, 19, 55, 80348), 'updated': datetime.datetime(2023, 3, 27, 12, 19, 55, 658881), 'drone_uuid': 'nemo-8455542ca3'} in RequestState
Mar 28 15:37:04 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:37:04 Drone {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': None, 'created': datetime.datetime(2023, 3, 27, 15, 21, 26, 737791), 'updated': datetime.datetime(2023, 3, 27, 15, 21, 26, 821617), 'drone_uuid': 'nemo-667fa5f690'} in RequestState
Mar 28 15:37:04 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:37:04 Drone {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': None, 'created': datetime.datetime(2023, 3, 27, 13, 52, 34, 910558), 'updated': datetime.datetime(2023, 3, 27, 13, 52, 34, 958508), 'drone_uuid': 'nemo-961bd21e33'} in RequestState
Mar 28 15:37:04 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:37:04 Drone {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': None, 'created': datetime.datetime(2023, 3, 27, 13, 52, 34, 910680), 'updated': datetime.datetime(2023, 3, 27, 13, 52, 34, 959096), 'drone_uuid': 'nemo-26b643134b'} in RequestState
Mar 28 15:37:04 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:37:04 Drone {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': None, 'created': datetime.datetime(2023, 3, 27, 15, 21, 26, 737546), 'updated': datetime.datetime(2023, 3, 27, 15, 21, 26, 826064), 'drone_uuid': 'nemo-2025eabfe8'} in RequestState
Mar 28 15:37:04 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:37:04 Drone {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': None, 'created': datetime.datetime(2023, 3, 27, 13, 1, 40, 251877), 'updated': datetime.datetime(2023, 3, 27, 13, 1, 40, 792939), 'drone_uuid': 'nemo-db29f70c92'} in RequestState
Mar 28 15:37:04 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:37:04 Drone {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': None, 'created': datetime.datetime(2023, 3, 27, 14, 1, 5, 732716), 'updated': datetime.datetime(2023, 3, 27, 14, 1, 6, 737537), 'drone_uuid': 'nemo-bb361407b9'} in RequestState
Mar 28 15:37:04 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:37:04 Drone {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': None, 'created': datetime.datetime(2023, 3, 27, 15, 21, 26, 737700), 'updated': datetime.datetime(2023, 3, 27, 15, 21, 26, 825348), 'drone_uuid': 'nemo-1fb0b384ae'} in RequestState
Mar 28 15:37:04 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:37:04 Drone {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': None, 'created': datetime.datetime(2023, 3, 27, 13, 17, 10, 740430), 'updated': datetime.datetime(2023, 3, 27, 13, 17, 10, 950521), 'drone_uuid': 'nemo-5706f85d59'} in RequestState
Mar 28 15:37:04 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:37:04 Drone {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': None, 'created': datetime.datetime(2023, 3, 27, 15, 19, 26, 731433), 'updated': datetime.datetime(2023, 3, 27, 15, 19, 26, 930026), 'drone_uuid': 'nemo-1ca4256a87'} in RequestState
Mar 28 15:37:04 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:37:04 Drone {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': None, 'created': datetime.datetime(2023, 3, 27, 13, 59, 5, 725789), 'updated': datetime.datetime(2023, 3, 27, 13, 59, 5, 759855), 'drone_uuid': 'nemo-61774b02db'} in RequestState

but later this:

Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:38:43 NEMO has status {'JobID': '16522846', 'State': 'Idle'}.
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:38:43 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16522846, 'created': datetime.datetime(2023, 3, 28, 13, 37, 42, 187912), 'updated': datetime.datetime(2023, 3, 28, 13, 38, 43, 786640), 'drone_uuid': 'nemo-16522846', 'resource_status': <ResourceStatus.Booting: 1>}
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:38:43 Drone {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16522852, 'created': datetime.datetime(2023, 3, 28, 13, 37, 42, 229095), 'updated': datetime.datetime(2023, 3, 28, 13, 37, 42, 229129), 'drone_uuid': 'nemo-16522852', 'resource_status': <ResourceStatus.Booting: 1>} in BootingState
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:38:43 Drone {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16522851, 'created': datetime.datetime(2023, 3, 28, 13, 37, 42, 225214), 'updated': datetime.datetime(2023, 3, 28, 13, 37, 42, 225246), 'drone_uuid': 'nemo-16522851', 'resource_status': <ResourceStatus.Booting: 1>} in BootingState
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:38:43 Drone {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16522853, 'created': datetime.datetime(2023, 3, 28, 13, 37, 42, 229840), 'updated': datetime.datetime(2023, 3, 28, 13, 37, 42, 229870), 'drone_uuid': 'nemo-16522853', 'resource_status': <ResourceStatus.Booting: 1>} in BootingState
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:38:43 NEMO has status {'JobID': '16522845', 'State': 'Idle'}.
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:38:43 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16522845, 'created': datetime.datetime(2023, 3, 28, 13, 37, 42, 179586), 'updated': datetime.datetime(2023, 3, 28, 13, 38, 43, 788804), 'drone_uuid': 'nemo-16522845', 'resource_status': <ResourceStatus.Booting: 1>}
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:38:43 NEMO has status {'JobID': '16522844', 'State': 'Idle'}.
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:38:43 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16522844, 'created': datetime.datetime(2023, 3, 28, 13, 37, 42, 169470), 'updated': datetime.datetime(2023, 3, 28, 13, 38, 43, 789600), 'drone_uuid': 'nemo-16522844', 'resource_status': <ResourceStatus.Booting: 1>}
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:38:43 NEMO has status {'JobID': '16522847', 'State': 'Idle'}.
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:38:43 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16522847, 'created': datetime.datetime(2023, 3, 28, 13, 37, 42, 198940), 'updated': datetime.datetime(2023, 3, 28, 13, 38, 43, 790416), 'drone_uuid': 'nemo-16522847', 'resource_status': <ResourceStatus.Booting: 1>}
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:38:43 NEMO has status {'JobID': '16522848', 'State': 'Idle'}.
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:38:43 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16522848, 'created': datetime.datetime(2023, 3, 28, 13, 37, 42, 207396), 'updated': datetime.datetime(2023, 3, 28, 13, 38, 43, 791399), 'drone_uuid': 'nemo-16522848', 'resource_status': <ResourceStatus.Booting: 1>}
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:38:43 NEMO has status {'JobID': '16522849', 'State': 'Idle'}.
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:38:43 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16522849, 'created': datetime.datetime(2023, 3, 28, 13, 37, 42, 216867), 'updated': datetime.datetime(2023, 3, 28, 13, 38, 43, 792193), 'drone_uuid': 'nemo-16522849', 'resource_status': <ResourceStatus.Booting: 1>}
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:38:43 NEMO has status {'JobID': '16522850', 'State': 'Idle'}.
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:38:43 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16522850, 'created': datetime.datetime(2023, 3, 28, 13, 37, 42, 224114), 'updated': datetime.datetime(2023, 3, 28, 13, 38, 43, 792956), 'drone_uuid': 'nemo-16522850', 'resource_status': <ResourceStatus.Booting: 1>}
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:38:43 NEMO has status {'JobID': '16522852', 'State': 'Idle'}.
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:38:43 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16522852, 'created': datetime.datetime(2023, 3, 28, 13, 37, 42, 229095), 'updated': datetime.datetime(2023, 3, 28, 13, 38, 43, 793745), 'drone_uuid': 'nemo-16522852', 'resource_status': <ResourceStatus.Booting: 1>}
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:38:43 NEMO has status {'JobID': '16522851', 'State': 'Idle'}.
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:38:43 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16522851, 'created': datetime.datetime(2023, 3, 28, 13, 37, 42, 225214), 'updated': datetime.datetime(2023, 3, 28, 13, 38, 43, 794546), 'drone_uuid': 'nemo-16522851', 'resource_status': <ResourceStatus.Booting: 1>}
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:38:43 NEMO has status {'JobID': '16522853', 'State': 'Idle'}.
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:38:43 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16522853, 'created': datetime.datetime(2023, 3, 28, 13, 37, 42, 229840), 'updated': datetime.datetime(2023, 3, 28, 13, 38, 43, 795314), 'drone_uuid': 'nemo-16522853', 'resource_status': <ResourceStatus.Booting: 1>}
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:38:43 Drone {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16522854, 'created': datetime.datetime(2023, 3, 28, 13, 37, 43, 726978), 'updated': datetime.datetime(2023, 3, 28, 13, 37, 43, 727054), 'drone_uuid': 'nemo-16522854', 'resource_status': <ResourceStatus.Booting: 1>} in BootingState
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:38:43 NEMO has status {'JobID': '16522854', 'State': 'Idle'}.
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:38:43 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16522854, 'created': datetime.datetime(2023, 3, 28, 13, 37, 43, 726978), 'updated': datetime.datetime(2023, 3, 28, 13, 38, 43, 830582), 'drone_uuid': 'nemo-16522854', 'resource_status': <ResourceStatus.Booting: 1>}
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:38:43 Drone {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16522855, 'created': datetime.datetime(2023, 3, 28, 13, 37, 43, 729845), 'updated': datetime.datetime(2023, 3, 28, 13, 37, 43, 729887), 'drone_uuid': 'nemo-16522855', 'resource_status': <ResourceStatus.Booting: 1>} in BootingState
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:38:43 NEMO has status {'JobID': '16522855', 'State': 'Idle'}.
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:38:43 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16522855, 'created': datetime.datetime(2023, 3, 28, 13, 37, 43, 729845), 'updated': datetime.datetime(2023, 3, 28, 13, 38, 43, 850533), 'drone_uuid': 'nemo-16522855', 'resource_status': <ResourceStatus.Booting: 1>}
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:38:43 Drone {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16522856, 'created': datetime.datetime(2023, 3, 28, 13, 37, 43, 732932), 'updated': datetime.datetime(2023, 3, 28, 13, 37, 43, 732973), 'drone_uuid': 'nemo-16522856', 'resource_status': <ResourceStatus.Booting: 1>} in BootingState
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.adapters.sites.moab: 2023-03-28 13:38:43 NEMO has status {'JobID': '16522856', 'State': 'Idle'}.
Mar 28 15:38:43 auditor.novalocal docker-COBalD-Tardis-atlhei[768830]: cobald.runtime.tardis.resources.dronestates: 2023-03-28 13:38:43 Resource attributes: {'site_name': 'NEMO', 'machine_type': 'tardis_c40m100', 'obs_machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1000, 'Disk': 1000}, 'remote_resource_uuid': 16522856, 'created': datetime.datetime(2023, 3, 28, 13, 37, 43, 732932), 'updated': datetime.datetime(2023, 3, 28, 13, 38, 43, 851888), 'drone_uuid': 'nemo-16522856', 'resource_status': <ResourceStatus.Booting: 1>}

We now have 26 booting machines in moab, but only the recent 13 are shown in the log:

16521180           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 08:09:09
16521181           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 08:09:09
16521182           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 08:09:09
16521168           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 08:09:08
16521169           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 08:09:08
16521170           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 08:09:08
16521171           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 08:09:08
16521172           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 08:09:08
16521173           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 08:09:08
16521174           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 08:09:08
16521175           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 08:09:08
16521176           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 08:09:08
16521177           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 08:09:08
16522854           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 15:37:42
16522855           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 15:37:42
16522856           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 15:37:42
16522844           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 15:37:42
16522845           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 15:37:42
16522846           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 15:37:42
16522847           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 15:37:42
16522848           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 15:37:42
16522849           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 15:37:42
16522850           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 15:37:42
16522851           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 15:37:42
16522852           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 15:37:42
16522853           fr_bh104       Idle    20  4:00:00:00  Tue Mar 28 15:37:42

And no changes of the db entries during all of this.

Sorry for the log dump, but I hope you can understand our issue from this.
Please let me know if you need any further information!

@giffels
Copy link
Member

giffels commented Mar 28, 2023

Thanks for the report, what TARDIS version are you using? Just asking, sincedocker run does usually not update a already pulled container.

@giffels
Copy link
Member

giffels commented Mar 28, 2023

sqlite> .dump Resources
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE Resources (id INTEGER PRIMARY KEY AUTOINCREMENT,remote_resource_uuid VARCHAR(255), drone_uuid VARCHAR(255) UNIQUE, state_id INTEGER, site_id INTEGER, machine_type_id INTEGER, created TIMESTAMP, updated TIMESTAMP, FOREIGN KEY(state_id) REFERENCES ResourceState(state_id), FOREIGN KEY(site_id) REFERENCES Sites(site_id), FOREIGN KEY(machine_type_id) REFERENCES MachineTypes(machine_type_id), CONSTRAINT unique_remote_resource_uuid_per_site UNIQUE (site_id, remote_resource_uuid));
INSERT INTO Resources VALUES(1,NULL,'nemo-8455542ca3',1,1,1,'2023-03-27 12:19:55.080348','2023-03-27 12:19:55.658881');
INSERT INTO Resources VALUES(2,NULL,'nemo-db29f70c92',1,1,1,'2023-03-27 13:01:40.251877','2023-03-27 13:01:40.792939');
INSERT INTO Resources VALUES(3,NULL,'nemo-5706f85d59',1,1,1,'2023-03-27 13:17:10.740430','2023-03-27 13:17:10.950521');
INSERT INTO Resources VALUES(4,NULL,'nemo-961bd21e33',1,1,1,'2023-03-27 13:52:34.910558','2023-03-27 13:52:34.958508');
INSERT INTO Resources VALUES(5,NULL,'nemo-26b643134b',1,1,1,'2023-03-27 13:52:34.910680','2023-03-27 13:52:34.959096');
INSERT INTO Resources VALUES(6,NULL,'nemo-61774b02db',1,1,1,'2023-03-27 13:59:05.725789','2023-03-27 13:59:05.759855');
INSERT INTO Resources VALUES(7,NULL,'nemo-9f6f374ca0',1,1,1,'2023-03-27 14:01:05.732834','2023-03-27 14:01:06.733262');
INSERT INTO Resources VALUES(8,NULL,'nemo-bb361407b9',1,1,1,'2023-03-27 14:01:05.732716','2023-03-27 14:01:06.737537');
INSERT INTO Resources VALUES(9,NULL,'nemo-1ca4256a87',1,1,1,'2023-03-27 15:19:26.731433','2023-03-27 15:19:26.930026');
INSERT INTO Resources VALUES(10,NULL,'nemo-667fa5f690',1,1,1,'2023-03-27 15:21:26.737791','2023-03-27 15:21:26.821617');
INSERT INTO Resources VALUES(11,NULL,'nemo-1fb0b384ae',1,1,1,'2023-03-27 15:21:26.737700','2023-03-27 15:21:26.825348');
INSERT INTO Resources VALUES(12,NULL,'nemo-f7c82faef6',1,1,1,'2023-03-27 15:21:26.737750','2023-03-27 15:21:26.825736');
INSERT INTO Resources VALUES(13,NULL,'nemo-2025eabfe8',1,1,1,'2023-03-27 15:21:26.737546','2023-03-27 15:21:26.826064');

This seems to be okay. The state_id is one, what is missing is the remote_resource_id. Could you verfiy, that state_id means RequestStatus in the ResourceStates table, please?

The reason for this is that TARDIS cannot know the remote_resource_id before deploying the resource. So, only when the drone state is in BootingState that field should be set.

@giffels giffels added the bug Something isn't working label Mar 28, 2023
@dirksammel
Copy link
Contributor Author

This seems to be okay. The state_id is one, what is missing is the remote_resource_id. Could you verfiy, that state_id means RequestStatus in the ResourceStates table, please?

Yes, it's RequestState:

sqlite> .dump ResourceStates
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE ResourceStates (state_id INTEGER PRIMARY KEY AUTOINCREMENT, state VARCHAR(255) UNIQUE);
INSERT INTO ResourceStates VALUES(1,'RequestState');
INSERT INTO ResourceStates VALUES(2,'BootingState');
INSERT INTO ResourceStates VALUES(3,'IntegrateState');
INSERT INTO ResourceStates VALUES(4,'IntegratingState');
INSERT INTO ResourceStates VALUES(5,'AvailableState');
INSERT INTO ResourceStates VALUES(6,'DrainState');
INSERT INTO ResourceStates VALUES(7,'DrainingState');
INSERT INTO ResourceStates VALUES(8,'DisintegrateState');
INSERT INTO ResourceStates VALUES(9,'ShutDownState');
INSERT INTO ResourceStates VALUES(10,'ShuttingDownState');
INSERT INTO ResourceStates VALUES(11,'CleanupState');
INSERT INTO ResourceStates VALUES(12,'DownState');

@dirksammel
Copy link
Contributor Author

Thanks for the report, what TARDIS version are you using? Just asking, sincedocker run does usually not update a already pulled container.

Good question, where can I find this? 😅
Our puppet run shows this:

Mar 28 19:09:45 auditor puppet-agent[776993]: (/Stage[main]/Profile::Cobaldtardis/Docker::Image[matterminers/cobald-tardis]/Exec[/usr/local/bin/update_docker_image.sh matterminers/cobald-tardis:latest]/returns) latest: Pulling from matterminers/cobald-t\
ardis
Mar 28 19:09:45 auditor puppet-agent[776993]: (/Stage[main]/Profile::Cobaldtardis/Docker::Image[matterminers/cobald-tardis]/Exec[/usr/local/bin/update_docker_image.sh matterminers/cobald-tardis:latest]/returns) Digest: sha256:76977ef4274b35db4d0f036839b\
523f4f1c8a0eb996762069e93fcbb82834c0a
Mar 28 19:09:45 auditor puppet-agent[776993]: (/Stage[main]/Profile::Cobaldtardis/Docker::Image[matterminers/cobald-tardis]/Exec[/usr/local/bin/update_docker_image.sh matterminers/cobald-tardis:latest]/returns) Status: Image is up to date for mattermine\
rs/cobald-tardis:latest
Mar 28 19:09:45 auditor puppet-agent[776993]: (/Stage[main]/Profile::Cobaldtardis/Docker::Image[matterminers/cobald-tardis]/Exec[/usr/local/bin/update_docker_image.sh matterminers/cobald-tardis:latest]/returns) docker.io/matterminers/cobald-tardis:lates\
t
Mar 28 19:09:45 auditor puppet-agent[776993]: (/Stage[main]/Profile::Cobaldtardis/Docker::Image[matterminers/cobald-tardis]/Exec[/usr/local/bin/update_docker_image.sh matterminers/cobald-tardis:latest]/returns) No updates to matterminers/cobald-tardis:l\
atest available. Currently on sha256:028634ae5e39b6c2691c1f807fb3d349babd8280c6fa337cdd50ed2f88b66021.

@giffels
Copy link
Member

giffels commented Mar 29, 2023

The most recent one is Digest:sha256:5405840eebafdadd3e8598e18d04e7eaf65ef8dfd650c5b56ea539ee230a3869. You can run docker pull cobald-tardis:latest to update. But I am not too optimistic, that this solves the issue. It looks more like you are running into some sort of race condition here.

@dirksammel
Copy link
Contributor Author

Hm, somehow I don't get this version:

sudo docker pull matterminers/cobald-tardis:latest
latest: Pulling from matterminers/cobald-tardis
Digest: sha256:76977ef4274b35db4d0f036839b523f4f1c8a0eb996762069e93fcbb82834c0a
Status: Image is up to date for matterminers/cobald-tardis:latest
docker.io/matterminers/cobald-tardis:latest

@giffels
Copy link
Member

giffels commented Mar 29, 2023

Hmm, that is also the version that I pulled right now. Then the information on dockerhub is not correct.

Bildschirmfoto 2023-03-29 um 11 07 19

@giffels
Copy link
Member

giffels commented Mar 29, 2023

I think I have identified the problem. Actually it is a feature or bug in the Moab adapter. The adapter is changing the drone_uuid, once the resource has been requested and the remote_resource_id is available. So, the SqlRegistry gets never updated in that case.

drone_uuid=self.drone_uuid(remote_resource_uuid),

In earlier versions it worked that way, since the resource was only added to the SqlRegistry, in case it was in BootingState. However, there was a good reason to change this behaviour, that I do not remember right now.

So my questions to @dirksammel and @mschnepf, do we still needs this feature. Moab is the only adapter, that does it.

@giffels
Copy link
Member

giffels commented Mar 29, 2023

Here the reason: #247

@giffels
Copy link
Member

giffels commented Mar 29, 2023

Two possible solutions:

  • Remove that feature, so that the Moab adapter behaves like the other adapters.
  • Add an update_drone_uuidcall, that changes the drone_uuid in the SqlRegistry as well.

I would prefer the first proposal.

@mschnepf
Copy link
Member

I also prefer the first proposal due to consistency. For systems with HTCondor as OBS it should work since the HTCondor batch system adapter has the TardisDroneUuid to identify drones on batch system level.
However, I do not see such a mechanism for the slurm adapter.

@dirksammel
Copy link
Contributor Author

dirksammel commented Mar 29, 2023

However, I do not see such a mechanism for the slurm adapter.

I'm not sure, but I think this is happening here
Because I get this:

sinfo --Format="Features" -e --noheader -r           
nemo-16522615       
nemo-16523462       
nemo-16521124       
nemo-16518819       
nemo-16523461       
nemo-16523422       
nemo-16522613       
nemo-16522614       
nemo-16521076       
nemo-16518822       
nemo-16518876       
nemo-16523415  
nemo-16523479       
nemo-16521167       
nemo-16518820

Edit: Ah, but I guess that's not the drone uuid we need, right?

@giffels
Copy link
Member

giffels commented Mar 29, 2023

These are the updated drone_uuids, not the generated once. In principle, that is just the hostname, right?

Two questions:

  • How is the hostname of the VM defined and can we influence that? I know, there is a script executed on the assigned NEMO batch node, which spawns a VM on that node.
  • Looking at this, it seems to me that the Slurm OBS does only work for Moab based clusters at the moment. Is that correct? Do we want to change this?

Edit: I am not an Slurm expert, but it seems that features can be defined using Feature=<string>. So, it would be possible to get that right?

@dirksammel
Copy link
Contributor Author

These are the updated drone_uuids, not the generated once. In principle, that is just the hostname, right?

The number is the corresponding JOBID in moab. This ends up in remote_resource_uuid': 16524055 and 'drone_uuid': 'nemo-16524055'.

Two questions:

* How is the hostname of the VM defined and can we influence that? I know, there is a script executed on the assigned NEMO batch node, which spawns a VM on that node.

The hostnames of the VMs can be retrieved like this:

sinfo --Format="Features,NodeHost" -e --noheader -r | grep nemo
nemo-16522615       host-10-20-40-16    
nemo-16523462       host-10-20-40-37    
nemo-16521076       host-10-20-40-9     
nemo-16521124       host-10-20-40-3     
nemo-16518819       host-10-20-40-4     
nemo-16523461       host-10-20-40-5     
nemo-16523422       host-10-20-40-6     
nemo-16522613       host-10-20-40-7     
nemo-16522614       host-10-20-40-8     
nemo-16518822       host-10-20-40-10    
nemo-16518876       host-10-20-40-11    
nemo-16523415       host-10-20-40-12    
nemo-16523479       host-10-20-40-13    
nemo-16521167       host-10-20-40-14    
nemo-16518820       host-10-20-40-15    
nemo-16522741       host-10-20-40-17    
nemo-16521163       host-10-20-40-18    
nemo-16523417       host-10-20-40-19

The hostname is defined by the IPs that are available for us. Influencing the hostname dynamically is not really possible, because slurm needs a pre-defined list of the available hostnames.

* Looking at this, it seems to me that the Slurm OBS does only work for Moab based clusters at the moment. Is that correct? Do we want to change this?

I'm not sure, @stefan-k should comment, but he's on vacation till Monday.

@stefan-k
Copy link
Contributor

I'm not sure, @stefan-k should comment, but he's on vacation till Monday.

AFAICT everything @dirksammel said is correct.

I'm unsure why the hostname is needed. In our setup, startvm starts a VM on the NEMO node via OpenStack. It then attaches the MOAB job id to the VM meta information thingy. A script inside the VM reads this value (via the OpenStack API) and sets the Feature field of the node in Slurm to nemo-<jobid>. The Feature field is then read by the Slurm batch system adapter and is used to match VMs to their corresponding drones.

Therefore I believe that in principle, the Slurm batch system adapter should work with all site adapters.

OpenStack decides which hostname a VM gets. We don't use hostnames to uniquely identify VMs, because we only have a finite amount of hostnames and they are regularly reused. As Dirk said, unfortunately Slurm requires us to predefine all possible nodes.

@giffels
Copy link
Member

giffels commented Mar 30, 2023

These are the updated drone_uuids, not the generated once. In principle, that is just the hostname, right?

The number is the corresponding JOBID in moab. This ends up in remote_resource_uuid': 16524055 and 'drone_uuid': 'nemo-16524055'.

Yes, it is the JobId. Currently, the drone_uuid is updated to be the Moab JobId once after submitting the drone.

@giffels
Copy link
Member

giffels commented Mar 30, 2023

@stefan-k Thanks a lot for thowing light on how this works internally.

I'm unsure why the hostname is needed. In our setup, startvm starts a VM on the NEMO node via OpenStack. It then attaches the MOAB job id to the VM meta information thingy. A script inside the VM reads this value (via the OpenStack API) and sets the Feature field of the node in Slurm to nemo-<jobid>. The Feature field is then read by the Slurm batch
system adapter and is used to match VMs to their corresponding drones.

Unfortunately, I thought the nemo-<jobid> is also the hostname of the VM, which is not the case. So let us forget about the hostname.
In principle, it would be also possible to add the drone_uuid instead the MOAB job id via this approach. right? For example, it can be added to the Environment of the Moab job, like it is done for the Slurm site adapter.

Therefore I believe that in principle, the Slurm batch system adapter should work with all site adapters.

Yes, that is in true. One just needs to add the drone_uuid as Feature to Slurm node.

@giffels
Copy link
Member

giffels commented Mar 30, 2023

In order to fix this issue, I would propose to ..

  • remove the line changing the drone_uuid from the Moab batch adapter and adding the usual TARDIS environment variables including TardisDroneUuid to the Moab job.
  • ask @stefan-k and @dirksammel to update their scripts to add the TardisDroneUuid environment variable to the OpenStack meta information thingy and use it as Slurm Feature of the node instead of the Moab JobId.

Edit: That would make the Moab site adapter consistent with the other site adapters.

@dirksammel
Copy link
Contributor Author

Sounds good!

@giffels
Copy link
Member

giffels commented Apr 5, 2023

#292 has been merged to the current master. @dirksammel could you give a try, please?
If I receive green lights from you, I will create new release.

@stefan-k
Copy link
Contributor

stefan-k commented Apr 5, 2023

@dirksammel is on vacation and will be back next week. I don't want to speak on his behalf, but we have a well laid out action plan (for once!), therefore I believe he should be able to provide feedback early next week.

@giffels
Copy link
Member

giffels commented May 6, 2023

Are there any news on that? Does the fixed version work for you?

@dirksammel
Copy link
Contributor Author

We're seeing a (maybe unrelated) problem with the AUDITOR plugin. I plan to look into that later this week.
Sorry for the delay!

@dirksammel
Copy link
Contributor Author

I disabled the AUDITOR plugin on Friday and observed the situation during the weekend, and it looks good!
The number of booting/running machines in moab was as expected (also after a restart of the C/T service), and the information from the logs and the sqlite db is in agreement.

@giffels
Copy link
Member

giffels commented May 16, 2023

Thanks a lot for confirming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants