-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
no ResourceState from SQLiteRegistry db after service crash/restart #291
Comments
Thanks for the report, what |
This seems to be okay. The The reason for this is that |
Yes, it's
|
Good question, where can I find this? 😅
|
The most recent one is |
Hm, somehow I don't get this version:
|
I think I have identified the problem. Actually it is a feature or bug in the Moab adapter. The adapter is changing the tardis/tardis/adapters/sites/moab.py Line 132 in 799757e
In earlier versions it worked that way, since the resource was only added to the So my questions to @dirksammel and @mschnepf, do we still needs this feature. Moab is the only adapter, that does it. |
Here the reason: #247 |
Two possible solutions:
I would prefer the first proposal. |
I also prefer the first proposal due to consistency. For systems with HTCondor as OBS it should work since the HTCondor batch system adapter has the |
I'm not sure, but I think this is happening here
Edit: Ah, but I guess that's not the drone uuid we need, right? |
These are the updated Two questions:
Edit: I am not an Slurm expert, but it seems that features can be defined using |
The number is the corresponding JOBID in moab. This ends up in
The hostnames of the VMs can be retrieved like this:
The hostname is defined by the IPs that are available for us. Influencing the hostname dynamically is not really possible, because slurm needs a pre-defined list of the available hostnames.
I'm not sure, @stefan-k should comment, but he's on vacation till Monday. |
AFAICT everything @dirksammel said is correct. I'm unsure why the hostname is needed. In our setup, Therefore I believe that in principle, the Slurm batch system adapter should work with all site adapters. OpenStack decides which hostname a VM gets. We don't use hostnames to uniquely identify VMs, because we only have a finite amount of hostnames and they are regularly reused. As Dirk said, unfortunately Slurm requires us to predefine all possible nodes. |
Yes, it is the JobId. Currently, the |
@stefan-k Thanks a lot for thowing light on how this works internally.
Unfortunately, I thought the
Yes, that is in true. One just needs to add the |
In order to fix this issue, I would propose to ..
Edit: That would make the Moab site adapter consistent with the other site adapters. |
Sounds good! |
#292 has been merged to the current master. @dirksammel could you give a try, please? |
@dirksammel is on vacation and will be back next week. I don't want to speak on his behalf, but we have a well laid out action plan (for once!), therefore I believe he should be able to provide feedback early next week. |
Are there any news on that? Does the fixed version work for you? |
We're seeing a (maybe unrelated) problem with the AUDITOR plugin. I plan to look into that later this week. |
I disabled the AUDITOR plugin on Friday and observed the situation during the weekend, and it looks good! |
Thanks a lot for confirming. |
Hey,
We seem to have some problems getting the
ResourceState
from the drone db after a crash/restart of the service.As far as I know, this problem is not related to any recent update, but was always there (@stefan-k can maybe comment).
This is from the config:
We use docker, so the actual db is at
/home/tardis/db/
:Some example that just happened:
The db looks like this:
(Maybe already strange that there are no values for
ResourceState
etc?)The log showed these booting drones:
After a restart of the service, the drones from the db can be seen in the log:
but later this:
We now have 26 booting machines in moab, but only the recent 13 are shown in the log:
And no changes of the db entries during all of this.
Sorry for the log dump, but I hope you can understand our issue from this.
Please let me know if you need any further information!
The text was updated successfully, but these errors were encountered: