Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resmgr: lifecycle overlap detection and workaround. #358

Merged

Conversation

klihub
Copy link
Collaborator

@klihub klihub commented Sep 17, 2024

We recently discovered a problem with the generated stream of container lifecycle events with some runtime versions. A side effect of this is that we get Create/Stop events for multiple container instances with seemingly overlapping lifecycle: the latter instance get created before the former one is stopped.

When undetected, such a false overlap might cause overcommit of resources, with both instances temporarily using the full resource set of the offending container. As a workaround, we now track containers also by fully qualified name ($namespace/pod/ctr) and internally generate an event for releasing the resources of the old instance whenever we notice that a creation event would cause a duplicate instance for the same name.

Use PrettyName()-compatible container name dumps together with
container IDs in NRI evnet dumps.

Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
Having loggers embedded in resmgr types as members was a bad idea.
Replace them with module global logger instances.

Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
@klihub klihub force-pushed the fixes/overlapping-lifecycle-workaround branch from 0d2f646 to c4b6aba Compare September 18, 2024 07:20
@klihub klihub marked this pull request as ready for review September 18, 2024 08:27
We recently discovered a problem with the generated stream of
container lifecycle events with some runtime versions. A side
effect of this is that we get Create/Stop events for multiple
container instances with seemingly overlapping lifecycle: the
latter instance get created before the former one is stopped.

When undetected, such a false overlap might cause overcommit
of resources, with both instances temporarily using the full
resource set of the container. As a workaround, we now track
containers also by fully qualified name ($namespace/pod/ctr)
and internally generate an event for releasing the resources
if the old instance whenever we notice that a creation event
would cause a duplicate instance for the same name.

Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
Copy link
Collaborator

@askervin askervin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@klihub klihub marked this pull request as draft September 19, 2024 13:27
@klihub klihub force-pushed the fixes/overlapping-lifecycle-workaround branch from f609d93 to 934044d Compare September 19, 2024 13:36
@klihub klihub marked this pull request as ready for review September 19, 2024 13:59
@klihub klihub requested a review from kad September 20, 2024 07:30
Copy link

@pfl pfl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the comments already on checkAllocations()

Check grants, looking for grants with stale allocations
or duplicate containers (detected using fully qualified
names). Dump total memory and CPU granted.

Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
@klihub klihub force-pushed the fixes/overlapping-lifecycle-workaround branch from 934044d to 63e4cdd Compare September 20, 2024 11:35
@klihub klihub requested a review from kad September 20, 2024 11:38
Copy link
Collaborator

@askervin askervin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@askervin askervin merged commit 14f103d into containers:main Sep 20, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants