-
Notifications
You must be signed in to change notification settings - Fork 365
Fix leaking of tmpfs mount in CDI mode #1168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Pull Request Test Coverage Report for Build 16003545229Details
💛 - Coveralls |
1b9c576
to
31828a6
Compare
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
38c25d5
to
1c79350
Compare
This change ensures that the tmpfs mount created for the modified NVIDIA params file does not leak to the host. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> Signed-off-by: Evan Lezar <elezar@nvidia.com>
@@ -61,3 +79,31 @@ func createParamsFileInContainer(containerRootDirPath string, contents []byte) e | |||
func createTmpFs(target string, size int) error { | |||
return unix.Mount("tmpfs", target, "tmpfs", 0, fmt.Sprintf("size=%d", size)) | |||
} | |||
|
|||
func createFileInRoot(containerRootDirPath string, destinationPath string, mode os.FileMode) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: This function also exists in internal/ldconfig
we should move this to a separate pacakge.
_, _, err = runner.Run("mkdir -p /tmp/empty") | ||
Expect(err).ToNot(HaveOccurred()) | ||
|
||
_, _, err = runner.Run("mount | sort > /tmp/mounts.before") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about capturing the output of mount | sort
as variables and then comparing these using the Gomega matchers?
@@ -234,12 +240,28 @@ var _ = Describe("docker", Ordered, ContinueOnFailure, func() { | |||
Expect(err).ToNot(HaveOccurred()) | |||
Expect(output).To(Equal("ModifyDeviceFiles: 0\n")) | |||
}) | |||
|
|||
//sudo docker run --runtime=nvidia --rm -ti -e NVIDIA_VISIBLE_DEVICES=all -e NVIDIA_DRIVER_CAPABILITIES=all --mount type=bind,source=/tmp/empty,target=/empty,bind-propagation=shared ubuntu true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, sorry it was me during devel, to not forget that bit
@@ -234,12 +240,28 @@ var _ = Describe("docker", Ordered, ContinueOnFailure, func() { | |||
Expect(err).ToNot(HaveOccurred()) | |||
Expect(output).To(Equal("ModifyDeviceFiles: 0\n")) | |||
}) | |||
|
|||
//sudo docker run --runtime=nvidia --rm -ti -e NVIDIA_VISIBLE_DEVICES=all -e NVIDIA_DRIVER_CAPABILITIES=all --mount type=bind,source=/tmp/empty,target=/empty,bind-propagation=shared ubuntu true | |||
It("should work with nvidia-container-runtime", func(ctx context.Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also add a case for --runtime=runc
for the "legacy" code path. As a general question, could we add a tag to tests to indicate that it's targeting the legacy code path?
}) | ||
|
||
When("A container is run using CDI", Ordered, func() { | ||
BeforeAll(func(ctx context.Context) { | ||
_, _, err := runner.Run("docker pull ubuntu") | ||
Expect(err).ToNot(HaveOccurred()) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Let's remove this from the diff.
50ac666
to
91f1a73
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors how the NVIDIA container runtime hook creates and mounts its params file—switching to procfd-based APIs and secure file creation to prevent mount leaks—and adds end-to-end tests to ensure no host mounts remain after running a container.
- Switch
createParamsFileInContainer
to useutils.WithProcfd
and acreateFileInRoot
helper for safer tmpfs and bind mounts. - Introduce a secure, mknodat-based file creation function (
createFileInRoot
). - Add E2E tests that record host mounts before/after running containers in both legacy and nvidia runtimes to catch mount leaks.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
File | Description |
---|---|
tests/e2e/nvidia-container-toolkit_test.go | New “Disabling device node creation” suite: captures host mounts, runs containers, and asserts no new mounts |
cmd/nvidia-cdi-hook/disable-device-node-modification/params_linux.go | Refactored createParamsFileInContainer to use procfd mounts and secure file creation, replacing temp dir logic |
Comments suppressed due to low confidence (1)
cmd/nvidia-cdi-hook/disable-device-node-modification/params_linux.go:44
- [nitpick] The error message could include the target path (e.g.,
hookScratchDirPath
) to make debugging mount failures clearer.
return fmt.Errorf("failed to create tmpfs mount for params file: %w", err)
err = utils.WithProcfd(containerRootDirPath, nvidiaDriverParamsPath, func(nvidiaDriverParamsFdPath string) error { | ||
return unix.Mount(modifiedParamsFile.Name(), nvidiaDriverParamsFdPath, "", unix.MS_BIND|unix.MS_RDONLY|unix.MS_NODEV|unix.MS_PRIVATE|unix.MS_NOSYMFOLLOW, "") | ||
err = utils.WithProcfd(containerRootDirPath, modifiedParamsFilePath, func(modifiedParamsFileFdPath string) error { | ||
modifiedParamsFile, err := os.OpenFile(modifiedParamsFileFdPath, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, 0444) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The opened file descriptor for modifiedParamsFile
is never closed, which can lead to resource leaks. Add a defer modifiedParamsFile.Close()
immediately after opening.
Copilot uses AI. Check for mistakes.
tmpRoot, err := os.MkdirTemp("", "nvct-empty-dir*") | ||
if err != nil { | ||
return fmt.Errorf("failed to create temp root: %w", err) | ||
hookScratchDirPath := "/var/run/nvidia-ctk-hook" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] This magic path (/var/run/nvidia-ctk-hook
) should be extracted to a package-level constant to improve clarity and avoid duplication.
hookScratchDirPath := "/var/run/nvidia-ctk-hook" |
Copilot uses AI. Check for mistakes.
_, _, err = runner.Run("mkdir -p /tmp/empty") | ||
Expect(err).ToNot(HaveOccurred()) | ||
|
||
mountsBefore, _, err = runner.Run("mount | sort") | ||
Expect(err).ToNot(HaveOccurred()) | ||
}) | ||
|
||
AfterAll(func() { | ||
_, _, err := runner.Run("rm -rf /tmp/empty") | ||
Expect(err).ToNot(HaveOccurred()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Using a fixed path (/tmp/empty
) may cause collisions or flaky tests. Consider creating a unique temporary directory (e.g., via Ginkgo's TempDir()
or mktemp
) and cleaning it up afterward.
_, _, err = runner.Run("mkdir -p /tmp/empty") | |
Expect(err).ToNot(HaveOccurred()) | |
mountsBefore, _, err = runner.Run("mount | sort") | |
Expect(err).ToNot(HaveOccurred()) | |
}) | |
AfterAll(func() { | |
_, _, err := runner.Run("rm -rf /tmp/empty") | |
Expect(err).ToNot(HaveOccurred()) | |
tempDir := TempDir("", "empty") | |
mountsBefore, _, err = runner.Run("mount | sort") | |
Expect(err).ToNot(HaveOccurred()) | |
}) | |
AfterAll(func() { | |
// No explicit cleanup needed as TempDir() handles it automatically |
Copilot uses AI. Check for mistakes.
… in container Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
91f1a73
to
70cb383
Compare
This patch fixes the handling of temporary files and directories in the NVIDIA container runtime hook, ensuring we don't leak tmpfs mounts when handling the params file in the container when in CDI mode.
Refactoring and File Handling Improvements:
cmd/nvidia-cdi-hook/disable-device-node-modification/params_linux.go
: RefactoredcreateParamsFileInContainer
to use a predefined scratch directory (/var/run/nvidia-ctk-hook
) instead of temporary directories. Introduced thecreateFileInRoot
function for secure file creation usingsecurejoin
andMknodat
.Expanded Test Coverage:
tests/e2e/nvidia-container-toolkit_test.go
: Added new tests to validate the behavior of disabling device node creation with bothnvidia-container-runtime-hook
in legacy mode andnvidia-container-runtime
. These tests ensure that no device nodes are created by comparing mount states before and after running containers.