Skip to content

Fix leaking of tmpfs mount in CDI mode #1168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ArangoGutierrez
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez commented Jun 30, 2025

This patch fixes the handling of temporary files and directories in the NVIDIA container runtime hook, ensuring we don't leak tmpfs mounts when handling the params file in the container when in CDI mode.

Refactoring and File Handling Improvements:

Expanded Test Coverage:

  • tests/e2e/nvidia-container-toolkit_test.go: Added new tests to validate the behavior of disabling device node creation with both nvidia-container-runtime-hook in legacy mode and nvidia-container-runtime. These tests ensure that no device nodes are created by comparing mount states before and after running containers.

@ArangoGutierrez ArangoGutierrez self-assigned this Jun 30, 2025
Copilot

This comment was marked as outdated.

@coveralls
Copy link

coveralls commented Jun 30, 2025

Pull Request Test Coverage Report for Build 16003545229

Details

  • 0 of 52 (0.0%) changed or added relevant lines in 1 file are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.09%) to 33.117%

Changes Missing Coverage Covered Lines Changed/Added Lines %
cmd/nvidia-cdi-hook/disable-device-node-modification/params_linux.go 0 52 0.0%
Files with Coverage Reduction New Missed Lines %
cmd/nvidia-cdi-hook/disable-device-node-modification/params_linux.go 2 0.0%
Totals Coverage Status
Change from base Build 16002648947: -0.09%
Covered Lines: 4381
Relevant Lines: 13229

💛 - Coveralls

@ArangoGutierrez ArangoGutierrez added the bug Issue/PR to expose/discuss/fix a bug label Jun 30, 2025
@ArangoGutierrez ArangoGutierrez force-pushed the b/5363680 branch 2 times, most recently from 1b9c576 to 31828a6 Compare June 30, 2025 10:59
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez ArangoGutierrez force-pushed the b/5363680 branch 2 times, most recently from 38c25d5 to 1c79350 Compare June 30, 2025 12:00
This change ensures that the tmpfs mount created for the modified
NVIDIA params file does not leak to the host.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@@ -61,3 +79,31 @@ func createParamsFileInContainer(containerRootDirPath string, contents []byte) e
func createTmpFs(target string, size int) error {
return unix.Mount("tmpfs", target, "tmpfs", 0, fmt.Sprintf("size=%d", size))
}

func createFileInRoot(containerRootDirPath string, destinationPath string, mode os.FileMode) (string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: This function also exists in internal/ldconfig we should move this to a separate pacakge.

_, _, err = runner.Run("mkdir -p /tmp/empty")
Expect(err).ToNot(HaveOccurred())

_, _, err = runner.Run("mount | sort > /tmp/mounts.before")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about capturing the output of mount | sort as variables and then comparing these using the Gomega matchers?

@@ -234,12 +240,28 @@ var _ = Describe("docker", Ordered, ContinueOnFailure, func() {
Expect(err).ToNot(HaveOccurred())
Expect(output).To(Equal("ModifyDeviceFiles: 0\n"))
})

//sudo docker run --runtime=nvidia --rm -ti -e NVIDIA_VISIBLE_DEVICES=all -e NVIDIA_DRIVER_CAPABILITIES=all --mount type=bind,source=/tmp/empty,target=/empty,bind-propagation=shared ubuntu true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, sorry it was me during devel, to not forget that bit

@@ -234,12 +240,28 @@ var _ = Describe("docker", Ordered, ContinueOnFailure, func() {
Expect(err).ToNot(HaveOccurred())
Expect(output).To(Equal("ModifyDeviceFiles: 0\n"))
})

//sudo docker run --runtime=nvidia --rm -ti -e NVIDIA_VISIBLE_DEVICES=all -e NVIDIA_DRIVER_CAPABILITIES=all --mount type=bind,source=/tmp/empty,target=/empty,bind-propagation=shared ubuntu true
It("should work with nvidia-container-runtime", func(ctx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add a case for --runtime=runc for the "legacy" code path. As a general question, could we add a tag to tests to indicate that it's targeting the legacy code path?

})

When("A container is run using CDI", Ordered, func() {
BeforeAll(func(ctx context.Context) {
_, _, err := runner.Run("docker pull ubuntu")
Expect(err).ToNot(HaveOccurred())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Let's remove this from the diff.

@ArangoGutierrez ArangoGutierrez changed the title Fix Disabling device node creation hook by passing MS_PRIVATE flag during mount creation Fix handling of temporary files and directories in the NVIDIA container runtime hook Jun 30, 2025
@ArangoGutierrez ArangoGutierrez changed the title Fix handling of temporary files and directories in the NVIDIA container runtime hook Fix createParamsFileInContainer func to prevent mount leaks when calling the NVIDIA container runtime hook Jun 30, 2025
@ArangoGutierrez ArangoGutierrez force-pushed the b/5363680 branch 2 times, most recently from 50ac666 to 91f1a73 Compare June 30, 2025 15:48
@ArangoGutierrez ArangoGutierrez requested review from elezar and Copilot July 1, 2025 07:45
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors how the NVIDIA container runtime hook creates and mounts its params file—switching to procfd-based APIs and secure file creation to prevent mount leaks—and adds end-to-end tests to ensure no host mounts remain after running a container.

  • Switch createParamsFileInContainer to use utils.WithProcfd and a createFileInRoot helper for safer tmpfs and bind mounts.
  • Introduce a secure, mknodat-based file creation function (createFileInRoot).
  • Add E2E tests that record host mounts before/after running containers in both legacy and nvidia runtimes to catch mount leaks.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
tests/e2e/nvidia-container-toolkit_test.go New “Disabling device node creation” suite: captures host mounts, runs containers, and asserts no new mounts
cmd/nvidia-cdi-hook/disable-device-node-modification/params_linux.go Refactored createParamsFileInContainer to use procfd mounts and secure file creation, replacing temp dir logic
Comments suppressed due to low confidence (1)

cmd/nvidia-cdi-hook/disable-device-node-modification/params_linux.go:44

  • [nitpick] The error message could include the target path (e.g., hookScratchDirPath) to make debugging mount failures clearer.
		return fmt.Errorf("failed to create tmpfs mount for params file: %w", err)

err = utils.WithProcfd(containerRootDirPath, nvidiaDriverParamsPath, func(nvidiaDriverParamsFdPath string) error {
return unix.Mount(modifiedParamsFile.Name(), nvidiaDriverParamsFdPath, "", unix.MS_BIND|unix.MS_RDONLY|unix.MS_NODEV|unix.MS_PRIVATE|unix.MS_NOSYMFOLLOW, "")
err = utils.WithProcfd(containerRootDirPath, modifiedParamsFilePath, func(modifiedParamsFileFdPath string) error {
modifiedParamsFile, err := os.OpenFile(modifiedParamsFileFdPath, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, 0444)
Copy link
Preview

Copilot AI Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The opened file descriptor for modifiedParamsFile is never closed, which can lead to resource leaks. Add a defer modifiedParamsFile.Close() immediately after opening.

Copilot uses AI. Check for mistakes.

tmpRoot, err := os.MkdirTemp("", "nvct-empty-dir*")
if err != nil {
return fmt.Errorf("failed to create temp root: %w", err)
hookScratchDirPath := "/var/run/nvidia-ctk-hook"
Copy link
Preview

Copilot AI Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] This magic path (/var/run/nvidia-ctk-hook) should be extracted to a package-level constant to improve clarity and avoid duplication.

Suggested change
hookScratchDirPath := "/var/run/nvidia-ctk-hook"

Copilot uses AI. Check for mistakes.

Comment on lines 227 to 223
_, _, err = runner.Run("mkdir -p /tmp/empty")
Expect(err).ToNot(HaveOccurred())

mountsBefore, _, err = runner.Run("mount | sort")
Expect(err).ToNot(HaveOccurred())
})

AfterAll(func() {
_, _, err := runner.Run("rm -rf /tmp/empty")
Expect(err).ToNot(HaveOccurred())
Copy link
Preview

Copilot AI Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using a fixed path (/tmp/empty) may cause collisions or flaky tests. Consider creating a unique temporary directory (e.g., via Ginkgo's TempDir() or mktemp) and cleaning it up afterward.

Suggested change
_, _, err = runner.Run("mkdir -p /tmp/empty")
Expect(err).ToNot(HaveOccurred())
mountsBefore, _, err = runner.Run("mount | sort")
Expect(err).ToNot(HaveOccurred())
})
AfterAll(func() {
_, _, err := runner.Run("rm -rf /tmp/empty")
Expect(err).ToNot(HaveOccurred())
tempDir := TempDir("", "empty")
mountsBefore, _, err = runner.Run("mount | sort")
Expect(err).ToNot(HaveOccurred())
})
AfterAll(func() {
// No explicit cleanup needed as TempDir() handles it automatically

Copilot uses AI. Check for mistakes.

@ArangoGutierrez ArangoGutierrez changed the title Fix createParamsFileInContainer func to prevent mount leaks when calling the NVIDIA container runtime hook Fix leaking of tmpfs mount in CDI mode Jul 1, 2025
… in container

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue/PR to expose/discuss/fix a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants