Skip to content

[no-relnote] Add test to check CDI injection by default #1169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ArangoGutierrez
Copy link
Collaborator

Now that the NVIDIA Container Toolkit uses CDI injection by default. This does not trigger the code in the nvidia-container-cli that creates the empty files on the host.

@ArangoGutierrez ArangoGutierrez self-assigned this Jun 30, 2025
Copilot

This comment was marked as outdated.

@coveralls
Copy link

coveralls commented Jun 30, 2025

Pull Request Test Coverage Report for Build 16028510793

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 33.21%

Totals Coverage Status
Change from base Build 16002648947: 0.0%
Covered Lines: 4381
Relevant Lines: 13192

💛 - Coveralls

@ArangoGutierrez ArangoGutierrez force-pushed the b/5366608 branch 2 times, most recently from 80b5ea1 to 13ca3e3 Compare June 30, 2025 13:04
@@ -240,6 +240,22 @@ var _ = Describe("docker", Ordered, ContinueOnFailure, func() {
BeforeAll(func(ctx context.Context) {
_, _, err := runner.Run("docker pull ubuntu")
Expect(err).ToNot(HaveOccurred())

err = buildDockerImage(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test should be in its own section. We can then test the different behaviours based on runtime mode. For example if we run docker run --runtime=runc --gpus all we would expect to see an error due to the path traversal.

Comment on lines 173 to 183
func buildDockerImage(runner Runner, tag string, args map[string]string, dockerfile string) error {
var buildArgs string
for k, v := range args {
buildArgs += fmt.Sprintf("--build-arg %s=%q ", k, v)
}

command := fmt.Sprintf("docker build -t %s %s - <<EOF\n%s\nEOF", tag, buildArgs, dockerfile)

_, _, err := runner.Run(command)
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the raw script is easier to follow.

_, _, err := runner.Run("docker run --rm --runtime=nvidia --gpus=all firmware-test")
Expect(err).To(HaveOccurred())

containerOutput, _, err := runner.Run("docker run --rm --runtime=runc --gpus=all firmware-test")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is EXPECTED to fail.

Comment on lines 277 to 278
_, _, err := runner.Run("docker run --rm --runtime=nvidia --gpus=all firmware-test")
Expect(err).To(HaveOccurred())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the v1.18.0 release this uses CDI by default which means that there will be NO error. This should be a different test case.

@ArangoGutierrez ArangoGutierrez force-pushed the b/5366608 branch 7 times, most recently from 8778e11 to f47a6be Compare July 1, 2025 07:45
@ArangoGutierrez ArangoGutierrez requested review from elezar and Copilot July 1, 2025 08:00
Copilot

This comment was marked as outdated.

Expect(err).ToNot(HaveOccurred())
})

It("should not create empty firmware files on the host when using CDI", func(ctx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't actually check that the files are not created on this host.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now edited the test description to be in line with the context

Expect(output).To(BeEmpty())
})

It("should fail when using the pre-start hook", func(ctx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

De we refer to this as "legacy" mode in other tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

label added

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not what I asked. In the other test that we explicitly want to trigger the legacy code path we use:

		It("should work with nvidia-container-runtime-hook", func(ctx context.Context) {

Let's update this one to be:

		It("should fail when using the nvidia-container-runtime-hook", func(ctx context.Context) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test description edited

It("should not fail when using CDI", func(ctx context.Context) {
output, _, err := runner.Run("docker run --rm --runtime=nvidia --gpus=all firmware-test")
Expect(err).ToNot(HaveOccurred())
Expect(output).To(BeEmpty())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update the tests to actually check that no files are created. Using an affected version of the toolkit we see:

$ docker run --rm -ti --gpus=all --runtime=nvidia firmware-test true
$ ls
gsp_ga10x.bin  gsp_tu10x.bin

So we would ensure that ls gsp_*.bin is empty.

_, _, err := runner.Run("docker pull ubuntu")
Expect(err).ToNot(HaveOccurred())

_, _, err = runner.Run(`docker build -t firmware-test --build-arg RM_VERSION="$(basename $(ls -d /lib/firmware/nvidia/*.*))" --build-arg CURRENT_DIR="$(pwd)" - <<EOF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set CURRENT_DIR to a tmp folder we can clean it up between each of the test cases to ensure that we're properly testing that we're NOT creating empty files on the host.

It("should fail when using the nvidia-container-runtime-hook", Label("legacy"), func(ctx context.Context) {
_, stderr, err := runner.Run("docker run --rm --runtime=runc --gpus=all firmware-test")
Expect(err).To(HaveOccurred())
Expect(stderr).To(ContainSubstring("nvidia-container-cli.real: mount error: path error:"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to this we should also check that files are not created.

Copilot

This comment was marked as outdated.

Comment on lines 272 to 274
tmpDir, _, err = runner.Run("mktemp -d")
Expect(err).ToNot(HaveOccurred())
tmpDir = strings.TrimSpace(tmpDir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no per-test tmp dir that we can use?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what the BeforeEach is doing: it creates a new tmp dir per test (per If block)

Expect(err).ToNot(HaveOccurred())
tmpDir = strings.TrimSpace(tmpDir)

dockerBuildCmd := fmt.Sprintf(`docker build -t firmware-test --build-arg RM_VERSION="$(basename $(ls -d /lib/firmware/nvidia/*.*))" --build-arg CURRENT_DIR=%q - <<EOF`, tmpDir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does %q use produce double quotes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup -> https://gobyexample.com/string-formatting

"To double-quote strings as in Go source, use %q."

@ArangoGutierrez ArangoGutierrez force-pushed the b/5366608 branch 3 times, most recently from 49f57d6 to 244939c Compare July 2, 2025 11:44
@ArangoGutierrez ArangoGutierrez requested review from elezar and Copilot July 2, 2025 14:20
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the test runner to propagate stderr on failures and adds an end-to-end test validating firmware file creation under CDI vs. legacy runtimes.

  • Propagate actual stderr from runner.Run when a script fails
  • Add new E2E tests to ensure CDI injection prevents host-side firmware file creation
  • Import fmt to support formatted commands in tests

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
tests/e2e/runner.go Return stderr.String() as the second value on errors
tests/e2e/nvidia-container-toolkit_test.go Introduce firmware-file-creation tests with setup and cleanup
Comments suppressed due to low confidence (2)

tests/e2e/nvidia-container-toolkit_test.go:307

  • This ignores the err return from ls; adding _, _, err := runner.Run(...) followed by Expect(err).ToNot(HaveOccurred()) will ensure any unexpected failures are caught.
			output, _, _ := runner.Run(fmt.Sprintf("ls -A %s", outputDir))

tests/e2e/nvidia-container-toolkit_test.go:277

  • [nitpick] The variable dockerBuildDockerfile holds the Dockerfile content; renaming it to something like dockerfileContent could improve clarity.
FROM ubuntu

@@ -124,7 +124,7 @@ func (r remoteRunner) Run(script string) (string, string, error) {
// Run the script
err = session.Run(script)
if err != nil {
return "", "", fmt.Errorf("script execution failed: %v\nSTDOUT: %s\nSTDERR: %s", err, stdout.String(), stderr.String())
return "", stderr.String(), fmt.Errorf("script execution failed: %v\nSTDOUT: %s\nSTDERR: %s", err, stdout.String(), stderr.String())
Copy link
Preview

Copilot AI Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider returning both stdout and stderr on error (e.g., return stdout.String(), stderr.String(), fmt.Errorf(...)) so callers have full context for debugging.

Suggested change
return "", stderr.String(), fmt.Errorf("script execution failed: %v\nSTDOUT: %s\nSTDERR: %s", err, stdout.String(), stderr.String())
return stdout.String(), stderr.String(), fmt.Errorf("script execution failed: %v\nSTDOUT: %s\nSTDERR: %s", err, stdout.String(), stderr.String())

Copilot uses AI. Check for mistakes.


AfterAll(func(ctx context.Context) {
if outputDir != "" {
runner.Run(fmt.Sprintf("rm -rf %s", outputDir))
Copy link
Preview

Copilot AI Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup command's error is ignored; you may want to capture and Expect(err).ToNot(HaveOccurred()) to surface any failures during teardown.

Suggested change
runner.Run(fmt.Sprintf("rm -rf %s", outputDir))
_, _, err := runner.Run(fmt.Sprintf("rm -rf %s", outputDir))
Expect(err).ToNot(HaveOccurred())

Copilot uses AI. Check for mistakes.

@@ -257,4 +258,54 @@ var _ = Describe("docker", Ordered, ContinueOnFailure, func() {
Expect(libs).To(ContainElements([]string{"libcuda.so", "libcuda.so.1"}))
})
})

When("A container tries to create firmware files on the host", Ordered, func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:

Suggested change
When("A container tries to create firmware files on the host", Ordered, func() {
When("Running a container where the firmware folder resolves outside the container root", Ordered, func() {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adopted

When("A container tries to create firmware files on the host", Ordered, func() {
var outputDir string
BeforeAll(func(ctx context.Context) {
pwd, _, err := runner.Run("pwd")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the os package?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

			pwd, err := os.Getwd()
			Expect(err).ToNot(HaveOccurred())

			workingDir := filepath.Join(pwd, "test-output")

			err = os.MkdirAll(workingDir, 0755)
			Expect(err).ToNot(HaveOccurred())

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember, we are running this over SSH, since the runner interface is designed to be agnostic, we can not use Go packages for system-level actions, we need to use bash via the runner

Now that the NVIDIA Container Toolkit uses CDI injection by default.
This does not trigger the code in the nvidia-container-cli that creates the empty files on the host.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants