Skip to content

Add NVSwitch device ID for p6 instance type #2987

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 11, 2025

Conversation

himani2411
Copy link
Contributor

@himani2411 himani2411 commented Jul 3, 2025

Description of changes

  • Add NVSwitch device ID for p6 instance type as NVIDIA Fabric manager needs to be enabled for GPU Health Checks to be invoked.

Steps for Device ID: https://nvidia.custhelp.com/app/answers/detail/a_id/2040/~/identifying-the-graphics-card-model-and-device-id-in-a-pc

  • This PR addresses a technical blocker and does not provide full support of p6 instance type.

Tests

  • Cluster launch with p4d instance type and Log line (which is now removed)
[2025-07-03T18:27:07+00:00] INFO: NVSwitch works 6

    * service[nvidia-fabricmanager] action start[2025-07-03T18:27:07+00:00] INFO: Processing service[nvidia-fabricmanager] action start (aws-parallelcluster-platform::test line 36)
 (up to date)

References

  • Link to impacted open issues.
  • Link to related PRs in other packages (i.e. cookbook, node).
  • Link to documentation useful to understand the changes.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

nvswitch_check_p4 = shell_out("lspci -d 10de:1af1 | wc -l")
nvswitch_check_p5 = shell_out("lspci -d 10de:22a3 | wc -l")
nvswitch_check_p4.stdout.strip.to_i + nvswitch_check_p5.stdout.strip.to_i
# NVSwitch device id is 10de:2901 for P6 instance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did we take the device id 10de:2901 from?
Is there some public reference that we can link to the PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvswitch_check_p4.stdout.strip.to_i + nvswitch_check_p5.stdout.strip.to_i
# NVSwitch device id is 10de:2901 for P6 instance
nvswitch_device_ids = ['10de:1af1', '10de:22a3', '10de:2901']
nvswitch_device_ids.sum { |id| shell_out("lspci -d #{id} | wc -l").stdout.strip.to_i }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we summing up all the number of switches rather than returning the specific number for the specific instance type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These device Id's are based on the GPU being used, and the solution is irrespective of the instance type as we use device ID of GPU's for which we know have NVswitches

@@ -54,10 +54,10 @@ def _nvidia_driver_version

# Get number of nv switches
def get_nvswitches
Copy link
Contributor

@gmarciani gmarciani Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we cover this change within the fabric manager spec test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will try to see if I can for this function as a Unit test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the unit test!

@himani2411 himani2411 force-pushed the nvdia-fabric-manager branch 2 times, most recently from 07dfcb0 to 1e9a845 Compare July 3, 2025 20:59
@himani2411 himani2411 force-pushed the nvdia-fabric-manager branch from 2951aac to b91084f Compare July 11, 2025 19:13
@himani2411 himani2411 enabled auto-merge (rebase) July 11, 2025 19:14
@himani2411 himani2411 merged commit 778d32a into aws:develop Jul 11, 2025
28 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants