Skip to content

Run-time checks for CUDA and cuBLAS versions #1938

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

timmoon10
Copy link
Collaborator

Description

Some users have experienced errors when running TE on systems with older CUDA versions (#1585, #1922). This is because we build the core library (libtransformer_engine.so) with a recent CUDA version and distribute it in a Pip wheel, so the compile-time and run-time CUDA versions may not match. This PR adds some more careful version checking logic, especially for cuBLAS GEMMs and CUDA multicast operations.

Closes #1585. Closes #1922.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Add run-time checks for CUDA and cuBLAS versions

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

timmoon10 added 2 commits July 9, 2025 02:28
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10 timmoon10 added the bug Something isn't working label Jul 9, 2025
@timmoon10
Copy link
Collaborator Author

/te-ci L1

Oleg-Goncharov
Oleg-Goncharov previously approved these changes Jul 9, 2025
Copy link
Collaborator

@Oleg-Goncharov Oleg-Goncharov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice additions to improve the runtime versions check!

@timmoon10
Copy link
Collaborator Author

/te-ci L1

@timmoon10
Copy link
Collaborator Author

/te-ci L1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

in function has_mnnvl_fabric: CUDA Error: invalid argument cuBLAS Error
2 participants