Skip to content

[Common] Skip cuDNN 9.10.0/9.10.1 due to bugs #1937

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

cyanguwa
Copy link
Collaborator

@cyanguwa cyanguwa commented Jul 8, 2025

Description

cuDNN fixed a few bugs in cuDNN 9.10.2 and this PR is to:

  • skip cuDNN 9.10.0 for SDPA FP8 related bugs,
  • skip cuDNN 9.10.0 and 9.10.1 for SDPA FP16/BF16 related bugs,
  • unify the use of ModelConfig in all unit tests, and
  • fix FP8 and certain FP16/BF16 tests so they check backend availability first before running any configs.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please see description.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

cyanguwa added 2 commits July 8, 2025 15:07
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
@cyanguwa
Copy link
Collaborator Author

cyanguwa commented Jul 8, 2025

/te-ci pytorch L1 L0

@cyanguwa cyanguwa requested a review from zhongbozhu July 8, 2025 22:25
@cyanguwa
Copy link
Collaborator Author

cyanguwa commented Jul 8, 2025

Pipeline 31306522 for 9.10.0 + 25.06 + 12.9
Pipeline 31306722 for 9.10.1 + 25.06 + 12.9

@cyanguwa cyanguwa added the 2.6.0 label Jul 9, 2025
@@ -47,6 +47,7 @@
from transformer_engine.pytorch.tensor.utils import replace_raw_data
from transformer_engine.pytorch.distributed import checkpoint
from test_numerics import reset_rng_states, dtype_tols
from fused_attn.test_fused_attn import ModelConfig, _get_attention_backends
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_fused_attn.py is kind of a strange place to keep ModelConfig. It's indicating that these tests are becoming over-tuned to attention. We could avoid this refactor by moving test_sanity_attention_extra_state into the fused attention tests:

def test_sanity_attention_extra_state(model, dtype):

Alternatively, we could move ModelConfig from test_fused_attn.py into utils.py. It has many attention-specific options though, so I'd prefer having separate implementations.

Copy link
Collaborator

@zhongbozhu zhongbozhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants