[Common] Skip cuDNN 9.10.0/9.10.1 due to bugs #1937

cyanguwa · 2025-07-08T22:22:22Z

Description

cuDNN fixed a few bugs in cuDNN 9.10.2 and this PR is to:

skip cuDNN 9.10.0 for SDPA FP8 related bugs,
skip cuDNN 9.10.0 and 9.10.1 for SDPA FP16/BF16 related bugs,
unify the use of ModelConfig in all unit tests, and
fix FP8 and certain FP16/BF16 tests so they check backend availability first before running any configs.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please see description.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

cyanguwa · 2025-07-08T22:24:57Z

/te-ci pytorch L1 L0

for more information, see https://pre-commit.ci

cyanguwa · 2025-07-08T22:58:11Z

Pipeline 31306522 for 9.10.0 + 25.06 + 12.9
Pipeline 31306722 for 9.10.1 + 25.06 + 12.9

timmoon10 · 2025-07-09T18:52:51Z

tests/pytorch/test_sanity.py

@@ -47,6 +47,7 @@
 from transformer_engine.pytorch.tensor.utils import replace_raw_data
 from transformer_engine.pytorch.distributed import checkpoint
 from test_numerics import reset_rng_states, dtype_tols
+from fused_attn.test_fused_attn import ModelConfig, _get_attention_backends


test_fused_attn.py is kind of a strange place to keep ModelConfig. It's indicating that these tests are becoming over-tuned to attention. We could avoid this refactor by moving test_sanity_attention_extra_state into the fused attention tests:

TransformerEngine/tests/pytorch/test_sanity.py

Line 1119 in 7b249ae

def test_sanity_attention_extra_state(model, dtype):

Alternatively, we could move ModelConfig from test_fused_attn.py into utils.py. It has many attention-specific options though, so I'd prefer having separate implementations.

zhongbozhu

LGTM

cyanguwa added 2 commits July 8, 2025 15:07

exclude 9.10.0/.1 for certain configs

a430054

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

fix kv_channels

20b81bc

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

7b249ae

for more information, see https://pre-commit.ci

cyanguwa requested a review from zhongbozhu July 8, 2025 22:25

cyanguwa added the 2.6.0 label Jul 9, 2025

timmoon10 reviewed Jul 9, 2025

View reviewed changes

zhongbozhu approved these changes Jul 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Common] Skip cuDNN 9.10.0/9.10.1 due to bugs #1937

[Common] Skip cuDNN 9.10.0/9.10.1 due to bugs #1937

Uh oh!

cyanguwa commented Jul 8, 2025 •

edited

Loading

Uh oh!

cyanguwa commented Jul 8, 2025

Uh oh!

cyanguwa commented Jul 8, 2025

Uh oh!

timmoon10 Jul 9, 2025

Uh oh!

zhongbozhu left a comment

Uh oh!

Uh oh!

[Common] Skip cuDNN 9.10.0/9.10.1 due to bugs #1937

Are you sure you want to change the base?

[Common] Skip cuDNN 9.10.0/9.10.1 due to bugs #1937

Uh oh!

Conversation

cyanguwa commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

cyanguwa commented Jul 8, 2025

Uh oh!

cyanguwa commented Jul 8, 2025

Uh oh!

timmoon10 Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

zhongbozhu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cyanguwa commented Jul 8, 2025 •

edited

Loading