[TP comm overlap unit test]`CUDA Error: misaligned address` error when testing with recent cublas (or pytorch container)

I get `CUDA Error: misaligned address` when running the tp comm overlap unit test with recent pytorch container.
I think the error comes from the cublas versions that enables `nvjet`.
```
[rank1]: Traceback (most recent call last):
[rank1]:   File "/lustre/fsw/coreai_mlperf_training/slym/module_tests/tp_overlap/te.tp_tests/tests/pytorch/distributed/run_gemm_with_overlap.py", line 922, in <module>
[rank1]:     sys.exit(_main(_parse_args()))
[rank1]:              ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
[rank1]:     return f(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^
[rank1]:   File "/lustre/fsw/coreai_mlperf_training/slym/module_tests/tp_overlap/te.tp_tests/tests/pytorch/distributed/run_gemm_with_overlap.py", line 721, in _main
[rank1]:     all_outputs = _fp8_gemm()
[rank1]:                   ^^^^^^^^^^^
[rank1]:   File "/lustre/fsw/coreai_mlperf_training/slym/module_tests/tp_overlap/te.tp_tests/tests/pytorch/distributed/run_gemm_with_overlap.py", line 602, in _fp8_gemm
[rank1]:     return tex.fp8_gemm(
[rank1]:            ^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/cpp_extensions/gemm.py", line 180, in fp8_gemm
[rank1]:     _ = fn(*args)
[rank1]:         ^^^^^^^^^
[rank1]: RuntimeError: /workspace/TransformerEngine/transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp:802 in function split_overlap_ag: CUDA Error: misaligned address
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TP comm overlap unit test]`CUDA Error: misaligned address` error when testing with recent cublas (or pytorch container) #1332

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[TP comm overlap unit test]CUDA Error: misaligned address error when testing with recent cublas (or pytorch container) #1332

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[TP comm overlap unit test]`CUDA Error: misaligned address` error when testing with recent cublas (or pytorch container) #1332