Skip to content

[TP comm overlap unit test]CUDA Error: misaligned address error when testing with recent cublas (or pytorch container) #1332

Open
@erhoo82

Description

@erhoo82

I get CUDA Error: misaligned address when running the tp comm overlap unit test with recent pytorch container.
I think the error comes from the cublas versions that enables nvjet.

[rank1]: Traceback (most recent call last):
[rank1]:   File "/lustre/fsw/coreai_mlperf_training/slym/module_tests/tp_overlap/te.tp_tests/tests/pytorch/distributed/run_gemm_with_overlap.py", line 922, in <module>
[rank1]:     sys.exit(_main(_parse_args()))
[rank1]:              ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
[rank1]:     return f(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^
[rank1]:   File "/lustre/fsw/coreai_mlperf_training/slym/module_tests/tp_overlap/te.tp_tests/tests/pytorch/distributed/run_gemm_with_overlap.py", line 721, in _main
[rank1]:     all_outputs = _fp8_gemm()
[rank1]:                   ^^^^^^^^^^^
[rank1]:   File "/lustre/fsw/coreai_mlperf_training/slym/module_tests/tp_overlap/te.tp_tests/tests/pytorch/distributed/run_gemm_with_overlap.py", line 602, in _fp8_gemm
[rank1]:     return tex.fp8_gemm(
[rank1]:            ^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/cpp_extensions/gemm.py", line 180, in fp8_gemm
[rank1]:     _ = fn(*args)
[rank1]:         ^^^^^^^^^
[rank1]: RuntimeError: /workspace/TransformerEngine/transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp:802 in function split_overlap_ag: CUDA Error: misaligned address

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions