[JAX] Collective GEMM custom op + primitive + minimal supporting functions #1846

denera · 2025-06-03T20:22:19Z

Description

This PR introduces a new XLA custom op for calling nvte_cublas_gemm or related comm+GEMM overlap algorithms, the accompanying JAX primitive, and bare minimum Python wrappers required to work with the custom call.

FWD/BWD autograd implementation will be tackled in a separate upcoming PR.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Alp Dener <adener@nvidia.com> started GemmPrimitive, abstract done Signed-off-by: Alp Dener <adener@nvidia.com> gemm custom op working with BF16, needs testing for FP8/MXFP8 Signed-off-by: Alp Dener <adener@nvidia.com> converted TE GEMM API to use ScaledTensor and added os ENV flag to use TE GEMM under general gemm() call Signed-off-by: Alp Dener <adener@nvidia.com> BF16 tests passing, FP8 tests should be passing but contracting_dims has a scoping issue Signed-off-by: Alp Dener <adener@nvidia.com> fp8 tests passing for E4M3, getting CUBLAS_STATUS_NOT_SUPPORTED for E5M2 Signed-off-by: Alp Dener <adener@nvidia.com> updated GEMM API to use separate LHS and RHS quantizers instead of a QuantizerSet Signed-off-by: Alp Dener <adener@nvidia.com> new GemmPrimitive passing all Dense tests Signed-off-by: Alp Dener <adener@nvidia.com> import cleanup and reverted code chunk movement Signed-off-by: Alp Dener <adener@nvidia.com> removed unused .transpose() implementations from ScaledTensors Signed-off-by: Alp Dener <adener@nvidia.com> all custom call tests passing on Hopper, GEMM-related tests cover both GemmPrimitive and native JAX impl Signed-off-by: Alp Dener <adener@nvidia.com> removed direct calls to GemmPrimitive.enabled() from outside of cpp_extensions Signed-off-by: Alp Dener <adener@nvidia.com> removed unused changes to ScaledTensor classes and debug prints Signed-off-by: Alp Dener <adener@nvidia.com>

Signed-off-by: Alp Dener <adener@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Alp Dener <adener@nvidia.com>

…erEngine into jax/nvte-cublas-gemm-op

… Blackwell, MXFP8 has issues with E5M2 Signed-off-by: Alp Dener <adener@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Alp Dener <adener@nvidia.com>

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Signed-off-by: Alp Dener <adener@nvidia.com>

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Alp Dener <adener@nvidia.com> all unit tests passing on H100x8 node Signed-off-by: Alp Dener <adener@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci linting fixes Signed-off-by: Alp Dener <adener@nvidia.com> fixed batch dimension numbers Signed-off-by: Alp Dener <adener@nvidia.com> fixed FP8 scale sharding rule when there are no FP8 scales Signed-off-by: Alp Dener <adener@nvidia.com> added error message for unsupported Shardy partitioner Signed-off-by: Alp Dener <adener@nvidia.com> fixed test tolerances for FP8 cases Signed-off-by: Alp Dener <adener@nvidia.com> fixed shardy test skip cases Signed-off-by: Alp Dener <adener@nvidia.com>

…m-op

for more information, see https://pre-commit.ci

…rtitioning rules work correctly Signed-off-by: Alp Dener <adener@nvidia.com>

…m-op

…d GemmPrimitive to accept unpadded scales and pad them after sharding Signed-off-by: Alp Dener <adener@nvidia.com>

…m-op

for more information, see https://pre-commit.ci

Signed-off-by: Alp Dener <adener@nvidia.com>

for more information, see https://pre-commit.ci

denera requested review from phu0ngng and jberchtold-nvidia June 3, 2025 20:22

denera self-assigned this Jun 3, 2025

denera added the jax label Jun 3, 2025

This was referenced Jun 3, 2025

[JAX] Add collective GEMM without compute/communication overlap #1675

Closed

[JAX] Collective GEMM custom op with nvte_cublas_gemm (no comm. overlap) #1307

Closed

[C/JAX] Comm+GEMM Overlap API for TE/JAX #1337

Closed

denera force-pushed the jax/collective-gemm-api branch from 1a845e9 to e92c81a Compare June 4, 2025 17:47

denera and others added 22 commits June 13, 2025 04:55

minor unit test cleanup

da0709a

Signed-off-by: Alp Dener <adener@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e5b933c

for more information, see https://pre-commit.ci

FP8 tests passing on Blackwell but MXFP8 outputs NaN

92dec51

Signed-off-by: Alp Dener <adener@nvidia.com>

Merge branch 'jax/nvte-cublas-gemm-op' of github.com:denera/Transform…

50d319b

…erEngine into jax/nvte-cublas-gemm-op

reverted dense and fuseddense changes, FP8 test passing on Hopper and…

9eba586

… Blackwell, MXFP8 has issues with E5M2 Signed-off-by: Alp Dener <adener@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

b80e284

for more information, see https://pre-commit.ci

MXFP8 issue traced to scale factor padding with NaNs instead of zeros

a7aa2f4

Signed-off-by: Alp Dener <adener@nvidia.com>

padding scale with 2^-127 instead of nans

1be8773

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

fix bug on rhs_scale_inv usage

75008de

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

cleanup E8M0 type converter use it in gemm.cpp

5b0c1f5

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

segfault fixed, passing all unittests on Blackwell

b49d586

Signed-off-by: Alp Dener <adener@nvidia.com>

merge with main

b760460

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

fix for fuseddense tests

bd9bca3

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

fix workspace alignment

8fcb1bb

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

b2b4159

for more information, see https://pre-commit.ci

Merge remote-tracking branch 'upstream/main' into jax/nvte-cublas-gem…

17d7a51

…m-op

[pre-commit.ci] auto fixes from pre-commit.com hooks

ddaaab9

for more information, see https://pre-commit.ci

moved reshape of encoder output in encoder examples to make custom pa…

44e5b81

…rtitioning rules work correctly Signed-off-by: Alp Dener <adener@nvidia.com>

Merge remote-tracking branch 'upstream/main' into jax/nvte-cublas-gem…

a281c97

…m-op

added helper functions for padding and unpadding block scales, change…

b8ca0b1

…d GemmPrimitive to accept unpadded scales and pad them after sharding Signed-off-by: Alp Dener <adener@nvidia.com>

denera and others added 4 commits June 27, 2025 16:39

Merge remote-tracking branch 'upstream/main' into jax/nvte-cublas-gem…

3ee96ba

…m-op

[pre-commit.ci] auto fixes from pre-commit.com hooks

7187582

for more information, see https://pre-commit.ci

stashing

0b7692a

Signed-off-by: Alp Dener <adener@nvidia.com>

both AG and RS overlaps working

77eaa63

Signed-off-by: Alp Dener <adener@nvidia.com>

denera force-pushed the jax/collective-gemm-api branch from 4575b98 to 77eaa63 Compare July 2, 2025 07:23

Comm+GEMM overlap working with row-parallel DenseGeneral FWD/BWD

aeddd66

Signed-off-by: Alp Dener <adener@nvidia.com>

denera force-pushed the jax/collective-gemm-api branch from 6b9dc0e to aeddd66 Compare July 4, 2025 07:42

fixed AG->GEMM overlap auxiliary output for all-gathered LHS copy

74ab649

Signed-off-by: Alp Dener <adener@nvidia.com>

denera force-pushed the jax/collective-gemm-api branch from 49536b2 to 74ab649 Compare July 4, 2025 07:57

denera added 2 commits July 4, 2025 08:22

comm+GEMM overlap working for column-parallel layernorm_dense FWD/BWD

1c7d5a3

Signed-off-by: Alp Dener <adener@nvidia.com>

comm+GEMM overlap working with layernorm_mlp FWD/BWD

b4ff961

Signed-off-by: Alp Dener <adener@nvidia.com>

denera force-pushed the jax/collective-gemm-api branch from 3f1214e to b4ff961 Compare July 4, 2025 09:03

te.flax modules updated for comm+GEMM overlap but untested

95564fc

Signed-off-by: Alp Dener <adener@nvidia.com>

denera force-pushed the jax/collective-gemm-api branch from 6762c45 to 95564fc Compare July 4, 2025 10:23

[pre-commit.ci] auto fixes from pre-commit.com hooks

3330052

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[JAX] Collective GEMM custom op + primitive + minimal supporting functions #1846

[JAX] Collective GEMM custom op + primitive + minimal supporting functions #1846

Uh oh!

denera commented Jun 3, 2025

Uh oh!

Uh oh!

[JAX] Collective GEMM custom op + primitive + minimal supporting functions #1846

Are you sure you want to change the base?

[JAX] Collective GEMM custom op + primitive + minimal supporting functions #1846

Uh oh!

Conversation

denera commented Jun 3, 2025

Description

Type of change

Checklist:

Uh oh!

Uh oh!