[PyTorch Debug] More advanced stats for Quantized Tensors #1897

pggPL · 2025-06-26T12:03:47Z

Description

This PR adds more statistics for Quantized Tensors that can be used to debug FP8 convergence issues.

It also adds inspect_tensor_all api call, which was necessary to support this features.

Type of change

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

…stats

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

for more information, see https://pre-commit.ci

timmoon10 · 2025-07-09T22:05:08Z

transformer_engine/debug/features/api.py

@@ -231,6 +233,9 @@ def inspect_tensor(
        tp_group: torch.distributed.ProcessGroup,
    ) -> None:
        """
+        This is legacy call, we advise to use *inspect_tensor_all* and *inspect_tensor_all_enabled* instead.


If we don't want users to call inspect_tensor and other legacy functions, we should raise a deprecation warning.

Also, it seems like a shame to so quickly abandon these function names and replace them with clunky _all variants. Is there a way to maintain backward compatibility? Maybe have kwargs like high_precision_tensors and quantized_tensors, and have quantized_tensors=False by default.

timmoon10 · 2025-07-09T22:09:40Z

transformer_engine/debug/features/api.py

+        It allows to inspect both quantized and high precision tensors.
+        The feature LogFp8TensorStats uses this call to collect FP8 statistics after the quantization.
+
+        If tensor and the transpose are quantized differently


Accidentally deleted some of the docstring?

pggPL and others added 23 commits June 26, 2025 11:59

code drop

02c9923

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

fix

ecdc8aa

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

fix

46e2a63

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

fix

c96edd5

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

fix

553aec3

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

Merge remote-tracking branch 'upstream/main' into debug_log_more_fp8_…

80817f4

…stats

[pre-commit.ci] auto fixes from pre-commit.com hooks

8800f8c

for more information, see https://pre-commit.ci

fixes

8ccc032

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

fix

633f883

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

docs change

09af73f

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

aad034e

for more information, see https://pre-commit.ci

fix

8e101df

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e650c7f

for more information, see https://pre-commit.ci

fix

c5689d2

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

fix

8de8149

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

1e53753

for more information, see https://pre-commit.ci

fix

10c9d46

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

fix

622947b

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

d7f1c3a

for more information, see https://pre-commit.ci

fix

f2918c2

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

8058211

for more information, see https://pre-commit.ci

test

2fdce1b

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

fix

b78a504

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

pggPL force-pushed the debug_log_more_fp8_stats branch from 30688dd to b78a504 Compare July 4, 2025 14:57

[pre-commit.ci] auto fixes from pre-commit.com hooks

314833f

for more information, see https://pre-commit.ci

pggPL marked this pull request as ready for review July 4, 2025 14:59

timmoon10 reviewed Jul 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PyTorch Debug] More advanced stats for Quantized Tensors #1897

[PyTorch Debug] More advanced stats for Quantized Tensors #1897

Uh oh!

pggPL commented Jun 26, 2025

Uh oh!

timmoon10 Jul 9, 2025

Uh oh!

timmoon10 Jul 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

[PyTorch Debug] More advanced stats for Quantized Tensors #1897

Are you sure you want to change the base?

[PyTorch Debug] More advanced stats for Quantized Tensors #1897

Uh oh!

Conversation

pggPL commented Jun 26, 2025

Description

Type of change

Uh oh!

timmoon10 Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

timmoon10 Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

timmoon10 Jul 9, 2025 •

edited

Loading