Skip to content

[PyTorch debug] Improve precision debug tools performance #1909

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

pggPL
Copy link
Collaborator

@pggPL pggPL commented Jun 30, 2025

Description

This PR aims to speed up layers which are not affected by any feature in particular iteration. They should be exactly as fast as layers without initializing debug tools.

I needed to fix 3 things:

  • There was a lot of CPU overhead when we tried to decide if layer uses any feature in current iteration. We have called inspect_tensor_enabled and few similar calls for each layer, iteration and tensor. I changed calls like inspect_tensor_enabled- they may return tuple (bool, int), where int indicated number of iteration the feature will be enabled next time. If each tensor for one layer returns (bool, n) we run non-debug layer for next n iterations,
  • debug_api.step() is called after every iteration. Inside it, we call STATS_BUFFER.log() which performs synchonization and some cpu ops, even if no stats is logged. I disable this logic if no stat was logged.
  • COMM/GEMM overlap was disabled for the whole time, now it is disabled when layer is affected by at least one feature.

If we want to only log some stats every n iterations, then this PR should make it work as fast as non-debug workflow when n -> infinity.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

pggPL added 3 commits June 27, 2025 09:10
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
@pggPL pggPL force-pushed the nvinspect_performance branch from e2f237d to b5024af Compare July 2, 2025 17:10
pre-commit-ci bot and others added 3 commits July 2, 2025 17:10
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
@pggPL pggPL force-pushed the nvinspect_performance branch from 58cf805 to 9893831 Compare July 2, 2025 20:26
pre-commit-ci bot and others added 3 commits July 2, 2025 20:27
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
@pggPL pggPL force-pushed the nvinspect_performance branch from dfa89ee to a0ae480 Compare July 3, 2025 10:20
pre-commit-ci bot and others added 5 commits July 3, 2025 10:20
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
@pggPL pggPL force-pushed the nvinspect_performance branch from 9522170 to 1343547 Compare July 3, 2025 14:33
pre-commit-ci bot and others added 3 commits July 3, 2025 14:33
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
@pggPL pggPL force-pushed the nvinspect_performance branch from 968336f to 7c1a1f7 Compare July 3, 2025 14:43
@pggPL pggPL marked this pull request as ready for review July 3, 2025 14:44
@pggPL
Copy link
Collaborator Author

pggPL commented Jul 3, 2025

PR ready for review, waiting for NVIDIA/nvidia-dlfw-inspect#7 to be merged to update version and run tests.

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
@pggPL pggPL force-pushed the nvinspect_performance branch from ba3d72e to 7322fc2 Compare July 3, 2025 20:42
@pggPL
Copy link
Collaborator Author

pggPL commented Jul 3, 2025

/te-ci pytorch L1

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
@pggPL
Copy link
Collaborator Author

pggPL commented Jul 4, 2025

/te-ci pytorch

1 similar comment
@timmoon10
Copy link
Collaborator

/te-ci pytorch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants