pipeline aware cpu offload #1886

liuzhenhai93 · 2025-06-17T06:26:17Z

Description

pipeline aware cpu offload

Type of change

New feature (non-breaking change which adds functionality)

Changes

Please list the changes introduced in this PR:

pipeline aware cpu offload

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

lhb8125 · 2025-06-17T09:45:39Z

transformer_engine/pytorch/pipeline_aware_cpu_offload.py

+        self._b_event = PipelineOffloadManager.get_instance()._b_event
+        self.do_offload = offload
+
+    def is_first_last_layer(self):


The naming is ambiguous. It returns true when the current layer is the last layer and the current vpp chunk is the first one or the last one, is my understanding correct?

lhb8125 · 2025-06-17T09:46:32Z

transformer_engine/pytorch/pipeline_aware_cpu_offload.py

+        return self.cur_backward_chunk().tensor_pop(saved_state)
+
+
+OFFLOAD_TAG = "offloading_mlp_input"


Do we only support offloading mlp input?

Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: liuzhenhai93 <liuzhenhai93@outlook.com>

Signed-off-by: liuzhenhai93 <liuzhenhai93@outlook.com>

for more information, see https://pre-commit.ci

pggPL · 2025-06-17T12:48:20Z

Hi, thank you for the PR. We are working on some bigger changes in CPU Offload so I think we will need to sync. I reached out to @lhb8125.

lhb8125 reviewed Jun 17, 2025

View reviewed changes

timmoon10 and others added 2 commits June 17, 2025 20:37

Fix test case that assumes char is signed (NVIDIA#1881)

58faa53

Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: liuzhenhai93 <liuzhenhai93@outlook.com>

pp_aware_offload

b9ac46a

Signed-off-by: liuzhenhai93 <liuzhenhai93@outlook.com>

liuzhenhai93 force-pushed the pp_aware_offload branch from 935f19c to b9ac46a Compare June 17, 2025 12:38

[pre-commit.ci] auto fixes from pre-commit.com hooks

f936867

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pipeline aware cpu offload #1886

pipeline aware cpu offload #1886

Uh oh!

liuzhenhai93 commented Jun 17, 2025 •

edited

Loading

Uh oh!

lhb8125 Jun 17, 2025

Uh oh!

lhb8125 Jun 17, 2025

Uh oh!

pggPL commented Jun 17, 2025

Uh oh!

Uh oh!

		return self.cur_backward_chunk().tensor_pop(saved_state)


		OFFLOAD_TAG = "offloading_mlp_input"

pipeline aware cpu offload #1886

Are you sure you want to change the base?

pipeline aware cpu offload #1886

Uh oh!

Conversation

liuzhenhai93 commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

lhb8125 Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

lhb8125 Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

pggPL commented Jun 17, 2025

Uh oh!

Uh oh!

liuzhenhai93 commented Jun 17, 2025 •

edited

Loading