Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FusedLayerNorm corrupts data when switching devices #1022

Open
stas00 opened this issue Jan 2, 2021 · 0 comments
Open

FusedLayerNorm corrupts data when switching devices #1022

stas00 opened this issue Jan 2, 2021 · 0 comments

Comments

@stas00
Copy link
Contributor

stas00 commented Jan 2, 2021

I am porting transformers's Bart to model parallel, so there is a lot of device switching, and I run into:

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm(...)`

when using apex's FusedLayerNorm.

This problem doesn't exist with torch.nn.LayerNorm in pt-nightly (which was ported from FusedLayerNorm pytorch/pytorch#27634).

I was able to reproduce it quite simply:

# fused.py
from apex.normalization import FusedLayerNorm
import torch

class Norm(torch.nn.Module):
    def __init__(self):
        super().__init__()    
        self.ln1 = torch.nn.LayerNorm(4)
        self.ln2 = FusedLayerNorm(4)

    def forward(self, x):
        y2 = self.ln2(x)
        print(f"apex  : {y2}")
        y1 = self.ln1(x)
        print(f"native: {y1}")

model = Norm()
        
x = torch.tensor([[1.5,.0,.0,.0]])
for id in [0, 1]:
    print(f"ID {id}")
    x = x.to(id)
    model.to(id)
    model(x)

2 GPUs are needed to see the problem:

python fused.py
ID 0
apex  : tensor([[ 1.7320, -0.5773, -0.5773, -0.5773]], device='cuda:0',
       grad_fn=<FusedLayerNormAffineFunctionBackward>)
native: tensor([[ 1.7320, -0.5773, -0.5773, -0.5773]], device='cuda:0',
       grad_fn=<NativeLayerNormBackward>)
ID 1
apex  : tensor([[0., 0., 0., 0.]], device='cuda:1',
       grad_fn=<FusedLayerNormAffineFunctionBackward>)
native: tensor([[ 1.7320, -0.5773, -0.5773, -0.5773]], device='cuda:1',
       grad_fn=<NativeLayerNormBackward>)

As you can see apex's norm broke the tensor and if I pass it to some other pytorch op it blows up with CUBLAS_STATUS_EXECUTION_FAILED

If hover, I flip the two norm calls:

    def forward(self, x):
        y1 = self.ln1(x)
        print(f"native: {y1}")
        y2 = self.ln2(x)
        print(f"apex  : {y2}")

Everything works:

ID 0
native: tensor([[ 1.7320, -0.5773, -0.5773, -0.5773]], device='cuda:0',
       grad_fn=<NativeLayerNormBackward>)
apex  : tensor([[ 1.7320, -0.5773, -0.5773, -0.5773]], device='cuda:0',
       grad_fn=<FusedLayerNormAffineFunctionBackward>)
ID 1
native: tensor([[ 1.7320, -0.5773, -0.5773, -0.5773]], device='cuda:1',
       grad_fn=<NativeLayerNormBackward>)
apex  : tensor([[ 1.7320, -0.5773, -0.5773, -0.5773]], device='cuda:1',
       grad_fn=<FusedLayerNormAffineFunctionBackward>)

So apex's version is missing something wrt to switching to a new device.

The workaround I found in this simple script is to torch.cuda.set_device(id) before the call to FusedLayerNorm:

x = torch.tensor([[1.5,.0,.0,.0]])
for id in [0, 1]:
    print(f"ID {id}")
    x = x.to(id)
    model.to(id)
    torch.cuda.set_device(id)
    model(x)

This is with pt-nightly, py-38, and apex master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant