Support for torch.cuda.amp in VQ-VAE training #65

vvvm23 · 2021-04-28T23:26:22Z

Feature request for AMP support in VQ-VAE training.
So far, I tried naively modifying the train function in train_vqvae.py like so:

#  ...
for i, (img, label) in enumerate(loader):
    model.zero_grad()

    img = img.to(device)

    with torch.cuda.amp.autocast():
        out, latent_loss = model(img)
        recon_loss = criterion(out, img)
        latent_loss = latent_loss.mean()
        loss = recon_loss + latent_loss_weight * latent_loss
    scaler.scale(loss).backward()

    if scheduler is not None:
        scheduler.step()
    scaler.step(optimizer)
    scaler.update()
# ...

The MSE error appears normal, but the latent error becomes infinite.
I'm going to try a few ideas when I have the time. I suspect that half precision and/or scaling doesn't play well with EMA updates. One "workaround" is to replace EMA with the 2nd term in the loss function in the original paper, so as to only update parameters using gradients, but that is far from ideal.

Thanks!

The text was updated successfully, but these errors were encountered:

rosinality · 2021-04-29T11:17:29Z

I think it will be safer to use fp32 for entire quantize operations.

vvvm23 · 2021-04-29T12:07:35Z

So, wrapping Quantize.forward in @torch.cuda.amp.autocast(enabled=False) and casting the buffers to be type torch.float32? Might also have to cast the input.

rosinality · 2021-04-29T14:34:09Z

Yes. It may work.

vvvm23 · 2021-04-29T14:59:46Z

Okay! I can make a pull request for this if you want? If not, I can just close this.

rosinality · 2021-04-30T15:55:52Z

If it is suffice to reproduct the result of fp32 training, definitely it would be nice to have.

vvvm23 · 2021-05-05T19:44:18Z

For some reason I can't improve forward pass speed under FP16. (maybe it is bottlenecked by FP32 in quantize operations?) Memory usage is improved though. I'll play around with this a little more and then maybe make a pull request.

ekyy2 mentioned this issue Mar 14, 2023

Stuck at epoch 1 iter 1 when train vqvae with multi-gpu #70

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for torch.cuda.amp in VQ-VAE training #65

Support for torch.cuda.amp in VQ-VAE training #65

vvvm23 commented Apr 28, 2021

rosinality commented Apr 29, 2021

vvvm23 commented Apr 29, 2021 •

edited

Loading

rosinality commented Apr 29, 2021

vvvm23 commented Apr 29, 2021

rosinality commented Apr 30, 2021

vvvm23 commented May 5, 2021

Support for torch.cuda.amp in VQ-VAE training #65

Support for torch.cuda.amp in VQ-VAE training #65

Comments

vvvm23 commented Apr 28, 2021

rosinality commented Apr 29, 2021

vvvm23 commented Apr 29, 2021 • edited Loading

rosinality commented Apr 29, 2021

vvvm23 commented Apr 29, 2021

rosinality commented Apr 30, 2021

vvvm23 commented May 5, 2021

vvvm23 commented Apr 29, 2021 •

edited

Loading