Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train.py fails when gpus=2 (or something other than gpus=1) #139

Closed
metaphorz opened this issue Jul 7, 2021 · 9 comments
Closed

train.py fails when gpus=2 (or something other than gpus=1) #139

metaphorz opened this issue Jul 7, 2021 · 9 comments

Comments

@metaphorz
Copy link

OS: CentOS Version 7
Python: 3.7.6
Pytorch Version: 1.7.1+cu110
GPU: 2 V100s
Docker: No, have not gone that route yet
Related Posted Issues: none that I could find based solely on GPU count

I am running the github repo for stylegan2-ada-pytorch. Through the help of others with Pytorch versions,
I was able to do successful training with gpus=1. So, gpus=1 is working.

The system I am on has 2 V100s. When I set gpus=2 on "python train.py ...." I receive the following errors:
(Traceback truncated and file references anonymized.)

Distributing across 2 GPUs...
Setting up training phases...
Exporting sample images...
Initializing logs...
Truncated Traceback (most recent call last):
torch.multiprocessing.spawn(fn=subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus)
File "…/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/…python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "…/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 1 terminated with the following error:
Truncated Traceback (most recent call last):
File "…/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File …./notebooks/stylegan2-ada-pytorch/train.py", line 422, in subprocess_fn
training_loop.training_loop(rank=rank, **args)
File "…/notebooks/stylegan2-ada-pytorch/training/training_loop.py", line 290, in training_loop
loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, sync=sync, gain=gain)
File "…/notebooks/stylegan2-ada-pytorch/training/loss.py", line 134, in accumulate_gradients
training_stats.report('Loss/D/loss', loss_Dgen + loss_Dreal)
RuntimeError: The size of tensor a (4) must match the size of tensor b (2) at non-singleton dimension 0

@woctezuma
Copy link

woctezuma commented Jul 7, 2021

#91, #98 just in case it helps, even though I know you already had a look. You are correct that it is only tangentially related, because the numbers do not match (4 and 2 here vs. 512 and 256 in my links).

It looks like the error happens here, even though the line numbers do not match:

loss_Dreal = 0
if do_Dmain:
loss_Dreal = torch.nn.functional.softplus(-real_logits) # -log(sigmoid(real_logits))
training_stats.report('Loss/D/loss', loss_Dgen + loss_Dreal)

There is a sum of two terms, the first of which is:

# Dmain: Minimize logits for generated images.
loss_Dgen = 0
if do_Dmain:
with torch.autograd.profiler.record_function('Dgen_forward'):
gen_img, _gen_ws = self.run_G(gen_z, gen_c, sync=False)
gen_logits = self.run_D(gen_img, gen_c, sync=False) # Gets synced by loss_Dreal.
training_stats.report('Loss/scores/fake', gen_logits)
training_stats.report('Loss/signs/fake', gen_logits.sign())
loss_Dgen = torch.nn.functional.softplus(gen_logits) # -log(1 - sigmoid(gen_logits))
with torch.autograd.profiler.record_function('Dgen_backward'):
loss_Dgen.mean().mul(gain).backward()

@metaphorz
Copy link
Author

Also, found this and so asked the poster: lucidrains/stylegan2-pytorch#209

@metaphorz
Copy link
Author

Tried an experiment and got part-way there. There is a "cfg" config option in train.py. The config had been set to 11gb-gpu and that worked fine as long as gpus=1 but not >1. So I tried setting it to auto and while that worked with multiple gpus, the fake*** images generated were bizarre (mostly red or green--nothing like the starting network (wikiart.pkl) or the images to use in training). So, now I am retracing steps wondering whether there is a config that will generate accurate fake*** images on multiple gpus. Too see all config options, look in train.py for variable cfg_specs. If I find something, I'll report back.

@metaphorz
Copy link
Author

--cfg='stylegan2' works for me on a trial with one node and two gpus

@woctezuma
Copy link

woctezuma commented Jul 9, 2021

You are using a fork, because the config (11gb-gpu) which you mentioned is not part of this repository.

cfg_specs = {
'auto': dict(ref_gpus=-1, kimg=25000, mb=-1, mbstd=-1, fmaps=-1, lrate=-1, gamma=-1, ema=-1, ramp=0.05, map=2), # Populated dynamically based on resolution and GPU count.
'stylegan2': dict(ref_gpus=8, kimg=25000, mb=32, mbstd=4, fmaps=1, lrate=0.002, gamma=10, ema=10, ramp=None, map=8), # Uses mixed-precision, unlike the original StyleGAN2.
'paper256': dict(ref_gpus=8, kimg=25000, mb=64, mbstd=8, fmaps=0.5, lrate=0.0025, gamma=1, ema=20, ramp=None, map=8),
'paper512': dict(ref_gpus=8, kimg=25000, mb=64, mbstd=8, fmaps=1, lrate=0.0025, gamma=0.5, ema=20, ramp=None, map=8),
'paper1024': dict(ref_gpus=8, kimg=25000, mb=32, mbstd=4, fmaps=1, lrate=0.002, gamma=2, ema=10, ramp=None, map=8),
'cifar': dict(ref_gpus=2, kimg=100000, mb=64, mbstd=32, fmaps=1, lrate=0.0025, gamma=0.01, ema=500, ramp=0.05, map=2),
}
assert cfg in cfg_specs

@MoemaMike
Copy link

Is this issue applicable to Colab Pro environment? I was under the impression that while as a colab pro user i have access to 4 GPUs, i cannot only access one GPU per colab notebook. If i am wrong and i can run gpus=2 or more that would be welcome news

@metaphorz
Copy link
Author

metaphorz commented Jul 9, 2021 via email

@metaphorz
Copy link
Author

metaphorz commented Jul 9, 2021 via email

@woctezuma
Copy link

PS: I just realized that you were right on the 11gb-gpu. Not sure where that came from.

It is part of the fork. I know this fork, even though I don't use it. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants