train.py fails when gpus=2 (or something other than gpus=1) #139

metaphorz · 2021-07-07T18:08:55Z

OS: CentOS Version 7
Python: 3.7.6
Pytorch Version: 1.7.1+cu110
GPU: 2 V100s
Docker: No, have not gone that route yet
Related Posted Issues: none that I could find based solely on GPU count

I am running the github repo for stylegan2-ada-pytorch. Through the help of others with Pytorch versions,
I was able to do successful training with gpus=1. So, gpus=1 is working.

The system I am on has 2 V100s. When I set gpus=2 on "python train.py ...." I receive the following errors:
(Traceback truncated and file references anonymized.)

Distributing across 2 GPUs...
Setting up training phases...
Exporting sample images...
Initializing logs...
Truncated Traceback (most recent call last):
torch.multiprocessing.spawn(fn=subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus)
File "…/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/…python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "…/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 1 terminated with the following error:
Truncated Traceback (most recent call last):
File "…/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File …./notebooks/stylegan2-ada-pytorch/train.py", line 422, in subprocess_fn
training_loop.training_loop(rank=rank, **args)
File "…/notebooks/stylegan2-ada-pytorch/training/training_loop.py", line 290, in training_loop
loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, sync=sync, gain=gain)
File "…/notebooks/stylegan2-ada-pytorch/training/loss.py", line 134, in accumulate_gradients
training_stats.report('Loss/D/loss', loss_Dgen + loss_Dreal)
RuntimeError: The size of tensor a (4) must match the size of tensor b (2) at non-singleton dimension 0

woctezuma · 2021-07-07T19:30:40Z

#91, #98 just in case it helps, even though I know you already had a look. You are correct that it is only tangentially related, because the numbers do not match (4 and 2 here vs. 512 and 256 in my links).

It looks like the error happens here, even though the line numbers do not match:

stylegan2-ada-pytorch/training/loss.py

Lines 116 to 119 in d4b2afe

    
           loss_Dreal = 0 
        
           if do_Dmain: 
        
               loss_Dreal = torch.nn.functional.softplus(-real_logits) # -log(sigmoid(real_logits)) 
        
               training_stats.report('Loss/D/loss', loss_Dgen + loss_Dreal)

There is a sum of two terms, the first of which is:

stylegan2-ada-pytorch/training/loss.py

Lines 94 to 104 in d4b2afe

    
           # Dmain: Minimize logits for generated images. 
        
           loss_Dgen = 0 
        
           if do_Dmain: 
        
               with torch.autograd.profiler.record_function('Dgen_forward'): 
        
                   gen_img, _gen_ws = self.run_G(gen_z, gen_c, sync=False) 
        
                   gen_logits = self.run_D(gen_img, gen_c, sync=False) # Gets synced by loss_Dreal. 
        
                   training_stats.report('Loss/scores/fake', gen_logits) 
        
                   training_stats.report('Loss/signs/fake', gen_logits.sign()) 
        
                   loss_Dgen = torch.nn.functional.softplus(gen_logits) # -log(1 - sigmoid(gen_logits)) 
        
               with torch.autograd.profiler.record_function('Dgen_backward'): 
        
                   loss_Dgen.mean().mul(gain).backward()

metaphorz · 2021-07-07T23:22:13Z

Also, found this and so asked the poster: lucidrains/stylegan2-pytorch#209

metaphorz · 2021-07-08T22:07:28Z

Tried an experiment and got part-way there. There is a "cfg" config option in train.py. The config had been set to 11gb-gpu and that worked fine as long as gpus=1 but not >1. So I tried setting it to auto and while that worked with multiple gpus, the fake*** images generated were bizarre (mostly red or green--nothing like the starting network (wikiart.pkl) or the images to use in training). So, now I am retracing steps wondering whether there is a config that will generate accurate fake*** images on multiple gpus. Too see all config options, look in train.py for variable cfg_specs. If I find something, I'll report back.

metaphorz · 2021-07-09T03:51:44Z

--cfg='stylegan2' works for me on a trial with one node and two gpus

woctezuma · 2021-07-09T06:15:14Z

You are using a fork, because the config (11gb-gpu) which you mentioned is not part of this repository.

stylegan2-ada-pytorch/train.py

Lines 154 to 163 in d4b2afe

    
           cfg_specs = { 
        
               'auto':      dict(ref_gpus=-1, kimg=25000,  mb=-1, mbstd=-1, fmaps=-1,  lrate=-1,     gamma=-1,   ema=-1,  ramp=0.05, map=2), # Populated dynamically based on resolution and GPU count. 
        
               'stylegan2': dict(ref_gpus=8,  kimg=25000,  mb=32, mbstd=4,  fmaps=1,   lrate=0.002,  gamma=10,   ema=10,  ramp=None, map=8), # Uses mixed-precision, unlike the original StyleGAN2. 
        
               'paper256':  dict(ref_gpus=8,  kimg=25000,  mb=64, mbstd=8,  fmaps=0.5, lrate=0.0025, gamma=1,    ema=20,  ramp=None, map=8), 
        
               'paper512':  dict(ref_gpus=8,  kimg=25000,  mb=64, mbstd=8,  fmaps=1,   lrate=0.0025, gamma=0.5,  ema=20,  ramp=None, map=8), 
        
               'paper1024': dict(ref_gpus=8,  kimg=25000,  mb=32, mbstd=4,  fmaps=1,   lrate=0.002,  gamma=2,    ema=10,  ramp=None, map=8), 
        
               'cifar':     dict(ref_gpus=2,  kimg=100000, mb=64, mbstd=32, fmaps=1,   lrate=0.0025, gamma=0.01, ema=500, ramp=0.05, map=2), 
        
           } 
        
           assert cfg in cfg_specs

MoemaMike · 2021-07-09T16:46:00Z

Is this issue applicable to Colab Pro environment? I was under the impression that while as a colab pro user i have access to 4 GPUs, i cannot only access one GPU per colab notebook. If i am wrong and i can run gpus=2 or more that would be welcome news

metaphorz · 2021-07-09T16:55:04Z

you are right. I am using a fork (from dvschultz); however, look at the function setup_training_loop_kwargs where cfg is defined as an option: https://github.com/NVlabs/stylegan2-ada-pytorch/blob/main/train.py It is under # Base config.

…

-p PS> I just realized that you were right on the 11gb-gpu. Not sure where that came from. Paul Fishwick, PhD Distinguished University Chair of Arts, Technology, and Emerging Communication Professor of Computer Science Director, Creative Automata Laboratory The University of Texas at Dallas Arts & Technology 800 West Campbell Road, AT10 Richardson, TX 75080-3021 Home: utdallas.edu/atec/fishwick Media: ***@***.*** Modeling: digest.sigsim.org Twitter: @PaulFishwick ONLINE: Webex,Collaborate, TEAMS, Zoom, Skype, Hangout From: Wok ***@***.***> Reply-To: NVlabs/stylegan2-ada-pytorch ***@***.***> Date: Friday, July 9, 2021 at 1:15 AM To: NVlabs/stylegan2-ada-pytorch ***@***.***> Cc: Paul Fishwick ***@***.***>, Author ***@***.***> Subject: Re: [NVlabs/stylegan2-ada-pytorch] train.py fails when gpus=2 (or something other than gpus=1) (#139) Okay, then you are using a fork, because the config which you mentioned is not part of this repository. https://github.com/NVlabs/stylegan2-ada-pytorch/blob/d4b2afe9c27e3c305b721bc886d2cb5229458eba/train.py#L154-L163 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

metaphorz · 2021-07-09T16:59:06Z

I also use Colab Pro. As a Colab Pro user, to my knowledge, you have access to a node that contains only one GPU. Typically, this will be a P100 GPU but if lucky, you get a V100. So, for multi-GPUs, you need to go the server route, which admittedly a bit painful compared with Colab. I think Paperspace and vast.ai support multi GPUs. So, my workflow consists of starting on Colab, creating or modifying a notebook, and then translating this to a server to get to multiple GPUs.

…

-p Paul Fishwick, PhD Distinguished University Chair of Arts, Technology, and Emerging Communication Professor of Computer Science Director, Creative Automata Laboratory The University of Texas at Dallas Arts & Technology 800 West Campbell Road, AT10 Richardson, TX 75080-3021 Home: utdallas.edu/atec/fishwick Media: ***@***.*** Modeling: digest.sigsim.org Twitter: @PaulFishwick ONLINE: Webex,Collaborate, TEAMS, Zoom, Skype, Hangout From: MoemaMike ***@***.***> Reply-To: NVlabs/stylegan2-ada-pytorch ***@***.***> Date: Friday, July 9, 2021 at 11:46 AM To: NVlabs/stylegan2-ada-pytorch ***@***.***> Cc: Paul Fishwick ***@***.***>, Author ***@***.***> Subject: Re: [NVlabs/stylegan2-ada-pytorch] train.py fails when gpus=2 (or something other than gpus=1) (#139) Is this issue applicable to Colab Pro environment? I was under the impression that while as a colab pro user i have access to 4 GPUs, i cannot only access one GPU per colab notebook. If i am wrong and i can run gpus=2 or more that would be welcome news — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

woctezuma · 2021-07-09T19:22:36Z

PS: I just realized that you were right on the 11gb-gpu. Not sure where that came from.

It is part of the fork. I know this fork, even though I don't use it. :)

metaphorz closed this as completed Jul 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train.py fails when gpus=2 (or something other than gpus=1) #139

train.py fails when gpus=2 (or something other than gpus=1) #139

metaphorz commented Jul 7, 2021

woctezuma commented Jul 7, 2021 •

edited

Loading

metaphorz commented Jul 7, 2021

metaphorz commented Jul 8, 2021

metaphorz commented Jul 9, 2021

woctezuma commented Jul 9, 2021 •

edited

Loading

MoemaMike commented Jul 9, 2021

metaphorz commented Jul 9, 2021 via email •

edited

Loading

metaphorz commented Jul 9, 2021 via email

woctezuma commented Jul 9, 2021

train.py fails when gpus=2 (or something other than gpus=1) #139

train.py fails when gpus=2 (or something other than gpus=1) #139

Comments

metaphorz commented Jul 7, 2021

woctezuma commented Jul 7, 2021 • edited Loading

metaphorz commented Jul 7, 2021

metaphorz commented Jul 8, 2021

metaphorz commented Jul 9, 2021

woctezuma commented Jul 9, 2021 • edited Loading

MoemaMike commented Jul 9, 2021

metaphorz commented Jul 9, 2021 via email • edited Loading

metaphorz commented Jul 9, 2021 via email

woctezuma commented Jul 9, 2021

woctezuma commented Jul 7, 2021 •

edited

Loading

woctezuma commented Jul 9, 2021 •

edited

Loading

metaphorz commented Jul 9, 2021 via email •

edited

Loading