-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
train.py fails when gpus=2 (or something other than gpus=1) #139
Comments
#91, #98 just in case it helps, even though I know you already had a look. You are correct that it is only tangentially related, because the numbers do not match (4 and 2 here vs. 512 and 256 in my links). It looks like the error happens here, even though the line numbers do not match: stylegan2-ada-pytorch/training/loss.py Lines 116 to 119 in d4b2afe
There is a sum of two terms, the first of which is: stylegan2-ada-pytorch/training/loss.py Lines 94 to 104 in d4b2afe
|
Also, found this and so asked the poster: lucidrains/stylegan2-pytorch#209 |
Tried an experiment and got part-way there. There is a "cfg" config option in train.py. The config had been set to 11gb-gpu and that worked fine as long as gpus=1 but not >1. So I tried setting it to auto and while that worked with multiple gpus, the fake*** images generated were bizarre (mostly red or green--nothing like the starting network (wikiart.pkl) or the images to use in training). So, now I am retracing steps wondering whether there is a config that will generate accurate fake*** images on multiple gpus. Too see all config options, look in train.py for variable cfg_specs. If I find something, I'll report back. |
--cfg='stylegan2' works for me on a trial with one node and two gpus |
You are using a fork, because the config ( stylegan2-ada-pytorch/train.py Lines 154 to 163 in d4b2afe
|
Is this issue applicable to Colab Pro environment? I was under the impression that while as a colab pro user i have access to 4 GPUs, i cannot only access one GPU per colab notebook. If i am wrong and i can run gpus=2 or more that would be welcome news |
you are right. I am using a fork (from dvschultz); however, look at the function setup_training_loop_kwargs where
cfg is defined as an option:
https://github.com/NVlabs/stylegan2-ada-pytorch/blob/main/train.py
It is under # Base config.
…-p
PS> I just realized that you were right on the 11gb-gpu. Not sure where that came from.
Paul Fishwick, PhD
Distinguished University Chair of Arts, Technology, and Emerging Communication
Professor of Computer Science
Director, Creative Automata Laboratory
The University of Texas at Dallas
Arts & Technology
800 West Campbell Road, AT10
Richardson, TX 75080-3021
Home: utdallas.edu/atec/fishwick
Media: ***@***.***
Modeling: digest.sigsim.org
Twitter: @PaulFishwick
ONLINE: Webex,Collaborate, TEAMS, Zoom, Skype, Hangout
From: Wok ***@***.***>
Reply-To: NVlabs/stylegan2-ada-pytorch ***@***.***>
Date: Friday, July 9, 2021 at 1:15 AM
To: NVlabs/stylegan2-ada-pytorch ***@***.***>
Cc: Paul Fishwick ***@***.***>, Author ***@***.***>
Subject: Re: [NVlabs/stylegan2-ada-pytorch] train.py fails when gpus=2 (or something other than gpus=1) (#139)
Okay, then you are using a fork, because the config which you mentioned is not part of this repository.
https://github.com/NVlabs/stylegan2-ada-pytorch/blob/d4b2afe9c27e3c305b721bc886d2cb5229458eba/train.py#L154-L163
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I also use Colab Pro. As a Colab Pro user, to my knowledge, you have access to a node that contains only one GPU.
Typically, this will be a P100 GPU but if lucky, you get a V100.
So, for multi-GPUs, you need to go the server route, which admittedly a bit painful compared with Colab. I think
Paperspace and vast.ai support multi GPUs. So, my workflow consists of starting on Colab, creating or modifying
a notebook, and then translating this to a server to get to multiple GPUs.
…-p
Paul Fishwick, PhD
Distinguished University Chair of Arts, Technology, and Emerging Communication
Professor of Computer Science
Director, Creative Automata Laboratory
The University of Texas at Dallas
Arts & Technology
800 West Campbell Road, AT10
Richardson, TX 75080-3021
Home: utdallas.edu/atec/fishwick
Media: ***@***.***
Modeling: digest.sigsim.org
Twitter: @PaulFishwick
ONLINE: Webex,Collaborate, TEAMS, Zoom, Skype, Hangout
From: MoemaMike ***@***.***>
Reply-To: NVlabs/stylegan2-ada-pytorch ***@***.***>
Date: Friday, July 9, 2021 at 11:46 AM
To: NVlabs/stylegan2-ada-pytorch ***@***.***>
Cc: Paul Fishwick ***@***.***>, Author ***@***.***>
Subject: Re: [NVlabs/stylegan2-ada-pytorch] train.py fails when gpus=2 (or something other than gpus=1) (#139)
Is this issue applicable to Colab Pro environment? I was under the impression that while as a colab pro user i have access to 4 GPUs, i cannot only access one GPU per colab notebook. If i am wrong and i can run gpus=2 or more that would be welcome news
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
It is part of the fork. I know this fork, even though I don't use it. :) |
OS: CentOS Version 7
Python: 3.7.6
Pytorch Version: 1.7.1+cu110
GPU: 2 V100s
Docker: No, have not gone that route yet
Related Posted Issues: none that I could find based solely on GPU count
I am running the github repo for stylegan2-ada-pytorch. Through the help of others with Pytorch versions,
I was able to do successful training with gpus=1. So, gpus=1 is working.
The system I am on has 2 V100s. When I set gpus=2 on "python train.py ...." I receive the following errors:
(Traceback truncated and file references anonymized.)
Distributing across 2 GPUs...
Setting up training phases...
Exporting sample images...
Initializing logs...
Truncated Traceback (most recent call last):
torch.multiprocessing.spawn(fn=subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus)
File "…/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/…python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "…/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 1 terminated with the following error:
Truncated Traceback (most recent call last):
File "…/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File …./notebooks/stylegan2-ada-pytorch/train.py", line 422, in subprocess_fn
training_loop.training_loop(rank=rank, **args)
File "…/notebooks/stylegan2-ada-pytorch/training/training_loop.py", line 290, in training_loop
loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, sync=sync, gain=gain)
File "…/notebooks/stylegan2-ada-pytorch/training/loss.py", line 134, in accumulate_gradients
training_stats.report('Loss/D/loss', loss_Dgen + loss_Dreal)
RuntimeError: The size of tensor a (4) must match the size of tensor b (2) at non-singleton dimension 0
The text was updated successfully, but these errors were encountered: