Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some beginner questions #12

Open
lendrick opened this issue Aug 24, 2020 · 23 comments
Open

Some beginner questions #12

lendrick opened this issue Aug 24, 2020 · 23 comments

Comments

@lendrick
Copy link

Hi, I'm trying to train a network and I can't find much in the way of documentation as to stylegan2's metrics on tensorboard, so I figured I'd ask if you have any insight on this stuff.

Based on this screenshot here:

https://i.imgur.com/gXFEyys.png

Apologies about how basic these questions are, but:

  • Where do I want my g_loss and d_loss to go? Are they moving in the right direction?
  • What is grad_norm?
  • What are g_reg and d_reg?
  • How can I tell if I've got my learning rate set too fast or too slow, and is there a rule of thumb as far as how much to adjust it by?

Thanks!

@lendrick
Copy link
Author

Just some additional info: I'm training a 512x512 network with ~10,000 normalized fantasy portrait images. I've attached a representative sample of the dataset, as well as some of the generated images.

The fine details don't seem to have improved much at all for a couple of days, despite the dataset being 512x512. I'm at 38,500 runs with a batch size of 16.

dataset-sample
seed10028
tiles24

@adriansahlman
Copy link
Owner

adriansahlman commented Aug 25, 2020

Hey! Dont worry about questions being basic. It was actually a while ago since I fiddled with GANs so its only good to repeat some of the basics so I dont forget them!

So what I can start by mentioning is that the reason the results of the original paper were so amazing was the quality of the data. All of the portraits were of humans (which drastically reduces the amount of structural variation compared to fantasy) and all of the faces were perfectly aligned where the pupils of the eyes were in the exact same pixel position for all images. I can see in your data sample that there are multiple styles present which the model did not have to learn when generating faces (where the only style was "real photo"). I am honestly amazed with the results you already have so I think I might have overestimated how good the data actually has to be. But either way I think that there is too much variation in your dataset for the model to learn to generate high fidelity at those resolutions.

For example, the original face generating model has learned how to create very realistic hairs. That is possible because human hairs are quite similar to each other (the variation is more in length, curls, e.t.c.). For your dataset the hair of each portrait have a bit more different styles. Same goes for the skin, clothes, e.t.c.

You can see how the generator has learned to produce only one "style" of portraits. If you could extract all images from the dataset with the same kind of style I think you could reach a higher fidelity for your output.

There is another example of similar training where someone trained the generator to produce anime portraits. This worked really well as the style was more limited.

Now I am speculating quite a bit but I do think it might be that the images in your training data that are of different styles than the ones currently being generated will end up polluting the training.

Anyway I hope you manage to improve the results. Still pretty cool how far you got! Maybe it could be better at lower resolutions with the same number of model parameters to account for the increased variety?

Where do I want my g_loss and d_loss to go? Are they moving in the right direction?

This is very hard to say, if not impossible. Since the training is completely adversarial, the loss may not be indicative of any kind of progress (which is a big problem for GANs, there are no really good metrics for overall progress). I think you can just ignore the loss values (unless they go up to infinity or something crazy).

What is grad_norm?

For every backward pass we calculate the gradient for each parameter. The gradient norm (grad_norm) is the euclidean distance of all those gradients. So if we had two parameters in total (we have millions in reality), we would have two gradients after a backward pass, gradient_A and gradient_B. We get the gradient norm by calculating sqrt(gradient_A^2 + gradient_B^2) (which is Pythagoras theorem). A higher gradient norm will indicate that the gradients calculated during training are high. This may not be indicative of the progress of the training but I have found it to sometimes be really high when training is about to collapse.

What are g_reg and d_reg?

Since training GANs is a very unstable task it is often useful to regularize that task. Regularization is usually some kind of "regulation to keep things in check". A classic regularization in training of neural networks is to use weight decay, which in practice keeps weights from becoming too large (negative or positive). Some regularization may produce a loss which we try to minimize (which in turn regularizes the training).
g_reg is the generator regularization loss which is calculated using "path length regularization" and d_reg is the discriminator regularization loss which is calculated from the gradients of the input to the discriminator network.
You can play around with these and change their scales to alter their respective importance during training. You can also just turn them off if you prefer.

How can I tell if I've got my learning rate set too fast or too slow, and is there a rule of thumb as far as how much to adjust it by?

This one is hard to answer. You probably have to test a lot of different values. A general rule of thumb is that the higher your batch size, the higher you can have your learning rate. The default learning rate for the Adam optimizer (which is used here) is 1e-3. So try something around that maybe.

@lendrick
Copy link
Author

So as I've been looking around, I'm seeing a lot of people talking about having gotten good results with transfer learning (oddly enough, even with completely dissimilar datasets). When I try resuming training on an existing model (nivdia's 512x512 ffhq model, for instance), my generated images look as if it's completely dumped the old model, as if it's starting from the first iteration. Is there some way I can get around that?

@adriansahlman
Copy link
Owner

Make sure you use --g_file and --d_file arguments when running the training script to load pretrained models (I added these options a couple of days ago so you might need to pull the most recent changes from the repository).

I do not have any experience doing transfer learning for these type of models but maybe a lower learning rate will make sure the pretrained model isnt discarded too quickly through parameter change during training? Would love to hear if you make any progress on this!

@lendrick
Copy link
Author

I'm getting an out of memory error:

(pytorch) D:\AI\sg2\stylegan2_pytorch>python run_training.py ffhq.yaml --g_file=checkpoints\ffhq_512x512\0000\G.pth --d_file=checkpoints\ffhq_512x512\0000\D.pth --resume --gpu 1
C:\Users\Bart\anaconda3\envs\pytorch\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
C:\Users\Bart\anaconda3\envs\pytorch\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
C:\Users\Bart\anaconda3\envs\pytorch\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
C:\Users\Bart\anaconda3\envs\pytorch\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
C:\Users\Bart\anaconda3\envs\pytorch\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
C:\Users\Bart\anaconda3\envs\pytorch\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
0%| | 0/1000000 [00:00<?, ?it/s]Traceback (most recent call last):
File "run_training.py", line 1002, in
main()
File "run_training.py", line 997, in main
run(args)
File "run_training.py", line 966, in run
trainer.train(iterations=args.iterations)
File "D:\AI\sg2\stylegan2_pytorch\stylegan2\train.py", line 511, in train
latent_labels=latent_labels
File "D:\AI\sg2\stylegan2_pytorch\stylegan2\loss_fns.py", line 75, in G_logistic_ns
fake_scores = D(G(latents, labels=latent_labels), labels=latent_labels).float()
File "C:\Users\Bart\anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "D:\AI\sg2\stylegan2_pytorch\stylegan2\models.py", line 1216, in forward
x = block(input=x)
File "C:\Users\Bart\anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "D:\AI\sg2\stylegan2_pytorch\stylegan2\modules.py", line 1597, in forward
x = layer(input=x)
File "C:\Users\Bart\anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "D:\AI\sg2\stylegan2_pytorch\stylegan2\modules.py", line 294, in forward
x = self.act(x)
File "C:\Users\Bart\anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "C:\Users\Bart\anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\activation.py", line 559, in forward
return F.leaky_relu(input, self.negative_slope, self.inplace)
File "C:\Users\Bart\anaconda3\envs\pytorch\lib\site-packages\torch\nn\functional.py", line 1063, in leaky_relu
result = torch._C._nn.leaky_relu(input, negative_slope)
RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 1; 8.00 GiB total capacity; 6.34 GiB already allocated; 23.69 MiB free; 6.37 GiB reserved in total by PyTorch)

My yaml file is as follows:

channels: [32, 64, 128, 256, 512, 512, 512, 512]
tensorboard_log_dir: 'runs/ffhq_512x512'
tensorboard_image_interval: 25
checkpoint_dir: 'checkpoints/ffhq_512x512'
data_dir: d:/ai/pinterest/downloads/output-normalized
batch_size: 16
checkpoint_interval: 100
gpu: [1]

I tried adjusting the batch size down to 8 and then 4, and ran into the same error. My other network is also 512x512, so I'm not sure what the difference here would be.

@adriansahlman
Copy link
Owner

This happens when loading a pretrained model? Have you updated to the latest code? Recently some guy found a way to reduce memory usage so that has been added like 1 day ago

@adriansahlman
Copy link
Owner

You are using 6.37 out of 8 GB but only have 23.69 mb free when Pytorch is trying to allocate 32 mb. I'm guessing some other memory is being used by some other application as well? I dont remember how much memory I was using when training a 512x512 resolution model :/

@lendrick
Copy link
Author

Are you running this on Linux? Maybe it's a Windows thing. I left some space unpartitioned on my hard drive for Linux, so maybe it's time I installed it.

I'll try today's build and see if that fixes it.

@lendrick
Copy link
Author

lendrick commented Aug 27, 2020

Also, I typed this up yesterday and didn't save it: On a lark, I tried running training on my CPU to see what the memory usage would like like, and it was 41 gigabytes (fortunately I have 64 gigs on here), so I wonder if there's something else going on there.

Still trying to get Linux to work, but I'll report back after that.

@adriansahlman
Copy link
Owner

Yea, I gave up on windows for this kind of stuff. Running linux just makes things a lot more easy! Although you should probably look into learning the basics of the linux terminal first, but thats super easy.

If you want to use 0% of your GPU memory for the operating system you will have to run it without any graphical user interface. Otherwise theres always gonna be a bit of memory used by the OS (unless maybe you can somehow run the OS graphic stuff through some integrated GPU and then only use your dedicated GPU for Pytorch).

41 GB sounds like there might be a bug or some very weird settings. Get back to me when you have played around with it :)

@lendrick
Copy link
Author

I had to manually install the nvidia file and things finally worked. Forutunately I'm a linux admin at work, so I'm already comfortable with the command line -- it's just hard with a completely black screen. :)

Anyway, I'm getting the same out of memory problem on Linux:

$ python run_training.py ffhq.yaml --g_file=ffhq512/Gs.pth --d_file=ffhq512/D.pth --gpu 1
Traceback (most recent call last):
File "run_training.py", line 1002, in
main()
File "run_training.py", line 997, in main
run(args)
File "run_training.py", line 966, in run
trainer.train(iterations=args.iterations)
File "/mnt/d/linux_ai/sg2/stylegan2/train.py", line 598, in train
reg_loss, self.D_opt, mul=self.D_reg_interval or 1)
File "/mnt/d/linux_ai/sg2/stylegan2/train.py", line 461, in _backward
loss.backward()
File "/home/bart/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/tensor.py", line 166, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/bart/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/init.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 1; 7.80 GiB total capacity; 6.67 GiB already allocated; 54.19 MiB free; 330.80 MiB cached)

@lendrick
Copy link
Author

Also, would it be possible while training to load the generator and discriminator into CPU memory and then load an unload them on the GPU depending on whether they're being used? I feel like that could cut GPU memory use way down (at a cost of some performance), but it would still be vastly better than training on the CPU.

@adriansahlman
Copy link
Owner

I think it could be possible to do that but the vast majority of memory usage actually comes from the training, not the model weights themselves! Im guessing a model is like 100 mb.

Is the batch size 4 by default? 4 should be the lowest batch size you can run on these models. Can you try that?

If that doesnt work I can try to find the same model and run it and see how much memory is used on my machine (I have 11 GB GPU memory).

@lendrick
Copy link
Author

lendrick commented Sep 1, 2020

I think it could be possible to do that but the vast majority of memory usage actually comes from the training, not the model weights themselves! Im guessing a model is like 100 mb.

I tested whether that would be enough to slip in under the RAM limit, and it wasn't.

I tried a batch size of 4 and it still failed. The model was nvidia's 512x512 face photos model.

I think I'm just going to have to give up and get a new graphics card. Maybe I'll suck it up and get one of those 3090s they just announced, with 24 gigs of ram. Then I won't have to worry about this for a while.

Until, of course, they start making even bigger models in like 6 months. :)

@adriansahlman
Copy link
Owner

If you get a 3090 I will be mighty jealous! If I had more time to play around with GANs and fun things like that I would probably buy one

@adriaciurana
Copy link

I'm trying to get an idea of ​​the training times. To get the results you show, how much training time did you dedicate and what resources?

Thank you very much.

@adriansahlman
Copy link
Owner

Hey,
Im guessing your asking @lendrick about his results?

If you use transfer learning you should be able to train the model a lot faster. The authors of the paper state that they trained on 61 images per second using 8 NVIDIA V100 GPUs. The discriminator saw 25M images in total during training.

@adriaciurana
Copy link

Yes effectively. It is to know a little the approximate times in which it takes to obtain those results. Unfortunately, I can't have 8 V100s haha
And I was wondering with a V100 or a 2080TI when it might take you a while to do a learning experience with similar results using 512x512 resolution as the author of the post (@lendrick). Of course, starting from pre-trained models.

Additionally, I want to thank all the effort in this project because it is very great to see that stylegan2 is available in PyTorch.

@adriansahlman
Copy link
Owner

Oh yea, I wish I had access to 8 V100s! I could not train a model with resolution 1024x1024 as I did not have enough GPU memory.

I have actually never tried using transfer learning with GANs so I dont dare say how long it would take to do transfer learning. You can start the training and there is an option to log the generator output to tensorboard at a certain iteration interval so you can look at the progress while the training is progressing.

I am glad you found the code useful! It took a while to figure out exactly how some different parts worked before I could have them running correctly in pytorch. I am not sure this code is as performant as the tensorflow version since they use custom CUDA code for some operations and this is written in pure Pytorch

@hyx07
Copy link

hyx07 commented Sep 29, 2020

I think WGAN can provide a rather good metric for evaluating the training progress, it can not tell where you are because it doesn't converge to the theoretical Nash equilibrium point, but a declining Wasserstein distance can safely tell that the model is getting better.
Thank you for this great project and excellent codes!

@adriansahlman
Copy link
Owner

@hyx07 Glad you enjoy the code! If you test out running training with wgan-loss instead of the default loss post an update if the loss was any indication of progress, would be fun to hear about!

@murthy95
Copy link
Contributor

@hyx07 eagerly waiting foy your reply.

@hyx07
Copy link

hyx07 commented Nov 25, 2020

@hyx07 eagerly waiting foy your reply.

Sorry I didn't have the time to try WGAN on StyleGAN2, but I have tried it on other generation task like pix2pix. In general if I have a generator as competitive as the Critic(the discriminator is called Critic in WGAN), I would find the output of the Critic, i.e. the estimated W Distance, to rise in the first a few epochs, and then slowly decline until the end of training.
However, in practice I found that LSGAN usually gave better results than WGAN, and LSGAN doesn't need gradient regularization, so I would first recommend LSGAN, and then if LSGAN doesn't work well, try WGAN or the original GAN Loss with gradient regularization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants