Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experience from "Accurate, Large Minibatch SGD" #22

Open
4 of 7 tasks
hma02 opened this issue Jul 11, 2017 · 0 comments
Open
4 of 7 tasks

Experience from "Accurate, Large Minibatch SGD" #22

hma02 opened this issue Jul 11, 2017 · 0 comments

Comments

@hma02
Copy link
Collaborator

hma02 commented Jul 11, 2017

According to the Facebook paper, there are several implementation details to be adjusted:

  • 1. Momentum correction. In our implementation, we used the equation (10) without momentum correction. We should either add momentum correction or change to equation (9).

  • 2. Gradient aggregation. In our implementation, we used either weight averaging (avg) and summing gradient (cdd), neither of which normalizes the per-worker loss by total minibatch size kn, but by per-worker size n. We should consider averaging gradient and scaling up lr.

  • 3. Learning rate gradual warm-up and linear scaling. The reason we didn't scale lr up was that when I tried this, gradient explosion happened at the beginning of training (even with a small number of workers) for VGG16. Note gradual warmup is increasing lr on every iteration rather than every epoch.

  • 4. Batch Normalization parameters. According to the paper: "the BN statistics should not be computed across all workers". We should explicitly not do parameter exchanging on those BN parameters.

  • 5. Use HeNormal initialization for ConvLayers and Normal for the last FCLayer. Set gamma to 0 for the last BN of each Residual Block.

  • 6. Do multiple trials for reporting random variation. Median error of the final 5 epochs. Mean and standard deviation of the error from 5 independent runs. Each run is 90 epochs. lr/10 happens at 30, 60 and 80 epochs.

  • 7. Use scale and aspect ratio data augmentation and normalize image by the per-color mean and std.

On the HPC side, the three phase allreduce "NCCL(reduction) -> MPI_Allreduce -> NCCL(broadcast)" mentioned in the paper can possibly be replaced by one NCCL2 operation. Or we need to make a python binding of Gloo?

The parallel communication idea mentioned in section 4 of the paper,

To allow for near perfect linear scaling, the aggregation must be performed in parallel with backprop

needs support from Theano. Currently, computation and communication are in serial in Theano-MPI.

hma02 added a commit that referenced this issue Jul 12, 2017
hma02 added a commit that referenced this issue Jul 13, 2017
hma02 added a commit that referenced this issue Jul 14, 2017
hma02 added a commit that referenced this issue Jul 14, 2017
Use HeNormal for ConvLayers
and Normal for the last FCLayer
hma02 added a commit that referenced this issue Jul 20, 2017
in models: alexnet, googlenet, resnet50, vgg16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant