Experience from "Accurate, Large Minibatch SGD" #22

hma02 · 2017-07-11T16:51:30Z

According to the Facebook paper, there are several implementation details to be adjusted:

1. Momentum correction. In our implementation, we used the equation (10) without momentum correction. We should either add momentum correction or change to equation (9).
2. Gradient aggregation. In our implementation, we used either weight averaging (avg) and summing gradient (cdd), neither of which normalizes the per-worker loss by total minibatch size kn, but by per-worker size n. We should consider averaging gradient and scaling up lr.
3. Learning rate gradual warm-up and linear scaling. The reason we didn't scale lr up was that when I tried this, gradient explosion happened at the beginning of training (even with a small number of workers) for VGG16. Note gradual warmup is increasing lr on every iteration rather than every epoch.
4. Batch Normalization parameters. According to the paper: "the BN statistics should not be computed across all workers". We should explicitly not do parameter exchanging on those BN parameters.
5. Use HeNormal initialization for ConvLayers and Normal for the last FCLayer. Set gamma to 0 for the last BN of each Residual Block.
6. Do multiple trials for reporting random variation. Median error of the final 5 epochs. Mean and standard deviation of the error from 5 independent runs. Each run is 90 epochs. lr/10 happens at 30, 60 and 80 epochs.
7. Use scale and aspect ratio data augmentation and normalize image by the per-color mean and std.

On the HPC side, the three phase allreduce "NCCL(reduction) -> MPI_Allreduce -> NCCL(broadcast)" mentioned in the paper can possibly be replaced by one NCCL2 operation. Or we need to make a python binding of Gloo?

The parallel communication idea mentioned in section 4 of the paper,

To allow for near perfect linear scaling, the aggregation must be performed in parallel with backprop

needs support from Theano. Currently, computation and communication are in serial in Theano-MPI.

Use HeNormal for ConvLayers and Normal for the last FCLayer

in models: alexnet, googlenet, resnet50, vgg16

hma02 added the enhancement label Jul 11, 2017

hma02 added a commit that referenced this issue Jul 12, 2017

explicitly not do parameter exchanging on BN parameters (#22)

008b99e

hma02 added a commit that referenced this issue Jul 12, 2017

perform gradient aggregation (#22)

2d5dc2a

hma02 added a commit that referenced this issue Jul 13, 2017

apply lr linear scaling rule (#22)

0cb0e2c

hma02 added a commit that referenced this issue Jul 13, 2017

more on gradient aggregation and momentum (#22)

7f46307

hma02 added a commit that referenced this issue Jul 14, 2017

add lr warm-up to BSP (#22)

0f3e6c0

hma02 added a commit that referenced this issue Jul 14, 2017

Set gamma to 0 for the last BN (#22)

6982f0e

Use HeNormal for ConvLayers and Normal for the last FCLayer

hma02 added a commit that referenced this issue Jul 20, 2017

fixing the missing hkl import, preprocessing (#22)

7dca17b

in models: alexnet, googlenet, resnet50, vgg16

hma02 mentioned this issue Jul 24, 2017

All nodes which are allocated for this job are already filled #23

Closed

hma02 added a commit that referenced this issue Aug 2, 2017

normalize image by per-color mean and std(#22)

80580c0

hma02 mentioned this issue Aug 10, 2017

nccl 2.0 support Theano/libgpuarray#497

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experience from "Accurate, Large Minibatch SGD" #22

Experience from "Accurate, Large Minibatch SGD" #22

hma02 commented Jul 11, 2017 •

edited

Loading

Experience from "Accurate, Large Minibatch SGD" #22

Experience from "Accurate, Large Minibatch SGD" #22

Comments

hma02 commented Jul 11, 2017 • edited Loading

hma02 commented Jul 11, 2017 •

edited

Loading