After reading this paper, I decided to implement the network and do some experiments on CIFAR-10 which is small enough. This is the first time for me to do vision tasks(besides MNIST which don’t require much knowledge on vision).
My first implementation was based on this code. It is quite simple and I’m not sure why the softmax layer isn’t used in the last. The implemented model is NIN(which is given by Network in Network). I changed the model to residual block:
def convLayer(l, num_filters, filter_size=(1, 1), stride=(1, 1),
The function bottleneck implement the architecture mentioned in the right part of Figure 5(a “bottleneck” building block).
Unfortunatelly, the accuracy is really low(less than 60%). I changed the training part using flexible learning rate. It improved the result but it is still unacceptable.
Then I started to implement the data preprocessing part, which is mentioned in the origin paper:
We follow the simple data augmentation in  for training: 4 pixels are padded on each side, and a 32×32 crop is randomly sampled from the padded image or its horizontal flip.
Although the report I have read given by benanne who is winner of Galaxy Challenge on Kaggle mentioned the importance of image processing, I never read the origin paper and implementation. I referred Alex Krizhevsky’s implementation on Google Code. The following is my implementation:
The realtime augmentation seems to be useful and in the future I’m planning to integrate it.
Adding the preprocessing lead to the accuracy 70%-76%. The main reason for that is the small amount of parameters(about 40k for 32 layers) in bottleneck block. I tried to adjust the number of filter maps from
n/2 in bottleneck block and the accuracy improved from 75% to 85%. Until now I don’t find the correct way to apply it in the CIFAR-10 corpus. In origin paper they used it only on ImageNet dataset.
The final result(32 layers) on CIFAR-10:
92.64% on validation set
92.38% on test set