This reading list refered following articles:

Due to lack of time(tests for driver license and PhD applications), this list was updated according to the above order. The reason is that deep learning papers usually take less time to read (although sometimes it is hard to achieve the giving result).

### Deep Learning Papers

###### Deep Residual Learning for Image Recognition

This paper was not published in NIPS2015. The reason for including is due to the extraordinary performance on ImageNet
Residual here means the object function fits the residual parts f(x)-x.
Intuition: if the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart. So the introduction of shortcuts(element-wise addition with the input) reduces the difficulty to fit identity mappings. It allows to train much deeper neural networks.

• It shows impressive performance on ImageNet and other datasets. Compared with the corresponding plain neural network(without shortcuts), the residual networks get better performance
• The residual networks enable people to train much deeper neural networks(152 layers in this paper) without performance loss. Instead, deeper networks improve the results. For the shallower networks, the residual networks converge faster
###### Very Deep Learning with Highway Networks

It includes more papers like Training Very Deep Networks and Highway Networks. The intuition is very similar to residual network(in fact this work is early than deep residual networks).
I’d like to introduce the highway network as the combination of one layer residual network with parameters. It used the gate, which is broadly used in LSTM to control whether outputs the identity of input or output default output.
More precisely, each layer output T(x)x+(1-T(x))f(x,W) where T(x) controls the portion of inputx and calculated valuef(x,W). The initial values let T(x) equals to one so this block always output the input value.

Another paper related to this was published by Yu Zhang who works on CNTK project:

Zhang et al. "Highway Long Short-Term Memory RNNs for Distant Speech Recognition." arXiv preprint arXiV:1510.08983 (2015).

###### Path-SGD: Path-Normalized Optimization in Deep Neural Networks

Background: SGD is not rescaling invariant(rescaling+update->different networks)

Core Application: Dynamic Programming

Inspired by the success of max-norm, the author proposed a global max-norm strategy. Dynamic programming was used to calculated the global regularizer such that the Path-SGD is as efficient as SGD.
The experiments show considerable result.

### Others

###### Competitive Distribution Estimation: Why is Good-Turing Good

Theoretical paper. Seems to be the best paper according to Paul Mineiro

###### End-to-End Attention-based Large Vocabulary Speech Recognition

(Updates on 2nd Feb.) Attention model is very hot during these years. This is the first end-to-end system using attention model for speech recognition. Very promising structure and I think it’s a great replacement for average operation in my work. This model focused on few frames and it will be useful for speaker recognition and spoofing detection.
Smoothing and sharpening is very useful to improve the result.