Cubs Reading Group's profile - ShortScience.org

jmlr.org
scholar.google.com

DRAW: A Recurrent Neural Network For Image Generation
Gregor, Karol and Danihelka, Ivo and Graves, Alex and Rezende, Danilo Jimenez and Wierstra, Daan
International Conference on Machine Learning - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed: 
Generate images with recurrent neural networks

#### Summary: 
This paper propose an architecture for image generation. The model itself is similar to variational autoencoder, but both the encoder and decoder are implemented with recurrent neural networks, in particular LSTM. It also has two new components: 

1. a reader that select an area of interest for the next recurrence and 

2. a writer that write to that particular area. They believe this mimics the attention and demonstrated this on a cluttered mnist dataset.

#### Novelty:
Using RNNs for image generation and selective attention of the region of interest

#### Drawbacks:
Idea seems over complicated and the image generation performance is not that good on real image datset such as CIFAR.

#### Datasets:
MNIST, SVHN, CIFAR

#### Additional remarks:
Presentation video available on cedar server

#### Presenter:
Yingbo Zhou

papers.nips.cc
scholar.google.com

Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks
Denton, Emily L. and Chintala, Soumith and Szlam, Arthur and Fergus, Rob
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed: 
Learn to generate sharp high resolution images

#### Summary: 
The authors applied generative adversarial networks (GANs) for image modeling. Instead of learning one model for a high resolution image directly, they learn it in a hierachical way using pyramid decomposition of the image. The image is decomposed to smaller versions that contains all low frequency information and its corresponding high frequency ones. It can also be regarded as a similar idea from lossless compression, where one have both compressed version of the image and the corresponding error so that once the compression is given the image can be reconstructed perfectly. The quantitative and qualitaive performance are impressive, and is so that the best image generative model for high resolution images.

#### Novelty:
Using laplacian pyramid decomposition that enables the generation of sharp high resolution images possible

#### Drawbacks:
Training is stagewise and slow, and levels of decomposition is fixed for all kinds of images.

#### Datasets:
CIFAR-10, STL, SUN

#### Additional remarks:
Presentation video available on cedar server

#### Resources:
Other relevant work to this one: Conditional Generative Adversarial Nets, Conditional generative adversarial nets for convolutional face generation

#### Presenter:
Yingbo Zhou

papers.nips.cc
scholar.google.com

Generative Adversarial Nets
Goodfellow, Ian J. and Pouget-Abadie, Jean and Mirza, Mehdi and Xu, Bing and Warde-Farley, David and Ozair, Sherjil and Courville, Aaron C. and Bengio, Yoshua
Neural Information Processing Systems Conference - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed: 
learning data distribution using non-parametric models

#### Summary: 
The authors propose learning generative model as playing an adversarial game, where one player is the generative model and the other is a discriminative model. The discriminative model learns to differentiate samples obtained from the generative model with true samples from the dataset, and the generative model tries to improve its data likelihood. They provide theoretical result that when one use such adversarial objective the global optimum of generative model will be the data distribution. Good generative performance is presented both quantitatively and qualitatively.

#### Novelty:
Treating learning generative model as playing a two player adversarial game, although such idea is kind of there in the training of energy based models, this paper is the first to successfully demonstrate this power.

#### Drawbacks:
As with any of the non-parametric generative models, the evaluation of data probability is not very easy.

#### Datasets:
MNIST, TFD, CIFAR-10 (only qualitative)

#### Additional remarks:
Presentation video available on cedar server

#### Presenter:
Yingbo Zhou

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

Conditional Random Fields as Recurrent Neural Networks
Zheng, Shuai and Jayasumana, Sadeep and Romera-Paredes, Bernardino and Vineet, Vibhav and Su, Zhizhong and Du, Dalong and Huang, Chang and Torr, Philip H. S.
International Conference on Computer Vision - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed: 
Image Segmentation, Pixel labelling, Object recognition

#### Summary: 
The authors approximate the CRF inference procedure using the mean field approximation. They use a specific set of unary and binary potentials. Each step in the mean field inference is modelled as a convolutional layer with appropriate filter sizes and channels. The mean field inference procedure requires multiple iterations (over time) to achieve convergence. This is exploited to model the whole procedure as CNN-RNN. The unary potentials and initial pixel labels are learnt using a FCN. The authors train the FCN and CNN-RNN separately and jointly and find that joint training gives the better performance of the two on the VOC2007 dataset.

#### Novelty:
Formulating the mean field CRF inference procedure as a combination of CNN and RNN. Joint training procedure of a fully convolutional network (FCN) + CRF as RNN to perform pixel labelling tasks

#### Drawbacks:
Does not scale with number of classes. No theoretical justification for success of joint training, only empirical justification

#### Datasets:
VOC2012, COCO

#### Additional remarks:
Presentation video available on cedar server

#### Resources:
http://www.robots.ox.ac.uk/~szheng/papers/CRFasRNN.pdf

#### Presenter:
Bhargava U. Kota

papers.nips.cc
scholar.google.com

Attentional Neural Network: Feature Selection Using Cognitive Feedback
Wang, Qian and Zhang, Jiaxing and Song, Sen and Zhang, Zheng
Neural Information Processing Systems Conference - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed: 
Denoising images, Recognition in cluttered images, Segmentation of objects of interest

#### Summary: 
The authors propose an architecture to perform image denoising and segmentation for MNIST variation datasets and MNIST-2. Their work involves producing a 'cognitive' bias which is like a prior assumption that a certain class is present in the input image. Their network generates a denoised image given the prior class distribution at each iteration. A regular classifier is used at the end of generation process for termination condition. The generated image is gated with the input to prevent hallucination due to cognitive bias. MNIST-2 is created by the authors by superimposing 2 digits on the same image.

#### Novelty:
Integrates cognitive bias and feature extraction in the same network

#### Drawbacks:
Does not scale with number of classes. Works only on binarized images due to gating and masking

#### Datasets:
MNIST variations, MNIST-2

#### Resources:
http://papers.nips.cc/paper/5268-attentional-neural-network-feature-selection-using-cognitive-feedback.pdf

#### Presenter:
Bhargava U. Kota

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

Deep Convolutional Network Cascade for Facial Point Detection
Sun, Yi and Wang, Xiaogang and Tang, Xiaoou
Conference and Computer Vision and Pattern Recognition - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed: 
Estimation of the position of facial keypoints

#### Summary: 
The authors propose an effective convolutional network cascade for facial point detection. Deep convolutional networks at the first level provide highly robust initial estimations, while shallower convolutional networks at the following two levels finely tune the initial prediction to achieve high accuracy. The method therefore can avoid local minimum caused by ambiguity and data corruption. Networks at the first level are deep convolutional networks with four convolutional stages, absolute value rectification, and locally shared weights. Networks at the second and third levels share a common shallower structure. Since they are designed to extract local features, deep structures and locally sharing weights are unnecessary.

#### Novelty:
Three-level carefully designed convolution networks

#### Drawbacks:
High computational cost because of deep model

#### Datasets:
BioID, LFPW

#### Resources:
http://www.ee.cuhk.edu.hk/~xgwang/papers/sunWTcvpr13.pdf

#### Presenter:
Neeti Narayan

arxiv.org
arxiv-vanity.com
scholar.google.com

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe and Christian Szegedy
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed:
Strategy for training deep neural networks

#### Summary:
The input distribution (to every layer) undergoes constant changes while training a deep network. The authors call this internal covariate shift in the input distribution. The authors claim this leads to slow learning of optimal model parameters. In order to overcome this, they introduce the idea of normalizing the input of every layer a part of the optimization strategy. Specifically, they reparameterize the input to each layer so that it is whitened and thus has non-changing distribution at every iteration.

They apply 2 approximation in their strategy:

1. this normalization is done for every mini-batch of training data,

2. the input dimensions are assumed to be uncorrelated.

Finally, the output of last layer is mean subtracted and variance normalized (these can be back-propagated while training). Additionally, the authors also introduce 2 learnable scalar parameters $(r,b)$ per dimension such that the final input to a layer is $y=rg(BN(x))+b$ where g is the activation function.

The advantage of BN apart from the intuition mentioned above is that it allows higher learning rate and network behavior remains unaffected by the scale of the parameters W and bias. The authors also empirically show that BN acts as a regularizer since optimization without dropout yields at par performance.

#### Novelty:
Previous work only focused on whitening in 1st layer input. This work extends this idea to all layers and suggests a practical approach for applying this idea to real world data.

#### Datasets:
Imagenet

#### Resources:
presentation video available on cedar server

#### Presenter:
Devansh Arpit

arxiv.org
arxiv-vanity.com
scholar.google.com

Learned-Norm Pooling for Deep Feedforward and Recurrent Neural Networks
Caglar Gulcehre and Kyunghyun Cho and Razvan Pascanu and Yoshua Bengio
arXiv e-Print archive - 2013 via Local arXiv
Keywords: cs.NE, cs.LG, stat.ML
more

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed: 
A new type of activation function

#### Summary: 
This paper propose a new activation function that computes a Lp norm from multiple projections on an input vector. The p value can be learned from training example, and can also be different for each hidden unit. The intuition is that 1) for different datasets there may exist different optimal p-values, so it make more sense to make p tunable; 2) allowing different unit take different p-values can potentially make the approximation of decision boundaries more efficient and more flexible. The empirical results support these two intuitions, and achieved comparable results on three datasets.

#### Novelty:
A generalization of pooling but applied through channels, when the data and weight vector dot product plus bias is constrained to non-negative case, the $L_\infty$ is equivalent to maxout unit.

#### Drawbacks:
Empirical performance is not very impressive, although evidence of supporting the intuition occurs.

#### Datasets:
MNIST, TFD, Pentomino

#### Resources:
http://arxiv.org/abs/1311.1780

#### Presenter:
Yingbo Zhou

arxiv.org
scholar.google.com

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
Gal, Yarin and Ghahramani, Zoubin
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed: 
Bayesian approximation of neural netowrks

#### Summary: 
This paper gives an alternative view of dropout as Bayesian approximation, which allow one to obtain uncertainty from the predictions. The result is surprisingly simple, both the predictive mean and variance can be obtained by calculating the mean and variance (with some minor adjustment) of multiple passes through the network with dropout.

#### Novelty:
A new interpretation of dropout as a Bayesian approximation.

#### Drawbacks:
Some computational overhead, since calculating the predictive mean and variance need multiple passes through the network.

#### Datasets:
MNIST, solar irradiance, Maunua Loa Co2,

#### Resources:
Paper: http://arxiv.org/pdf/1506.02142v1.pdf

Blog post: http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html

#### Presenter:
Yingbo Zhou"

papers.nips.cc
scholar.google.com

Spatial Transformer Networks
Jaderberg, Max and Simonyan, Karen and Zisserman, Andrew and Kavukcuoglu, Koray
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed: 
A module to spatially transform feature maps conditioned on feature maps itself. Attempts to improve rotation, scale and shift invariance in neural nets.

#### Summary: 
This paper introduces Spatial Transformer Networks (STN) for rotation, shift and scale invariance. The module consists of three parts - Localization function, Grid point generation and Sampling. Each of these modules are differentiable and can be inserted at any point in a standard neural network architecture. The constraints are that the learnt spatial transform must be parametrized. Localization function learns these parameters by looking at the previous layer output (typically HxWxC for convolutional layers) and regressing to the parameters using FC layers or convolutional layers. The source grid generator parameters are learnt the same way. Given these two, the output of the STN is constructed by sampling (using any differentiable kernel) the source grid and using the transform parameters.

#### Novelty:
A new module is introduced to increase invariance to rotation, scale and shift

#### Drawbacks:
Since only some points in the source feature maps are selected due to grid generation it is unclear how the error is backpropagated to previous layers

#### Datasets:
Distorted MNIST, CUB-200-2011 Birds, SVHN

#### Resources:
http://arxiv.org/pdf/1506.02025v1.pdf

#### Presenter:
Bhargava U. Kota

arxiv.org
arxiv-vanity.com
scholar.google.com

Explaining and Harnessing Adversarial Examples
Ian J. Goodfellow and Jonathon Shlens and Christian Szegedy
arXiv e-Print archive - 2014 via Local arXiv
Keywords: stat.ML, cs.LG
more

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed: 
A fast way of finding adversarial examples, and a hypothesis for the adversarial examples

#### Summary: 
This paper tries to explain why adversarial examples exists, the adversarial example is defined in another paper \cite{arxiv.org/abs/1312.6199}. The adversarial example is kind of counter intuitive because they normally are visually indistinguishable from the original example, but leads to very different predictions for the classifier. For example, let sample $x$ be associated with the true class $t$. A classifier (in particular a well trained dnn) can correctly predict $x$ with high confidence, but with a small perturbation $r$, the same network will predict $x+r$ to a different incorrect class also with high confidence.
 
 This paper explains that the exsistence of such adversarial examples is more because of low model capacity in high dimensional spaces rather than overfitting, and got some empirical support on that. It also shows a new method that can reliably generate adversarial examples really fast using `fast sign' method. Basically, one can generate an adversarial example by taking a small step toward the sign direction of the objective. They also showed that training along with adversarial examples helps the classifier to generalize.

#### Novelty:
A fast method to generate adversarial examples reliably, and a linear hypothesis for those examples.

#### Datasets:
MNIST

#### Resources:
Talk of the paper https://www.youtube.com/watch?v=Pq4A2mPCB0Y

#### Presenter:
Yingbo Zhou

www.jmlr.org
scholar.google.com

Understanding the difficulty of training deep feedforward neural networks
Glorot, Xavier and Bengio, Yoshua
Journal of Machine Learning Research - 2010 via Local Bibsonomy
Keywords: dblp

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed:
Instead of approaching the problem why pre-training works, this paper addresses why traditional way of training deep NNs dont work.

#### Summary:
The main focus of this paper is to empirically study why deep nets dont work with backprop without any pre-training. To analyse this, the authors mainly study the trends of activations and gradient strength across layers vs training iteration using simple backprop. Their study shows that the higher layer units saturate to 0 in the case of Sigmoid which prevents any backpropagated gradients to lower layers. It takes a lot of iterations to get out of saturation after which the lower layers start to learn.

For this reason the authors suggest using activations symmetric around 0 to avoid saturation, like Tanh and Softsign. For Tanh, they find that units of every layer initialized on either part of 0 start saturating (to respective sides) one after the other starting from lower layer to higher layer. For Softsign on the other hand, units from all layers move towards saturation together. Further the histogram of final activations suggest that Tanh units have a peak at both 0 and -1,+1 saturation, while Softsign units generally lie in the linear region. Note that the linear region in Tanh/Softsign has activation gradients-- hence propagates information.

The most interesting part of this study is the way the authors analyse the flow of information from the input layer to the top layer and vice versa. While the forward prop transmits the information about input to higher layers, backward prop transmits the error gradient. They measure the flow of information in terms of the variance of activation (forward) and gradients (backwards) for different layers. Since we would want the information flow to be equal at all layers, the variance should also be the same. So they propose to initialize the weight vectors such that this variance is preserved across layers. They call this ""Normalized Initialization"". Their empirical results show that both activations and gradients (hence information) at all layers have better propagation with their initialization.

#### Novelty:
Analysis of activation values and back-prop gradient across layers for analyzing training difficulties. Also, a new weight initialization method.

#### Drawbacks:
The variance study for activation/gradient is done for linear networks but applied to Tanh and Softsign. How is this justified?

#### Datasets:
Shapeset 3x2, MNIST, CIFAR-10

#### Presenter:
Devansh Arpit

arxiv.org
arxiv-vanity.com
scholar.google.com

Bounding the Test Log-Likelihood of Generative Models
Yoshua Bengio and Li Yao and Kyunghyun Cho
arXiv e-Print archive - 2013 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed: 
Evaluation and comparison of generative models

#### Summary: 
This paper improves upon an existing non parametric estimator by sampling from hidden variables instead of features. They present an unbiased estimator and prove it asymptotically converges to true distribution with number of samples. They also prove that the expected value of unbiased estimator is a lower bound on the true distribution. They also present a biased estimator with a different sampling scheme. They empirically validate their estimators using MNIST dataset on different generative models

#### Novelty:
Sampling from hidden space for non-parametric estimation

#### Drawbacks:
This method works only for models which have hidden variables. Application for deep networks is not clear. Procedure for sampling from hidden variables is not explicitly mentioned. Assumes that P(x|h) is easily calculated from the model

#### Datasets:
MNIST


#### Resources:
paper: http://arxiv.org/pdf/1311.6184v4.pdf

#### Presenter:
Bhargava U. Kota

papers.nips.cc
scholar.google.com

Iterative Neural Autoregressive Distribution Estimator NADE-k
Raiko, Tapani and Li, Yao and Cho, KyungHyun and Bengio, Yoshua
Neural Information Processing Systems Conference - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed: 
Fully visible Bayesian network learning

#### Summary: 
This paper is very similar to the order agnostic NADE paper, it generalized the idea of order agnostic NADE and extended to k iterations. The difference between this work and the previous NADE work is: 1, instead of totally mask out the variables to compute, it instead provide the data mean for those variables; 2. mask is not supplied to the network; 3. it employed a walk-back like scheme, where the prediction is completed in k iterations.

#### Novelty:
It is a generalization of NADE models.

#### Drawbacks:
Training would be slow, and with large k, the challenge of training very deep net remains.

#### Datasets:
binary mnist, caltec-101 silhouettes

#### Additional remarks:


#### Resources:
implementation is at https://github.com/yaoli/nade_k

#### Presenter:
Yingbo Zhou

jmlr.org
scholar.google.com

A Deep and Tractable Density Estimator
Uria, Benigno and Murray, Iain and Larochelle, Hugo
International Conference on Machine Learning - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed: 
Fully visible Bayesian network learning

#### Summary: 
This work is an extension of the original NADE paper. As oppose to using a prefixed random fully visible connected Bayesian network (FVBN), they try to train a factorial number of all possible FVBN by optimizing a stochastic version of the objective, which is an unbiased estimator. The resultant model is very easy to do any type of inference, in addition, since it is trained on all orderings, the ensemble generation of NADE models are also very easy with no additional cost. The training is to mask out the variables that one wants to predict, and maximize the likelihood over training data for the prediction of those missing variables. The model is very similar to denoising autoencoder with Bernoulli type of noise on the input. One drawback of this masking is that the model has no distinction between a masked out variable and a variable that has value 0. To overcome this, they supply the mask as additional input to the network and showed that this is an important ingredient for the model to work.

#### Novelty:
Proposed order agnoistic NADE, which overcome several drawbacks of original NADE.

#### Drawbacks:
The inference at test time is a bit expensive.

#### Datasets:
UCI, binary MNIST

#### Additional remarks:


#### Resources:
The first author provided the implementation on his website

#### Presenter:
Yingbo Zhou

papers.nips.cc
scholar.google.com

Learning Generative Models with Visual Attention
Tang, Yichuan and Srivastava, Nitish and Salakhutdinov, Ruslan
Neural Information Processing Systems Conference - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed:
Training specific generative model under milder unsupervised assumption

#### Summary:
This paper implemented an attention based scheme for learning generative models, which could make the unsupervised learning more applicable in practice. In common unsupervised settings, one would assume that the unsupervised data is already in the desired format that one could used directly, which could be a very strong assumption. In this work, the demonstrated a specific application of the idea for training face models. They use a canonical low resolution face model that model the object in mind, alone with a search scheme that resembles the attention to search the face region in a high resolution image. The whole scheme is formalized as a full probabilistic model, and the attention is thus implemented as a inference through the model. The probabilistic model is implemented using RBMs. For inference, they imployed hybrid Monte Carlo, as with all MCMC methods, it is hard for the sampling methods to go between modes that are separated by low density areas. To overcome this, they instead used convnet to propose the moves and used hmc with the convnet initilized states so that the full system is still probabilistic. The result demonstrated are pretty interesting.

#### Novelty:
The application of visual attention using RBM with probabilistic inference.

#### Drawbacks:
The inference do need a good convnet for initilization, and it seems the mixing of Markov chain is a big problem.

#### Datasets:
Caltech and CMU face dataset

#### Resources:
A neurobiological model of visual attention and
invariant pattern recognition based on dynamic routing of information, is the paper of visual attention

#### Presenter:
Yingbo Zhou

papers.nips.cc
scholar.google.com

Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection
Socher, Richard and Huang, Eric H. and Pennington, Jeffrey and Ng, Andrew Y. and Manning, Christopher D.
Neural Information Processing Systems Conference - 2011 via Local Bibsonomy
Keywords: dblp

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed: 
Paraphrase Detection using Recursive Auto Encoders

#### Summary: 
The authors present the usage of autoencoders to learn recursive tree (in a semantic sense) structures in data. They use deep recursive autoencoders to learn a representation for natural language sentences in unsupervised and semi-supervised frameworks. They then introduce a pooling scheme on top of this representation to handle sentences of varying length and determine if they are paraphrases of each other and achieve state of the art results on MSRP paraphrase corpus.

#### Novelty:
The idea of unfolding recursive autoencoders (RAE),
 Pooling to handle sentences and representations of varying sentence lengths.

#### Datasets:
MSRP paraphrase corpus

#### Additional remarks:
The main idea was introduced in Semi-Supervised Recursive
 Autoencoders for Predicting Sentiment Distributions, R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning. In EMNLP, 2011. Please refer that for training of RAEs.

#### Resources:
paper: http://www.socher.org/uploads/Main/SocherHuangPenningtonNgManning_NIPS2011.pdf

#### Presenter:
Bhargava U. Kota

nips.djvuzone.org
sci-hub
scholar.google.com

Extracting Tree-Structured Representations of Trained Networks
Craven, Mark and Shavlik, Jude W.
Neural Information Processing Systems Conference - 1995 via Local Bibsonomy
Keywords: dblp

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed:
Neural nets are good at learning complex functions but they are blackboxes since they cannot be used to reason why a decision is made. This paper addresses this problem by learning a decision tree that approximates the hypothesis learned by the NN.

#### Summary:
While NNs have been shown to be able to learn complex functions efficiently, we cannot find the logic/reason why a particular test sample is classified into a particular class. Thus when a net when a mistake, we cannot know the reason behind it to rectify it. This raises critical concerns regarding the reliability of such learning algorithms. This paper introduces the idea of inducing a decision tree that approximates the hypothesis of the trained network. This has 2 advantages:
1. the decision tree can query as many test cases as it needs from the trained NN as the task is to learn the parent hypothesis.
2. This implies much more training examples for leaf nodes as compared to conventional tree algorithms.

Another difference from traditional methods is that for splitting a node, the authors use 'm-of-n expressions'. Here, a conjunction of n conditions are used at least m of which must be satisfied to go on one side or the other. These conditions are derived in a greedy manner.

The evaluation criteria of a node is to check its fidelity with the NN and check how many training examples reach a node. If too many examples reach a node then it is not performing the task of classification.

#### Novelty:
learning a decision tree to approximate the hypothesis learned by a NN, the algorithm to train the decision tree

#### Drawbacks:
1. As this paper is from 1996, the empirical results are very weak since the dataset sizes are within a few hundreds while the feature dimensions are less than 100.
2. The algorithm is also not explained very clearly.
3. No qualitative results are shown for the core claim of reasoning the decisions of a NN.

#### Datasets:
Congressional voting dataset, Cleveland heart-disease dataset, UC-Irvine database, Un-named dataset for 'recognizing protein-coding regions in DNA'

#### Additional remarks:
Since every node is a raw feature and reasoning is being done on the same, this algorithm seems more suitable for Medical data.

A good extension of this work would be to learn a decision tree that approximates a NN hypothesis in a more semantic feature space.

#### Presenter:
Devansh Arpit

jmlr.org
scholar.google.com

Better Mixing via Deep Representations
Bengio, Yoshua and Mesnil, Grégoire and Dauphin, Yann and Rifai, Salah
International Conference on Machine Learning - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed:
It has been empirically observed that deep representations lead to better mode mixing when sampled using MCMC. The authors present a set of hypotheses as to why this happens and confirm them empirically.

#### Summary:
The paper claims that deep representations (specially from parametric models) disentangle the factors of variations in the raw feature space. This disentangling leads to better ""mode mixing"" during MCMC sampling. For eg., in faces, the factors of variation could be identity-pose-illumination. If the higher layer learns these features then changing the representation in this space starting from a ""valid"" point would lead to changes in each of these factors directly and hence will produce ""valid"" images, which in the original feature space would be far apart; thus better mode mixing. This hypothesis is explained using 2 additional ones: (a) the manifold structure of the ""valid"" data is flattened in the higher layer space, and (b) the fraction of total volume occupied by high probability (valid) points is larger in the higher layer space. While (a) should lead to better interpolation in higher layer space, (b) should lead to more valid points in a parzen window around any known sample. These are confirmed experimentally.

#### Novelty:
novel intuitions why deep representations are good for generative modeling.

#### Drawbacks:
no theoretical justification

#### Datasets:
MNIST, Toronto Face dataset (TFD)

#### Additional remarks:
used DBN and Deep CAE for experiments on the datasets

#### Presenter:
Devansh Arpit

scholar.google.com

Parsing Natural Scenes and Natural Language with Recursive Neural Networks
Socher, Richard and Lin, Cliff Chiung-Yu and Ng, Andrew Y. and Manning, Christopher D.
International Conference on Machine Learning - 2011 via Local Bibsonomy
Keywords: dblp

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed: 
Learning recursive structure from natural data - sentences and images

#### Summary: 
This paper introduces recursive neural networks in order to learn recursive structure in data. This recursive structure is illustrated by using examples of scene understanding using superpixel labels and parsing of natural language sentences. The authors go on define a novel neural network architecture, an objective function based on max-margin estimation and provide methods for backpropagation to train this network. They then evaluate its performance for accuracy in pixel labelling on the Stanford background dataset as well as for parsing sentences of Wall Street Journal section of the Penn TreeBank database. They achieve state-of-the-art and near state-of-the-art results respectively.

#### Novelty:
Introduction of recursive neural networks, framing an objective function to learn a recursive tree structure from data
 
 State of the art results on scene understanding and pixel labels for Stanford background dataset

#### Drawbacks:
Computational analysis of method not provided Not many details about the backpropagation method.

#### Datasets:
Stanford background dataset, Penn TreeBank

#### Additional remarks:
The thesis of the first author, Chapter 3 was referred during presentation. This paper almost identical to that and the thesis provides slightly more details.

#### Resources:
paper: http://www-nlp.stanford.edu/pubs/SocherLinNgManning_ICML2011.pdf
 
 thesis: http://nlp.stanford.edu/~socherr/thesis.pdf

#### Presenter:
Bhargava U. Kota

arxiv.org
arxiv-vanity.com
scholar.google.com

Estimating or Propagating Gradients Through Stochastic Neurons
Yoshua Bengio
arXiv e-Print archive - 2013 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed: 
Gradient estimation for stochastic neurons

#### Summary: 
This paper proposed an unbiased estimator of stochastic units so that one can use gradient based learning. In addition, it also proposed a simple, biased estimator called straight through.

#### Novelty:
A new approach for estimating gradient for stochastic units.

#### Drawbacks:
The proposed unbised estimator seems to have large variance, and the biased one seems not performing very well in practice

#### Presenter:
Yingbo Zhou

arxiv.org
scholar.google.com

Distilling the Knowledge in a Neural Network
Hinton, Geoffrey E. and Vinyals, Oriol and Dean, Jeffrey
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed:
Traditional classifiers are trained using hard targets. This not only calls for learning a very complex function (due to spikes) but also ignores the relative similarity between classes, e.g., truck is more probable to be misclassified as a car instead of a cat. Instead the classifier is forced to assign both the car and cat to a single target value. This leads to poor generalization. This paper addresses this problem.

#### Summary:
In order to address the aforementioned problems, the paper proposes a method to generate soft labels for each sample by first training a cubersome/large/complex classifier like dropout at a high ""temperature"" in so that it generates soft probabilities for every sample which represents its membership to each class. It then trains a vanilla NN initially at a high temperature and then at a low one using the generated soft labels on either the same training data or a transfer data. By doing so the simpler (student) model performs similar to the complex (teacher) model.

#### Novelty:
technique for generating soft labels for classes for training a much simpler classifier compared to currently used large and complex methods like dropout/conv-nets.

#### Drawbacks:
I believe a major drawback of this paper is that it entails learning a complex classifier for generating soft labels. Another drawback is that it is incapable of using unlabeled data.

#### Datasets:
MNIST, JFT (internal google image dataset)

#### Additional remarks:

#### Resources:
https://www.youtube.com/watch?v=7kAlBa7yhDM

#### Presenter:
Devansh Arpit

www.jmlr.org
scholar.google.com

The Neural Autoregressive Distribution Estimator
Larochelle, Hugo and Murray, Iain
Journal of Machine Learning Research - 2011 via Local Bibsonomy
Keywords: dblp

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed: 
Density estimation

#### Summary: 
This paper presented a tractable density estimator inspired from RBM. The inspiration comes from the procedure of using mean field approximation for the actual RBM probablities. It turns out that one step of the mean field approximation just corresponds to a one-hidden layer feed forward nerual networks with tied weights.

#### Novelty:
Find the link between the one step mean field approximation on RBM to a one layer neural network, and come up with a tractable density estimator that performs well.

#### Drawbacks:
There still a need to pre-select an ordering of variables before running the algorithm.

#### Datasets:
MNIST, binary observation datasets from Larochelle 2010

#### Additional remarks:
The KL divergence minimization derivation is quite tricky (eq 7, 8)

#### Resources:


#### Presenter:
Yingbo Zhou

arxiv.org
arxiv-vanity.com
scholar.google.com

Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling
arXiv e-Print archive - 2013 via Local arXiv
Keywords: stat.ML, cs.LG
more

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed: 
Variational learning of Bayesian networks

#### Summary: 
This paper present a generic method for learning belief networks, which uses variational lower bound for the likelihood term.

#### Novelty:
Uses a re-parameterization trick to change random variables to deterministic function plus a noise term, so one can apply normal gradient based learning

#### Drawbacks:
The resulting model marginal likelihood is still intractible, may not be very good for applications that require the use of actual values of the marginal probablities

#### Datasets:
MNIST, Frey face

#### Additional remarks:
Experimentally compared with wake sleep algorithm on logliklihood lower bound as well as estimated marginal likelihood

#### Resources:
Implementation: https://github.com/y0ast/Variational-Autoencoder

#### Presenter:
Yingbo Zhou

Cubs Reading Group

sciscore: 2