NIPS Conference Reviews's profile

papers.nips.cc
scholar.google.com

Disentangling factors of variation in deep representation using adversarial training
Mathieu, Michaël and Zhao, Junbo Jake and Sprechmann, Pablo and Ramesh, Aditya and LeCun, Yann
Neural Information Processing Systems Conference - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

The authors presented a new generative model that learns to disentangle the factors of variations of the data. The authors claim that the proposed model is pretty robust to supervision. This is achieved by combining two of the most successful generative models: VAE and GAN. The model is able to resolve the analogies in a consistent way on several datasets with minimal parameter/architecture tunning.

This paper presents a way to learn latent codes for data, that captures both the information relevant for a given classification task, as well as the remaining irrelevant factors of variation (rather than discarding the latter as a classification model would). This is done by combining a VAE-style generative model, and adversarial training. This model proves capable of disentangling style and content in images (without explicit supervision for style information), and proves useful for analogy resolution.

This paper introduces a generative model for learning to disentangle hidden factors of variation. The disentangling separates the code into two, where one is claimed to be the code that descries factors relevant to solving a specific task, and the other describing the remaining factors. Experimental results show that the proposed method is promising.

The authors combine state of the art methods VAE and GAN to generate images with two complementary codes: one relevant and one irrelevant. They major contribution of the paper is the development of a training procedure that exploits triplets of images (two sharing the relevant code, one note sharing) to regularize the encoder-decoder architecture and avoid trivial solutions. The results are qualitatively good and comparable to previous article using more sources of supervision.

Paper seeks to explore the variations amongst samples which separate multiple classes using auto encoders and decoders. Specifically, the authors propose combining generative adversarial networks and variational auto encoders. The idea mimics the game play between two opponents, where one attempts to fool the other into believing a synthetic sample is in fact a natural sample. The paper proposes an iterative training procedure where a generative model was first trained on a number of samples while keeping the weights of the adversary constant and later the adversary is trained while keeping the generative model weights constant. The paper performs experiments on generation of instances between classes, retrieval of instances belonging to a given class, and interpolation of instances between two classes. The experiments were performed on MNIST, a set of 2D character animation sprites, and 2D NORB toy image dataset.

arxiv.org
scholar.google.com

Improved Techniques for Training GANs
Salimans, Tim and Goodfellow, Ian J. and Zaremba, Wojciech and Cheung, Vicki and Radford, Alec and Chen, Xi
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

The Authors provide a bag of tricks for training GAN's in the image domain. Using these, they achieve very strong semi-supervised results on SHVN, MNIST, and CIFAR.

The authors then train the improved model on several images datasets, evaluate it on different tasks: semi-supervised learning, and generative capabilities, and achieve state-of-the-art results.

This paper investigates several techniques to stabilize GAN training and encourage convergence. Although lack of theoretical justification, the proposed heuristic techniques give better-looking samples. In addition to human judgement, the paper proposes a new metric called Inception score by applying pre-trained deep classification network on the generated samples. By introducing free labels with the generated samples as new category, the paper proposes the experiment using GAN under semi-supervised learning setting, which achieve SOTA semi-supervised performance on several benchmark datasets (MNIST, CIFAR-10, and SVHN).

papers.nips.cc
scholar.google.com

An Online Sequence-to-Sequence Model Using Partial Conditioning
Jaitly, Navdeep and Le, Quoc V. and Vinyals, Oriol and Sutskever, Ilya and Sussillo, David and Bengio, Samy
Neural Information Processing Systems Conference - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

The paper proposes a "neural transducer" model for sequence-to-sequence tasks that operates in a left-to-right and on-line fashion. In other words, the model produces output as the input is received instead of waiting until the full input is received like most sequence-to-sequence models do. Key ideas used to make the model work include a recurrent attention mechanism, the use of an end-of-block symbol in the output alphabet to indicate when the transducer should move to the next input block, and approximate algorithms based on dynamic programming and beam search for training and inference with the transducer model. Experiments on the TIMIT speech task show that the model works well and explore some of the design parameters of the model.

Like similar models of this type, the input is processed by an encoder and a decoder produces an output sequence using the information provided by the encoder and conditioned on its own previous predictions. The method is evaluated on a toy problem and the TIMIT phoneme recognition task. The authors also propose some smaller ideas like two different attention mechanism variations.

The map from block input to output is governed by a standard sequence-to-sequence model with additional state carried over from the previous block. Alignment of the two sequences is approximated by a dynamic program using a greedy local search heuristic. Experimental results are presented for phone recognition on TIMIT.

The encoder is a multi-layer LSTM RNN. The decoder is an RNN model conditioned on weighted sums of the last layer of the encoder and it's previous output. The weighting schemes (attention) varies and can be conditioned on the hidden states or also previous attention vectors. The decoder model produces a sequence of symbols, until it outputs a special end character "e" and is moved to the next block (other mechanisms where explored as well (no end-of-block-symbol and separately predicting the end of a block given the attention vector). It is then fed the weighted sum of the next block of encoder states. The resulting sequence of symbols determines an alignment of the target symbols over the blocks of inputs, where each block may be assigned a variable number of characters. The system is trained by fixing an alignment, that approximately resembles the best alignment. Finding this approximately best alignment is akin to a beam-search with a beam size of M (line 169), but a restricted set of symbols conditional on the last symbol in a particular hypothesis (since the target sequence is known). Alignments are computed less frequently than model updates (typically every 100 to 300 sequences). For inference, an unconstrained beam-search procedure is performed with a threshold on sequence length and beam size.

arxiv.org
arxiv-vanity.com
scholar.google.com

Professor Forcing: A New Algorithm for Training Recurrent Networks
Alex Lamb and Anirudh Goyal and Ying Zhang and Saizheng Zhang and Aaron Courville and Yoshua Bengio
arXiv e-Print archive - 2016 via Local arXiv
Keywords: stat.ML, cs.LG
more

[link] Summary by NIPS Conference Reviews 9 years ago

Authors present a method similar to teacher forcing that uses generative adversarial networks to guide training on sequential tasks.

This work describes a novel algorithm to ensure the dynamics of an LSTM during inference follows that during training. The motivating example is sampling for a long number of steps at test time while only training on shorter sequences at training time. Experimental results are shown on PTB language modelling, MNIST, handwriting generation and music synthesis.

The paper is similar to Generative Adversarial Networks (GAN): in addition to a normal sequence model loss function, the parameters try to “fool” a classifier. That classifier is trying to distinguish generated sequences from the sequence model, from real data. A few Objectives are proposed in section 2.2. The key difference to GAN is the B in equations 1-4. B is a function outputs some statistics of the model, such as the hidden state of the RNN, whereas GAN tries rather to discriminate the actual output sequences.

This paper proposes a method for training recurrent neural networks (RNN) in the framework of adversarial training. Since RNNs can be used to generate sequential data, the goal is to optimize the network parameters in such a way that the generated samples are hard to distinguish from real data. This is particularly interesting for RNNs as the classical training criterion only involves the prediction of the next symbol in the sequence. Given a sequence of symbols $x_1, ..., x_t$, the model is trained so as to output $y_t$ as close to $x_{t+1}$ as possible. Training that way does not provide models that are robust during generation, as a mistake at time t potentially makes the prediction at time $t+k$ totally unreliable. This idea is somewhat similar to the idea of computing a sentence-wide loss in the context of encode-decoder translation models. The loss can only be computed after a complete sequence has been generated.

papers.nips.cc
scholar.google.com

Can Active Memory Replace Attention?
Kaiser, Lukasz and Bengio, Samy
Neural Information Processing Systems Conference - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

The authors propose to replace the notion of 'attention' in neural architectures with the notion of 'active memory' where rather than focusing on a single part of the memory one would operate on the whole of it in parallel.

This paper introduces an extension to neural GPUs for machine translation. I found the experimental analysis section lacking in both comparisons to state of the art MT techniques as well as thoroughly evaluating the proposed method.

This paper proposes active memory, which is a memory mechanism that operates all the part in parallel. The active memory was compared to attention mechanism and it is shown that the active memory is more effective for long sentence translation than the attention mechanism in English-French translation.

This paper proposes two new models for modeling sequential data in the sequence-to-sequence framework. The first is called the Markovian Neural GPU and the second is called the Extended Neural GPU. Both models are extensions of the Neural GPU model (Kaiser and Sutskever, 2016), but unlike the Neural GPU, the proposed models do not model the outputs independently but instead connect the output token distributions recursively. The paper provides empirical evidence on a machine translation task showing that the two proposed models perform better than the Neural GPU model and that the Extended Neural GPU performs on par with a GRU-based encoder-decoder model with attention.

papers.nips.cc
scholar.google.com

On Multiplicative Integration with Recurrent Neural Networks
Wu, Yuhuai and Zhang, Saizheng and Zhang, Ying and Bengio, Yoshua and Salakhutdinov, Ruslan
Neural Information Processing Systems Conference - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

This paper has a simple premise: that the, say, LSTM cell works better with multiplicative updates (equation 2) rather than additive ones (equation 1). This additive update is used in various places in lieu of additive ones, in various places in the LSTM recurrence equations (the exact formulation is in the supplementary material). A slightly hand wavy argument is made in favour of the multiplicative update, on the grounds of superior gradient flow (section 2.2). Mainly however, the authors make a rather thorough empirical investigation which shows remarkably good performance of their new architectures, on a range of real problems. Figure 1(a) is nice, showing an apparent greater information flow (as defined by a particular gradient) through time for the new scheme, as well as faster convergence and less saturated hidden unit activations. Overall, the experimental results appear thorough and convincing, although I am not a specialist in this area.

This model presents a multiplicative alternative (with an additive component) to the additive update which happens at the core of various RNNs (Simple RNNs, GRUs, LSTMs). The multiplicative component, without introducing a significant change in the number of parameters, yields better gradient passing properties which enable the learning of better models, as shown in experiments.

papers.nips.cc
scholar.google.com

Architectural Complexity Measures of Recurrent Neural Networks
Zhang, Saizheng and Wu, Yuhuai and Che, Tong and Lin, Zhouhan and Memisevic, Roland and Salakhutdinov, Ruslan and Bengio, Yoshua
Neural Information Processing Systems Conference - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

This paper proposes several definitions of measures of complexity of a recurrent neural network. They measure 1) recurrent depth (degree of multi-layeredness as a function of time of recursive connections) 2) feedforward depth (degree of multi-layeredness as a function of input -> output connections) 3) recurrent skip coefficient (degree of directness, like the inverse of multilayeredness, of connections) In addition to the actual definitions, there are two main contributions: - The authors show that the measures (which are limits as the number of time steps -> infinity) are well defined. - The authors correlate the measures with empirical performance in various ways, showing that all measure of depth can lead to improved performance.

This paper provides 3 measures of complexity for RNNs. They then show experimentally that these complexity measures are meaningful, in the sense that increasingly complexity seems to correlated with better performance.

The authors first present a rigorous graph-theoretic framework that describes the connecting architectures of RNNs in general, with which the authors easily explain how we can unfold an RNN. The authors then go on and propose tree architecture complexity measures of RNNs, namely the recurrent depth, the feedforward depth and the recurrent skip coefficient. Experiments on various tasks show the importance of certain measures on certain tasks, which indicates that those three complexity measures might be good guidelines when designing a recurrent neural network for certain tasks.

papers.nips.cc
scholar.google.com

Reward Augmented Maximum Likelihood for Neural Structured Prediction
Norouzi, Mohammad and Bengio, Samy and Chen, Zhifeng and Jaitly, Navdeep and Schuster, Mike and Wu, Yonghui and Schuurmans, Dale
Neural Information Processing Systems Conference - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

The proposed approach consists in corrupting the training targets with a noise derived from the task reward while doing maximum likelihood training. This simple but specific smoothing of the target distribution allows to significantly boost the performance of neural structured output prediction as showcased on TIMIT phone and translation tasks. The link between this approach and RL-based expected reward maximization is also made clear by the paper,

Prior work has chosen either maximum likelihood learning, which is relatively tractable but assumes a log likelihood loss, or reinforcement learning, which can be performed for a task-specific loss function but requires sampling many predictions to estimate gradients. The proposed objective bridges the gap with "reward-augmented maximum likelihood," which is similar to maximum likelihood but estimates the expected loss with samples that are drawn in proportion to their distance from the ground truth. Empirical results show good improvements with LSTM-based predictors on speech recognition and machine translation benchmarks relative to maximum likelihood training.

This work is inspired by recent advancement in reinforcement learning and likelihood learning. The authors suggest to learn parameters so as to minimize the KL divergence between CRFs and a probability model that is proportional to the reward function (which the authors call payoff distribution, see Equation 4). The authors suggest an optimization algorithm for the KL-divergence minimization that depends on sampling from the payoff distribution.

Current methods to learn a model for structured prediction include max margin optimisation and reinforcement learning. However, the max margin approach only optimises a bound on the true reward, and requires loss augmented inference to obtain gradients, which can be expensive. On the other hand, reinforcement learning does not make use of available supervision, and can therefore struggle when the reward is sparse, and furthermore the gradients can have high variance. The paper proposes a novel approach to learning for problems that involve structured prediction. They relate their approach to simple maximum likelihood (ML) learning and reinforcement learning (RL): ML optimises the KL divergence of a delta distribution relative to the model distribution, and RL optimises the KL divergence of the model distribution relative to the exponentiated reward distribution. They propose reward-augmented maximum likelihood learning, which optimises the KL divergence of the exponentiated reward distribution relative to the model distribution. Compared to RL, the arguments of the KL divergence are swapped. Compared to ML, the delta distribution is generalised to the exponentiated reward distribution. Training is cheap in RML learning. It is only necessary to sample from the output set according to the exponentiated reward distribution. All experiments are performed in speech recognition and machine translation, where the structure over the output set is defined by the edit distance. An improvement is demonstrated over simple ML.

arxiv.org
arxiv-vanity.com
scholar.google.com

Swapout: Learning an ensemble of deep architectures
Saurabh Singh and Derek Hoiem and David Forsyth
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CV, cs.LG, cs.NE
more

[link] Summary by NIPS Conference Reviews 9 years ago

Swapout is a method that stochastically selects forward propagation in a neural network from a palette of choices: drop, identity, feedforward, residual. Achieves best results on CIFAR-10,100 that I'm aware of.

This paper examines a stochastic training method for deep architectures that is formulated in such a way that the method generalizes dropout and stochastic depth techniques. The paper studies a stochastic formulation for layer outputs which could be formulated as $Y =\Theta_1 \odot X+ \Theta_2 \odot F(X)$ where $\Theta_1$ and $\Theta_2$ are tensors of i.i.d. Bernoulli random variables. This allows layers to either: be dropped $(Y=0)$, act a feedforward layer $Y=F(X)$, be skipped $Y=X$, or behave like a residual network $Y=X+F(X)$. The paper provides some well reasoned conjectures as to why "both dropout and swapout networks interact poorly with batch normalization if one uses deterministic inference", while also providing some nice experiments on the importance of the choice of the form of stochastic training schedules and the number of samples required to obtain estimates that make sampling useful. The approach is able to yield performance improvement over comparable models if the key and critical details of the stochastic training schedule and a sufficient number of samples are used.

This paper proposes a generalization of some stochastic regularization techniques for effectively training deep networks with skip connections (i.e. dropout, stochastic depth, ResNets.) Like stochastic depth, swapout allows for connections that randomly skip layers, which has been shown to give improved performance--perhaps due to shorter paths to the loss layer and the resulting implicit ensemble over architectures with differing depth. However, like dropout, swapout is independently applied to each unit in a layer allowing for a richer space of sampled architectures. Since accurate expectation approximations are not easily attainable due to the skip connections, the authors propose stochastic inference (in which multiple forward passes are averaged during inference) instead of deterministic inference. To evaluate its effectiveness, the authors evaluate swapout on the CIFAR dataset, showing improvements over various baselines.

papers.nips.cc
scholar.google.com

Deep ADMM-Net for Compressive Sensing MRI
Yang, Yan and Sun, Jian and Li, Huibin and Xu, Zongben
Neural Information Processing Systems Conference - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

The paper addresses the problem of compressive sensing MRI (CS-MRI) by proposing a "deep unfolding" approach (cf. http://arxiv.org/abs/1409.2574) with a sparsity-based data prior and inference via ADMM. All layers of the proposed ADMM-Net are based on a generalization of ADMM inference steps and are discriminatively trained to minimize a reconstruction error. In contrast to other methods for CS-MRI, the proposed approach offers both high reconstruction quality and fast run-time.

The basic idea is to convert the convention optimization based CS reconstruction algorithm into a fixed neural network learned with back-propagation algorithm. Specifically, the ADMM-based CS reconstruction is approximated with a deep neural network. Experimental results show that the approximated neural network outperforms several existing CS-MRI algorithms with less computational time.

The ADMM algorithm has proven to be useful for solving problems with differentiable and non-differentiable terms, and therefore has a clear link with compressed sensing. Experiments prove some gain in performance with respect to the state of the art, specially in terms of computational cost at test time.

papers.nips.cc
scholar.google.com

Scan Order in Gibbs Sampling: Models in Which it Matters and Bounds on How Much
He, Bryan D. and Sa, Christopher De and Mitliagkas, Ioannis and Ré, Christopher
Neural Information Processing Systems Conference - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

A study of how scan orders influence Mixing time in Gibbs sampling.

This paper is interested in comparing the mixing rates of Gibbs sampling using either systematic scan or random updates. The basic contributions are two: First, in Section 2, a set of cases where 1) systematic scan is polynomially faster than random updates. Together with a previously known case where it can be slower this contradicts a conjecture that the speeds of systematic and random updates are similar. Secondly, (In Theorem 1) a set of mild conditions under which the mixing times of systematic scan and random updates are not "too" different (roughly within squares of each other).

First, following from a recent paper by Roberts and Rosenthal, the authors construct several examples which do not satisfy the commonly held belief that systematic scan is never more than a constant factor slower and a log factor faster than random scan. The authors then provide a result Theorem 1 which provides weaker bounds, which however they verify at least under some conditions. In fact the Theorem compares random scan to a lazy version of the systematic scan and shows that and obtains bounds in terms of various other quantities, like the minimum probability, or the minimum holding probability.

MCMC is at the heart of many applications of modern machine learning and statistics. It is thus important to understand the computational and theoretical performance under various conditions. The present paper focused on examining systematic Gibbs sampling in comparison to random scan Gibbs. They do so first though the construction of several examples which challenge the dominant intuitions about mixing times, and develop theoretical bounds which are much wider than previously conjectured.

papers.nips.cc
scholar.google.com

Exploring Models and Data for Image Question Answering
Ren, Mengye and Kiros, Ryan and Zemel, Richard S.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

This paper addresses the task of image-based Q&A on 2 axes: comparison of different models on 2 datasets and creation of a new dataset based on existing captions.

The paper is addressing an important and interesting new topic which has seen recent surge of interest (Malinowski2014, Malinowski2015, Antol2015, Gao2015, etc.). The paper is technically sound, well-written, and well-organized. They achieve good results on both datasets and the baselines are useful to understand important ablations. The new dataset is also much larger than previous work, allowing training of stronger models, esp. deep NN ones.

However, there are several weaknesses: their main model is not very different from existing work on image-Q&A (Malinowski2015, who also had a VIS+LSTM style model (but they were also jointly training the CNN and RNN, and also decoding with RNNs to produce longer answers) and achieves similar performance (except that adding bidirectionality and 2-way image input helps). Also, as the authors themselves discuss, the dataset in its current form, synthetically created from captions, is a good start but is quite conservative and limited, being single-word answers, and the transformation rules only designed for certain simple syntactic cases.

It is exploration work and will benefit a lot from a bit more progress in terms of new models and a slightly more broad dataset (at least with answers up to 2-3 words).

Regarding new models, e.g., attention-based models are very relevant and intuitive here (and the paper would be much more complete with this), since these models should learn to focus on the right area of the image to answer the given question and it would be very interesting to analyze the results of whether this focusing happens correctly.

Before attention models, since 2-way image input helped (actually, it would be good to ablate 2-way versus bidirectionality in the 2-VIS+BLSTM model), it would be good to also show the model version that feeds the image vector at every time step of the question.

Also, it would be useful to have a nearest neighbor baseline as in Devlin et al., 2015, given their discussion of COCO's properties. Here too, one could imagine copying answers of training questions, for cases where the captions are very similar.

Regarding a broader-scope dataset, the issue with the current approach is that it is too similar to the captioning approach or task, which has the drawback that a major motivation to move to image-Q&A is to move away from single, vague (non-specific), generic, one-event-focused captions to a more complex and detailed understanding of and reasoning over the image; which doesn't happen with this paper's current dataset creation approach, and so this will also not encourage thinking of very different models to handle image-Q&A, since the best captioning models will continue to work well here. Also, having 2-3 word answers will capture more realistic and more diverse scenarios; and though it is true that evaluation is harder, one can start with existing metrics like BLEU, METEOR, CIDEr, and human eval. And since these will not be full sentences but just 2-3 word phrases, such existing metrics will be much more robust and stable already.

Originality:

The task of image-Q&A is very recent with only a couple of prior and concurrent work, and the dataset creation procedure, despite its limitations (discussed above) is novel. The models are mostly not novel, being very similar to Malinowski2015, but the authors add bidirectionality and 2-way image input (but then Malinowski2015 was jointly training the CNN and RNN, and also decoding with RNNs to produce longer answers).

Significance:

As discussed above, the paper show useful results and ablations on the important, recent task of image-Q&A, based on 2 datasets -- an existing small dataset and a new large dataset; however, the second, new dataset is synthetically created by rule-transforming captions and only to single-word answers, thus keeping the impact of the dataset limited, because it keeps the task too similar to the generic captioning task and because there is no generation of answers or prediction of multi-word answers.

papers.nips.cc
scholar.google.com

Winner-Take-All Autoencoders
Makhzani, Alireza and Frey, Brendan J.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

The paper proposes a novel way to train a sparse autoencoder where the hidden unit sparsity is governed by a winner-take-all kind of selection scheme. This is a convincing way to achieve a sparse autoencoder, while the paper could have included some more details about their training strategy and the complexity of the algorithm.

The authors present a fully connected auto-encoder with a new sparsity constraint called the lifetime sparsity. For each hidden unit across the mini-batch, they rank the activation values, keeping only the top-k% for reconstruction. The approach is appealing because they don't need to find a hard threshold and it makes sure every hidden unit/filter is updated (no dead filters because their activation was below the threshold).

Their encoder is a deep stack of ReLu and the decoder is shallow and linear (note that usually non-symmetric auto-encoders lead to worse results). They also show how to apply to RBM. The effect of sparsity is very effective and noticeable on the images depicting the filters.

They extend this auto-encoder in a convolutional/deconvolutional framework, making it possible to train on larger images than MNIST or TFD. They add a spatial sparsity, keeping the top activation per feature map for the reconstruction and combine it with the lifetime sparsity presented before.

The proposed approach exploits on a mechanism close to the one of k-sparse autoencoders proposed by Makkhzani et al [14]. The authors extend the idea from [14] to build winner-take-all encoders (and RBMs), that enforce both spatial and lifetime regularization by keeping only a percentage (the biggest) of activations. The lifetime sparsity allows overcoming problems that could arise with k-sparse autoencoders. The authors next propose to embed their modeling framework in convolutional neural nets to deal with larger images than e.g. those of mnist.

papers.nips.cc
scholar.google.com

End-To-End Memory Networks
Sukhbaatar, Sainbayar and Szlam, Arthur and Weston, Jason and Fergus, Rob
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

This paper presents an end-to-end version of memory networks (Weston et al., 2015) such that the model doesn't train on the intermediate 'supporting facts' strong supervision of which input sentences are the best memory accesses, making it much more realistic. They also have multiple hops (computational steps) per output symbol. The tasks are Q&A and language modeling, and achieves strong results.

The paper is a useful extension of memNN because it removes the strong, unrealistic supervision requirement and still performs pretty competitively. The architecture is defined pretty cleanly and simply. The related work section is quite well-written, detailing the various similarities and differences with multiple streams of related work. The discussion about the model's connection to RNNs is also useful.

papers.nips.cc
scholar.google.com

StopWasting My Gradients: Practical SVRG
Harikandeh, Reza and Ahmed, Mohamed Osama and Virani, Alim and Schmidt, Mark and Konecný, Jakub and Sallinen, Scott
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

This paper extends the stochastic optimization algorithm SVRG proposed in recent years. These modifications mainly includes: the convergence analysis of SVRG with corrupted full gradient; Mix the iteration of SGD and SVRG; the strategy of mini-batch; Using support vectors etc. For each modification, the author makes clear proofs and achieves linear convergence under smooth and strong convex assumptions. However, this paper's novelty is not big enough. The improvement of convergence rate is not obvious and the proof outline is very similar to the original SVRG. The key problem such as the support for non-strongly convex loss is still unsolved. 

This paper starts with a key proposition showing that SVRG does not require a very accurate approximation of the total gradient of the objective function needed by SVRG algorithm. The authors use this proposition to derive a batching SVRG algorithm with the same convergence rate as that of original SVRG. Then, the authors propose a mixed stochastic gradient/SVRG approach and give a convergence proof for such a scheme. As a different approach of speeding up, the authors proposed a speed-up technique for Huberized hinge-loss support vector machine.

papers.nips.cc
scholar.google.com

Spatial Transformer Networks
Jaderberg, Max and Simonyan, Karen and Zisserman, Andrew and Kavukcuoglu, Koray
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

This paper presents a novel layer that can be used in convolutional neural networks. A spatial transformer layer computes re-sampling points of the signal based on another neural network. The suggested transformations include scaling, cropping, rotations and non-rigid deformation whose paramerters are trained end-to-end with the rest of the model. The resulting re-sampling grid is then used to create a new representation of the underlying signal through bi-linear or nearest neighbor interpolation. This has interesting implications: the network can learn to co-locate objects in a set of images that all contain the same object, the transformation parameter localize the attention area explicitly, fine data resolution is restricted to areas important for the task. Furthermore, the model improves over previous state-of-the-art on a number of tasks.

The layer has one mini neural network that regresses on the parameters of a parametric transformation, e.g. affine), then there is a module that applies the transformation to a regular grid and a third more or less "reads off" the values in the transformed positions and maps them to a regular grid, hence under-forming the image or previous layer. Gradients for back-propagation in a few cases are derived. The results are mostly of the classic deep learning variety, including mnist and svhn, but there is also the fine-grained birds dataset. The networks with spatial transformers seem to lead to improved results in all cases.

papers.nips.cc
scholar.google.com

Inverse Reinforcement Learning with Locally Consistent Reward Functions
Nguyen, Quoc Phong and Low, Kian Hsiang and Jaillet, Patrick
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

This paper addresses the problem of inverse reinforcement learning when the agent can change it's objective during the recording of trajectories. This results in a transition between several reward functions that explain only locally the trajectory of the observed agent. Transition probabilities between reward functions are unknown. The author propose a cascade of an EM and Viterbi algorithms to discover the reward functions and the segments on which they are valid.

Their algorithm consists in maximizing the log-likelihood of the expert's demonstrated trajectories depending on some parameters which are the original distributions of states and rewards, the local rewards and the transition function between rewards. To do so, they use the expectation-maximisation (EM) method. Then, via the Viterbi algorithm, they are able to partition the trajectories into segments with local consistent rewards.

Strengths of the paper:

1. The authors leverage existing and classical methods from the machine learning and optimization fields such as EM, Viterbi, Value iteration and gradient ascent in order to build their algorithm. This will allow the community to easily reproduce their results. 2. The experiments are conducted on synthetic and real-world data. They compare their method to MLIRL which does not use locally consistent rewards and which is the canonical choice to compare to as their algorithm is a generalization of MLIRL. The results presented show the superiority of their method over MLIRL. 3. The idea presented by the authors is original as far as I know.

Weaknesses of the paper:

1. The paper is very dense ( the figures are incorporated in the text) which makes the reading difficult.
2. The algorithm proposed needs the knowledge of the dynamics and the number of rewards. The authors, as future works, plan to extend their algorithm to unknown number of rewards, however they do not mention to get rid off the knowledge of the dynamics. Could the authors comment on that as some IRL algorithms do not need a perfect knowledge of the dynamics?

3. The method needs to solve iteratively MDPs when learning the reward functions. For each theta in the gradient ascent a MDP needs to be solved. Is this prohibitive for huge MDPs? Is there a way to avoid that step? The action-value function Q is defined via a softmax operator in order to have a derivable policy, does it allow to solve more efficiently the MDP?
4. The authors are using gradient ascent in the EM method, could they comment on the concavity of their criteria?
5. In the experiments (gridworlds), the number of features for the states is very small and thus it is understandable that a reward which is linear on the features will perform badly. Do the authors consider comparing their method to an IRL method where the number of features defining the states is greater? This is the main problem that I have with the experiments, the features used are not expressive enough to consider using a classical IRL method and this can explain why MLIRL performs badly and that its performance does not improve when the number of expert trajectories grows.
6. The performance is measured by the average log-likelihood of the expert's demonstrated trajectories which is the criterion maximized by the algorithm. I think that a more pertinent measure would be the value function of the policy produced by the optimization of the reward obtained by the algorithm. Could the authors comment on that and explain why their performance metric is more appropriate?

papers.nips.cc
scholar.google.com

Teaching Machines to Read and Comprehend
Hermann, Karl Moritz and Kociský, Tomás and Grefenstette, Edward and Espeholt, Lasse and Kay, Will and Suleyman, Mustafa and Blunsom, Phil
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

This paper deals with the formal question of machine reading. It proposes a novel methodology for automatic dataset building for machine reading model evaluation. To do so, the authors leverage on news resources that are equipped with a summary to generate a large number of questions about articles by replacing the named entities of it. Furthermore a attention enhanced LSTM inspired reading model is proposed and evaluated. The paper is well-written and clear, the originality seems to lie on two aspects. First, an original methodology of question answering dataset creation, where context-query-answer triples are automatically extracted from news feeds. Such proposition can be considered as important because it opens the way for large model learning and evaluation. The second contribution is the addition of an attention mechanism to an LSTM reading model. the empirical results seem to show relevant improvement with respect to an up-to-date list of machine reading models.

Given the lack of an appropriate dataset, the author provides a new dataset which scraped CNN and Daily Mail, using both the full text and abstract summaries/bullet points. The dataset was then anonymised (i.e. entity names removed). Next the author presents a two novel Deep long-short term memory models which perform well on the Cloze query task.

papers.nips.cc
scholar.google.com

Bandits with Unobserved Confounders: A Causal Approach
Bareinboim, Elias and Forney, Andrew and Pearl, Judea
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

The paper "Bandits with unobs. confounders: a causal approach" addresses the problem of bandit learning. It is assumed that in the observational setting, the player's decision is influenced by some unobserved context. If we randomize the player's decision, however, this intention is lost. The key idea is now that, using the available data from both scenarios, one can infer whether one should overrule the player's intention. Ultimately, this leads to the following strategy: observe the player's intention and then decide whether he should act accordingly or pull the other arm.

The author showed that the current MAB algorithms actually attempt to maximize rewards according to the experimental distribution, which is not optimal in the confounding case, and proposed to make use of the effect of the treatment on the treated (ETT), i.e., by comparing the average payouts obtained by players for going in favor of or against their intuition. To me, the paper is interesting because it addresses the confounding issue in MAB and proposed a way to estimate some properties of the confounder (related to the casino's payout strategy in the given example) based on ETT.

At first glance, one might think that the blinking light on the slot machines (B) and the drunkenness of the patron (D) could be either modified or observed in lines 153-159, where we read about a hypothetical attempt to optimize reward using traditional Thompson sampling. If those factors were observable or subject to intervention -- and I'd think they would be, in reality -- then it would be straightforward to do better than the 30% reward rate that's given. The paper eventually makes it clear that both of these variables are unobserved and unalterable. It would help if this were explicit early in the example, or if the cover story were modified to make this aspect more intuitive.

papers.nips.cc
scholar.google.com

Multi-Task Bayesian Optimization
Swersky, Kevin and Snoek, Jasper and Adams, Ryan Prescott
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper presents a multi-task Bayesian optimization approach to hyper-parameter setting in machine learning models. In particular, it leverages previous work on multi-task GP learning with decomposable covariance functions and Bayesian optimization of expensive cost functions. Previous work has shown that decomposable covariance functions can be useful in multi-task regression problems (e.g. \cite{conf/nips/BonillaCW07}) and that Bayesian optimization based on response-surfaces can also be useful for hyper-parameter tuning of machine learning algorithms \cite{conf/nips/SnoekLA12} \cite{conf/icml/BergstraYC13}. 

The paper combines the decomposable covariance assumption \cite{conf/nips/BonillaCW07} and Bayesian optimization based on expected improvement \cite{journals/jgo/Jones01} and entropy search \cite{conf/icml/BergstraYC13} to show empirically that it is possible to : 
1. Transfer optimization knowledge across related problems, addressing e.g. the cold-start problem 
2. Optimize an aggregate of different objective functions with applications to speeding-up cross validation 
3. Use information from a smaller problem to help optimize a bigger problem faster 

Positive experimental results are shown on synthetic data (Branin-Hoo function), optimizing logistic regression hyper-parameters and optimizing hyper-parameters of online LDA on real data.

papers.nips.cc
scholar.google.com

Predicting Parameters in Deep Learning
Denil, Misha and Shakibi, Babak and Dinh, Laurent and Ranzato, Marc'Aurelio and de Freitas, Nando
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

Motivated by recent attempts to learn very large networks this work proposes an approach for reducing the number of free parameters in neural-network type architectures. The method is based on the intuition that there is typically strong redundancy in the learned parameters (for instance, the first layer filters of of NNs applied to images are smooth): The authors suggest to learn only a subset of the parameter values and to then predicted the remaining ones through some form of interpolation. The proposed approach is evaluated for several architectures (MLP, convolutional NN, reconstruction-ICA) and different vision datasets (MNIST, CIFAR, STL-10). The results suggest that in general it is sufficient to learn fewer than 50% of the parameters without any loss in performance (significantly fewer parameters seem sufficient for MNIST).

The method is relatively simple: The authors assume a low-rank decomposition of the weight matrix and then further fix one of the two matrices using prior knowledge about the data (e.g., in the vision case, exploiting the fact that nearby pixels - and weights - tend to be correlated). This can be interpreted as predicting the "unobserved" parameters from the subset of learned filter weights via kernel ridge regression, where the kernel captures prior knowledge about the topology / "smoothness" of the weights. For the situation when such prior knowledge is not available the authors describe a way to learn a suitable kernel from data.

The idea of reducing the number of parameters in NN-like architectures through connectivity constraints in itself is of course not novel, and the authors provide a pretty good discussion of related work in section 5. Their method is very closely related to the idea of factorizing weight matrices as is, for instance, commonly done for 3-way RBMs (e.g. ref [22] in the paper), but also occasionally for standard RBMs (e.g. [R1], missing in the paper). The present papers differs from these in that the authors propose to exploit prior knowledge to constrain one of the matrices. As also discussed by the authors, the approach can further be interpreted as a particular type of pooling -- a strategy commonly employed in convolutional neural networks. Another view of the proposed approach is that the filters are represented as a linear combination of basis functions (in the paper, the particular form of the basis functions is determined by the choice of kernel). Such representations have been explored in various forms and to various ends in the computer vision and signal processing literature (see e.g. [R2,R3,R4,R5]). [R4,R5], for instance, represent filters in terms of a linear combination of basis functions that reduce the computational complexity of the filtering process).

papers.nips.cc
scholar.google.com

Memory Limited, Streaming PCA
Mitliagkas, Ioannis and Caramanis, Constantine and Jain, Prateek
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper proposes an approach to one-pass SVD based on a blocked variant of the power method, which variance is reduced within each block of streaming data, and compares to exact batch SVD.
Figure 1d is offered as an example where the proposed Algo 1 can scale to data for which the authors claim to be so large that "traditional batch methods" could not be run and reported. Yet there are many existing well-known SVD methods which are routinely used for even larger data sets than the largest here (sparse 8.2M vectors for 120k dimensions). These include the EMPCA (Roweiss 1998) and fast randomized SVD (Haiko et at 2011), both of which the author's cite. Why were these methods (both very simple to implement efficiently even in Matlab, etc.) not reported for this data? Especially necessary to compare against is the randomized SVD, since it too can be done in one-pass (see Haiko et al); although that cited paper discusses the tradeoffs in doing multiple passes -- something this paper does not even discuss. The authors say it took "a few hours" for Algo 1 to extract the top 7 components. Methods like the randomized SVD family of Haiko et al scale linearly in those parameters (n=8.2M and d=120k and k=7 and the number of non-zeros of the sparse data) and typically run in less than 1 hour for even larger data sets. So, demonstrating both the speed and accuracy of the proposed Algo 1 compared to the randomized algorithms seems necessary at this point, to establish the practical significance of this proposed approach.

This paper identifies and resolves a basic gap in the design of streaming PCA algorithms. It is shown that a block stochastic streaming version of the power method recovers the dominant rank-k PCA subspace with optimal memory requirements and sample complexity not too worse than batch PCA (which maintains the covariance matrix explicitly), assuming that streaming data is drawn from a natural probabilistic generative model. The paper is excellently written and provides intuitions for the analysis, starting with exact rank 1 and exact rank k case to the general rank k approximation problem. Some empirical analysis is also provided illustrating the approach for PCA on large document-term matrices.

papers.nips.cc
scholar.google.com

Analyzing the Harmonic Structure in Graph-Based Learning
Wu, Xiao-Ming and Li, Zhenguo and Chang, Shih-Fu
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The authors introduce a functional called the "harmonic loss" and show that (a) it characterizes smoothness in the sense that functions with small harmonic loss change little across large cuts (to be precise, the cut has to be a level set separator) (b) several algorithms for learning on graphs implicitly try to find functions that minimize the harmonic loss, subject to some constraints.

The "harmonic loss" they define is essentially the (signed) divergence $\nabla f$ of the function across the cut, so it's not surprising that it should be closely related to smoothness. In classical vector calculus one would take the inner product of this divergence with itself and use the identity

< $\nabla f, \nabla f $> = < $f, \nabla^2 f $>

to argue that functions with small variation, i.e., small $| \nabla f |^2$ almost everywhere can be found by solving the Laplace equation. On graphs, modulo some tweaking with edge weights, essentially the same holds, leading to minimizing the quadratic form $ f^\top L f$, which is at the heart of all spectral methods. So in this sense, I am not surprised.

Alternatively, one can minimize the integral of $| \nabla f |$, which is the total variation, and leads to a different type of regularization ($l1$ rather than $l2$ is one way to put it). The "harmonic loss" introduced in this paper is essentially this total variation, except there is no absolute value sign. Among all this fairly standard stuff, the interesting thing about the paper is that for the purpose of analyzing algorithms one can get away with only considering this divergence across cuts that separate level sets of $f$, and in that case all the gradients point in the same direction so one can drop the absolute value sign. This is nice because the "harmonic loss" becomes linear and a bunch of things about it are very easy to prove. At least this is my interpretation of what the paper is about.

scholar.google.com

Distributed representations of words and phrases and their compositionality
Mikolov, Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff
Advances in neural information processing systems - 2013 via Local Bibsonomy
Keywords: thema:deepwalk, language, modelling, representation

[link] Summary by NIPS Conference Reviews 10 years ago

The paper discusses a number of extensions to the Skip-gram model previously proposed by Mikolov et al (citation [7] in the paper): which learns linear word embeddings that are particularly useful for analogical reasoning type tasks. The extensions proposed (namely, negative sampling and sub-sampling of high frequency words) enable extremely fast training of the model on large scale datasets. This also results in significantly improved performance as compared to previously proposed techniques based on neural networks. The authors also provide a method for training phrase level embeddings by slightly tweaking the original training algorithm.

This paper proposes 3 improvements for the skip-gram model which allows for learning embeddings for words. The first improvement is subsampling frequent word, the second is the use of a simplified version of noise constrastive estimation (NCE) and finally they propose a method to learn idiomatic phrase embeddings. In all three cases the improvements are somewhat ad-hoc. In practice, both the subsampling and negative samples help to improve generalization substantially on an analogical reasoning task. The paper reviews related work and furthers the interesting topic of additive compositionality in embeddings.

The article does not propose any explanation as to why the negative sampling produces better results than NCE which it is suppose to loosely approximate. In fact it doesn't explain why besides the obvious generalization gain the negative sampling scheme should be preferred to NCE since they achieve similar speeds.

papers.nips.cc
scholar.google.com

The Fast Convergence of Incremental PCA
Balsubramani, Akshay and Dasgupta, Sanjoy and Freund, Yoav
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper proves fast convergence rates for Oja's well-known incremental algorithm for PCA. The proof uses a novel technique to describe the progress of the algorithm, by breaking it into several "epochs"; this is necessary because the PCA problem is not convex, and has saddle points. The proof also uses some ideas from the study of stochastic gradient descent algorithms for strongly convex functions. The theoretical bounds give some insight into the practical performance of Oja's algorithm, and its sensitivity to different parameter settings. 

They prove the $\tilde{O}(1/n)$ finite sample rate of convergence for estimating the leading eigenvector of the covaraince matrix. Their results suggest the best learning rate for incremental PCA. Also, their analysis provide insights for relationship with SGD on strongly convex functions.

papers.nips.cc
scholar.google.com

Matrix factorization with binary components
Slawski, Martin and Hein, Matthias and Lutsik, Pavlo
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper discusses a new approach to binary matrix factorization 
that is motivated by recent developments in non-negative matrix 
factorization. The goal of 
the paper is to present an algorithm for finding a factorization of a 
matrix in the form $D = T A$ where the entries of $T$ are in 
$\{0,1\}$. Such a model has wide applicability and is of interest to 
the ML community. The algorithm has provable recovery guarantees in 
the case of noiseless observations. A modified algorithm is applied 
to the noisy setting; however, the authors do not establish recovery 
guarantees. 

The paper presents an algorithm for low-rank matrix factorization with constraints on one of the factors should be binary. The paper has several novel contributions for this problem. The algorithm guarantees the exact solution with the time complexity of $O(mr2^r+mnr)$, where previous approach (E. Meeds et al., NIPS 2007) uses MCMC algorithm so that it cannot guarantee a global convergence. Under additional assumptions on the binary factor matrix $T$, the uniqueness of $T$ is proved which means that each data point has a unique representation with the columns of $T$. Using Littlewood-Offord lemma, the paper computes a theoretical speed-up factor for their heuristic of the candidate binary vector set reduction step.

media.nips.cc
sci-hub
scholar.google.com

Learning to Pass Expectation Propagation Messages
Heess, Nicolas and Tarlow, Daniel and Winn, John
Advances in Neural Information Processing Systems 26 - 2013 via Local Bibsonomy
Keywords: ep

[link] Summary by NIPS Conference Reviews 10 years ago

This paper proposes to learning expectation propagation (EP) message update operators from data that would enable fast and efficient approximate inference in situations where computing these operators is otherwise intractable. 

This paper attacks the problem of computing the intractable low dimensional statistics in EP message passing by training a neural network. Training data is obtained using importance sampling and assuming that we know the forward model. The paper appears technically correct, honest about shortcomings, provides an original approach to a known challenge within EP and nicely illustrates the developed method in a number of well-chosen examples.

The authors propose a method for learning a mapping from input messages to the output message in the context of expectation propagation. The method can be thought of as a sort of "compilation" step, where there is a one-time cost of closely approximating the true output messages using important sampling, after which a neural network is trained to reproduce the output messages in the context of future inference queries.

papers.nips.cc
scholar.google.com

Robust Low Rank Kernel Embeddings of Multivariate Distributions
Song, Le and Dai, Bo
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The authors present a robust low rank kernel embedding related to higher order tensors and 
latent variable models. In general the work is interesting and promising. It provides synergies between machine learning, kernel methods, tensors and latent variable models. 

The RKHS embedding of a joint probability distribution between two variables involves the notion of covariance operators. For joint distributions over multiple variables, a tensor operator is needed. The paper defines these objects together with appropriate inner product, norms and reshaping operations on them. The paper then notes that in the presence of latent variables where the conditional dependence structure is a tree, these operators are low-rank when reshaped along the edges connecting latent variables. A low-rank decomposition of the embedding is then proposed that can be implemented on Gram matrices. Empirical results on density estimation tasks are impressive.

papers.nips.cc
scholar.google.com

Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis
Voss, James R. and Rademacher, Luis and Belkin, Mikhail
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper presents a fast ICA algorithm that works best under Gaussian noise. This is demonstrated with components simulated from different univariate distributions and variable Gaussian noise. 

The writing is clear. The paper is incremental in the sense that it builds on ideas from (Belkin et. al, 2013) but focuses on speeding up and improving their cumulant-based approach. 
This is achieved via 
1. a Hessian expansion of the cumulant-tensor-based quasi-orthogonalization. 
2. gradient-based iterations that preserve quasi-orthogonalization of the latent factors (noised case) as well as whitening in the noiseless case. 

This paper proposes a cumulant based independent component analysis (ICA) algorithm for source separation in the presence of additive Gaussian noise. The algorithm is somewhat incremental building upon Refs [2] and [3], but appear technically correct with experimental results confirming the claims made. The algorithms used for benchmarking assume no additive noise but is like InfoMax often quite robust to addition of noise.

papers.nips.cc
scholar.google.com

Multi-Prediction Deep Boltzmann Machines
Goodfellow, Ian J. and Mirza, Mehdi and Courville, Aaron C. and Bengio, Yoshua
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The paper presents a method for learning layers of representation and for completing missing queries both in input and labels in single procedure unlike some other methods like deep boltzmann machines (DBM). It is a recurrent net following the same operations as DBM with the goal of predicting a subset of inputs from its complement. Parts of paper are badly written, especially model explanation and multi-inference section, nevertheless the paper should be published and I hope the authors will rewrite them.

Deep Boltzmann Machines (DBNs) are usually initialized by greedily training a stack of RBMs, and then fine-tuning the overall model using persistent contrastive divergence (PCD). To perform classification, one typically provides the mean-field features to a separate classifier (e.g. a MLP) which is trained discriminatively. Therefore the overall process is somewhat ad-hoc, consisting of L + 2 models (where L is the number of hidden layers) each with its own objective. This paper presents a holistic training procedure for DBNs which has a single training stage (where both input and output variables are predicted) producing models which can classify directly as well as efficiently performing other tasks such as imputing missing inputs. The main technical contribution is the mechanism by which training is performed; a way of training DBNs which uses the mean field equations for the DBN to induce recurrent nets that are trained to solve different inference tasks (essentially predicting different subsets of observed variables).

papers.nips.cc
scholar.google.com

Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs
Dauphin, Yann and Bengio, Yoshua
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The paper uses a subsampling-based method to speed up ratio matching training of 
RBMs on high-dimensional sparse binary data. The proposed approach is a simple 
adaptation of the method proposed by Dauphin et al. (2011) for denoising 
autoencoders. 

This paper develops an algorithm that can successfully train RBMs on very high dimensional but sparse input data, such as often arises in NLP problems. The algorithm adapts a previous method developed for denoising autoencoders for use with RBMs. The authors present extensive experimental results verifying that their method learns a good generative model; provides unbiased gradient estimates; attains a two order of magnitude speed up on large sparse problems relative to the standard implementation; and yields state of the art performance on a number of NLP tasks. They also document the curious result that using a biased version of their estimator in fact leads to better performance on the classification tasks they tested.

papers.nips.cc
scholar.google.com

Fast Convergence of Regularized Learning in Games
Syrgkanis, Vasilis and Agarwal, Alekh and Luo, Haipeng and Schapire, Robert E.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The authors perform theoretical analysis about faster convergence with multi-player normal-form games by generalizing techniques for two-player zero-sum games. They also perform empirical evaluation by using the 4-bidder simultaneous auction game.

The paper is concerned with two problems:

1. How does the social welfare of players using regret minimization algorithms compare to the optimal welfare. 
2. Can one obtain better regret bounds when all players use a regret minimization algorithm

The paper deals with bounds on regret minimization algorithms in games. The usual regret bounds on these algorithms is in $O(\sqrt{T})$. However, this assumes that the learner faces a completely adversarial opponent. However, it is natural to assume that on a game everyone will play a regret minimization algorithm and the question is whether or not one can obtain better rates in this scenario. The authors show that regret in $O(T^{1/4})$ is achievable for general games.

papers.nips.cc
scholar.google.com

Competitive Distribution Estimation: Why is Good-Turing Good
Orlitsky, Alon and Suresh, Ananda Theertha
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The paper gives justification for the widespread use of the Good-Turing estimator for discrete distribution estimation through minimax regret analysis with two comparator classes. The paper obtains competitive regret bounds that lead to a more accurate characterization of the performance of the the Good-Turing estimators and in some cases is much better than the best known risk bounds. The comparator classes considered are estimators with knowledge of the distribution up to permutation, and estimators with full knowledge of the distribution, but with the constraint that the must assign the same probability mass to symbols appearing with the same frequencies.

papers.nips.cc
scholar.google.com

A* Sampling
Maddison, Chris J. and Tarlow, Daniel and Minka, Tom
Neural Information Processing Systems Conference - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper introduces a new approach to sampling from continuous probability distributions. The method extends prior work on using a combination of Gumbel perturbations and optimization to the continuous case. This is technically challenging, and they devise several interesting ideas to deal with continuous spaces, e.g. to produce an exponentially large or even infinite number of random variables (one per point of the continuous/discrete space) with the right distribution in an implicit way. Finally, they highlight an interesting connection with adaptive rejection sampling. Some experimental results are provided and show the promise of the approach. 

This paper introduces a sampling algorithm based on the Gumbel-max trick and A* search for continuous spaces. The Gumbel-Max trick adds perturbations to an energy function and after applying argmax, results in exact samples from the Gibbs distribution. While this applies to discrete spaces, this paper extends this idea to continuous spaces using the upper bounds on the infinitely many perturbation values.

papers.nips.cc
scholar.google.com

Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS)
Shrivastava, Anshumali and 0001, Ping Li
Neural Information Processing Systems Conference - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper generalizes the LSH method to account for the (bounded) lengths of the data base vectors, so that the LSH tricks for fast approximate nearest neighbor search can exploit the well-known relation between Euclidian distance and dot product similarity (e.g. as in equation 2) and support MIPS search as well. They give 3 motivating examples where solving MIPS vs kNN per se is more appropriate and needed. Their algorithm is essentially equation 9 (using equation 7 compute vector reformulations $Q(q)$ and $P(x)$ of the query a database element respectively). This is based on apparently novel observation (equation 8) that the distance from the query converges to the dot product plus a constant, when a parameter m which exponentiated the $P(x)$ vector elements is sufficiently large (e.g. just 3 is claimed to suffice, leading to vectors $Q(q)$ and $P(x)$ which are just that m times larger than the original input dimensionality.

papers.nips.cc
scholar.google.com

Scalable Influence Estimation in Continuous-Time Diffusion Networks
Du, Nan and Song, Le and Gomez-Rodriguez, Manuel and Zha, Hongyuan
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The paper addresses how to estimate and maximize influence in large networks, where influence of node (or set of nodes) A is the expected number of nodes that will eventually adopt a certain idea following the initial adoption by A. The authors develop an algorithm for estimating influence within a given time frame, then use it as the basis of a greedy algorithm to find a given number of nodes to (approximately) maximize influence within the given time frame. They present theoretical bounds and an experimental evaluation of the algorithm. 

The authors build on an extensive list of existing work, which is appropriately cited. The most relevant is the work by Gomez-Rodriguez & Scholkopf (2012) \cite{conf/icml/Gomez-RodriguezS12}, which provides an exact analytical solution to the identical formulation of the influence estimation problem. The main innovation in the present paper is a fast randomized algorithm for estimating influence, which is based on the algorithm for estimating neighborhood size by Cohen (1997) \cite{journals/jcss/Cohen97}. This approximation allows more flexibility in modeling the flows through the edges, is substantially faster than the analytical solution, and scales well with network size. Overall, this is a solid paper on an important topic of practical relevance.

papers.nips.cc
scholar.google.com

Submodular Optimization with Submodular Cover and Submodular Knapsack Constraints
Iyer, Rishabh K. and Bilmes, Jeff A.
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The authors introduce two new submodular optimization problems and 
investigate approximation algorithms for them. The problems are 
natural generalizations of many previous problems: there is a covering 
problem ($min\\{ f(X) : g(X) \ge c\\}$) and a packing or knapsack problem 
($max\\{ g(X) : f(X) \le b\\}$), where both f and g are submodular. These 
generalize well-known previously studied versions of the problems 
usually assume that f is modular. They show that there is an intimate 
relationship between the two problems: any polynomial-time 
bi-criterion algorithm for one problem implies one for the other 
problem (with similar approximation factors) using a simple reduction. 
They then present a general iterative framework for solving the two 
problems by replacing either f or g by tight upper or lower bounds 
(often modular) at each iteration. These tend to reduce the problem 
at each iteration to a simpler subproblem for which there are existing 
algorithms with approximation guarantees. In many cases, they are able 
to translate these into approximation guarantees for the more general 
problem. Their approximation bounds are curvature-dependent and 
highlight the importance of this quantity on the difficulty of the 
problem. The authors also present a hardness result that matches their 
best approximation guarantees up to log factors, show that a number of 
existing approximation algorithms (e.g. greedy ones) for the simpler 
problem variants can be recovered from their framework by using 
specific modular bounds, and show experimentally that the simpler 
algorithm variants may perform as well as the ones with better 
approximation guarantees in practice.

papers.nips.cc
scholar.google.com

A memory frontier for complex synapses
Lahiri, Subhaneil and Ganguli, Surya
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The paper studies the problem of memory storage with discrete (digital) synapses. Previous work established that memory capacity can be increased by adding a cascade of (latent) states but the optimal state transition dynamics was unknown and the actual dynamics was usually hand-picked using some heuristic rules. In this paper the authors aim to derive the optimal transition dynamics for synaptic cascades. They first derive an upper bound on achievable memory capacity and show that simple models with linear chain structures can approach (achieve) this bound. 

The paper applies the theory if ergodic Markov chains in continuous time to the analysis of the memory properties of online learning in synapses with intrinsic states extending earlier work of Abbott, Fusi and their co-workers.

papers.nips.cc
scholar.google.com

Optimal Neural Population Codes for High-dimensional Stimulus Variables
Wang, Zhuo and Stocker, Alan A. and Lee, Daniel D.
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

Finding the objective functions that regions of the nervous system are optimized for is a central question in neuroscience, providing a central computational principle behind neural representation in a given region. One common objective is to maximize the Shannon Information the neural response encodes about the input (infomax). This is supported by some experimental. Another is to minimize the decoding error when the neural population is decoded for a particular variable or variable. This has also been found to have some experimental evidence. These two different objectives are similar in some circumstances, giving similar predictions, in other cases they differ more.

Studies finding model optimal distributions of neural population tuning that minimizes decoding error (L2-min) have mostly considered 1-dimensional stimuli. In this paper the authors extend substantially on this, by developing analytical methods for finding the optimal distributions of neural tuning for higher dimensional stimuli. Their methods apply under certain limited conditions , such as when there is an equal number of neurons as stimulus dimensions (diffeomorphic). The authors compare their results to the infomax solution (in most detail for the 2D case), and find fairly similar results in some respects, but with two key differences. That the L2-min basis functions are more orthogonal than the infomax, and that the L2-min has discrete solutions rather than the continuum found for infomax. A consequence of these differences is that L2-min representations encode more correlated signals.

papers.nips.cc
scholar.google.com

Correlations strike back (again): the case of associative memory retrieval
Savin, Cristina and Dayan, Peter and Lengyel, Máté
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The paper investigates how correlation among synaptic weights, not correlation among neural activity, influences the retrieval performance of auto-associative memory. Authors studied two types of well-known learning rules, additive learning rule (e.g., Hebbian learning) and palimpsest learning rule (e.g., cascade learning), and showed that synaptic correlations are induced in most of the cases. They also investigated optimal retrieval dynamics and showed that there exists a local version of dynamics that can be implemented in neural networks (except for an XOR cascade model).

papers.nips.cc
scholar.google.com

Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression
Titsias, Michalis K. and Lázaro-Gredilla, Miguel
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

In a GP regression model, the process outputs can be integrated over analytically, but this is not so for (a) inputs and (b) kernel hyperparameters. Titsias etal 2010 showed a very clever way to do (a) with a particular variational technique (the goal was to do density estimation). In this paper, (b) is tackled, which requires some nontrivial extensions of Titsias etal. In particular, they show how to decouple the GP prior from the kernel hyperparameters. This is a simple trick, but very effective for what they want to do. They also treat the large number of kernel hyperparameters with an additional level of ARD and show how the ARD hyperparameters can be solved for analytically, which is nice.

papers.nips.cc
scholar.google.com

One-shot learning and big data with n=2
Dicker, Lee H. and Foster, Dean P.
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper studies a linear latent factor model, where one observes "examples" consisting of high-dimensional vectors $x_1, x_2, ..\in R^d$, and one wants to predict "labels" consisting of scalars $y_1, y_2, ... \in R$. Crucially, one is working in the "one-shot learning" regime, where the number of training examples n is small (say, $n=2$ or $n=10$), while the dimension d is large (say, $d \rightarrow \infty$). This paper considers a well-known method, principal component regression (PCR), and proves some somewhat surprising theoretical results: PCR is inconsistent, but a modified PCR estimator is weakly consistent; the modified estimator is obtained by "expanding" the PCR estimator, which is different from the usual "shrinkage" methods for high-dimensional data.

This paper aims to provide an analysis for principle component
regression in the setting where the feature vectors $x$. The authors
let $x = v + e$ where $e$ is some corruption of the nominal feature
vector $v$; and $v = a u$ where $a \sim N(0,\eta^2 \gamma^2 d)$ while
the observations $y = \theta/(\gamma \sqrt{d}) \langle v,u \rangle + \xi$. This
formulation is slightly different than the standard one because our
design vectors are noisy, which can pose challenges in identifying the
linear relationship between $x$ and $y$. Thus, using the top principle
components of $x$ is a standard method used in order to help
regularize the estimation. The paper is relevant to the ML
community. The key message of using a bias-corrected estimate of $y$
is interesting, but not necessarily new. Handling bias in regularized
methods is a common problem (cf. Regularization and variable selection
via the Elastic Net, Zou and Hastie, 2005). The authors present
theoretical analysis to justify their results. I find the paper
interesting; however I am not sure if the number of new results and
level of insights warrants acceptance.

papers.nips.cc
scholar.google.com

Summary Statistics for Partitionings and Feature Allocations
Fidaner, Isik Baris and Cemgil, Ali Taylan
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The authors propose novel approaches for summarizing the posterior of partitions in infinite mixture models. Often in applications, the posterior of the partition is quite diffuse; thus, the default MAP estimate is unsatisfactory. The proposed approach is based on the cumulative block sizes, which counts the number of clusters of size $\ge k$, for $k=1, …,n$. They also examine the projected cumulative block sizes, when the partition is projected onto a subset of $\\{1,...,n\\}$. These quantities are summarized by the cumulative occurrence distribution, the per element information of a set, the entropy, the projected entropy, and the subset occurrence. Finally, they propose using an agglomerative clustering algorithm where the projection entropy is used to measure distances between sets. In illustrations, the posterior of the partition is summarized by the dendrogram produced from the entropy agglomerative algorithm, along with existing summaries such as the posterior histogram of the number of clusters and the pairwise occurrences.

papers.nips.cc
scholar.google.com

Actor-Critic Algorithms for Risk-Sensitive MDPs
A., Prashanth L. and Ghavamzadeh, Mohammad
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The paper addresses the problem of finding a policy with a high expected return and a bounded variance. The paper considers both the discounted and the average reward cases. The authors propose formulate this problem as a constrained optimization problem, where the gradient of the Lagrangian dual function is estimated form samples. This gradient is composed of the gradient of the expected return and the gradient of the expected squared return. Both gradients need to be estimated in every state. The authors use a linear function approximation to generalize the gradient estimates to states that were not encountered in the samples. The authors use stochastic perturbation to evaluate the gradients in particular states by sampling two trajectories, one with policy parameters theta and another with policy parameters theta+beta, where beta is a perturbation random variable. The policy parameters are updated in an actor-critic scheme. The authors prove that the proposed optimization method converges to a local optimum. Numerical experiments on a traffic lights control problem show that the proposed technique finds a policy with a slightly higher risque than the optimal solution, but with a significantly lower variance.

papers.nips.cc
scholar.google.com

What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach
Dai, Zhenwen and Exarchakis, Georgios and Lücke, Jörg
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper presents a generative model for natural image patches which takes into account occlusions and the translation invariance of features. The model consists of a set of masks and a set of features which can be translated throughout the patch. Given a set of translations for the masks and features the patch is then generated by sampling (conditionally) independent Gaussian noise. An inference framework for the parameters is proposed and is demonstrated on synthetic data with convincing results. Additionally, experiments are run on natural image patches and the method learns a set of masks and features for natural images. When combined together the resulting receptive fields look mostly like Gabors, but some of them have a globular structures.

papers.nips.cc
scholar.google.com

Decision Jungles: Compact and Rich Models for Classification
Shotton, Jamie and Sharp, Toby and Kohli, Pushmeet and Nowozin, Sebastian and Winn, John M. and Criminisi, Antonio
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper revisits the idea of decision DAGs for classification. Unlike a decision tree, a decision DAG is able to merge nodes at each layer, preventing the tree from growing exponentially with depth. This represents an alternative to decision-trees utilizing pruning methods as a means of controlling model size and preventing overfitting. The paper casts learning with this model as an empirical risk minimization problem, where the idea is to learn both the DAG structure along with the split parameters of each node. Two algorithms are presented to learn the structure and parameters in a greedy layer-wise manner using an information-gain based objective. Compared to several baseline approaches using ensembles of fixed-size decision trees, ensembles of decision DAGs seem to provide improved generalization performance for a given model size (as measured by the total number of nodes in the ensemble).

papers.nips.cc
scholar.google.com

Density estimation from unweighted k-nearest neighbor graphs: a roadmap
von Luxburg, Ulrike and Alamgir, Morteza
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

A method of estimating a density (up to constants) from an unweighted, directed k nearest neighbor graph is described. It is assumed (more or less) that the density is continuously differentiable, supported on a compact and connected subset of $R^d$ with non-empty interior and a smooth boundary, and is upper- and lower-bounded on its support.

papers.nips.cc
scholar.google.com

Variational Policy Search via Trajectory Optimization
Levine, Sergey and Koltun, Vladlen
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The paper introduces a new approach of how classical policy search can be combined and improved with trajectory optimization methods serving as exploration strategy. An optimization criteria with the goal of finding optimal policy parameters is decomposed with a variational approach. The variational distribution is approximated as Gaussian distribution which allows a solution with the iterative LQR algorithm. The overall algorithm uses expectation maximization to iterate between minimizing the KL divergence of the variational decomposition and maximizing the lower bound with respect to the policy parameters.

papers.nips.cc
scholar.google.com

A simple example of Dirichlet process mixture inconsistency for the number of components
Miller, Jeffrey W. and Harrison, Matthew T.
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper addresses one simple but potentially very important point: That Dirichlet process mixture models can be inconsistent in the number of mixture components that they infer. This is important because DPs are nowadays widely used in various types of statistical modeling, for example when building clustering type algorithms. This can have real-world implications, for example when clustering breast cancer data with the aim of identifying distinct disease subtypes. Such subtypes are used in clinical practice to inform treatment, so identifying the correct number of clusters (and hence subtypes) has a very important real-world impact. 

The paper focuses on proofs concerning two specific cases where the DP turns out to be inconsistent. Both consider the case of the "standard normal DPM", where the likelihood is a univariate normal distribution with unit variance, the mean of which is subject to a normal prior with unit variance. The first proof shows that, if the data are drawn i.i.d. from a zero-mean, unit-variance normal (hence matching the assumed DPM model), $P(T=1 | \text{data})$ does not converge to 1. The second proof takes this further, demonstrating that in fact$ P(T=1 | \text{data}) -> 0 $

papers.nips.cc
scholar.google.com

Training and Analysing Deep Recurrent Neural Networks
Hermans, Michiel and Schrauwen, Benjamin
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The authors propose a new deep architecture, which combines the hierarchy of deep learning with time-series modeling known from HMMs or recurrent neural networks. The proposed training algorithm builds the network layer-by-layer using supervised (pre-)training a next-letter prediction objective. The experiments demonstrate that after training very large networks for about 10 days, the network performance on a Wikipedia dataset published by Hinton et al. improves over previous work. The authors then proceed to analyze and discuss details of how the network approaches its task. For example, long-term dependencies are modeled in higher layers, correspondence between opening and closing parenthesis are modeled as a “pseudo-stable attractor-like state”.

papers.nips.cc
scholar.google.com

Variance Reduction for Stochastic Gradient Optimization
Wang, Chong and Chen, Xi and Smola, Alexander J. and Xing, Eric P.
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The authors propose to accelerate the stochastic gradient optimization algorithm by reducing the variance of the noisy gradient estimate by using the 'control variate' trick (a standard variance reduction technique for Monte Carlo simulations, explained in [3] for example). The control variate is a vector which hopefully has high correlation with the noisy gradient but for which the expectation is easier to compute. Standard convergence rates for stochastic gradient optimization depend on the variance of the gradient estimates, and thus a variance reduction technique should yield an acceleration of convergence. The authors give examples of control variates by using Taylor approximations of the gradient estimate for the optimization problem arising in regularized logistic regression as well as for MAP estimation for the latent Dirichlet Allocation (LDA) model. They compare constant step-size SGD with and without variance reduction for logistic regression on the covtype dataset, claiming that the variance reduction allows to use bigger step-sizes without having the problem of high variance and thus yields faster empirical convergence. For LDA, they compare the adaptive step-size version of the stochastic optimization method of [10] with and without variance reduction, showing a faster convergence on the held-out test log-likelihood on three large corpora.

papers.nips.cc
scholar.google.com

Sparse Additive Text Models with Low Rank Background
Shi, Lei
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper presents a model inspired by the SAGE (Sparse Additive GEnerative) model of Eisenstein et al. The authors use a different approach for modeling the "background" component of the model. SAGE uses the same background model for all; the authors allow different backgrounds for different topics/classification labels/etc., but try to keep the background matrix low rank. To make inference faster when using this low rank constraint, they use a bound on the likelihood function that avoids the log-sum-exp calculations from SAGE. Experimental results are positive for a few different tasks.

Sparse additive models represent sets of distributions over large vocabularies as log-linear combinations of a dense, shared background vector and a sparse, distribution-specific vector. The paper presents a modification that allows distributions to have distinct background vectors, but requires that the matrix of background vectors be low-rank. This method leads to better predictive performance in a labeled classification task and in a mixed-membership LDA-like setting.

Previous work on SAGE introduced a new model for text. It built a lexical distribution by adding deviation components to a fixed background. The model presented in this paper SAM-LRB, builds on SAGE and claims to improve it by two additions. First, providing a unique background for each class/topic. Second, providing an approximation of log-likelihood so as to provide a faster learning and inference algorithm in comparison to SAGE.

papers.nips.cc
scholar.google.com

Deep Fisher Networks for Large-Scale Image Classification
Simonyan, Karen and Vedaldi, Andrea and Zisserman, Andrew
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The paper proposes a new image representation for recognition based on a stacking of two layers of Fisher vector encoders, with the first layer capturing semi-local information and the second performing sum-pooling aggregation over the entire picture. The approach is inspired by the recent success of deep convolutional networks (CNN). The key-difference is that the architecture proposed in this paper is predominantly hand-designed with relatively few parameters learned compared to CNNs. This is both the strength and the weakness of the approach as it leads to much faster training but also slighter lower accuracy compared to fully learned deep networks. 

This paper uses Fisher Vectors as inner building blocks in a recognition architecture. The basic Fisher vector module had previously demonstrated superior performance in recognition application. Here, it is augmented with discriminative linear projection for dimensionality reduction, and multiscale local pooling, to make it suitable for stacking. Inputs of all layers are jointly used for classification.

papers.nips.cc
scholar.google.com

Causal Inference on Time Series using Restricted Structural Equation Models
Peters, Jonas and Janzing, Dominik and Schölkopf, Bernhard
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper considers a class of structural equation models for times series data.  The models allow nonlinear instantaneous effects and lagged effects. On the other hand, Granger-causality based methods do not allow instantaneous effects and a linear non-Gaussian method TS-LiNGAM (Hyvarinen et al., ICML2008, JMLR2010) assumes linear effects. 

This paper introduces a model and procedure for learning instantaneous and lagged causal relationships among variables in a time series when each causal relationship is either identifiable in the sense of the additive noise model (Hoyer et al. 2009) or exhibits a time structure. The learning procedure finds a causal order by iteratively fitting VAR or GAM models where each variable is a function of all other variables and making the variable with the least dependence the lowest variable in the order. Excess parents are then pruned to produce the summary causal graph (where x->y indicates either an instantaneous or lagged cause up to the order of the VAR or GAM model that is fit). Experiments show that the method outperforms competing methods and returns no results in cases where the model can be identified (rather than wrong results).

papers.nips.cc
scholar.google.com

More data speeds up training time in learning halfspaces over sparse vectors
Daniely, Amit and Linial, Nati and Shalev-Shwartz, Shai
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper provides one of the most natural examples of a learning problem for which the problem becomes computationally tractable when given a sufficient amount of data, but is computationally intractable (though still information theoretically tractable) when given a smaller quantity of data. This computational intractability is based on a complexity-theoretic assumption about the hardness of distinguishing satisfiable 3SAT formulas from random ones at a given clause density (more specifically, the 3MAJ variant of the conjecture). 

The specific problem considered by the authors is learning halfspaces over 3-sparse vectors. The authors complement their negative results with nearly matching positive results (if one believes a significantly stronger complexity theoretic conjecture-- that hardness persists even for random formulae whose density is $n^\mu$ over the satistfiability threshold). Sadly, the algorithmic results are described in the Appendix, and are not discussed. It seems like they are essentially modifications of Hazan et al.'s 2012, though it would be greatly appreciated if the authors included a high-level discussion of the algorithm. Even if no formal proofs of correctness will fit in the body, a description of the algorithm would be helpful.

papers.nips.cc
scholar.google.com

Transportability from Multiple Environments with Limited Experiments: Completeness Results
Bareinboim, Elias and Pearl, Judea
Neural Information Processing Systems Conference - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

Previously it has been shown that do-calculus is a sound inferential machinery for estimating a causal effect from a causal diagram and a set of observations and interventions. This paper further proves that it is not only sound, but also complete, meaning that every valid equality between probabilities defined on a semi-Markovian graph can be obtained through finite applications of the three rules of do-calculus. Moreover, the paper studies mz-transportability, which unifies those previously studied special cases of meta-identifiability. The authors proposed a complete algorithm to determine if a causal effect is mz-transportable, and if it is, outputs a transport formula for estimating the causal effect.

papers.nips.cc
scholar.google.com

Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching
Fiori, Marcelo and Sprechmann, Pablo and Vogelstein, Joshua T. and Musé, Pablo and Sapiro, Guillermo
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper examines the problem of approximate graph matching (isomorphism). Given graphs G, H with p nodes, represented by respective adjacency matrices A, B, Find a permutation matrix P that best "matches" AP and PB. 

This paper poses the multimodal graph matching problem as a convex optimization problem, and solves it using augmented Langrangian techniques (viz., ADMM). This is an important problem with application in several fields. Experimental results on synthetic and multiple real world datasets demonstrate effectiveness of the proposed approach.

papers.nips.cc
scholar.google.com

Modeling Clutter Perception using Parametric Proto-object Partitioning
Yu, Chen-Ping and Hua, Wen-Yu and Samaras, Dimitris and Zelinsky, Gregory J.
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper proposes an image-based model for visual clutter perception ("a crowded, disorderly state"). For a given image, the model begins by applying an existing superpixel clustering then computing the intensity, colour and orientation histograms of pixels within each superpixel. Boundaries between adjacent superpixels are then retained or merged to create "proto-objects". The novel merging algorithm acts on the Earth Movers Distance (EMD), a measure of the similarity between two histograms. The distribution of histogram distances in each image for each image feature is modeled as a mixture of two Weibull distributions. The crossover point between the two distributions (or a fixed cumulative percentile if a single distribution is preferred by model selection) is used as the threshold point for merging: an edge is labelled ``similar'', and the superpixels merged, if the pair of superpixels exceed the threshold point for all three features. The clutter value for each image is the ratio of the final number of proto-objects to the initial number of superpixels (i.e. 0 = no proto-objects, not cluttered; 1 = all superpixels are proto-objects).
The model is validated by comparing to human clutter rankings of a subset of an existing image database. Human observers rank images from least to most cluttered, then the median ranking for each image is used as the ground truth for clutter perception. The new model correlates more highly with human rankings of clutter than a number of previous clutter perception and image segmentation models (including human object segmentation from a previous study).

papers.nips.cc
scholar.google.com

PAC-Bayes-Empirical-Bernstein Inequality
Tolstikhin, Ilya O. and Seldin, Yevgeny
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper derives a new empirical PAC Bayesian bound by combining an existing (non-empirical) PAC Bayesian Berstein bound (i.e., involving the true variance of the loss values) with a PAC Bayesian analysis of the concentration of the empirical variance around its true value. This new bound has the advantage of being tighter when the empirical variance is small compared to the empirical loss. Experiments on real and empirical data with simple models compare the new bound with the usual empirical PAC Bayesian bound confirming the advantage.

papers.nips.cc
scholar.google.com

Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs
MacDermed, Liam and Isbell, Charles L.
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper proposes a new method for Dec-POMDP planning that is built out of several components. The first is a new way of solving cooperative Bayesian games using an integer linear program. The second is the transformation of the Dec-POMDP to a belief POMDP in which a "centralized mediator" must select at each timestep the best action for each agent-belief pair. The third is to automate the discovery of optimal belief compression by dividing each timestep into two parts, the first corresponding to the original Dec-POMDP and the second giving each agent a chance to select how its beliefs in that timestep are mapped to a bounded set and thus compressed. The fourth assembles these components together into a point-based value iteration method that solves the resulting belief POMDP using a varient of PERSEUS in which the CBG solver is used to compute maximizations. 

Three contributions are made: 

* An approach to convert DEC-POMDPs to bounded belief DEC-POMDPs 
* An approach to convert bounded belief DEC-POMDPs to POMDPs with exponentially many actions 
* An integer linear program to optimize one-step look-ahead policies in POMDPs with exponentially many actions

papers.nips.cc
scholar.google.com

On Decomposing the Proximal Map
Yu, Yaoliang
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The paper deals with an interesting theoretical question concerning the proximity operator. It investigates when the proximity of the sum of two convex functions decomposes into the composition of the corresponding proximity operators. The problem is interesting since in the applications there is a growing interest in building complex regularizers by adding several simple terms. 
They pursues a quite complete study. After proving a simple sufficient condition (Theorem 1), they gives the main result of the paper (Theorem 4): it is a complete characterization of the property (for a function) of being radial versus the property of being "well-coupled" with positively homogeneous functions (where well-coupled means that the prox of the sum of the couple decomposes into the composition of the two individual prox map). They also consider the case of polyhedral gauge functions, deriving a sufficient condition which is expressed by means of a cone invariance property. Examples are provided which show several prox-decomposition results, recovering known facts (in a simpler way) but also proving new ones. 

The value of the paper is mainly on the theoretical side. It sheds light on the mechanism of composing proximity operators and unifies several particular results that were spread in the literature. The article is well written and technically sound. The only fault I see is that perhaps some times is not completely rigorous as I explain in the following.

papers.nips.cc
scholar.google.com

Polar Operators for Structured Sparse Estimation
Zhang, Xinhua and Yu, Yaoliang and Schuurmans, Dale
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The authors build their work on top of the generalized conditional gradient (GCG) method for sparse optimization. In particular, GCG methods require computation of the polar operator for the sparse regularization function (an example is the dual norm if the regularization function is an atomic norm). In this work, the authors identify a class of regularization functions, which are based on an underlying subset cost function. The key idea is a to 'lift' the regularizer into a higher dimensional space together with some constraints in the higher-dimensional space, where it has the property of 'marginalized modularity' allowing it to be reformulated as a linear program. Finally, the approach is generalized to general proximal objectives. The results demonstrate that the method is able to achieve better objective values in much less CPU time when compared with another polar operator method and accelerated proximal gradient (APG) on group Lasso and path coding problems.

papers.nips.cc
scholar.google.com

Generalized Random Utility Models with Multiple Types
Soufiani, Hossein Azari and Diao, Hansheng and Lai, Zhenyu and Parkes, David C.
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper is related with the problem of demand estimation in multi-heterogeneous agents, specifically, to classify agents and estimate preferences of each agent type using agents’ ranking data of different alternatives. The problem is important since it has great practical value in studying underlying preference distributions of multiple agents. To tackle the problem, the authors introduce generalized random utility models (GRUM), provide RJMCMC algorithms for parameter estimation in GRUM and theoretically establish conditions for identifiability for the model. Experimental results on both synthetic and real dataset show the model’s effectiveness.

papers.nips.cc
scholar.google.com

Provable Subspace Clustering: When LRR meets SSC
Wang, Yu-Xiang and Xu, Huan and Leng, Chenlei
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper proposes a new subspace clustering algorithm called Low Rank Sparse Subspace Clustering (LRSSC) and aims to study the conditions under which it is guaranteed to produce a correct clustering. The correctness is defined in terms of two properties. The self-expressiveness property (SEP) captures whether a data point is expressed as a linear combination of other points in the same subspace. The graph connectivity property (GCP) captures whether the points in one subspace form a connected component of the graph formed by all the data points. The LRSSC algorithm builds on two existing subspace clustering algorithms, SSC and LRR, which have complementary properties. The solution of LRR is guaranteed to satisfy the SEP under the strong assumption of independent subspaces and the GCP under weak assumptions (shown in this paper). On the other hand, the solution of SSC is guaranteed to satisfy the SEP under milder conditions, even with noisy data or data corrupted with outliers, but the solution of SSC need not satisfy the GCP. This paper combines the objective functions of both methods with the hope of obtaining a method that satisfies both SEP and GEP for some range of values of the relative weight between the two objective functions. Theorem 1 derives conditions under which LRSSC satisfies SEP in the deterministic case. These conditions are natural generalizations of existing conditions for SSC. But they are actually weaker than existing conditions. Theorem 2 derives conditions under which LRSSC satisfies SEP in the random case (data drawn at random from randomly drawn subspaces). Overall, it is shown that when the weight of the SSC term is large enough and the ratio of the data dimension to the subspace dimension grows with the log of the number of points, then LRSSC is guaranteed to satisfy SEP with high probability. I say high, because it does not tend to 1. Finally, Proposition 1 and Lemma 4 show that LRR satisfies GCP (presumably almost surely). Experiments support that for a range of the SSC weight, LRSCC works. Additional experiments on model selection show the usefulness of the analysis.

papers.nips.cc
scholar.google.com

Bayesian optimization explains human active search
Borji, Ali and Itti, Laurent
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The authors explore different optimization strategies for 1-D continuous functions and their relationship to how people optimize the functions. They used a wide variety of continuous functions (with one exception): polynomial, exponential, trigonometric, and the Dirac function. They also explore how people interpolate and extrapolate noisy samples from a latent function (which has a long tradition in psychology under the name of function learning) and how people select an additional sample to observe under the task of interpolating or extrapolating. Over all, they found that Gaussian processes do a better job at describing human performance than any of the approx. 20 other tested optimization methods.

papers.nips.cc
scholar.google.com

Transfer Learning in a Transductive Setting
Rohrbach, Marcus and Ebert, Sandra and Schiele, Bernt
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper describes how to attack the zero-, one-, or few-shot recognition problem, where we have a fair amount of training data for some classes, but none or very few for some other classes. It does this using three different techniques, all combined in a single framework: using semantically-meaningful mid-layer knowledge (attributes), building a graph on new classes to exploit the manifold structure, and finally by using an attribute-based representation for building the graph structure (rather than low-level features), which improves performance. The method is evaluated on 3 different datasets (Animals with Attributes, ImageNet, and MPII Cooking composites), and shows improved performance on all compared to the state-of-the-art (slightly).

papers.nips.cc
scholar.google.com

Data-driven Distributionally Robust Polynomial Optimization
Mevissen, Martin and Ragnoli, Emanuele and Yu, Jia Yuan
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The authors considered robust optimization for polynomial optimization problems where the uncertainty set is a set of possible distributions of the parameter. In specific, this set is a ball around a density function estimated from data samples. The authors showed that this distributionally robust optimization formulation can be reduced to a polynomial optimization problem, hence computationally the robust counterpart is of the same hardness as the nominal (non-robust) problem, and can be solved using a tower of SDP known in literature. The authors also provide finite-sample guarantees for estimating the uncertainty set from data. Finally, they applied their methods to a water network problem.

papers.nips.cc
scholar.google.com

Latent Maximum Margin Clustering
Zhou, Guang-Tong and Lan, Tian and Vahdat, Arash and Mori, Greg
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This work proposes an extension to the maximum margin clustering (MMC) method that introduces latent variables. The motivation for adding latent variables is that they can model additional data semantics, resulting in better final clusters. The authors introduce a latent MMC (LMMC) objective, state how to optimize it, and then apply it to the task of video clustering. For this task, the latent variables are tag words, and the affinity of a video for a tag is given by a pre-trained binary tag detector. Experiments show that LMMC consistently, and sometimes substantially, beats several reasonable baselines.

papers.nips.cc
scholar.google.com

Reciprocally Coupled Local Estimators Implement Bayesian Information Integration Distributively
Zhang, Wenhao and Wu, Si
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The authors show by approximate analysis of two identical continuous attractor networks (Zhang 1996), reciprocally coupled by Gaussian weights, that such a network can approximately implement the Bayesian posterior solution for queue integration.

papers.nips.cc
scholar.google.com

Documents as multiple overlapping windows into grids of counts
Perina, Alessandro and Jojic, Nebojsa and Bicego, Manuele and Truski, Andrzej
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper describes a creative alternative for topic modeling:  mixed-membership on a "counting grid." The advantage of this approach  seems to be that you can move smoothly across the grid, achieving a 
high effective number of topics while the spatial smoothing prevents  overfitting. The disadvantage seems to be that there are more  parameters (grid dimension and size, and window size). A variational  inference procedure that is somewhat to LDA is possible, although no speed/complexity comparisons are provided. The spatial nature of the  approach has potential advantages for visualization as well.

papers.nips.cc
scholar.google.com

The Randomized Dependence Coefficient
López-Paz, David and Hennig, Philipp and Schölkopf, Bernhard
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The authors propose a non-linear measure of dependence between two random variables. This turns out to be the canonical correlation between random, nonlinear projections of the variables after a copula transformation which renders the marginals of the r.vs invariant to linear transformations. 

The paper introduces a new method called RDC to measure the statistical dependence between random variables. It combines a copula transform to a variant of kernel CCA using random projections, resulting in a $O(n log n)$ complexity. Results on synthetic and real benchmark data show promising results for feature selection. 

The RDC is a non-linear dependency estimator that satisfies Renyi's criteria and exploits the very recent FastFood speedup trick (ICML13) \cite{journals/corr/LeSS14}. This is a straightforward recipe: 1) copularize the data, effectively preserving the dependency structure while ignoring the marginals, 2) sample k non-linear features of each datum (inspired from Bochner's theorem) and 3) solve the regular CCA eigenvalue problem on the resulting paired datasets. Ultimately, RDC feels like a copularised variation of kCCA (misleading as this may sound). Its efficiency is illustrated successfully on a set of classical non-linear bivariate dependency scenarios and 12 real datasets via a forward feature selection procedure.

papers.nips.cc
scholar.google.com

Bayesian Active Model Selection with an Application to Automated Audiometry
Gardner, Jacob R. and Malkomes, Gustavo and Garnett, Roman and Weinberger, Kilian Q. and Barbour, Dennis L. and Cunningham, John P.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The authors introduce a new method for actively selecting the model that best fits a dataset. Contrary to active learning, where the next learning point is chosen to get a better estimate of the model hyperparameters, this methods selects the next point to better distinguish between a set of models. Similar active model selection techniques exist, but they need to retrain each model for each new data point to evaluate. The strength of the author's method is that is only requires to evaluate the predictive distributions of models, without retraining.

They propose to apply this method to detect noise-induced hearing loss. The traditional way of screening for NIHL involves testing a wide range of intensities and frequencies, which is time consuming. The authors show that with their method, the number of tests to be run could be drastically decreased, reducing the cost of large-scale screenings for NIHL.

papers.nips.cc
scholar.google.com

Training Very Deep Networks
Srivastava, Rupesh Kumar and Greff, Klaus and Schmidhuber, Jürgen
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

Machine learning researchers frequently find that they get better results by adding more and more layers to their neural networks, but the difficulties of initialization and decaying/exploding gradients have been severely limiting. Indeed, the difficulties of getting information to flow through deep neural networks arguably kept them out of widespread use for 30 years. This paper addresses this problem head on and demonstrates one method for training 100 layer nets.

The paper describes an affective method to train very deep neural networks by means of 'information highways', or building direct connections to upper network layers. Although a generalization of prior techniques, such as cross-layer connections, the authors have shown this method to be effective by experimentation. The contributions are quite novel and well supported by experimental evidence.

papers.nips.cc
scholar.google.com

Particle Gibbs for Infinite Hidden Markov Models
Tripuraneni, Nilesh and Gu, Shixiang and Ge, Hong and Ghahramani, Zoubin
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The paper proposes a sampler for iHMMs, which the authors show has improved mixing properties and performs better in posterior inference problems when compared to the existing state-of-the-art sampling methods. An existing Gibbs sampler is turned into a particle Gibbs sampler by using a conditional SMC step to sample the latent sequence of states. The paper uses conjugacy to derive optimal SMC proposals and ancestor sampling to improve the performance of the conditional SMC step. The result is more efficient sampling of the latent states, making the sampler robust to spurious states and yielding faster convergence.

papers.nips.cc
scholar.google.com

A Bayesian Framework for Modeling Confidence in Perceptual Decision Making
Khalvati, Koosha and Rao, Rajesh P.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The authors' model confidence data from two experiments (conducted by others and previously published in the scientific literature) using a POMDP. In both experiments, subjects saw a random-dot kinematogram on each trial and made a binary choice about the dominant motion direction. The first experiment used monkeys as subjects and stimuli had a fixed duration. The second experiment used people as subjects and stimuli continued until a subject made a response. The paper reports that the POMDP model does a good job of fitting the experimental data, both the accuracy data and the confidence data.

papers.nips.cc
scholar.google.com

Path-SGD: Path-Normalized Optimization in Deep Neural Networks
Neyshabur, Behnam and Salakhutdinov, Ruslan and Srebro, Nathan
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

Deep rectified neural networks are over-parameterized in the sense that scaling of the weights in one layer, can be compensated for exactly in the subsequent layer. This paper introduces Path-SGD, a simple modification to the SGD update rule, whose update is invariant to such rescaling. The method is derived from the proximal form of gradient descent, whereby a constraint term is added which preserves the norm of the "product weight" formed along each path in the network (from input to output node). Path-SGD is thus principled and shown to yield faster convergence for a standard 2 layer rectifier network, across a variety of dataset (MNIST, CIFAR-10, CIFAR-100, SVHN). As the method implicitly regularizes the neural weights, this also translates to better generalization performance on half of the datasets.

At its core, Path-SGD belongs to the family of learning algorithms which aim to be invariant to model reparametrizations. This is the central tenet of Amari's natural gradient (NG) \cite{amari_natural_1998}, whose importance has resurfaced in the area of deep learning. Path-SGD can thus be cast an approximation to NG, which focuses on a particular type of rescaling between neighboring layers. The paper would greatly benefit from such a discussion in my opinion. I also believe NG to be a much more direct way to motivate Path-SGD, than the heuristics of max-norm regularization.

papers.nips.cc
scholar.google.com

DeViSE: A Deep Visual-Semantic Embedding Model
Frome, Andrea and Corrado, Gregory S. and Shlens, Jonathon and Bengio, Samy and Dean, Jeffrey and Ranzato, Marc'Aurelio and Mikolov, Tomas
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This computer vision paper uses an unsupervised, neural net based semantic embedding of a Wikipaedia text corpus trained using skip-gram coding to enhance the performance of the Krizhevsky et al deep network \cite{krizhevsky2012imagenet} that won the 2012 ImageNet large scale visual recognition challenge, particularly for zero-shot learning problems (i.e. previously unseen classes with some similarity to previously seen ones). The two networks are trained separately, then the output layer of \cite{krizhevsky2012imagenet} is replaced with a linear mapping to the semantic text representation and re-trained on ImageNet 1k using a dot product loss reminiscent of a structured output SVM one. The text representation is not currently re-trained. The model is tested on ImageNet 1k and 21k. With the semantic embedding output it does not quite manage to reproduce the ImageNet 1k flat-class hit rates of the original softmax-output model, but it does better than the original on hierarchical-class hit rates and on previously unseen classes from ImageNet 21k. For unseen classes, the improvements are modest in absolute terms (albeit somewhat larger in relative ones). 

It consists of the following steps: 
1. Learn an embedding of a large number of words in a Euclidean space. 
2. Learn a deep architecture which takes images as input and predicts one of 1,000 object categories. 
The 1,000 categories are a subset of the 'large number of words' of step (1). 
3. Remove the last layer of the visual model -- leaving what is referred to as the 'core' visual model. 
Replace it by the word embeddings and add a layer to map the core visual model output to the word embeddings.

papers.nips.cc
scholar.google.com

Generalized Denoising Auto-Encoders as Generative Models
Bengio, Yoshua and Yao, Li and Alain, Guillaume and Vincent, Pascal
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper continues a recent line of theoretical work that seeks to explain what autoencoders learn about the data-generating distribution. Of practical importance from this work have been ways to sample from autoencoders. Specifically, this paper picks up where \cite{journals/jmlr/AlainB14} left off. That paper was able to show that autoencoders (under a number of conditions) estimate the score (derivative of the log-density) of the data-generating distribution in a way that was proportional to the difference between reconstruction and input. However, it was these conditions that limited this work: it only considered Gaussian corruption, it only applied to continuous inputs, it was proven for only squared error, and was valid only in the limit of small corruption. The current paper connects the autoencoder training procedure to the implicit estimation of the data-generating distribution for arbitrary corruption, arbitrary reconstruction loss, and can handle both discrete and continuous variables for non-infinitesimal corruption noise. Moreover, the paper presents a new training algorithm called "walkback" which estimates the same distribution as the "vanilla" denoising algorithm, but, as experimental evidence suggests, may do so in a more efficient way.

papers.nips.cc
scholar.google.com

Deep Convolutional Neural Network for Image Deconvolution
Xu, Li and Ren, Jimmy S. J. and Liu, Ce and Jia, Jiaya
Neural Information Processing Systems Conference - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper presents a method for nonblind deconvolution of blurry images, that also can also fix artifacts (e.g. compression, clipping) in the input, and is robust to deviations from the input generation model. A convolutional network is used both to deblur and fix artifacts; deblurring is performed using a sequence of horizontal and vertical conv kernels, taking advantage of a high degree of separability in the pseudoinverse blur kernel, and are initialized with a decomposition of the pseudoinverse. A standard compact-kernel convnet is stacked on top, allowing further fixing of artifacts and noise, and traned end-to-end with pairs of blurry and ground truth images.

papers.nips.cc
scholar.google.com

Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation
Denton, Emily L. and Zaremba, Wojciech and Bruna, Joan and LeCun, Yann and Fergus, Rob
Neural Information Processing Systems Conference - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The paper addresses the problem of speeding up the evaluation of pre-trained image classification ConvNets. To this end, a number of techniques are proposed, which are based on the tensor representation of the conv. layer weight matrix. Namely, the following techniques are considered (Sect. 3.2-3.5):

1. SVD decomposition of the tensor
2. outer product decomposition of the tensor
3. monochromatic approximation of the first conv. layer - projecting RGB colors to a 1-D space, followed by clustering
4. biclustering tensor approximation - clustering input and output features to split the tensor into a number of sub-tensors, each of which is then separately approximated
5. fine-tuning of approximate models to (partially) recover the lost accuracy

papers.nips.cc
scholar.google.com

Two-Stream Convolutional Networks for Action Recognition in Videos
Simonyan, Karen and Zisserman, Andrew
Neural Information Processing Systems Conference - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper proposes a model for solving discriminative tasks with video inputs. The
model consists of two convolutional nets. The input to one net is an appearance
frame. The input to the second net is a stack of densely computed optical flow
features. Each pathway is trained separately to classify its input. The
prediction for a video is obtained by taking a (weighted) average of the
predictions made by each net.

papers.nips.cc
scholar.google.com

Communication Efficient Distributed Machine Learning with the Parameter Server
Li, Mu and Andersen, David G. and Smola, Alexander J. and Yu, Kai
Neural Information Processing Systems Conference - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper presents improvements on a system for large-scale learning known as "parameter server". The parameter server is designed to perform reliable distributed machine learning in large-scale industrial systems (1000's of nodes). The architecture is based on a bipartite graph composed by "servers" and "workers". Workers compute gradients based on subsets of the training instances, while servers aggregate the workers' gradients, update the shared parameter vector and redistribute it to the workers for the next iteration. The architecture is based on asynchronous communication and allows trading-off convergence speed and accuracy through a flexible consistency model. The optimization problem is solved with a modified proximal gradient method, in which only blocks of coordinates are updated at a time. Results are shown in an ad-click prediction dataset with $O(10^{11})$ instances as well as features. Results are presented both in terms of convergence time of the algorithm and average time spent per worker. Both are roughly half of the values for the previous version of the parameter server (version called "B" in the paper). Roughly 1h convergence time using 1000 machines each with 16 cores and 192Gb RAM, 10Gb Ethernet connection (800 workers and 200 servers). Other jobs were concurrently run in the cluster. The authors claim it was not possible to compare against other algorithms since at the scale they are operating there is no other open-source solution. In the supplementary material they do compare their system with shotgun \cite{conf/icml/BradleyKBG11} and obtain faster convergence (4x) and similar value of the objective function at convergence.

papers.nips.cc
scholar.google.com

Semi-Separable Hamiltonian Monte Carlo for Inference in Bayesian Hierarchical Models
Zhang, Yichuan and Sutton, Charles A.
Neural Information Processing Systems Conference - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper proposes a way to speed up Hamiltonian Monte Carlo (HMC) \cite{Duane1987216} sampling for hierarchical models. It is similar in spirit to RMHMC, in which the mass matrix varies according to local topology, except that here the mass matrices for each parameter type (parameter or hyperparameter) only depend on their counterpart, which allows an explicit leapfrog integrator to be used to simulate dynamics rather than an implicit integrator requiring fixed-point iteration to convergence for each step. The authors point out that their method goes beyond straightforward Gibbs sampling with HMC within each Gibbs step since their method leaves the counterpart parameter's momentum intact.

papers.nips.cc
scholar.google.com

Kernel Mean Estimation via Spectral Filtering
Muandet, Krikamol and Sriperumbudur, Bharath K. and Schölkopf, Bernhard
Neural Information Processing Systems Conference - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The paper presents a family of kernel mean shrinkage estimators. These estimators generalize the ones proposed in \cite{journals/jmlr/FukumizuSG13} and can incoporate useful domain knowledge through spetral filters. Here is a summary of interesting contributions:
1. Theorem 1 that shows the consistency and admissibility of kmse presented in \cite{journals/jmlr/FukumizuSG13}.
2. The idea of spectral kmse (its use in this unsupervised setting) and similarity of final form with the supervised setting.
3. Theorem 5 that shows consistency of the proposed spectral kmse.

papers.nips.cc
scholar.google.com

Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets
Joulin, Armand and Mikolov, Tomas
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

Endowing memory to recurrent neural networks is clearly one of the most important topics of deep learning and crucial to do real reasoning. The proposed stack-augmented recurrent nets outperform simple RNN and LSTM \cite{journals/neco/HochreiterS97} on a series of synthetic problems (learning simple algorithmic patterns). The complexity of problems is clearly defined and the behavior of resulting stack RNN could be well understood and easily analyzed. However, the conclusions merely depending on those synthetic data set may take a risk. The importance of the problems to real sequence modeling task could be uncertain and the failures of other models could be greatly improved by more and dense hyper-parameter searching. Like in \cite{journals/corr/LeJH15}, by a very simple trick a RNN works very well on a toy task (a adding problem) which seems to need to model long term dependencies.

papers.nips.cc
scholar.google.com

Probabilistic Line Searches for Stochastic Optimization
Mahsereci, Maren and Hennig, Philipp
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The authors propose a probabilistic version of the "line search" procedure that is commonly used as a subroutine in many deterministic optimization algorithms. The new technique can be applied when the evaluations of the objective function and its gradients are corrupted by noise. Therefore, the proposed method can be successfully used in stochastic optimization problems, eliminating the requirement of having to specify a learning rate parameter in this type of problems. The proposed method uses a Gaussian process surrogate model for the objective and its gradients. This allows us to obtain a probabilistic version of the conditions commonly used to terminate line searches in the deterministic scenario. The result is a soft version of those conditions that is used to stop the probabilistic line search process. At each iteration within such process, the next evaluation location is collected by using Bayesian optimization methods. A series of experiments with neural networks on the MNIST and CIFAR10 datasets validate the usefulness of the proposed technique.

papers.nips.cc
scholar.google.com

Fast and Accurate Inference of Plackett-Luce Models
Maystre, Lucas and Grossglauser, Matthias
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper propose a new inference mechanism for the Plackett-Luce model based on the preliminary observation that the ML estimate can be seen as the stationary distribution of a certain Markov chain. In fact, two inferences mechanisms are proposed, one is approximate and consistent, the other converges to the ML estimate but is slower. The authors then debate on the application settings (pairwise preferences, partial rankings). Finally, the authors exhibit three sets of experiments. The first one compares the proposed algorithm to other approximate inference mechanisms for the PL model in terms of statistical efficiency. Then on real-world datasets, one experiment compares the empirical performance of the approximate methods and a second the speed of exact methods to reach a certain level of optimality.

papers.nips.cc
scholar.google.com

Color Constancy by Learning to Predict Chromaticity from Luminance
Chakrabarti, Ayan
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The algorithm presented here is simple and interesting. Pixel luminance, chrominance, and illumination chrominance are all histogrammed, and then evaluation is simply each pixel's luminance voting on each pixel's true chrominance for each of the "memorized" illuminations. The model can be trained generative by simply counting pixels in the training set, or can be trained end-to-end for a slight performance boost. This algorithm's simplicity and speed are appealing, and additionally it seems like it may be a useful building block for a more sophisticated spatially-varying illumination model.

papers.nips.cc
scholar.google.com

Bayesian Manifold Learning: The Locally Linear Latent Variable Model (LL-LVM)
Park, Mijung and Jitkrittum, Wittawat and Qamar, Ahmad and Szabó, Zoltán and Buesing, Lars and Sahani, Maneesh
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The paper introduces a model which is probabilistic for non linear manifold discovery. It is based on a generative model with missing variables and required a variational EM implementation which is standard but nevertheless technical to derive in this specific context.

papers.nips.cc
scholar.google.com

Unlocking neural population non-stationarities using hierarchical dynamics models
Park, Mijung and Bohner, Gergo and Macke, Jakob H.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper describes using an additional time scale over trials to model (slow) non-stationarities. It adds to the successful PLDS model, another gain vector matching the latent dimensions that is constant during each trial. Many neuroscientific datasets indeed show such slow drifts, which could very well be captured by such modeling effort.

papers.nips.cc
scholar.google.com

On the Pseudo-Dimension of Nearly Optimal Auctions
Morgenstern, Jamie and Roughgarden, Tim
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper addresses the problem of learning reserve prices that approximately maximize revenue, using sample draws from an unknown distribution over bidder valuations. The authors introduce t-level auctions, in which (roughly speaking) each bidder's bid space is effectively discretized into levels, and the bidder whose bid falls on the highest level wins and pays the lowest value that falls on its lowest level required to win.

The authors bound the number of samples needed to find an approximately revenue-maximizing auction from all auctions in a set C (e.g., from the set of 10-level auctions). They bound the difference in revenue between the revenue-maximizing t-level auction and the optimal auction. Results are presented for single-item auctions but are generalized to matroid settings and single-parameter settings.

papers.nips.cc
scholar.google.com

Galileo: Perceiving Physical Object Properties by Integrating a Physics Engine with Deep Learning
Wu, Jiajun and Yildirim, Ilker and Lim, Joseph J. and Freeman, Bill and Tenenbaum, Joshua B.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The authors introduce a novel approach for inferring hidden physical properties of objects (mass and friction), which also allows the system to make subsequent predictions that depend on these properties. They use a black-box generative model (a physics simulator), to perform sampling-based inference, and leverage a tracking algorithm to transform the data into more suitable latent variables (and reduce its dimensionality) as well as a deep model to improve the sampler. The authors assume priors over the hidden physical properties, and make point estimates of the geometry and velocities of objects using a tracking algorithm, which comprise a full specification of the scene that can be input to a physics engine to generate simulated velocities. These simulated velocities then support inference of the hidden properties within an MCMC sampler: the properties' values are proposed and their consequent simulated velocities are generated, which are then scored against the estimated velocities, similar to ABC. A deep network can be trained as a recognition model, from the inferences of the generative model, and also from the Physics 101 dataset directly. Its predictions of the mass and friction can be used to initialize the MCMC sampler.

papers.nips.cc
scholar.google.com

Smooth Interactive Submodular Set Cover
He, Bryan D. and Yue, Yisong
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper considers a generalization of the Interactive Submodular Set Cover (ISSC) problem \cite{conf/icml/GuilloryB10}. In ISSC, the goal is to interactively collect elements until the value of the set of elements, represented by an unknown submodular function, reaches some threshold. In the original ISSC there is a single correct submodular function, which can be revealed using responses to each selected element, and a single desired threshold. This paper proposes to simultaneously require reaching some threshold for all the possible submodular functions. The threshold value is determined as a convex function of a submodular agreement measure between the given function and the responses to all elements. Each element has a cost, and so the goal is to efficiently decide which elements to collect to satisfy the goal at a small cost.

papers.nips.cc
scholar.google.com

A Convergent Gradient Descent Algorithm for Rank Minimization and Semidefinite Programming from Random Linear Measurements
Zheng, Qinqing and Lafferty, John D.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The paper presents results on recovery of low-rank semidefinite matrices from linear measurements, using nonconvex optimization. The approach is inspired by recent work on phase retrieval, and combines spectral initialization with gradient descent. The connection to phase retrieval comes because measurements which are linear in the semidefinite matrix $X = Z Z'$ are quadratic in the factors $Z$. The paper proves recovery results which imply that correct recovery occurs when the number of measurements m is essentially proportional to n $r^2$, where n is the dimensionality and r is the rank. The convergence analysis is based on a form of restricted strong convexity (restricted because there is an $r(r-1)/2$-dimensional set of equivalent solutions along which the objective is flat). This condition also implies linear convergence of the proposed algorithm.

The implementation seems awful. When compared to recent implementations, e.g. http://arxiv.org/abs/1408.2467 the performance seems orders of magnitude away from the state of the art -- and being an order of magnitude faster than general-purpose SDP solver on the nuclear norm does not make it any better. The authors should acknowledge that and compare the results with other codes on some established benchmark (e.g. Lenna), so as to show that the price in terms of run-time brings about much better performance in terms of objective function values (SNR, RMSE) -- which is plausible, but far from certain.

papers.nips.cc
scholar.google.com

Space-Time Local Embeddings
Sun, Ke and Wang, Jun and Kalousis, Alexandros and Marchand-Maillet, Stéphane
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The paper presents a data visualisation method based on the concept of space-time. The space-time representation is capable of showing a broader family of proximities than an Euclidean space with the same dimensionality. Based on the KL measure, the authors argue that the lower dimensional representation of the high dimensional data using the space-time local embedding method can keep more information than Euclidean embeddings. I am quite convinced, but there is one question about interpretability of the visualised data in space-time.

papers.nips.cc
scholar.google.com

Parallel Correlation Clustering on Big Graphs
Pan, Xinghao and Papailiopoulos, Dimitris S. and Oymak, Samet and Recht, Benjamin and Ramchandran, Kannan and Jordan, Michael I.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This work addresses an important special case of the correlation clustering problem: Given as input a graph with edges labeled -1 (disagreement) or +1 (agreement), the goal is to decompose the graph so as to maximize agreement within components. Building on recent work \cite{conf/kdd/BonchiGL14} \cite{conf/kdd/ChierichettiDK14}, this paper contributes two concurrent algorithms, a proof of their approximation ratio, a run-time analysis as well as a set of experiments which demonstrate convincingly the advantage of the proposed algorithms over the state of the art.

papers.nips.cc
scholar.google.com

Expressing an Image Stream with a Sequence of Natural Sentences
Park, Cesc C. and Kim, Gunhee
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The paper attacks the problem of describing a sequence of images from blog-posts with a sequence of consistent sentences. For this the paper proposes to first retrieve the K=5 most similar images and associated sentences from the training set for each query image. The main contribution of the paper lies in defining a way to select the most relevant sentences for the query image sequence, providing a coherent description. For this sentences are first embedded in a vector and then the sequence of sentences is modeled with a bidirectional LSTM. The output of the bi-directional LSTM is first fed through a relu \cite{conf/icml/NairH10} and fully connected layer and then scored with a compatibility score between image and sentence. Additionally a local coherence model \cite{journals/coling/BarzilayL08} is included to enforce the compatibility between sentences.

papers.nips.cc
scholar.google.com

Planar Ultrametrics for Image Segmentation
Yarkony, Julian and Fowlkes, Charless C.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The paper presents a method to obtain a hierarchical clustering of a planar graph by posing the problem as that of approximating a set of edge weights using an ultrametric. This is accomplished by minimizing the $\ell_2$ norm between the given edge weights and the learnt ultrametric. Learning the ultrametric amounts to estimating a collection of multicuts that satisfies a hierarchical partitioning constraint. An efficient algorithm is presented that solves an approximation based on a finding a linear combination of a subset of possible two-way cuts of the graph.

papers.nips.cc
scholar.google.com

Logarithmic Time Online Multiclass prediction
Choromanska, Anna and Langford, John
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper proposes a novel online algorithm for constructing a multiclass classifier that enjoys a time complexity logarithmic in the number of classes k. This is done by constructing online a decision tree which locally maximizes an appropriate novel objective function, which measures the quality of a tree according to a combined "balancedness" and "purity" score. A theoretical analysis (of a probably intractable algorithm) is provided via a boosting argument (assuming weak learnability), essentially extending the work of Kearns and Mansour (1996) \cite{conf/stoc/KearnsM96} to the multiclass setup. A concrete algorithm is given to a relaxed problem (but see below) without any guarantees, but quite simple, natural and interesting.

papers.nips.cc
scholar.google.com

Robust Portfolio Optimization
Qiu, Huitong and Han, Fang and Liu, Han and Caffo, Brian
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The authors derive an estimator of a "proxy" of the covariance matrix of a stationary stochastic process (in their case asset returns) which is robust to data outliers and does not make assumptions on the tails of the distribution. They show that for elliptical distributions, which includes Gaussians, this proxy is consistent with true covariance matrix up to a scaling factor; and that their proposed estimator of the proxy has bounded error.

papers.nips.cc
scholar.google.com

Covariance-Controlled Adaptive Langevin Thermostat for Large-Scale Bayesian Sampling
Shang, Xiaocheng and Zhu, Zhanxing and Leimkuhler, Benedict J. and Storkey, Amos J.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper presents a new method (the "covariance-controlled adaptive Langevin thermostat") for MCMC posterior sampling for Bayesian inference. Along the lines of previous work in scalable MCMC, this is a stochastic gradient sampling method. The presented method aims to decrease parameter-dependent noise (in order to speed-up convergence to the given invariant distribution of the Markov chain, and generate beneficial samples more efficiently), while maintaining the desired invariant distribution of the Markov chain. Similar to existing stochastic gradient MCMC methods, this method aims to find use in large-scale machine learning settings (i.e. Bayesian inference with large numbers of observations). Experiments on three models (a normal-gamma model, Bayesian logistic regression, and a discriminative restricted Boltzmann machine) aim to show that the presented method performs better than Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) \cite{10.1016/0370-2693(87)91197-X} and Stochastic Gradient Nose-Hoover Thermostat (SGNHT), two similar existing methods.

papers.nips.cc
scholar.google.com

Adaptive Low-Complexity Sequential Inference for Dirichlet Process Mixture Models
Tsiligkaridis, Theodoros and Forsythe, Keith W.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This paper introduces ASUGS (adaptive sequential updating and greedy search), building on the previous work on SUGS by Wang & Dunson 2011 \cite{10.1198/jcgs.2010.07081}, which is a sequential (ie online) MAP inference method for DPMMs.

The main contribution of the paper is to provide online updating for the concentration parameter, $\alpha$.

The paper shows that the posterior distribution on $\alpha$ can be expected to behave has a gamma distribution (that depends on the current number of clusters and on n) in the large-scale limit, assuming an exponential prior on $\alpha$.

ASUGS uses the mean of this gamma distribution as the $\alpha$ for updating cluster assignments, the remainder of the algorithm proceeding as in SUGS (ie using conjugacy to update model parameters in an online fashion, with hard assignments of data to clusters.)

The paper also shows that this choice of \alpha is bounded by $\log^\epsilon n$ for an arbitrarily small $\epsilon$, so that we may expect this process to converge, or at the very least be stable even in large settings.

papers.nips.cc
scholar.google.com

Algorithmic Stability and Uniform Generalization
Alabdulmohsin, Ibrahim M.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The paper seeks to establish a connection between algorithmic stability and generalization performance. Notions of algorithmic stability have been proposed before and linked to the generalization performance of learning algorithms \cite{conf/uai/KutinN02} \cite{journals/neco/KearnsR99} and have also been shown to be crucial for learnability \cite{journals/jmlr/Shalev-ShwartzSSS10}.

\cite{PoggioETAL:04} proved that for bounded loss functions, the generalization of ERM is equivalent to the probabilistic leave-one-out stability of the learning algorithm. \cite{journals/jmlr/Shalev-ShwartzSSS10} then showed that a problem is learnable in Vapnik's general setting of learning iff there exists an asymptotically stability ERM procedure.

This paper first establishes that for Vapnik's general setting of learning, a probabilistic notion of stability, is necessary and sufficient for the training losses to converge to test losses uniformly for all distributions. The paper then presents some discussions on how this notion of stability can be interpreted to give results in terms of the capacity of the function class or the size of the population.

papers.nips.cc
scholar.google.com

Learning with Symmetric Label Noise: The Importance of Being Unhinged
van Rooyen, Brendan and Menon, Aditya Krishna and Williamson, Robert C.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The paper presents a solution to binary classification with symmetric label noise (SLN). They show that, in order to obtain consistency (w.r.t. to the 0-1 loss in the "noiseless" case) while using a convex surrogate, one must use the loss $\ell(v,y) = 1 - vy$ -- the "unhinged loss" -- , which is shown to enjoy some useful properties, including robustness to SLN. In a more restricted sense of robustness, it is the only such loss, but in any case it overcomes the limitations of other convex losses for the same problem.

Different implications of using the unhinged loss are discussed; the problem of classification with SLN with the unhinged loss and "linear" classifiers is investigated and solved analytically. The authors also present an empirical evaluation to motivate that their theoretical considerations have practical impact.

papers.nips.cc
scholar.google.com

Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing
Shah, Nihar Bhadresh and Zhou, Denny
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

The paper proposes a payment rule for crowdsourced tasks. This rule is intended to incentivize workers to accurately report their confidence (e.g. by skipping a task when they have low confidence), and to pay little to spammers. Payment is based on the product of the evaluations of a worker's responses to a set of gold-standard tasks; if the worker gets a single gold standard task wrong and asserts high confidence, the overall payment is zero.

papers.nips.cc
scholar.google.com

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
Ren, Shaoqing and He, Kaiming and Girshick, Ross B. and Sun, Jian
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 10 years ago

This work proposes a two stage object detection algorithm based on convolutional neural network (CNN). The first stage is region proposal, which is based on the traditional sliding window method but working on the top layer feature map of CNN (RPN). In the second stage, a fast R-CNN is applied to the proposed regions. Since the convolution layers are shared between RPN and R-CNN, and the calculation is speeded up using GPU, the algorithm can achieve near real-time (5fps).

NIPS Conference Reviews

sciscore: 1.472