Welcome to ShortScience.org! 
[link]
FaceNet directly maps face images to $\mathbb{R}^{128}$ where distances directly correspond to a measure of face similarity. They use a triplet loss function. The triplet is (face of person A, other face of person A, face of person which is not A). Later, this is called (anchor, positive, negative). The loss function is learned and inspired by LMNN. The idea is to minimize the distance between the two images of the same person and maximize the distance to the other persons image. ## LMNN Large Margin Nearest Neighbor (LMNN) is learning a pseudometric $$d(x, y) = (x y) M (x y)^T$$ where $M$ is a positivedefinite matrix. The only difference between a pseudometric and a metric is that $d(x, y) = 0 \Leftrightarrow x = y$ does not hold. ## Curriculum Learning: Triplet selection Show simple examples first, then increase the difficulty. This is done by selecting the triplets. They use the triplets which are *hard*. For the positive example, this means the distance between the anchor and the positive example is high. For the negative example this means the distance between the anchor and the negative example is low. They want to have $$f(x_i^a)  f(x_i^p)_2^2 + \alpha < f(x_i^a)  f(x_i^n)_2^2$$ where $\alpha$ is a margin and $x_i^a$ is the anchor, $x_i^p$ is the positive face example and $x_i^n$ is the negative example. They increase $\alpha$ over time. It is crucial that $f$ maps the images not in the complete $\mathbb{R}^{128}$, but on the unit sphere. Otherwise one could double $\alpha$ by simply making $f' = 2 \cdot f$. ## Tasks * **Face verification**: Is this the same person? * **Face recognition**: Who is this person? ## Datasets * 99.63% accuracy on Labeled FAces in the Wild (LFW) * 95.12% accuracy on YouTube Faces DB ## Network Two models are evaluated: The [Zeiler & Fergus model](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZeilerF13) and an architecture based on the [Inception model](http://www.shortscience.org/paper?bibtexKey=journals/corr/SzegedyLJSRAEVR14). ## See also * [DeepFace](http://www.shortscience.org/paper?bibtexKey=conf/cvpr/TaigmanYRW14#martinthoma) 
[link]
Spatial Pyramid Pooling (SPP) is a technique which allows Convolutional Neural Networks (CNNs) to use input images of any size, not only $224\text{px} \times 224\text{px}$ as most architectures do. (However, there is a lower bound for the size of the input image). ## Idea * Convolutional layers operate on any size, but fully connected layers need fixedsize inputs * Solution: * Add a new SPP layer on top of the last convolutional layer, before the fully connected layer * Use an approach similar to bag of words (BoW), but maintain the spatial information. The BoW approach is used for text classification, where the order of the words is discarded and only the number of occurences is kept. * The SPP layer operates on each feature map independently. * The output of the SPP layer is of dimension $k \cdot M$, where $k$ is the number of feature maps the SPP layer got as input and $M$ is the number of bins. Example: We could use spatial pyramid pooling with 21 bins: * 1 bin which is the max of the complete feature map * 4 bins which divide the image into 4 regions of equal size (depending on the input size) and rectangular shape. Each bin gets the max of its region. * 16 bins which divide the image into 4 regions of equal size (depending on the input size) and rectangular shape. Each bin gets the max of its region. ## Evaluation * Pascal VOC 2007, Caltech101: stateoftheart, without finetuning * ImageNet 2012: Boosts accuracy for various CNN architectures * ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014: Rank #2 ## Code The paper claims that the code is [here](http://research.microsoft.com/enus/um/people/kahe/), but this seems not to be the case any more. People have tried to implement it with Tensorflow ([1](http://stackoverflow.com/q/40913794/562769), [2](https://github.com/fchollet/keras/issues/2080), [3](https://github.com/tensorflow/tensorflow/issues/6011)), but by now no public working implementation is available. ## Related papers * [Atrous Convolution](https://arxiv.org/abs/1606.00915)
1 Comments

[link]
Federated learning is the problem of training a model that incorporates updates from the data of many individuals, without having direct access to that data, or having to store it. This is potentially desirable both for reasons of privacy (not wanting to have access to private data in a centralized way), and for potential benefits to transport cost when data needed to train models exists on a user's device, and would require a lot of bandwidth to transfer to a centralized server. Historically, the default way to do Federated Learning was with an algorithm called FedSGD, which worked by:  Sending a copy of the current model to each device/client  Calculating a gradient update to be applied on top of that current model given a batch of data sampled from the client's device  Sending that gradient back to the central server  Averaging those gradients and applying them all at once to a central model The authors note that this approach is equivalent to one where a single device performs a step of gradient descent locally, sends the resulting *model* back to the the central server, and performs model averaging by averaging the parameter vectors there. Given that, and given their observation that, in federated learning, communication of gradients and models is generally much more costly than the computation itself (since the computation happens across so many machines), they ask whether the communication required to get to a certain accuracy could be better optimized by performing multiple steps of gradient calculation and update on a given device, before sending the resulting model back to a central server to be average with other clients models. Specifically, their algorithm, FedAvg, works by:  Dividing the data on a given device into batches of size B  Calculating an update on each batch and applying them sequentially to the starting model sent over the wire from the server  Repeating this for E epochs Conceptually, this should work perfectly well in the world where data from each batch is IID  independently drawn from the same distribution. But that is especially unlikely to be true in the case of federated learning, when a given user and device might have very specialized parts of the data space, and prior work has shown that there exist pathological cases where averaged models can perform worse than either model independently, even *when* the IID condition is met. The authors experiment empirically ask the question whether these sorts of pathological cases arise when simulating a federated learning procedure over MNIST and a language model trained on Shakespeare, trying over a range of hyperparameters (specifically B and E), and testing the case where data is heavily nonIID (in their case: where different "devices" had nonoverlapping sets of digits). https://i.imgur.com/xq9vi8S.png They show that, in both the IID and nonIID settings, they are able to reach their target accuracy, and are able to do so with many fewer rounds of communciation than are required by FedSGD (where an update is sent over the wire, and a model sent back, for each round of calculation done on the device.) The authors argue that this shows the practical usefulness of a Federated Learning approach that does more computation on individual devices before updating, even in the face of theoretical pathological cases. 
[link]
The paper introduces a sequential variational autoencoder that generates complex images iteratively. The authors also introduce a new spatial attention mechanism that allows the model to focus on small subsets of the image. This new approach for image generation produces images that can’t be distinguished from the training data. #### What is DRAW: The deep recurrent attention writer (DRAW) model has two differences with respect to other variational autoencoders. First, the encoder and the decoder are recurrent networks. Second, it includes an attention mechanism that restricts the input region observed by the encoder and the output region observed by the decoder. #### What do we gain? The resulting images are greatly improved by allowing a conditional and sequential generation. In addition, the spatial attention mechanism can be used in other contexts to solve the “Where to look?” problem. #### What follows? A possible extension to this model would be to use a convolutional architecture in the encoder or the decoder. Although this might be less useful since we are already restricting the input of the network. #### Like: * As observed in the samples generated by the model, the attention mechanism works effectively by reconstructing images in a local way. * The attention model is fully differentiable. #### Dislike: * I think a better exposition of the attention mechanism would improve this paper. 
[link]
The authors propose a unified setting to evaluate the performance of batch reinforcement learning algorithms. The proposed benchmark is discrete and based on the popular Atari Domain. The authors review and benchmark several current batch RL algorithms against a newly introduced version of BCQ (Batch Constrained Deep Q Learning) for discrete environments. https://i.imgur.com/zrCZ173.png Note in line 5 that the policy chooses actions with a restricted argmax operation, eliminating actions that have not enough support in the batch. One of the key difficulties in batchRL is the divergence of value estimates. In this paper the authors use Double DQN, which means actions are selected with a value net $Q_{\theta}$ and the policy evaluation is done with a target network $Q_{\theta'}$ (line 6). **How is the batch created?** A partially trained DQNagent (trained online for 10mio steps, aka 40mio frames) is used as behavioral policy to collect a batch $B$ containing 10mio transitions. The DQN agent uses either with probability 0.8 an $\epsilon=0.2$ and with probability 0.2 an $\epsilon = 0.001$. The batch RL agents are trained on this batch for 10mio steps and evaluated every 50k time steps for 10 episodes. This process of batch creation differs from the settings used in other papers in i) having only a single behavioral policy, ii) the batch size and iii) the proficiency level of the batch policy. The experiments, performed on the arcade learning environment include DQN, REM, QRDQN, KLControl, BCQ, OnlineDQN and Behavioral Cloning and show that:  for conventional RL algorithms distributional algorithms (QRDQN) outperform the plain algorithms (DQN)  batch RL algorithms perform better than conventional algorithms with BCQ outperforming every other algorithm in every tested game In addition to the return the authors plot the value estimates for the Qnetworks. A drop in performance corresponds in all cases to a divergence (up or down) in value estimates. The paper is an important contribution to the debate about what is the right setting to evaluate batch RL algorithms. It remains however to be seen if the proposed choice of i) a single behavior policy, ii) the batch size and iii) quality level of the behavior policy will be accepted as standard. Further work is in any case required to decide upon a benchmark for continuous domains. 