[link]
I'll admit that I found this paper a bit of a letdown to read, relative to expectations rooted in its high citation count, and my general excitement and interest to see how deep learning could be brought to bear on molecular design. But before a critique, let's first walk through the mechanics of how the authors' approach works. The method proposed is basically a very straightforward Variational Auto Encoder, or VAE. It takes in a textual SMILES string representation of a molecular structure, uses an encoder to map that into a continuous vector representation, a decoder to map the vector representation back into a a SMILES string, and an auxiliary predictor to predict properties of a molecule given the continuous representation. So, the training loss is a combination of the reconstruction loss (log probability of the true molecule under the distribution produced by the decoder) and the semisupervised predictive loss. The hope with this model is that it would allow you to sample from a space of potential molecules by starting from an existing molecule, and then optimizing the the vector representation of that molecule to make it score higher on whatever property you want to optimize for. https://i.imgur.com/WzZsCOB.png The authors acknowledge that, in this setup, you're just producing a probability distribution over characters, and that the continuous vectors sampled from the latent space might not actually map to valid SMILES strings, and beyond that may well not correspond to chemically valid molecules. Empirically, they said that the proportion of valid generated molecules ranged between 1 and 70%. But they argue that it'd be too difficult to enforce those constraints, and instead just sample from the model and run the results through a handdesigned filter for molecular validity. In my view, this is the central weakness of the method proposed in this paper: that they seem to have not tackled the question of either chemical viability or even syntactic correctness of the produced molecules. I found it difficult to nail down from the paper what the ultimate percentage of valid molecules was from points in latent space that were off of the training . A table reports "percentage of 5000 randomlyselected latent points that decode to valid molecules after 1000 attempts," but I'm confused by what the 1000 attempts means here  does that mean we draw 1000 samples from the distribution given by the decoder, and see if *any* of those samples are valid? That would be a strange metric, if so, and perhaps it means something different, but it's hard to tell. https://i.imgur.com/9sy0MXB.png This paper made me really curious to see whether a GAN could do better in this space, since it would presumably be better at the task of incentivizing syntactic correctness of produced strings (given that any deviation from correctness could be signal for the discriminator), but it might also lead to issues around mode collapse, and when I last checked the literature, GANs on text data in particular were still not great. 
[link]
The goal of oneshot learning tasks is to design a learning structure that can perform a new task (or, more canonically, add a new class to an existing task) using only one a small number of examples of the new task or class. So, as an example: you'd want to be able to take one positive and one negative example of a given task and correctly classify subsequent points as either positive or negative. A common way of achieving this, and the way that the paper builds on, is to learn a parametrized function projecting both your labeled points (your "support set") and your unlabeled point (your "query") into an embedding space, and then assigning a class to your query according to how close it is to the support set points associated with each label. The hope is that, in the course of training on different but similar tasks, you've learned a metric space where nearby things tend to be of similar classes. This method is called a "matching network". This paper has the specific objective of using such oneshot methods for drug discovery, and evaluates on tasks drawn from that domain, but most of the mechanics of the paper can be understood without reference to molecular dat in particular. In the simplest version of such a network, the query and support set points are embedded unconditionally  meaning that the query would be embedded in the same way regardless of the values in the support set, and that each point in the support set would be embedded without knowledge of each other. However, given how little data we're giving our model to work with, it might be valuable to allow our query embedder (f(x)) and support set embedder (g(x)) to depend on the values within the support set. Prior work had achieved this by: 1) Creating initial f'(x) and g'(x) query and support embedders. 2) Concatenating the embedded support points g'(x) into a single vector and running a bidirectional LSTM over the concatenation, which results in a representation g(x) of each input that incorporates information from g'(x_i) for all other x_i (albeit in a way that imposes a concatenation ordering that may not correspond to a meaningful order) 3) Calculating f(x) of your embedding point by using an attention mechanism to combine f'(x) with the contextualized embeddings g(x) The authors of the current paper argue that this approach is suboptimal because of the artificially imposed ordering, and because it calculated g(x) prior to f(x) using asymmetrical model structures (though it's not super clear why this latter point is a problem). Instead, they propose a somewhat elaborate and difficulttofollow attention based mechanism. As best as I can understand, this is what they're suggesting: https://i.imgur.com/4DLWh8H.png 1) Update the query embedding f(x) by calculating an attention distribution over the vector current embeddings of support set points (here referred to as bolded <r>), pooling downward to a single aggregate embedding vector r, and then using a LSTM that takes in that aggregate vector and the prior update to generate a new update. This update, dz, is added to the existing query embedding estimate to get a new one 2) Update the vector of support set embeddings by iteratively calculating an attention mapping between the vector of current support set embeddings and the original features g'(S), and using that attention mapping to create a new <r>, which, similar to the above, is fed into an LSTM to calculate the next update. Since the model is evaluated on molecular tasks, all of the embedding functions are structured as graph convolutions. Other than the obvious fact that attention is a great way of aggregating information in an orderindependent way, the authors give disappointingly little justification of why they would expect their method to work meaningfully better than past approaches. Empirically, they do find that it performs slightly better than prior contextualized matching networks on held out tasks of predicting toxicity and side effects with only a small number from the held out task. However, neither this paper's new method nor previous oneshot learning work is able to perform very well on the challenging MUV dataset, where heldout binding tasks involve structurally dissimilar molecules from those seen during training, suggesting that whatever generalization this method is able to achieve doesn't quite rise to the task of making inferences based on molecules with different structures. 
[link]
This paper was published after the 2015 Duvenaud et al paper proposing a differentiable alternative to circular fingerprints of molecules: substituting out exactmatch random hash functions to identify molecular structures with learned convolutionalesque kernels. As far as I can tell, the Duvenaud paper was the first to propose something we might today recognize as graph convolutions on atoms. I hoped this paper would build on that one, but it seems to be coming from a conceptually different direction, and it seems like it was more or less contemporaneous, for all that it was released later. This paper introduces a structure that allows for more explicit message passing along bonds, by calculating atom features as a function of their incoming bonds, and then bond features as a function of their constituent atoms, and iterating this procedure, so information from an atom can be passed into a bond, then, on the next iteration, pulled in by another atom on the other end of that bond, and then pulled into that atom's bonds, and so forth. This has the effect of, similar to a convolutional or recurrent network, creating representations for each atom in the molecular graph that are informed by context elsewhere in the graph, to different degrees depending on distance from that atom. More specifically, it defines:  A function mapping from a prior layer atom representation to a subsequent layer atom representation, taking into account only information from that atom (Atom to Atom)  A function mapping from a prior layer bond representation (indexed by the two atoms on either side of the bond), taking into account only information from that bond at the prior layer (Bond to Bond)  A function creating a bond representation by applying a shared function to the atoms at either end of it, and then combining those representations with an aggregator function (Atoms to Bond)  A function creating an atom representation by applying a shared function all the bonds that atom is a part of, and then combining those results with an aggregator function (Bonds to Atom) At the top of this set of layers, when each atom has had information diffused into it by other parts of the graph, depending on the network depth, the authors aggregate the peratom representations into histograms (basically, instead of summing or maxpooling featurewise, creating course distributions of each feature), and use that for supervised tasks. One frustration I had with this paper is that it doesn't do a great job of highlighting its differences with and advantages over prior work; in particular, I think it doesn't do a very good job arguing that its performance is superior to the earlier Duvenaud work. That said, for all that the presentation wasn't ideal, the idea of messagepassing is an important one in graph convolutions, and will end up becoming more standard in later works. 
[link]
My objective in reading this paper was to gain another perspective on, and thus a more wellgrounded view of, machine learning scoring functions for dockingbased prediction of ligand/protein binding affinity. As quick background context, these models are useful because many therapeutic compounds act by binding to a target protein, and it can be valuable to prioritize doing wet lab testing on compounds that are predicted to have a stronger binding affinity. Docking systems work by predicting the pose in which a compound (or ligand) would bind to a protein, and then scoring prospective poses based on how likely such a pose would be to have high binding affinity. It's important to note that there are two predictive components in such a pipeline, and thus two sources of potential error: the searching over possible binding poses, done by physicsbased systems, and scoring of the affinity of a given pose, assuming that were actually the correct one. Therefore, in the second kind of modeling, which this paper focuses on, you take in features *of a particular binding pose*, which includes information like which atoms of the compound are nearby to which atoms of the protein. The actual neural network structure used here was admittedly a bit underwhelming (though, to be fair, many of the ideas it seems to be gesturing at wouldn't be properly formalized until Graph Convolutional Networks came around). I'll describe the network mechanically first, and then offer some commentary on the design choices. https://i.imgur.com/w9wKS10.png 1. For each atom (a) in the compound, a set of neighborhood features are defined. The neighborhood is based on two hyperparameters, one for "how many atoms from the protein should be included," and one for "how many atoms from the compound should be included". In both cases, you start by adding the closest atom from either the compound or protein, and as hyperparameter values of each increase, you add in fartheraway atoms. The neighborhood features here are (i) What are the types of the atoms? (ii) What are the partial charges of the atoms? (iii) How far are the atoms from the reference atom? (iiii) What amino acid within the protein do the protein atoms come? 2. All of these features are turned into embeddings. Yes, all of them, even the ones (distance and charge) that are continuous values. Coming from a machine learning perspective, this is... pretty weird as a design choice. The authors straightup discretize the distance values, and then use those as discrete values for the purpose of looking up embeddings. (So, you'd have one embedding vector for distance (0.250.5, and a different one for 0.00.25, say). 3. The embeddings are concatenated together into a single "atom neighborhood vector" based on a predetermined ordering of the neighbor atoms and their property vectors. We now have one atom neighborhood vector for each atom in the compound. 4. The authors then do what they call a convolution over the atom neighborhood vectors. But it doesn't act like a normal convolution in the sense of mixing information from nearby regions of atom space. It just is basically a fully connected layer that's applied to atom neighborhood vector separately, but with shared weights, so the same layer is applied to each neighborhood vector. They then do a featurewise max pool across the layertransformed version of neighborhood vectors, getting you one vector for the full compound 5. This single vector is then put into a softmax, which predicts whether this ligand (in in this particular pose) will have strong binding with the protein Some thoughts on what's going on here. First, I really don't have a good explanation for why they'd have needed to embed a discretized version of the continuous variables, and since they don't do an ablation test of that design choice, it's hard to know if it mattered. Second, it's interesting to see, in their "convolution" (which I think is more accurately described as a Siamese Network, since it's only convolutionlike insofar as there are shared weights), the beginning intuitions of what would become Graph Convolutions. The authors knew that they needed methods to aggregate information from arbitrary numbers of atoms, and also that they need should learn representations that have visibility onto neighborhoods of atoms, rather than single ones, but they do so in an entirely handengineered way: manually specifying a fixed neighborhood and pulling in information from all those neighbors equally, in a big concatenated vector. By contrast, when Graph Convolutions come along, they act by defining a "messagepassing" function for features to aggregate across graph edges (here: molecular bonds or binaries on being "near enough" to another atom), which similarly allows information to be combined across neighborhoods. And, then, the 'convolution' is basically just a simple aggregation: necessary because there's no canonical ordering of elements within a graph, so you need an orderagnostic aggregation like a sum or max pool. The authors find that their method is able to improve on the handdesigned scoring functions within the docking programs. However, they also find (similar to another paper I read recently) that their model is able to do quite well without even considering structural relationships of the binding pose with the protein, which suggests the dataset (DUD  a dataset of 40 proteins with ~4K correctly binding ligands, and ~35K ligands paired with proteins they don't bind to) and problem given to the model is too easy. It's also hard to tell how I should consider AUCs within this problem  it's one thing to be better than an existing method, but how much value do you get from a given unit of AUC improvement, when it comes to actually meaningfully reducing wetlab time used on testing compounds? I don't know that there's much to take away from this paper in terms of useful techniques, but it is interesting to see the evolution of ideas that would later be more cleanly formalized in other works. 
[link]
Wang et al. discuss an alternative definition of adversarial examples, taking into account an oracle classifier. Adversarial perturbations are usually constrained in their norm (e.g., $L_\infty$ norm for images); however, the main goal of this constraint is to ensure label invariance – if the image didn’t change notable, the label didn’t change either. As alternative formulation, the authors consider an oracle for the task, e.g., humans for image classification tasks. Then, an adversarial example is defined as a slightly perturbed input, whose predicted label changes, but where the true label (i.e., the oracle’s label) does not change. Additionally, the perturbation can be constrained in some norm; specifically, the perturbation can be constrained on the true manifold of the data, as represented by the oracle classifier. Based on this notion of adversarial examples, Wang et al. argue that deep neural networks are not robust as they utilize overcomplete feature representations. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Xie et al. Propose to regularize deep neural networks by randomly disturbing (i.e., changing) training labels. In particular, for each training batch, they randomly change the label of each sample with probability $\alpha$  when changing a label, it’s sampled uniformly from the set of labels. In experiments, the authors show that this sort of loss regularization improves generalization. However, Dropout usually performs better; in their case, only the combination with leads to noticable improvements on MNIST and SVHN – and only compared to no regularization and data augmentation at all. In their discussion, they offer two interpretations of dropping labels. First, it canbe seen as learning an ensemble of models on different noisy label sets; second, it can be seen as implicitly performing data augmentation. Both interepretation area reasonable, but do not provide a definite answer to why disturbing training labels should work well. https://i.imgur.com/KH36sAM.png Figure 1: Comparison of training testing error rate during training for no regularization, dropout and DropLabel. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Carlini and Wagner show that defensive distillation as defense against adversarial examples does not work. Specifically, they show that the attack by Papernot et al [1] can easily be modified to attack distilled networks. Interestingly, the main change is to introduce a temperature in the last softmax layer. This termperature, when chosen hgih enough will take care of aligning the gradients from the softmax layer and from the logit layer – otherwise, they will have significantly different magnitude. Personally, I found that this also aligns with the observations in [2] where Carlini and Wagner also find that attack objectives defined on the logits work considerably better. [1] N. Papernot, P. McDaniel, X. Wu, S. Jha, A. Swami. Distillation as a defense to adersarial perturbations against deep neural networks. SP, 2016. [2] N. Carlini, D. Wagner. Towards Evaluating the Robustness of Neural Networks. ArXiv, 2016. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Medical image segmentation have been a classic problem in medical image analysis, with a score of research backing the problem. Many approaches worked by designing handcrafted features, while others worked using global or local intensity cues. These approaches were sometimes extended to 3D, but most of the algorithms work with 2D images (or 2D slices of a 3D image). It is hypothesized that using the full 3D volume of a scan may improve segmentation performance due to the amount of context that the algorithm can be exposed to, but such approaches have been very expensive computationally. Deep learning approches like ConvNets have been applied to segmentation problems, which are computationally very efficient during inference time due to highly optimized linear algebra routines. Although these approaches form the stateofart, they still utilize 2D views of a scan, and fail to work well on full 3D volumes. To this end, Milletari et al. propose a new CNN architecture consisting of volumetric convolutions with 3D kernels, on full 3D MRI prostate scans, trained on the task of segmenting the prostate from the images. The network architecture primarily consisted of 3D convolutions which use volumetric kernels having size 5x5x5 voxels. As the data proceeds through different stages along the compression path, its resolution is reduced. This is performed through convolution with 2x2x2 voxels wide kernels applied with stride 2, hence there are no pooling layers in the architecture. The architecutre resembles an encoderdecoder type architecture with the decoder part, also called downsampling, reduces the size of the signal presented as input and increases the receptive field of the features being computed in subsequent network layers. Each of the stages of the left part of the network, computes a number of features which is two times higher than the one of the previous layer. The right portion of the network extracts features and expands the spatial support of the lower resolution feature maps in order to gather and assemble the necessary information to output a two channel volumetric segmentation. The two features maps computed by the very last convolutional layer, having 1x1x1 kernel size and producing outputs of the same size as the input volume, are converted to probabilistic segmentations of the foreground and background regions by applying softmax voxelwise. In order to train the network, the authors propose to use Dice loss function. The CNN is trained endtoend on a dataset of 50 prostate scans in MRI. The network approached a 0.869 $\pm$ 0.033 dice loss, and beat the other stateofart models. 
[link]
# **Introduction** ### **Goal of the paper** * The goal of this paper is to use an RGBD image to find the best pose for grasping an object using a parallel pose gripper. * The goal of this algorithm is to also give an open loop method for manipulation of the object using vision data. ### **Previous Research** * Even the state of the art in grasp detection algorithms fail under real world circumstances and cannot work in real time. * To perform grasping a 7D grasp representation is used. But usually a 5D grasping representation is used and this is projected back into 7D space. * Previous methods directly found the 7D pose representation using only the vision data. * Compared to older computer vision techniques like sliding window classifier deep learning methods are more robust to occlusion , rotation and scaling. * Grasp Point detection gave high accuracy (> 92%) but was helpful for only grasping cloths or towels. ### **Method** * Grasp detection is generally a computer vision problem. * The algorithm given by the paper made use of computer vision to find the grasp as a 5D representation. The 5D representation is faster to compute and is also less computationally intensive and can be used in real time. * The general grasp planning algorithms can be divided into three distinct sequential phases ; 1. Grasp detection 1. Trajectory planning 1. Grasp execution * One of the most major tasks in grasping algorithms is to find the best place for grasping and to map the vision data to coordinates that can be used for manipulation. * The method makes use of three neural networks : 1. 50 deep neural network (ResNet 50) to find the features in RGB image. This network is pretrained on the ImageNet dataset. 1. Another neural network to find the feature in depth image. 1. The output from the two neural networks are fed into another network that gives the final grasp configuration as the output. * The robot grasping configuration can be given as a function of the x,y,w,h and theta where (x,y) are the centre of the grasp rectangle and theta is the angle of the grasp rectangle. * Since very deep networks are being used (number of layers > 20) , residual layers are used that helps in improving the loss surface of the network and reduce the vanishing gradient problems. * This paper gives two types of networks for the grasp detection ; 1. UniModal Grasp Predictor * These use only an RGB 2D image to extract the feature from the input image and then use the features to give the best pose. * A Linear  SVM is used as the final classifier to classify the best pose for the object. 1. MultiModal Grasp Predictor * This model makes use of both the 2D image and the RGBD image to extract the grasp. * RGBD image is decomposed into an RGB image and a depth image. * Both the images are passed through the networks and the outputs are the combined together to a shallow CNN. * The output of the shallow CNN is the best grasp for the object. ### **Experiments and Results** * The experiments are done on the Cornell Grasp dataset. * Almost no or minimum preprocessing is done on the images except resizing the image. * The results of the algorithm given by this paper are compared to unimodal methods that use only RGB images. * To validate the model it is checked if the predicted angle of grasp is less than 30 degrees and that the Jaccard similarity is more than 25% of the ground truth label. ### **Conclusion** * This paper shows that DeepConvolutional neural networks can be used to predict the grasping pose for an object. * Another major observation is that the deep residual layers help in better extraction of the features of the grasp object from the image. * The new model was able to run at realtime speeds. * The model gave state of the art results on Cornell Grasping dataset.  ### **Open research questions** * Transfer Learning concepts to try the model on real robots. * Try the model in industrial environments on objects of different sizes and shapes. * Formulating the grasping problem as a regression problem. 
[link]
## Segmented SNN **Summary**: this paper use 3stage 3D CNN to identify candidate proposals, recognize actions and localize temporal boundaries. **Models**: this network can be mainly divided into 3 parts: generate proposals, select proposal and refine temporal boundaries, and using NMS to remove redundant proposals. 1. generate multiscale(16,32,64,128,256.512) segment using sliding window with 75% overlap. high computing complexity! 2. network: Each stage of the threestage network is using 3D convNets concatenating with 3 FC layers. * the proposal network is basically a classifier which will judge if each proposal contains action or not. * the classification network is used to classify each proposal which the proposal network think is valid into background and K action categories * the localization network functioned as a scoring system which raises scores of proposals that have high overlap with corresponding ground truth while decreasing the others. . 
[link]
 Implementations:  https://hub.docker.com/r/mklinov/caffeflownet2/  https://github.com/lmbfreiburg/flownet2docker  https://github.com/lmbfreiburg/flownet2  Explanations:  A Brief Review of FlowNet  not a clear explanation https://medium.com/towardsdatascience/abriefreviewofflownetdca6bd574de0  https://www.youtube.com/watch?v=JSzUdVBmQP4 Supplementary material: http://openaccess.thecvf.com/content_cvpr_2017/supplemental/Ilg_FlowNet_2.0_Evolution_2017_CVPR_supplemental.pdf 
[link]
This model called Med2Vec is inspired by Word2Vec. It is Word2Vec for time series patient visits with ICD codes. The model learns embeddings for medical codes as well as the demographics of patients. https://i.imgur.com/Zjj6Xxz.png The context is temporal. For each $x_t$ as input the model predicts $x_{t+1}$ and $x_{t1}$ or more depending on the temporal window size. 
[link]
This work extends sequencetosequence models for machine translation by using syntactic information on the source language side. This paper looks at the translation task where English is the source language, and Japanese is the target language. The dataset is the ASPEC corpus of scientific paper abstracts that seem to be in both English and Japanese? (See note below). The trees for the source (English) are generated by running the ENJU parser on the English data, resulting in binary trees, and only the bracketing information is used (no phrase category information). Given that setup, the method is an extension of seq2seq translation models where they augment it with a TreeLSTM to do the encoding of the source language. They deviate from a standard TreeLSTM by running an LSTM across tokens first, and using the LSTM hidden states as the leaves of the tree instead of the token embeddings themselves. Once they have the encoding from the tree, it is concatenated with the standard encoding from an LSTM. At decoding time, the attention for output token $y_j$ is computed across all source tree nodes $i$, which includes $n$ input token nodes and $n1$ phrasal nodes, as the similarity between the hidden state $s_j$ and the encoding at node $i$, then passed through softmax. Another deviation from standard practice (I believe) is that the hidden state calculations $s_j$ in the decoder are a function of the previous output token $y_{t1}$, the previous time steps hidden state $s_{j1}$ and the previous time step's attentionmodulated hidden state $\tilde{s}_{j1}$. The authors introduce an additional trick for improving decoding performance when translating long sentences, since they say standard length normalization did not work. Their method is to compute a probability distribution over output length given input length, and use this to create an additional penalty term in their scoring function, as the log of the probability of the current output length given input length. They evaluate using RIBES (not familiar) and BLEU scores, and show better performance than other NMT and SMT methods, and similar to the best performing (nonneural) tree to sequence model. Implementation: They seem to have a custom implementation in C++ rather than using a DNN library. Their implementation takes one day to run one epoch of training on the full training set. They do not say how many epochs they train for. Note on data: We have looked at this data a bit for a project I'm working on, and the English sentences look like translations from Japanese. A large proportion of the sentences are written in passive form with the structure "X was Yed" e.g.. "the data was processed, the cells were cultured." This looks to me like they translated subjectdropped Japanese sentences which would have the same word order, but are not actually passive! So that raises for me the question of how representative the source side inputs are of natural English. 
[link]
Rozsa et al. propose PASS, an perceptual similarity metric invariant to homographies to quantify adversarial perturbations. In particular, PASS is based on the structural similarity metric SSIM [1]; specifically $PASS(\tilde{x}, x) = SSIM(\psi(\tilde{x},x), x)$ where $\psi(\tilde{x}, x)$ transforms the perturbed image $\tilde{x}$ to the image $x$ by applying a homography $H$ (which can be found through optimization). Based on this similarity metric, they consider additional attacks which create small perturbations in terms of the PASS score, but result in larger $L_p$ norms; see the paper for experimental results. [1] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. TIP, 2004. Also see this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Bastani et al. propose formal robustness measures and an algorithm for approximating them for piecewise linear networks. Specifically, the notion of robustness is similar to related work: $\rho(f,x) = \inf\{\epsilon \geq 0  f \text{ is not } (x,\epsilon)\text{robust}$ where $(x,\epsilon)$robustness demands that for every $x'$ with $\x'x\_\infty$ it holds that $f(x') = f(x)$ – in other words, the label does not change for perturbations $\eta = x'x$ which are small in terms of the $L_\infty$ norm and the constant $\epsilon$. Clearly, a higher $\epsilon$ implies a stronger notion of robustness. Additionally, the above definition is essentially a pointwise definition of robustness. In order to measure robustness for the whole network (i.e. not only pointwise), the authors introduce the adversarial frequency: $\psi(f,\epsilon) = p_{x\sim D}(\rho(f,x) \leq \epsilon)$. This measure measures how often $f$ failes to be robust in the sense of $(x,\epsilon)$robustness. The network is more robust when it has low adversarial frequency. Additionally, they introduce adversarial severity: $\mu(f,\epsilon) = \mathbb{E}_{x\sim D}[\rho(f,x)  \rho(f,x) \leq \epsilon]$ which measures how severly $f$ fails to be robust (if it fails to be robust for a sample $x$). Both above measures can be approximated by counting given that the robustness $\rho(f, x)$ is known for all samples $x$ in a separate test set. And this is the problem of the proposed measures: in order to approximate $\rho(f, x)$, the authors propose an optimizationbased approach assuming that the neural network is piecewise linear. This assumption is not necessarily unrealistic, dot products, convolutions, $\text{ReLU}$ activations and max pooling are all piecewise linear. Even batch normalization is piecewise linear at test time. The problem, however, is that th enetwork needs to be encoded in terms of linear programs, which I believe is cumbersome for realworld networks. Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Fawzi et al. study robustness in the transition from random samples to semirandom and adversarial samples. Specifically they present bounds relating the norm of an adversarial perturbation to the norm of random perturbations – for the exact form I refer to the paper. Personally, I find the definition of semirandom noise most interesting, as it allows to get an intuition for distinguishing random noise from adversarial examples. As in related literature, adversarial examples are defined as $r_S(x_0) = \arg\min_{x_0 \in S} \r\_2$ s.t. $f(x_0 + r) \neq f(x_0)$ where $f$ is the classifier to attack and $S$ the set of allowed perturbations (e.g. requiring that the perturbed samples are still images). If $S$ is mostly unconstrained regarding the direction of $r$ in high dimensional space, Fawzi et al. consider $r$ to be an adversarial examples – intuitively, and adversary can choose $r$ arbitrarily to fool the classifier. If, however, the directions considered in $S$ are constrained to an $m$dimensional subspace, Fawzi et al. consider $r$ to be semirandom noise. In the extreme case, if $m = 1$, $r$ is random noise. In this case, we can intuitively think of $S$ as a randomly chosen one dimensional subspace – i.e. a random direction in multidimensional space. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Tanay and Griffin introduce the boundary tilting perspective as alternative to the “linear explanation” for adversarial examples. Specifically, they argue that it is not reasonable to assume that the linearity in deep neural networks causes the existence of adversarial examples. Originally, Goodfellow et al. [1] explained the impact of adversarial examples by considering a linear classifier: $w^T x' = w^Tx + w^T\eta$ where $\eta$ is the adversarial perturbations. In large dimensions, the second term might result in a significant shift of the neuron's activation. Tanay and Griffin, in contrast, argue that the dimensionality does not have an impact; althought he impact of $w^T\eta$ grows with the dimensionality, so does $w^Tx$, such that the ratio should be preserved. Additionally, they showed (by giving a counterexample) that linearity is not sufficient for the existence of adversarial examples. Instead, they offer a different perspective on the existence of adversarial examples that is, in the course of the paper, formalized. Their main idea is that the training samples live on a manifold in the actual input space. The claim is, that on the manifold there are no adversarial examples (meaning that the classes are well separated on the manifold and it is hard to find adversarial examples for most training samples). However, the decision boundary extends beyond the manifold and might lie close to the manifold such that adversarial examples leaving the manifold can be found easily. This idea is illustrated in Figure 1. https://i.imgur.com/SrviKgm.png Figure 1: Illustration of the underlying idea of the boundary tilting perspective, see the text for details. [1] Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy: Explaining and Harnessing Adversarial Examples. CoRR abs/1412.6572 (2014) Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Zahavy et al. introduce the concept of ensemble robustness and show that it can be used as indicator for generalization performance. In particular, the main idea is to lift he concept of robustness against adversarial examples to ensemble of networks – as trained, e.g. through Dropout or BayesbyBackprop. Letting $Z$ denote the sample set, a learning algorithm is $(K, \epsilon)$ robust if $Z$ can be divided into $K$ disjoint sets $C_1,\ldots,C_K$ such that for every training set $s_1,\ldots,s_n \in Z$ it holds: $\forall i, \forall z \in Z, \forall k = 1,\ldots, K$: if $s,z \in C_k$, then $l(f,s_i) – l(f,z) \leq \epsilon(s_1,\ldots,s_n)$ where $f$ is the model produced by the learning algorithm, $l$ measures the loss and $\epsilon:Z^n \mapsto \mathbb{R}$. For ensembles (explicit or implicit) this definition is extended by considering the maximum generalization loss under the expectation of a randomized learning algorithm: $\forall i, \forall k = 1,\ldots,K$: if $s \in C_k$, then $\mathbb{E}_f \max_{z \in C_k} l(f,s_i) – l(f,z) \leq \epsilon(s_1,\ldots,s_n)$ Here, the randomized learning algorithm computes a distribution over models given a training set. Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Kurakin et al. present some larger scale experiments using adversarial training on ImageNet to increase robustness. In particular, they claim to be the first using adversarial training on ImageNet. Furthermore, they provide experiments underlining the following conclusions:  Adversarial training can also be seen as regularizer. This, however, is not surprising as training on noisy training samples is also known to act as regularization.  Label leaking describes the observation that an adversarially trained model is able to defend against (i.e. correctly classify) an adversarial example which has been computed by knowing to true label while not defending against adversarial examples that were crafted without knowing the true label. This means that crafting adversarial examples without guidance by the true label might be beneficial (in terms of a stronger attack).  Model complexity seems to have an impact on robustness after adversarial training. However, from the experiments, it is hard to deduce how this connection might look exactly. Also see this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Liu et al. provide a comprehensive study on the transferability of adversarial examples considering different attacks and models on ImageNet. In their experiments, they consider both targeted and nontargeted attack and also provide a realworld example by attacking clarifai.com. Here, I want to list some interesting conclusions drawn from their experiments:  Nontargeted attacks easily transfer between models; targetedattacks, in contrast, do generally not transfer – meaning that the target does not transfer across models.  The level of transferability does also seem to heavily really on hyperparameters of the trained models. In the experiments, the author observed this on different ResNet models which share the general architecture building blocks, but are of different depth.  Considering different models, it turns out that the gradient directions (i.e. the adversarial directions used in many gradientbased attacks) are mostly orthogonal – this means that different models have different vulnerabilities. However, the observed transferability suggests that this only holds for the “steepest” adversarial direction; the gradient direction of one model is, thus, still useful to craft adversarial examples for another model.  The authors also provide an interesting visualization of the local decision landscape around individual examples. As illustrated in Figure 1, the region where the chosen image is classified correctly is often limited to a small central area. Of course, I believe that these examples are handpicked to some extent, but they show the worstcase scenario relevant for defense mechanisms. https://i.imgur.com/STz0iwo.png Figure 1: Decision boundary showing different classes in different colors. The axes correspond to one pixel differences; the used images are computed using $x' = x +\delta_1u + \delta_2v$ where $u$ is the gradient direction and $v$ a random direction. Also see this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
MoosaviDezfooli et al. propose universal adversarial perturbations – perturbations that are imageagnostic. Specifically, they extend the framework for crafting adversarial examples, i.e. by iteratively solving $\arg\min_r \r \_2$ s.t. $f(x + r) \neq f(x)$. Here, $r$ denotes the adversarial perturbation, $x$ a training sample and $f$ the neural network. Instead of solving this problem for a specific $x$, the authors propose to solve the problem over the full training set, i.e. in each iteration, a different sample $x$ is chosen, one step in the direction of the gradient is taken and the perturbation is updated accordingly. In experiments, they show that these universal perturbations are indeed able to fool networks an several images; in addition, these perturbations are – sometimes – transferable to other networks. Also view this summary on [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Carlini and Wagner propose three novel methods/attacks for adversarial examples and show that defensive distillation is not effective. In particular, they devise attacks for all three commonly used norms $L_1$, $L_2$ and $L_\infty$ – which are used to measure the deviation of the adversarial perturbation from the original testing sample. In the course of the paper, starting with the targeted objective $\min_\delta d(x, x + \delta)$ s.t. $f(x + \delta) = t$ and $x+\delta \in [0,1]^n$, they consider up to 7 different surrogate objectives to express the constraint $f(x + \delta) = t$. Here, $f$ is the neural network to attack and $\delta$ denotes the perturbation. This leads to the formulation $\min_\delta \\delta\_p + cL(x + \delta)$ s.t. $x + \delta \in [0,1]^n$ where $L$ is the surrogate loss. After extensive evaluation, the loss $L$ is taken to be $L(x') = \max(\max\{Z(x')_i : i\neq t\}  Z(x')_t, \kappa)$ where $x' = x + \delta$ and $Z(x')_i$ refers to the logit for class $i$; $\kappa$ is a constant ($=0$ in their experiments) that can be used to control the confidence of the adversarial example. In practice, the box constraint $[0,1]^n$ is encoded through a change of variable by expressing $\delta$ in terms of the hyperbolic tangent, see the paper for details. Carlini and Wagner then discuss the detailed attacks for all three norms, i.e. $L_1$, $L_2$ and $L_\infty$ where the first and latter are discussed in more detail as they impose nondifferentiability. 
[link]
The problem statement this paper tries to address is that the training set is distinguished by a large imbalance between the number of foreground examples and background examplesTo make the point concrete cases like sliding window object detectors like deformable parts model, the imbalance may be as extreme as 100,000 background examples to one annotated foreground example. Before i proceed to give you the details of Hard Example mining, i just want to note that HEM in its essence is mostly while training you sort your losses and train your model on the most difficult examples which mostly means the ones with the most loss.(An extension to this can be found in the paper Focal Loss). This is a simple but powerful technique. So taking this as out background,The authors propose a simple but effective method to train an FastRCNN. Their approach is as follows, 1. For an input image at SGD iteration t, they first compute a convolution feature map using the convNetwork 2. The ROI Network uses this feature map and all the input ROI's to do a forward pass 3. Hard examples are sorted by loss and taking the B/N examples for which the current network performs worse.(Here B is batch size and N is Number of examples) 4. While doing this, The researchers notice that Colocated ROI's with high overlap are likely to have corelated losses. Also If you notice Overlapping ROI's will project onto the mostly the same region in the Convfeature map because the feature map is a denser/smaller representation of the feature map.So this might lead to loss double counting.To deal with this They use standard NonMaximum Supression. 5. Now how NMS works here is, It iteratively selects the ROI with the highest loss and removes all lower loss ROI's that have high overlap with the selected region.Here they use a IOU threshold of 0.7 
[link]
Generative Adversarial Networks (GANs) are an exciting technique, a kernel of an effective concept that has been shown to be able to overcome many of the problems of previous generative models: particularly the fuzziness of VAEs. But, as I’ve mentioned before, and as you’ve doubtless read if you’re read any material about the topic, they’re finicky things, difficult to train in a stable way, and particularly difficult to not devolve into mode collapse. Mode collapse is a phenomenon where, at each iteration, the generator places all of its mass on one single output or dense cluster of outputs, instead of representing the full distribution of output space, they way we’d like it to. One proposed solution to this is the one I discussed yesterday, of explicitly optimizing the generator according to not only what the discriminator thinks about its current allocation of probability, but what the discriminator’s next move will be (thus incentivizing the generator not to take indefensible strategies like “put all your mass in one location the discriminator can push down next round”. An orthogonal approach to that one is the one described in LSGANs: to change the objective function of the network, away from sigmoid crossentropy, and instead to a least squares loss. While I don’t have the latex capabilities to walk through the exact mathematics in this format, what this means on a conceptual level is that instead of incentivizing the generator to put all of its mass on places that the discriminator is sure is a “true data” region, we’re instead incentivizing the generator to put mass right on the true/fake data decision boundary. Likely this doesn’t make very much sense yet (it didn’t for me, at this point in reading). Occasionally, delving deeper into math and theory behind an idea provides you rigor, but without much intuition. I found the opposite to be true in this case, where learning more (for the first time!) about f divergences actually made this method make more sense. So, bear with me, and hopefully trust me not to take you to deep into the weeds without a good reason. On a theoretical level, this paper’s loss function means that you end up minimizing a chi squared divergence between the distributions, instead of a KL divergence. "F divergences" are a quantity that calculates a measure of how different two distributions are from one another, and does that by taking an average of the density q, weighted at each point by f, which is some function of the ratio of densities, p/q. (You could also think of this as being an average of the function f, weighted by the density q; they’re equivalent statements). For the KL divergence, this function is x*logx. For chi squared it’s (x1)^2. All of this starts to coalesce into meaning with the information that, typically the behavior of a typical GAN looks like the divergence FROM the generator’s probability mass, TO the discriminator’s probability mass. That means that we take the ratio of how much mass a generator puts somewhere to how much mass the data has there, and we plug it into the x*logx function seen below. https://i.imgur.com/BYRfi0u.png Now, look how much the function value spikes when that ratio goes over 1. Intuitively, what this means is that we heavily punish the generator when it puts mass in a place that’s unrealistic, i.e. where there isn’t representation from the data distribution. But  and this is the important thing  we don’t symmetrically punish it when it its mass at a point is far higher than the mass put their in the real data; or when the ratio is much smaller than one. This means that we don’t have a way of punishing mode collapse, the scenario where the generator puts all of its mass on one of the modes of the data; we don’t do a good job of pushing the generator to have mass everywhere that the data has mass. By contrast, the Chi Squared divergence pushes the ratio of (generator/data) to be equal to 1 *from both directions*. So, if there’s more generator mass than data mass somewhere, that’s bad, but it’s also bad for there to be more data mass than generator mass. This gives the network a stronger incentive to not learn mode collapsed solutions. 
[link]
If you’ve ever read a paper on Generative Adversarial Networks (from now on: GANs), you’ve almost certainly heard the author refer to the scourge upon the land of GANs that is mode collapse. When a generator succumbs to mode collapse, that means that, instead of modeling the full distribution, of input data, it will choose one region where there is a high density of data, and put all of its generated probability weight there. Then, on the next round, the discriminator pushes strongly away from that region (since it now is majorityoccupied by fake data), and the generator finds a new mode. In the view of the authors of the Unrolled GANs paper, one reason why this happens is that, in the typical GAN, at each round the generator implicitly assumes that it’s optimizing itself against the final and optimal discriminator. And, so, it makes its best move given that assumption, which is to put all its mass on a region the discriminator assigns high probability. Unfortunately for our shortsighted robot friend, this isn’t a oneround game, and this massconcentrating strategy gives the discriminator a really good way to find fake data during the next round: just dramatically downweight how likely you think data is in the generator’s priorround sweet spot, which it’s heavy concentration allows you to do without impacting your assessment of other data. Unrolled GANs operate on this key question: what if we could give the generator an ability to be less shortsighted, and make moves that aren’t just optimizing for the present, but are also defensive against the future, in ways that will hopefully tamp down on this runningaroundincircles dynamic illustrated above. If the generator was incentivized not only to make moves that fool the current discriminator, but also make moves that make the nextstep discriminator less likely to tell it apart, the hope is that it will spread out its mass more, and be less likely to fall into the hole of a mode collapse. This intuition was realized in UnrolledGANs, through a mathematical approach that is admittedly a little complex for this discussion format. Essentially, in addition to the typical GAN loss (which is based on the current values of the generator and discriminator), this model also takes one “step forward” of the discriminator (calculates what the new parameters of the discriminator would be, if it took one update step), and backpropogates backward through that step. The loss under the nextstep discriminator parameters is a function of both the current generator, and the nextstep parameters, which come from the way the discriminator reacts to the current generator. When you take the gradient with respect to the generator of both of these things, you get something very like the ideal we described earlier: a generator that is trying to put its mass into areas the current discriminator sees as highprobability, but also change its parameters such that it gives the discriminator a less effective response strategy. https://i.imgur.com/0eEjm0g.png Empirically: UnrolledGANs do a quite good job at their stated aim of reducing mode collapse, and the unrolled training procedure is now a common buildingblock technique used in other papers. 
[link]
DRL has lot of disadvantages like large data requirement, slow learning, difficult interpretation, difficult transfer, no causality, analogical reasoning done at a statistical level not at a abstract level etc. This can be overcome by adding a symbolic front end on top of DL layer before feeding it to RL agent. Symbolic front end gives advantage of smaller state space generalization, flexible predicate length and easier combination of predicate expressions. DL avoids manual creation of features unlike symbolic reasoning. Hence DL along with symbolic reasoning might be the way to progress for AGI. State space reduction in symbolic reasoning is carried out by using object interactions(object positions and object types) for state representation. Although certain assumptions are made in the process such as objects of same type behave similarly etc, one can better understand causal relations in terms of actions, object interactions and reward by using symbolic reasoning. Broadly, pipeline consists of (1)CNN layer  Raw pixels to representation (2)Salient pixel identification  Pixels that have activations in CNN above a certain threshold (3)Identify objects of similar kind by using activation spectra of salient pixels (4)Identify similar objects in consecutive time steps to track object motion using spatial closeness(as objects can move only by a small distance in consecutive frames) and similar neighbors(different type of objects can be placed close to each other and spatial closeness alone cannot identify similar objects) (4)Building symbolic interactions by using relative object positions for all pairs of objects located within a certain maximal distance. Relative object position is necessary to capture object dynamics. Maximal distance threshold is required to make the learning quicker eventhough it may reach a locally optimal policy (4)RL agent uses object interactions as states in QLearning update. Instead of using all object interactions in a frame as one state, number of states are further reduced by considering interactions between two types to be independent of other types and doing a QLearning update separately for each type pair. Intuitive explanation for doing so is to look at a frame as a set of independent object type interactions. Action choice at a state is then the one that maximizes sum of Q values across all type pairs. Results claim that using DRL with symbolic reasoning, transfer in policies can be observed by first training on evenly spaced grid world and using it for randomly spaced grid world with a performance close to 70% contrary to DQN that achieves 50% even after training for 1000 epochs with epoch length of 100. 
[link]
This paper proposes a simple method for sequentially training new tasks and avoid catastrophic forgetting. The paper starts with the Bayesian formulation of learning a model that is $$ \log P(\theta  D) = \log P(D  \theta) + \log P(\theta)  \log P(D) $$ By switching the prior into the posterior of previous task(s), we have $$ \log P(\theta  D) = \log P(D  \theta) + \log P(\theta  D_{prev})  \log P(D) $$ The paper use the following form for posterior $$ P(\theta  D_{prev}) = N(\theta_{prev}, diag(F)) $$ where $F$ is the Fisher Information matrix $E_x[ \nabla_\theta \log P(x\theta) (\nabla_\theta \log P(x\theta))^T]$. Then the resulting objective function is $$ L(\theta) = L_{new}(\theta) + \frac{\lambda}{2}\sum F_{ii} (\theta_i  \theta^{prev*}_i)^2 $$ where $L_{new}$ is the loss on new task, and $\theta^{prev*}$ is previous best parameter. It can be viewed as a distance which uses Fisher Informatrix to properly scale each dimension, and it further proves that the Fisher Information matrix is important in the experienment by comparing with simple $L_2$ distance. 
[link]
# Main Results (tl;dr) ## Deep *Linear* Networks 1. Loss function is **nonconvex** and nonconcave 2. **Every local minimum is a global minimum** 3. Shallow neural networks *don't* have bad saddle points 4. Deep neural networks *do* have bad saddle points ## Deep *ReLU* Networks * Same results as above by reduction to deep linear networks under strong simplifying assumptions * Strong assumptions: * The probability that a path through the ReLU network is active is the same, agnostic to which path it is. * The activations of the network are independent of the input data and the weights. ## Highlighted Takeaways * Depth *doesn't* create nonglobal minima, but depth *does* create bad saddle points. * This paper moves deep linear networks closer to a good model for deep ReLU networks by discarding 5 of the 7 of the previously used assumptions. This gives more "support" for the conjecture that deep ReLU networks don't have bad local minima. * Deep linear networks don't have bad local minima, so if deep ReLU networks do have bad local minima, it's purely because of the introduction of nonlinear activations. This highlights the importance of the activation function used. * Shallow linear networks don't have bad saddles point while deep linear networks do, indicating that the saddle point problem is introduced with depth beyond the first hidden layer. Bad saddle point : saddle point whose Hessian has no negative eigenvalues (no direction to descend) Shallow neural network : single hidden layer Deep neural network : more than one hidden layer Bad local minima : local minima that aren't global minima # Position in Research Landscape * Conjecture from 1989: For deep linear networks, every local minimum is a global minimum: [Neural networks and principal component analysis: Learning from examples without local minima (Neural networks 1989)](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.408.1839&rep=rep1&type=pdf) * This paper proves that conjecture. * Given 7 strong assumptions, the losses of local minima are concentrated in an exponentially (with dimension) tight band: [The Loss Surfaces of Multilayer Networks (AISTATS 2015)](https://arxiv.org/abs/1412.0233) * Discarding some of the above assumptions is an open problem: [Open Problem: The landscape of the loss surfaces of multilayer networks (COLT 2015)](http://proceedings.mlr.press/v40/Choromanska15.pdf) * This paper discards 5 of those assumptions and proves the result for a strictly more general deep nonlinear model class. # More Details ## Deep *Linear* Networks * Main result is Result 2, which proves the conjecture from 1989: every local minimum is a global minimum. * Not where the strong assumptions come in * Assumptions (realistic and practically easy to satisfy): * $XX^T$ and $XY^T$ are full rank * $d_y \leq d_x$ (output is lower dimension than input) * $\Sigma = YX^T(XX^T )^{−1}XY^T$ has $d_y$ distinct eigenvalues * specific to the squared error loss function * Essentially gives a comprehensive understanding of the loss surface of deep linear networks ## Deep ReLU Networks * Specific to ReLU activation. Makes strong use of its properties * Choromanska et al. (2015) relate the loss function to the Hamiltonian of the spherical spinglass model, using 3 reshaping assumptions. This allows them to apply existing random matrix theory results. This paper drops those reshaping assumptions by performing completely different analysis. * Because Choromanska et al. (2015) used random matrix theory, they analyzed a random Hessian, which means they need to make 2 distributional assumptions. This paper also drops those 2 assumptions and analyzes a deterministic Hessian. * Remaining Unrealistic Assumptions: * The probability that a path through the ReLU network is active is the same, agnostic to which path it is. * The activations of the network are independent of the input data and the weights. # Related Resources * [NIPS Oral Presentation](https://channel9.msdn.com/Events/NeuralInformationProcessingSystemsConference/NeuralInformationProcessingSystemsConferenceNIPS2016/DeepLearningwithoutPoorLocalMinima) 
[link]
Problem  Refine synthetically simulated images to look real https://machinelearning.apple.com/images/journals/gan/real_synt_refined_gaze.png Approach  * Generative adversarial networks Contributions  1. **Refiner** FCN that improves simulated image to realistically looking image 2. **Adversarial + Self regularization loss** * **Adversarial loss** term = CNN that Classifies whether the image is refined or real * **Self regularization** term = L1 distance of refiner produced image from simulated image. The distance can be either in pixel space or in feature space (to preserve gaze direction for example). https://i.imgur.com/I4KxCzT.png Datasets  * grayscale eye images * depth sensor hand images Technical Contributions  1. **Local adversarial loss**  The discriminator is applied on image patches thus creating multiple "realness" metrices https://machinelearning.apple.com/images/journals/gan/locald.png 2. **Discriminator with history**  to avoid the refiner from going back to previously used refined images. https://machinelearning.apple.com/images/journals/gan/history.gif 
[link]
# Very Short The authors propose a deep, recurrent, convolutional architecture called PredNet, inspired by the idea of predictive coding from neuroscience. In PredNet, first layer attempts to predict the input frame, based on past frames and input from higher layers. The next layer then attempts to predict the *prediction error* of the first layer, and so on. The authors show that such an architecture can predict future frames of video, and predict the parameters syntheticallygenerated video, better than a conventional recurrent autoencoder. # Short ## The Model PredNet has the following architecture: https://i.imgur.com/7vOcGwI.png Where the R blocks are Recurrent Neural Networks, and the A blocks are Convolutional Layers. $E_l$ indictes the prediction error at layer $l$. The network is trained from snippets of video, and the loss is given as: $L_{train} = \sum_{t=1}^T \sum_l \frac{\lambda_l}{n_l} \sum_i^{2n_l} [E_l^t]_i$ Where $t$ indexes the time step, $l$ indexes the layer, $n_l$ is the number of units in the layer, $E_l^t = [ReLU(A_l^t\hat A_l^t) ; ReLU(\hat A_l^t  A_l^t) ]$ is the concatenation of the negative and positive components of the error, $\lambda_l$ is a hyperparameter determining the effect that layer $l$ error should have on the loss. In the experiments, they use two settings for the $\lambda_l$ hyperparameters. In the "$L_0$" setting, they set $\lambda_0=1, \lambda_{>0}=0$, which ends up being optimal when trying to optimize nextframe L1 error. In the "$L_{all}$" setting, they use $\lambda_0=1, \lambda_{>0}=0.1$, which, in the syntheticimages experiment, seems to be better at predicting the parameters of the syntheticimage generator. ## Results They apply the model on two tasks: 1) Predicting the future frames of a synthetic video generated by a graphics engine. Here they predict both the next frame (in which their $L_0$ model does best), and the parameters (face characteristics, rotation, angle) of the program that generates the synthetic faces, (on which their $L_{all}$ model does best). They predict face generating parameters by first training the model, and then freezing weights and regressing from the learned representations at a given layer to the parameters. They show that both the $L_0$ and $L_{all}$ models outperform a more conventional recurrent autoencoder. https://i.imgur.com/S8PpJnf.png **Nextframe predictions on a sequence of faces (note: here, predictions are *not* fed back into the model to generate the next frame)** 2) Predicting future frames of video from dashboard cameras. https://i.imgur.com/Zus34Vm.png **Nextframe predictions of dashboardcamera images** The authors conclude that allowing higher layers to model *prediction errors*, instead of *abstract representations* can lead to better modeling of video. 
[link]
Doing POS tagging using a bidirectional LSTM with word and characterbased embeddings. They add an extra component to the loss function – predicting a frequency class for each word, together with their POS tag. Results show that overall performance remains similar, but there’s an improvement in tagging accuracy for lowfrequency words. https://i.imgur.com/nwb8dOC.png 
[link]
Investigation of how well LSTMs capture longdistance dependencies. The task is to predict verb agreement (singular or plural) when the subject noun is separated by different numbers of distractors. They find that an LSTM trained explicitly for this task manages to handle even most of the difficult cases, but a regular language model is more prone to being misled by the distractors. https://i.imgur.com/0kYhawn.png 
[link]
The aim is to have the system discover a method for parsing that would benefit a downstream task. https://i.imgur.com/q57gGCz.png They construct a neural shiftreduce parser – as it’s moving through the sentence, it can either shift the word to the stack or reduce two words on top of the stack by combining them. A TreeLSTM is used for composing the nodes recursively. The whole system is trained using reinforcement learning, based on an objective function of the downstream task. The model learns parse rules that are beneficial for that specific task, either without any prior knowledge of parsing or by initially training it to act as a regular parser. 
[link]
The authors tackle the problem of domain adaptation for NER, where the label set of the target domain is different from the source domain. They first train a CRF model on the source domain. Next, they train a LR classifier to predict labels in the target domain, based on predicted label scores from the model. Finally, the weights from the classifier are used to initialise another CRF model, which is then finetuned on the target domain data. https://i.imgur.com/zwSB7qN.png 
[link]
They describe a method for augmenting existing word embeddings with knowledge of semantic constraints. The idea is similar to retrofitting by Faruqui et al. (2015), but using additional constraints and a different optimisation function. https://i.imgur.com/zedR5FV.png Existing word vectors are further optimised to 1) have high similarity for known synonyms, 2) have low similarity for known antonyms, and 3) have high similarity to words that were highly similar in the original space. They evaluate on SimLex999, showing stateoftheart performance. Also, they use the method to improve a dialogue tracking system. 
[link]
They create an LSTM neural language model that 1) has better handling of numerical values, and 2) is conditioned on a knowledge base. https://i.imgur.com/Rb6V1Hy.png First the the numerical value each token is given as an additional signal to the network at each time step. While we normally represent token “25” as a normal word embedding, we now also have an extra feature with numerical value float(25). Second, they condition the language model on text in a knowledge base. All the information in the KB is converted to a string, passed through an LSTM and then used to condition the main LM. They evaluate on a dataset of 16,003 clinical records which come paired with small KB tuples of 20 possible attributes. The numerical grounding helps quite a bit, and the best results are obtained when the KB conditioning is also added. 
[link]
They start with the neural machine translation model using alignment, by Bahdanau et al. (2014), and add an extra variational component. https://i.imgur.com/6yIEbDf.png The authors use two neural variational components to model a distribution over latent variables z that captures the semantics of a sentence being translated. First, they model the posterior probability of z, conditioned on both input and output. Then they also model the prior of z, conditioned only on the input. During training, these two distributions are optimised to be similar using KullbackLeibler distance, and during testing the prior is used. They report improvements on ChineseEnglish and EnglishGerman translation, compared to using the original encoderdecoder NMT framework. 
[link]
They propose a joint model for 1) identifying event keywords in a text, 2) identifying entities, and 3) identifying the connections between these events and entities. They also do this across different sentences, jointly for the whole text. https://i.imgur.com/ETKZL7V.png Example of the entity and event annotation that the system is modelling. The entity detection part is done with a CRF; the structure of an event is learned with a probabilistic graphical model; information is integrated from surrounding sentences using a Stanford coreference system; and these are all tied together across the whole document using Integer Linear Programming. 
[link]
Adversarial examples are datapoints that are designed to fool a classifier. For example, we can take an image that is classified correctly using a neural network, then backprop through the model to find which changes we need to make in order for it to be classified as something else. And these changes can be quite small, such that a human would hardly notice a difference. https://i.imgur.com/pkK570X.png Examples of adversarial images. In this paper, they show that much of this property holds even when the images are fed into the classifier from the real world – after being photographed with a cell phone camera. While the accuracy goes from 85.3% to 36.3% when adversarial modifications are applied on the source images, the performance still drops from 79.8% to 36.4% when the images are photographed. They also propose two modifications to the process of generating adversarial images – making it into a more gradual iterative process, and optimising for a specific adversarial class. 
[link]
The goal is to improve the training process for a spoken dialogue system, more specifically a telephonebased system providing restaurant information for the Cambridge (UK) area. They train a supervised system which tries to predict the success on the current dialogue – if the model is certain about the outcome, the predicted label is used for training the dialogue system; if the model is uncertain, the user is asked to provide a label. Essentially it reduces the amount of annotation that is required, by choosing which examples should be annotated through active learning. https://i.imgur.com/dWY1EdE.png The dialogue is mapped to a vector representation using a bidirectional LSTM trained like an autoencoder, and a Gaussian Process is used for modelling dialogue success. 
[link]
Hermann et al (2015) created a dataset for testing reading comprehension by extracting summarised bullet points from CNN and Daily Mail. All the entities in the text are anonymised and the task is to place correct entities into empty slots based on the news article. https://i.imgur.com/qeJATKq.png This paper has handreviewed 100 samples from the dataset and concludes that around 25% of the questions are difficult or impossible to answer even for a human, mostly due to the anonymisation process. They present a simple classifier that achieves unexpectedly good results, and a neural network based on attention that beats all previous results by quite a margin. 
[link]
* Presents an architecture dubbed ResNeXt * They use modules built of * 1x1 conv * 3x3 group conv, keeping the depth constant. It's like a usual conv, but it's not fully connected along the depth axis, but only connected within groups * 1x1 conv * plus a skip connection coming from the module input * Advantages: * Fewer parameters, since the full connections are only within the groups * Allows more feature channels at the cost of more aggressive grouping * Better performance when keeping the number of params constant * Questions/Disadvantages: * Instead of keeping the num of params constant, how about aiming at constant memory consumption? Having more feature channels requires more RAM, even if the connections are sparser and hence there are fewer params * Not so much improvement over ResNet 
[link]
# Very Short The authors propose **learning** an optimizer **to** optimally **learn** a function (the *optimizee*) which is being trained **by gradient descent**. This optimizer, a recurrent neural network, is trained to make optimal parameter updates to the optimizee **by gradient descent**. # Short Let's suppose we have a stochastic function $f: \mathbb R^{\text{dim}(\theta)} \rightarrow \mathbb R^+$, (the *optimizee*) which we wish to minimize with respect to $\theta$. Note that this is the typical situation we encounter when training a neural network with Stochastic Gradient Descent  where the stochasticity comes from sampling random minibatches of the data (the data is omitted as an argument here). The "vanilla" gradient descent update is: $\theta_{t+1} = \theta_t  \alpha_t \nabla_{\theta_t} f(\theta_t)$, where $\alpha_t$ is some learning rate. Other optimizers (Adam, RMSProp, etc) replace the multiplication of the gradient by $\alpha_t$ with some sort of weighted sum of the history of gradients. This paper proposes to apply an optimization step $\theta_{t+1} = \theta_t + g_t$, where the update $g_t \in \mathbb R^{\text{dim}(\theta)}$ is defined by a recurrent network $m_\phi$: $$(g_t, h_{t+1}) := m_\phi (\nabla_{\theta_t} f(\theta_t), h_t)$$ Where in their implementation, $h_t \in \mathbb R^{\text{dim}(\theta)}$ is the hidden state of the recurrent network. To make the number of parameters in the optimizer manageable, they implement their recurrent network $m$ as a *coordinatewise* LSTM (i.e. A set of $\text{dim}(\theta)$ small LSTMs that share parameters $\phi$). They train the optimizer networks's parameters $\phi$ by "unrolling" T subsequent steps of optimization, and minimizing: $$\mathcal L(\phi) := \mathbb E_f[f(\theta^*(f, \phi))] \approx \frac1T \sum_{t=1}^T f(\theta_t)$$ Where $\theta^*(f, \phi)$ are the final optimizee parameters. In order to avoid computing second derivatives while calculating $\frac{\partial \mathcal L(\phi)}{\partial \phi}$, they make the approximation $\frac{\partial}{\partial \phi} \nabla_{\theta_t}f(\theta_t) \approx 0$ (corresponding to the dotted lines in the figure, along which gradients are not backpropagated). https://i.imgur.com/HMaCeip.png **The computational graph of the optimization of the optimizer, unrolled across 3 timesteps. Note that $\nabla_t := \nabla_{\theta_t}f(\theta_t)$. The dotted line indicates that we do not backpropagate across this path.** The authors demonstrate that their method usually outperforms traditional optimizers (ADAM, RMSProp, SGD, NAG), on a synthetic dataset, MNIST, CIFAR10, and Neural Style Transfer. They argue that their algorithm constitutes a form of transfer learning, since a pretrained optimizer can be applied to accelerate training of a newly initialized network. 
[link]
_Objective:_ Find a generative model that avoids usual shortcomings: (i) highresolution, (ii) variety of images and (iii) matching the dataset diversity. _Dataset:_ [ImageNet](https://www.imagenet.org/) ## Innerworkings: The idea is to find an image that maximizes the probability for a given label by using a variant of a Markov Chain Monte Carlo (MCMC) sampler. [![screen shot 20170601 at 12 31 14 pm](https://cloud.githubusercontent.com/assets/17261080/26675978/3c9e6d9446c611e79f67477c4036a891.png)](https://cloud.githubusercontent.com/assets/17261080/26675978/3c9e6d9446c611e79f67477c4036a891.png) Where the first term ensures that we stay in the image manifold that we're trying to find and don't just produce adversarial examples and the second term makes sure that find an image corresponding to the label we're looking for. Basically we start with a random image and iteratively find a better image to match the label we're trying to generate. ### MALAapprox: MALAapprox is the MCMC sampler based on the MetropolisAdjusted Langevin Algorithm that they use in the paper, it is defined iteratively as follow: [![screen shot 20170601 at 12 25 45 pm](https://cloud.githubusercontent.com/assets/17261080/26675866/bf15cc2846c511e79620659d26f84bf8.png)](https://cloud.githubusercontent.com/assets/17261080/26675866/bf15cc2846c511e79620659d26f84bf8.png) where: * epsilon1 makes the image more generic. * epsilon2 increases confidence in the chosen class. * epsilon3 adds noise to encourage diversity. ### Image prior: They try several priors for the images: 1. PPGNx: p(x) is modeled with a Denoising AutoEncoder (DAE). [![screen shot 20170601 at 1 48 33 pm](https://cloud.githubusercontent.com/assets/17261080/26678501/1737c64e46d111e782a47ee0aa8bfe2f.png)](https://cloud.githubusercontent.com/assets/17261080/26678501/1737c64e46d111e782a47ee0aa8bfe2f.png) 2. DGNAM: use a latent space to model x with h using a GAN. [![screen shot 20170601 at 1 49 41 pm](https://cloud.githubusercontent.com/assets/17261080/26678517/2e74319446d111e795dc9bb638128242.png)](https://cloud.githubusercontent.com/assets/17261080/26678517/2e74319446d111e795dc9bb638128242.png) 3. PPGNh: incorporates a prior for p(h) using a DAE. [![screen shot 20170601 at 1 51 14 pm](https://cloud.githubusercontent.com/assets/17261080/26678579/6bd8cb5846d111e7895df9432b7e5e1f.png)](https://cloud.githubusercontent.com/assets/17261080/26678579/6bd8cb5846d111e7895df9432b7e5e1f.png) 4. Joint PPGNh: to increases expressivity of G, model h by first modeling x in the DAE. [![screen shot 20170601 at 1 51 23 pm](https://cloud.githubusercontent.com/assets/17261080/26678622/a7bf2f6846d111e7920998f97e0a218d.png)](https://cloud.githubusercontent.com/assets/17261080/26678622/a7bf2f6846d111e7920998f97e0a218d.png) 5. Noiseless joint PPGNh: same as previous but without noise. [![screen shot 20170601 at 1 54 11 pm](https://cloud.githubusercontent.com/assets/17261080/26678655/d549922046d111e793d0d48a6b6fa1a8.png)](https://cloud.githubusercontent.com/assets/17261080/26678655/d549922046d111e793d0d48a6b6fa1a8.png) ### Conditioning: In the paper they mostly use conditioning on label but captions or pretty much anything can also be used. [![screen shot 20170601 at 2 26 53 pm](https://cloud.githubusercontent.com/assets/17261080/26679654/6297ab8646d611e786faf763face01ca.png)](https://cloud.githubusercontent.com/assets/17261080/26679654/6297ab8646d611e786faf763face01ca.png) ## Architecture: The final architecture using a pretrained classifier network is below. Note that only G and D are trained. [![screen shot 20170601 at 2 29 49 pm](https://cloud.githubusercontent.com/assets/17261080/26679785/db14352046d611e7966872864f1a8eb1.png)](https://cloud.githubusercontent.com/assets/17261080/26679785/db14352046d611e7966872864f1a8eb1.png) ## Results: Pretty much any base network can be used with minimal training of G and D. It produces very realistic image with a great diversity, see below for examples of 227x227 images with ImageNet. [![screen shot 20170601 at 2 32 38 pm](https://cloud.githubusercontent.com/assets/17261080/26679884/4494002a46d711e7882ec69aff2ddd17.png)](https://cloud.githubusercontent.com/assets/17261080/26679884/4494002a46d711e7882ec69aff2ddd17.png) 
[link]
_Objective:_ Fondamental analysis of random networks using meanfield theory. Introduces two scales controlling network behavior. ## Results: Guide to choose hyperparameters for random networks to be nearly critical (in between order and chaos). This in turn implies that information can propagate forward and backward and thus the network is trainable (not vanishing or exploding gradient). Basically for any given number of layers and initialization covariances for weights and biases, tells you if the network will be trainable or not, kind of an architecture validation tool. **To be noted:** any amount of dropout removes the critical point and therefore imply an upper bound on trainable network depth. ## Caveats: * Consider only bounded activation units: no relu, etc. * Applies directly only to fully connected feedforward networks: no convnet, etc. 
[link]
_Objective:_ Build a network easily trainable by backpropagation to perform unsupervised domain adaptation while at the same time learning a good embedding for both source and target domains. _Dataset:_ [SVHN](ufldl.stanford.edu/housenumbers/), [MNIST](yann.lecun.com/exdb/mnist/), [USPS](https://www.otexts.org/1577), [CIFAR](https://www.cs.toronto.edu/%7Ekriz/cifar.html) and [STL](https://cs.stanford.edu/%7Eacoates/stl10/). ## Architecture: Very similar to RevGrad but with some differences. Basically a shared encoder and then a classifier and a reconstructor. [![screen shot 20170522 at 6 11 22 pm](https://cloud.githubusercontent.com/assets/17261080/26318076/213615923f1a11e792139cc07cfe2f2a.png)](https://cloud.githubusercontent.com/assets/17261080/26318076/213615923f1a11e792139cc07cfe2f2a.png) The two losses are: * the usual crossentropy with softmax for the classifier * the pixelwise squared loss for reconstruction Which are then combined using a tradeoff hyperparameter between classification and reconstruction. They also use data augmentation to generate additional training data during the supervised training using only geometrical deformation: translation, rotation, skewing, and scaling Plus denoising to reconstruct clean inputs given their noisy counterparts (zeromasked noise and Gaussian noise). ## Results: Outperforms state of the art on most tasks at the time, now outperformed itself by Generate To Adapt on most tasks. 
[link]
_Objective:_ Find a feature representation that cannot discriminate between the training (source) and test (target) domains using a discriminator trained directly on this embedding. _Dataset:_ MNIST, SYN Numbers, SVHN, SYN Signs, OFFICE, PRID, VIPeR and CUHK. ## Architecture: The basic idea behind this paper is to use a standard classifier network and chose one layer that will be the feature representation. The network before this layer is called the `Feature Extractor` and after the `Label Predictor`. Then a new network called a `Domain Classifier` is introduced that takes as input the extracted feature, its objective is to tell if a computed feature embedding came from an image from the source or target dataset. At training the aim is to minimize the loss of the `Label Predictor` while maximizing the loss of the `Domain Classifier`. In theory we should end up with a feature embedding where the discriminator can't tell if the image came from the source or target domain, thus the domain shift should have been eliminated. To maximize the domain loss, a new layer is introduced, the `Gradient Reversal Layer` which is equal to the identity during the forwardpass but reverse the gradient in the backpropagation phase. This enables the network to be trained using simple gradient descent algorithms. What is interesting with this approach is that any initial network can be used by simply adding a few new set of layers for the domain classifiers. Below is a generic architecture. [![screen shot 20170418 at 1 59 53 pm](https://cloud.githubusercontent.com/assets/17261080/25129680/590f57ee243f11e7892791124303b584.png)](https://cloud.githubusercontent.com/assets/17261080/25129680/590f57ee243f11e7892791124303b584.png) ## Results: Their approach is working but for some domain adaptation it completely fails and overall its performance are not great. Since then the stateoftheart has changed, see DANN combined with GAN or ADDA. 
[link]
# Semantic Segmentation using Adversarial networks ## Luc, Couprie, Chintala, Verbeek, 2016 * The paper aims to improve segmentation performance (IoU) by extending the network * The authors derive intuition from GAN's, where a game is played between generator and discriminator. * In this work, the game works as follows: a segmentation network maps an image WxHx3 to a label map WxHxC. a discriminator CNN is equipped with the task to discriminate the generated label maps from the ground truth. It is an adversarial game, because the segmentor aims for _more real_ label maps and the discriminator aims to distuinguish them from ground truth. * The discriminator is a CNN that maps from HxWxC to a binary label. * Section 3.2 outlines how to feed the label maps in three ways * __Basic__ where the label maps are concatenated to the image and fed to the discriminator. Actually, the authors observe that leaving the image out does not change performance. So they end up feeding only the label maps for _basic_ * __Product__ where the label maps and input are multiplied, leading to an input of 3C channels * __Scaling__ which resembles basic, but the onehot distribution is perturbed a bit. This avoids the discriminator from trivially detecting the entropy rather than anything useful * The discriminator is constructed with two axes of variation, leading to 4 architectures * __FOV__: either a field of view of 18x18 or 34x34 over the label map * __light__: an architecture with more or less capacity, e.g. number of channels * The paper shows some fair result on the Stanford dataset, but keep in mind that it only contains 700 images * The results in the Pascal dataset are minor, with the IoU improving from 71.8 to 72.0. * Authors tried to pretrain the adversary, but they found this led to instable training. They end up training in an alternating scheme between segmentor and discriminator. They found that slow alternations work best. 
[link]
The propagation rule used in this paper is the following: $$ H^l = \sigma \left(\tilde{D}^{\frac{1}{2}} \tilde{A} \tilde{D}^{\frac{1}{2}} H^{l1} W^l \right) $$ Where $\tilde{A}$ is the [adjacency matrix][adj] of the undirected graph (with self connections, so has a diagonal of 1s and is symmetric) and $H^l$ are the hidden activations at layer $l$. The $D$ matrices are performing row normalisation, $\tilde{D}^{\frac{1}{2}} \tilde{A} \tilde{D}^{\frac{1}{2}}$ is [equivalent to][pygcn] (with $\tilde{A}$ as `mx`): ``` rowsum = np.array(mx.sum(1)) # sum along rows r_inv = np.power(rowsum, 1).flatten() # 1./rowsum elementwise r_mat_inv = sp.diags(r_inv) # cast to sparse diagonal matrix mx = r_mat_inv.dot(mx) # sparse matrix multiply ``` The symmetric way to express this is part of the [symmetric normalized Laplacian][laplace]. This work, and the [other][hammond] [work][defferrard] on graph convolutional networks, is motivated by convolving a parametric filter over $x$. Convolution becomes easy if we can perform a *graph Fourier transform* (don't worry I don't understand this either), but that requires us having access to eigenvectors of the normalized graph Laplacian (which is expensive). [Hammond's early paper][hammond] suggested getting around this problem by using Chebyshev polynomials for the approximation. This paper takes the approximation even further, using only *first order* Chebyshev polynomial, on the assumption that this will be fine because the modeling capacity can be picked up by the deep neural network. That's how the propagation rule above is derived, but we don't really need to remember the details. In practice $\tilde{D}^{\frac{1}{2}} \tilde{A} \tilde{D}^{\frac{1}{2}} = \hat{A}$ is calculated prior and using a graph convolutional network involves multiplying your activations by this sparse matrix at every hidden layer. If you're thinking in terms of a batch with $N$ examples and $D$ features, this multiplication happens *over the examples*, mixing datapoints together (according to the graph structure). If you want to think of this in an orthodox deep learning way, it's the following: ``` activations = F.linear(H_lm1, W_l) # (N,D) activations = activations.permute(1,0) # (D,N) activations = F.linear(activations, hat_A) # (D,N) activations = activations.permute(1,0) # (N,D) H_l = F.relu(activations) ``` **Won't this be really slow though, $\hat{A}$ is $(N,N)$!** Yes, if you implemented it that way it would be very slow. But, many deep learning frameworks support sparse matrix operations ([although maybe not the backward pass][sparse]). Using that, a graph convolutional layer can be implemented [quite easily][pygcnlayer]. **Wait a second, these are batches, not minibatches?** Yup, minibatches are future work. **What are the experimental results, though?** There are experiments showing this works well for semisupervised experiments on graphs, as advertised. Also, the many approximations to get the propagation rule at the top are justified by experiment. **This summary is bad.** Fine, smarter people have written their own posts: [the author's][kipf], [Ferenc's][ferenc]. [adj]: https://en.wikipedia.org/wiki/Adjacency_matrix [pygcn]: https://github.com/tkipf/pygcn/blob/master/pygcn/utils.py#L56L63 [laplace]: https://en.wikipedia.org/wiki/Laplacian_matrix#Symmetric_normalized_Laplacian [hammond]: https://arxiv.org/abs/0912.3848 [defferrard]: https://arxiv.org/abs/1606.09375 [sparse]: https://discuss.pytorch.org/t/doespytorchsupportautogradonsparsematrix/6156/7 [pygcnlayer]: https://github.com/tkipf/pygcn/blob/master/pygcn/layers.py#L35L68 [kipf]: https://tkipf.github.io/graphconvolutionalnetworks/ [ferenc]: http://www.inference.vc/howpowerfularegraphconvolutionsreviewofkipfwelling20162/ 
[link]
This work deals with rotation equivariant convolutional filters. The idea is that when you rotate an image you should not need to relearn new filters to deal with this rotation. First we can look at how convolutions typically handle rotation and how we would expect a rotation invariant solution to perform below:          https://i.imgur.com/cirTi4S.png  https://i.imgur.com/iGpUZDC.png      The method computes all possible rotations of the filter which results in a list of activations where each element represents a different rotation. From this list the maximum is taken which results in a two dimensional output for every pixel (rotation, magnitude). This happens at the pixel level so the result is a vector field over the image. https://i.imgur.com/BcnuI1d.png We can visualize their degree selection method with a figure from https://arxiv.org/abs/1603.04392 which determined the rotation of a building: https://i.imgur.com/hPI8J6y.png We can also think of this approach as attention \cite{1409.0473} where they attend over the possible rotations to obtain a score for each possible rotation value to pass on. The network can learn to adjust the rotation value to be whatever value the later layers will need.  Results on [Rotated MNIST](http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/MnistVariations) show an impressive improvement in training speed and generalization error: https://i.imgur.com/YO3poOO.png 
[link]
#### Motivation: + Take advantage of the fact that missing values can be very informative about the label. + Sampling a time series generates many missing values. ![Sampling](https://raw.githubusercontent.com/tiagotvv/mlpapers/master/clinicaldata/images/Lipton2016_motivation.png?raw=true) #### Model (indicator flag): + Indicator of occurrence of missing value. ![Indicator](https://raw.githubusercontent.com/tiagotvv/mlpapers/master/clinicaldata/images/Lipton2016_indicator.png?raw=true) + An RNN can learn about missing values and their importance only by using the indicator function. The nonlinearity from this type of model helps capturing these dependencies. + If one wants to use a linear model, feature engineering is needed to overcome its limitations. + indicator for whether a variable was measured at all + mean and standard deviation of the indicator + frequency with which a variable switches from measured to missing and viceversa. #### Architecture: + RNN with target replication following the work "Learning to Diagnose with LSTM Recurrent Neural Networks" by the same authors. ![Architecture](https://raw.githubusercontent.com/tiagotvv/mlpapers/master/clinicaldata/images/Lipton2016_architecture.png?raw=true) #### Dataset: + Children's Hospital LA + Episode is a multivariate time series that describes the stay of one patient in the intensive care unit Dataset properties  Value  Number of episodes  10,401 Duration of episodes  From 12h to several months Time series variables  Systolic blood pressure, Diastolic blood pressure, Peripheral capillary refill rate, End tidal CO2, Fraction of inspired O2, Glasgow coma scale, Blood glucose, Heart rate, pH, Respiratory rate, Blood O2 Saturation, Body temperature, Urine output. #### Experiments and Results: **Goal** + Predict 128 diagnoses. + Multilabel: patients can have more than one diagnose. **Methodology** + Split: 80% training, 10% validation, 10% test. + Normalized data to be in the range [0,1]. + LSTM RNN: + 2 hidden layers with 128 cells. Dropout = 0.5, L2regularization: 1e6 + Training for 100 epochs. Parameters chosen correspond to the time that generated the smallest error in the validation dataset. + Baselines: + Logistic Regression (L2 regularization) + MLP with 3 hidden layers and 500 hidden neurons / layer (parameters chosen via validation set) + Tested with rawfeatures and handengineered features. + Strategies for missing values: + Zeroing + Impute via forward / backfilling + Impute with zeros and use indicator function + Impute via forward / backfilling and use indicator function + Use indicator function only #### Results + Metrics: + Micro AUC, Micro F1: calculated by adding the TPs, FPs, TNs and FNs for the entire dataset and for all classes. + Macro AUC, Macro F1: Arithmetic mean of AUCs and F1 scores for each of the classes. + Precision at 10: Fraction of correct diagnostics among the top 10 predictions of the model. + The upper bound for precision at 10 is 0.2281 since in the test set there are on average 2.281 diagnoses per patient. ![Results](https://raw.githubusercontent.com/tiagotvv/mlpapers/master/clinicaldata/images/Lipton2016_results.png?raw=true) #### Discussion: + Predictive model based on data collected following a given routine. This routine can change if the model is put into practice. Will the model predictions in this new routine remain valid? + Missing values in a way give an indication of the type of treatment being followed. + Tradeoff between complex models operating on raw features and very complex features operating on more interpretable models. 
[link]
* They present a variation of Faster RCNN. * Faster RCNN is a model that detects bounding boxes in images. * Their variation is about as accurate as the best performing versions of Faster RCNN. * Their variation is significantly faster than these variations (roughly 50ms per image). ### How * PVANET reuses the standard Faster RCNN architecture: * A base network that transforms an image into a feature map. * A region proposal network (RPN) that uses the feature map to predict bounding box candidates. * A classifier that uses the feature map and the bounding box candidates to predict the final bounding boxes. * PVANET modifies the base network and keeps the RPN and classifier the same. * Inception * Their base network uses eight Inception modules. * They argue that these are good choices here, because they are able to represent an image at different scales (aka at different receptive field sizes) due to their mixture of 3x3 and 1x1 convolutions. * ![Receptive field sizes in inception modules](images/PVANET__inception_fieldsize.jpg?raw=true "Receptive field sizes in inception modules") * Representing an image at different scales is useful here in order to detect both large and small bounding boxes. * Inception modules are also reasonably fast. * Visualization of their Inception modules: * ![Inception modules architecture](images/PVANET__inception_modules.jpg?raw=true "Inception modules architecture") * Concatenated ReLUs * Before the eight Inception modules, they start the network with eight convolutions using concatenated ReLUs. * These CReLUs compute both the classic ReLU result (`max(0, x)`) and concatenate to that the negated result, i.e. something like `f(x) = max(0, x <concat> (1)*x)`. * That is done, because among the early one can often find pairs of convolution filters that are the negated variations of each other. So by adding CReLUs, the network does not have to compute these any more, instead they are created (almost) for free, reducing the computation time by up to 50%. * Visualization of their final CReLU block: * TODO * ![CReLU modules](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/PVANET__crelu.jpg?raw=true "CReLU modules") * MultiScale output * Usually one would generate the final feature map simply from the output of the last convolution. * They instead combine the outputs of three different convolutions, each resembling a different scale (or level of abstraction). * They take one from an early point of the network (downscaled), one from the middle part (kept the same) and one from the end (upscaled). * They concatenate these and apply a 1x1 convolution to generate the final output. * Other stuff * Most of their network uses residual connections (including the Inception modules) to facilitate learning. * They pretrain on ILSVRC2012 and then perform finetuning on MSCOCO, VOC 2007 and VOC 2012. * They use plateau detection for their learning rate, i.e. if a moving average of the loss does not improve any more, they decrease the learning rate. They say that this increases accuracy significantly. * The classifier in Faster RCNN consists of fully connected layers. They compress these via Truncated SVD to speed things up. (That was already part of Fast RCNN, I think.) ### Results * On Pascal VOC 2012 they achieve 82.5% mAP at 46ms/image (Titan X GPU). * Faster RCNN + ResNet101: 83.8% at 2.2s/image. * Faster RCNN + VGG16: 75.9% at 110ms/image. * RFCN + ResNet101: 82.0% at 133ms/image. * Decreasing the number of region proposals from 300 per image to 50 almost doubles the speed (to 27ms/image) at a small loss of 1.5 percentage points mAP. * Using Truncated SVD for the classifier reduces the required timer per image by about 30% at roughly 1 percentage point of mAP loss. 
[link]
* They present a variation of Faster RCNN, i.e. a model that predicts bounding boxes in images and classifies them. * In contrast to Faster RCNN, their model is fully convolutional. * In contrast to Faster RCNN, the computation per bounding box candidate (region proposal) is very low. ### How * The basic architecture is the same as in Faster RCNN: * A base network transforms an image to a feature map. Here they use ResNet101 to do that. * A region proposal network (RPN) uses the feature map to locate bounding box candidates ("region proposals") in the image. * A classifier uses the feature map and the bounding box candidates and classifies each one of them into `C+1` classes, where `C` is the number of object classes to spot (e.g. "person", "chair", "bottle", ...) and `1` is added for the background. * During that process, small subregions of the feature maps (those that match the bounding box candidates) must be extracted and converted to fixedsizes matrices. The method to do that is called "Region of Interest Pooling" (RoIPooling) and is based on max pooling. It is mostly the same as in Faster RCNN. * Visualization of the basic architecture: * ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/RFCN__architecture.jpg?raw=true "Architecture") * Positionsensitive classification * Fully convolutional bounding box detectors tend to not work well. * The authors argue, that the problems come from the translationinvariance of convolutions, which is a desirable property in the case of classification but not when precise localization of objects is required. * They tackle that problem by generating multiple heatmaps per object class, each one being slightly shifted ("positionsensitive score maps"). * More precisely: * The classifier generates per object class `c` a total of `k*k` heatmaps. * In the simplest form `k` is equal to `1`. Then only one heatmap is generated, which signals whether a pixel is part of an object of class `c`. * They use `k=3*3`. The first of those heatmaps signals, whether a pixel is part of the *top left* corner of a bounding box of class `c`. The second heatmap signals, whether a pixel is part of the *top center* of a bounding box of class `c` (and so on). * The RoIPooling is applied to these heatmaps. * For `k=3*3`, each bounding box candidate is converted to `3*3` values. The first one resembles the top left corner of the bounding box candidate. Its value is generated by taking the average of the values in that area in the first heatmap. * Once the `3*3` values are generated, the final score of class `c` for that bounding box candidate is computed by averaging the values. * That process is repeated for all classes and a softmax is used to determine the final class. * The graphic below shows examples for that: * ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/RFCN__examples.jpg?raw=true "Examples") * The above described RoIPooling uses only averages and hence is almost (computationally) free. * They make use of that during the training by sampling many candidates and only backpropagating on those with high losses (online hard example mining, OHEM). * À trous trick * In order to increase accuracy for small bounding boxes they use the à trous trick. * That means that they use a pretrained base network (here ResNet101), then remove a pooling layer and set the à trous rate (aka dilation) of all convolutions after the removed pooling layer to `2`. * The á trous rate describes the distance of sampling locations of a convolution. Usually that is `1` (sampled locations are right next to each other). If it is set to `2`, there is one value "skipped" between each pair of neighbouring sampling location. * By doing that, the convolutions still behave as if the pooling layer existed (and therefore their weights can be reused). At the same time, they work at an increased resolution, making them more capable of classifying small objects. (Runtime increases though.) * Training of RFCN happens similarly to Faster RCNN. ### Results * Similar accuracy as the most accurate Faster RCNN configurations at a lower runtime of roughly 170ms per image. * Switching to ResNet50 decreases accuracy by about 2 percentage points mAP (at faster runtime). Switching to ResNet152 seems to provide no measureable benefit. * OHEM improves mAP by roughly 2 percentage points. * À trous trick improves mAP by roughly 2 percentage points. * Training on `k=1` (one heatmap per class) results in a failure, i.e. a model that fails to predict bounding boxes. `k=7` is slightly more accurate than `k=3`.
1 Comments

[link]
* Style transfer between images works  in its original form  by iteratively making changes to a content image, so that its style matches more and more the style of a chosen style image. * That iterative process is very slow. * Alternatively, one can train a single feedforward generator network to apply a style in one forward pass. The network is trained on a dataset of input images and their stylized versions (stylized versions can be generated using the iterative approach). * So far, these generator networks were much faster than the iterative approach, but their quality was lower. * They describe a simple change to these generator networks to increase the image quality (up to the same level as the iterative approach). ### How * In the generator networks, they simply replace all batch normalization layers with instance normalization layers. * Batch normalization normalizes using the information from the whole batch, while instance normalization normalizes each feature map on its own. * Equations * Let `H` = Height, `W` = Width, `T` = Batch size * Batch Normalization: * ![Batch Normalization Equations](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Instance_Normalization_The_Missing_Ingredient_for_Fast_Stylization__batch_normalization.jpg?raw=true "Batch Normalization Equations") * Instance Normalization * ![Instance Normalization Equations](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Instance_Normalization_The_Missing_Ingredient_for_Fast_Stylization__instance_normalization.jpg?raw=true "Instance Normalization Equations") * They apply instance normalization at test time too (identically). ### Results * Same image quality as iterative approach (at a fraction of the runtime). * One content image with two different styles using their approach: * ![Example](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Instance_Normalization_The_Missing_Ingredient_for_Fast_Stylization__example.jpg?raw=true "Example") 
[link]
Official code: https://github.com/anewell/posehgtrain * They suggest a new model architecture for human pose estimation (i.e. "lay a skeleton over a person"). * Their architecture is based progressive pooling followed by progressive upsampling, creating an hourglass form. * Input are images showing a person's body. * Outputs are K heatmaps (for K body joints), with each heatmap showing the likely position of a single joint on the person (e.g. "akle", "wrist", "left hand", ...). ### How * *Basic building block* * They use residuals as their basic building block. * Each residual has three layers: One 1x1 convolution for dimensionality reduction (from 256 to 128 channels), a 3x3 convolution, a 1x1 convolution for dimensionality increase (back to 256). * Visualized: * ![Building Block](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Stacked_Hourglass_Networks_for_Human_Pose_Estimation__building_block.jpg?raw=true "Building Block") * *Architecture* * Their architecture starts with one standard 7x7 convolutions that has strides of (2, 2). * They use MaxPooling (2x2, strides of (2, 2)) to downsample the images/feature maps. * They use Nearest Neighbour upsampling (factor 2) to upsample the images/feature maps. * After every pooling step they add three of their basic building blocks. * Before each pooling step they branch off the current feature map as a minor branch and apply three basic building blocks to it. Then they add it back to the main branch after that one has been upsampeled again to the original size. * The feature maps between each basic building block have (usually) 256 channels. * Their HourGlass ends in two 1x1 convolutions that create the heatmaps. * They stack two of their HourGlass networks after each other. Between them they place an intermediate loss. That way, the second network can learn to improve the predictions of the first network. * Architecture visualized: * ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Stacked_Hourglass_Networks_for_Human_Pose_Estimation__architecture.jpg?raw=true "Architecture") * *Heatmaps* * The output generated by the network are heatmaps, one per joint. * Each ground truth heatmap has a small gaussian peak at the correct position of a joint, everything else has value 0. * If a joint isn't visible, the ground truth heatmap for that joint is all zeros. * *Other stuff* * They use batch normalization. * Activation functions are ReLUs. * They use RMSprob as their optimizer. * Implemented in Torch. ### Results * They train and test on FLIC (only one HourGlass) and MPII (two stacked HourGlass networks). * Training is done with augmentations (horizontal flip, up to 30 degress rotation, scaling, no translation to keep the body of interest in the center of the image). * Evaluation is done via PCK@0.2 (i.e. percentage of predicted keypoints that are within 0.2 head sizes of their ground truth annotation (head size of the specific body)). * Results on FLIC are at >95%. * Results on MPII are between 80.6% (ankle) and 97.6% (head). Average is 89.4%. * Using two stacked HourGlass networks performs around 3% better than one HourGlass network (even when adjusting for parameters). * Training time was 5 days on a Titan X (9xx generation). 
[link]
* Most neural machine translation models currently operate on word vectors or one hot vectors of words. * They instead generate the vector of each word on a characterlevel. * Thereby, the model can spot charactersimilarities between words and treat them in a similar way. * They do that only for the source language, not for the target language. ### How * They treat each word of the source text on its own. * To each word they then apply the model from [Characteraware neural language models](https://arxiv.org/abs/1508.06615), i.e. they do per word: * Embed each character into a 620dimensional space. * Stack these vectors next to each other, resulting in a 2dtensor in which each column is one of the vectors (i.e. shape `620xN` for `N` characters). * Apply convolutions of size `620xW` to that tensor, where a few different values are used for `W` (i.e. some convolutions cover few characters, some cover many characters). * Apply a tanh after these convolutions. * Apply a maxovertime to the results of the convolutions, i.e. for each convolution use only the maximum value. * Reshape to 1dvector. * Apply two highwaylayers. * They get 1024dimensional vectors (one per word). * Visualization of their steps: * ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Characterbased_Neural_Machine_Translation__architecture.jpg?raw=true "Architecture") * Afterwards they apply the model from [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473) to these vectors, yielding a translation to a target language. * Whenever that translation yields an unknown targetlanguageword ("UNK"), they replace it with the respective (untranslated) word from the source text. ### Results * They the GermanEnglish [WMT](http://www.statmt.org/wmt15/translationtask.html) dataset. * BLEU improvemements (compared to neural translation without characterlevel words): * GermanEnglish improves by about 1.5 points. * EnglishGerman improves by about 3 points. * Reduction in the number of unknown targetlanguagewords (same baseline again): * GermanEnglish goes down from about 1500 to about 1250. * EnglishGerman goes down from about 3150 to about 2650. * Translation examples (Phrase = phrasebased/nonneural translation, NN = noncharacterbased neural translation, CHAR = theirs): * ![Examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Characterbased_Neural_Machine_Translation__examples.jpg?raw=true "Examples") 
[link]
* They suggest a single architecture that tries to solve the following tasks: * Face localization ("Where are faces in the image?") * Face landmark localization ("For a given face, where are its landmarks, e.g. eyes, nose and mouth?") * Face landmark visibility estimation ("For a given face, which of its landmarks are actually visible and which of them are occluded by other objects/people?") * Face roll, pitch and yaw estimation ("For a given face, what is its rotation on the x/y/zaxis?") * Face gender estimation ("For a given face, which gender does the person have?") ### How * *Pretraining the base model* * They start with a basic model following the architecture of AlexNet. * They train that model to classify whether the input images are faces or not faces. * They then remove the fully connected layers, leaving only the convolutional layers. * *Locating bounding boxes of face candidates* * They then use a [selective search and segmentation algorithm](https://www.robots.ox.ac.uk/~vgg/rg/papers/sande_iccv11.pdf) on images to extract bounding boxes of objects. * Each bounding box is considered a possible face. * Each bounding box is rescaled to 227x227. * *Feature extraction per face candidate* * They feed each bounding box through the above mentioned pretrained network. * They extract the activations of the network from the layers `max1` (27x27x96), `conv3` (13x13x384) and `pool5` (6x6x256). * They apply to the first two extracted tensors (from max1, conv3) convolutions so that their tensor shapes are reduced to 6x6xC. * They concatenate the three tensors to a 6x6x768 tensor. * They apply a 1x1 convolution to that tensor to reduce it to 6x6x192. * They feed the result through a fully connected layer resulting in 3072dimensional vectors (per face candidate). * *Classification and regression* * They feed each 3072dimensional vector through 5 separate networks: 1. Detection: Does the bounding box contain a face or no face. (2 outputs, i.e. yes/no) 2. Landmark Localization: What are the coordinates of landmark features (e.g. mouth, nose, ...). (21 landmarks, each 2 values for x/y = 42 outputs total) 3. Landmark Visibility: Which landmarks are visible. (21 yes/no outputs) 4. Pose estimation: Roll, pitch, yaw of the face. (3 outputs) 5. Gender estimation: Male/female face. (2 outputs) * Each of these network contains a single fully connected layer with 512 nodes, followed by the output layer with the above mentioned number of nodes. * *Architecture Visualization*: * ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/HyperFace__architecture.jpg?raw=true "Architecture") * *Training* * The base model is trained once (see above). * The feature extraction layers and the five classification/regression networks are trained afterwards (jointly). * The loss functions for the five networks are: 1. Detection: BCE (binary crossentropy). Detected bounding boxes that have an overlap `>=0.5` with an annotated face are considered positive samples, bounding boxes with overlap `<0.35` are considered negative samples, everything in between is ignored. 2. Landmark localization: Roughly MSE (mean squared error), with some weighting for visibility. Only bounding boxes with overlap `>0.35` are considered. Coordinates are normalized with respect to the bounding boxes center, width and height. 3. Landmark visibility: MSE (predicted visibility factor vs. expected visibility factor). Only for bounding boxes with overlap `>0.35`. 4. Pose estimation: MSE. 5. Gender estimation: BCE. * *Testing* * They use two postprocessing methods for detected faces: * Iterative Region Proposals: * They localize landmarks per face region. * Then they compute a more appropriate face bounding box based on the localized landmarks. * They feed that new bounding box through the network. * They compute the face score (face / not face, i.e. number between 0 and 1) for both bounding boxes and choose the one with the higher score. * This shrinks down bounding boxes that turned out to be too big. * The method visualized: * ![IRP](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/HyperFace__irp.jpg?raw=true "IRP") * Landmarksbased NonMaximum Suppression: * When multiple detected face bounding boxes overlap, one has to choose which of them to keep. * A method to do that is to only keep the bounding box with the highest facescore. * They instead use a medianofk method. * Their steps are: 1. Reduce every box in size so that it is a bounding box around the localized landmarks. 2. For every box, find all bounding boxes with a certain amount of overlap. 3. Among these bounding boxes, select the `k` ones with highest face score. 4. Based on these boxes, create a new box which's size is derived from the median coordinates of the landmarks. 5. Compute the median values for landmark coordinates, landmark visibility, gender, pose and use it as the respective values for the new box. ### Results * Example results: * ![Example results](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/HyperFace__example_results.jpg?raw=true "Example results") * They test on AFW, AFWL, PASCAL, FDDB, CelebA. * They achieve the best mean average precision values on PASCAL and AFW (compared to selected competitors). * AFW results visualized: * ![AFW](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/HyperFace__afw.jpg?raw=true "AFW") * Their approach achieve good performance on FDDB. It has some problems with small and/or blurry faces. * If the feature fusion is removed from their approach (i.e. extracting features only from one fully connected layer at the end of the base network instead of merging feature maps from different convolutional layers), the accuracy of the predictions goes down. * Their architecture ends in 5 shallow networks and shares many layers before them. If instead these networks share no or few layers, the accuracy of the predictions goes down. * The postprocessing of bounding boxes (via Iterative Region Proposals and Landmarksbased NonMaximum Suppression) has a quite significant influence on the performance. * Processing time per image is 3s, of which 2s is the selective search algorithm (for the bounding boxes). 
[link]
* When using pretrained networks (like VGG) to solve tasks, one has to use features generated by these networks. * These features come from specific layers, e.g. from the fully connected layers at the end of the network. * They test whether the features from fully connected layers or from the last convolutional layer are better suited for face attribute prediction. ### How * Base networks * They use standard architectures for their test networks, specifically the architectures of FaceNet and VGG (very deep version). * They modify these architectures to both use PReLUs. * They do not use the pretrained weights, instead they train the networks on their own. * They train them on the WebFace dataset (350k images, 10k different identities) to classify the identity of the shown person. * Attribute prediction * After training of the base networks, they train a separate SVM to predict attributes of faces. * The datasets used for this step are CelebA (100k images, 10k identities) and LFWA (13k images, 6k identities). * Each image in these datasets is annotated with 40 binary face attributes. * Examples for attributes: Eyeglasses, bushy eyebrows, big lips, ... * The features for the SVM are extracted from the base networks (i.e. feed forward a face through the network, then take the activations of a specific layer). * The following features are tested: * FC2: Activations of the second fully connected layer of the base network. * FC1: As FC2, but the first fully connected layer. * Spat 3x3: Activations of the last convolutional layer, maxpooled so that their widths and heights are both 3 (i.e. shape Cx3x3). * Spat 1x1: Same as "Spat 3x3", but maxpooled to Cx1x1. ### Results * The SVMs trained on "Spat 1x1" performed overall worst, the ones trained on "Spat 3x3" performed best. * The accuracy order was roughly: `Spat 3x3 > FC1 > FC2 > Spat 1x1`. * This effect was consistent for both networks (VGG, FaceNet) and for other training datasets as well. * FC2 performed particularly bad for the "blurry" attribute (most likely because that was unimportant to the classification task). * Accuracy comparison per attribute: * ![Comparison](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Face_Attribute_Prediction_Using_OfftheShelf_CNN_Features__comparison.png?raw=true "Comparison") * The conclusion is, that when using pretrained networks one should not only try the last fully connected layer. Many characteristics of the input image might not appear any more in that layer (and later ones in general) as they were unimportant to the classification task. 
[link]
* They describe a model to locate faces in images. * Their model uses information from suspected face regions *and* from the corresponding suspected body regions to classify whether a region contains a face. * The intuition is, that seeing the region around the face (specifically where the body should be) can help in estimating whether a suspected face is really a face (e.g. it might also be part of a painting, statue or doll). ### How * Their whole model is called "CMSRCNN" (Contextual MultiScale RegionCNN). * It is based on the "Faster RCNN" architecture. * It uses the VGG network. * Subparts of their model are: MSRPN, CMSCNN. * MSRPN finds candidate face regions. CMSCNN refines their bounding boxes and classifies them (face / not face). * **MSRPN** (MultiScale Region Proposal Network) * "Looks" at the feature maps of the network (VGG) at multiple scales (i.e. before/after pooling layers) and suggests regions for possible faces. * Steps: * Feed an image through the VGG network. * Extract the feature maps of the three last convolutions that are before a pooling layer. * Pool these feature maps so that they have the same heights and widths. * Apply L2 normalization to each feature map so that they all have the same scale. * Apply a 1x1 convolution to merge them to one feature map. * Regress face bounding boxes from that feature map according to the Faster RCNN technique. * **CMSCNN** (Contextual MultiScale CNN): * "Looks" at feature maps of face candidates found by MSRPN and classifies whether these regions contains faces. * It also uses the same multiscale technique (i.e. take feature maps from convs before pooling layers). * It uses some area around these face regions as additional information (suspected regions of bodies). * Steps: * Receive face candidate regions from MSRPN. * Do per candidate region: * Calculate the suspected coordinates of the body (only based on the x/yposition and size of the face region, i.e. not learned). * Extract the feature maps of the *face* region (at multiple scales) and apply RoIPooling to it (i.e. convert to a fixed height and width). * Extract the feature maps of the *body* region (at multiple scales) and apply RoIPooling to it (i.e. convert to a fixed height and width). * L2normalize each feature map. * Concatenate the (RoIpooled and normalized) feature maps of the face (at multiple scales) with each other (creates one tensor). * Concatenate the (RoIpooled and normalized) feature maps of the body (at multiple scales) with each other (creates another tensor). * Apply a 1x1 convolution to the face tensor. * Apply a 1x1 convolution to the body tensor. * Apply two fully connected layers to the face tensor, creating a vector. * Apply two fully connected layers to the body tensor, creating a vector. * Concatenate both vectors. * Based on that vector, make a classification of whether it is really a face. * Based on that vector, make a regression of the face's final bounding box coordinates and dimensions. * Note: They use in both networks the multiscale approach in order to be able to find small or tiny faces. Otherwise, after pooling these small faces would be hard or impossible to detect. ### Results * Adding context to the classification (i.e. the body regions) empirically improves the results. * Their model achieves the highest recall rate on FDDB compared to other models. However, it has lower recall if only very few false positives are accepted. * FDDB ROC curves (theirs is bold red): * ![FDDB results](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/CMSRCNN__fddb.jpg?raw=true "FDDB results") * Example results on FDDB: * ![FDDB examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/CMSRCNN__examples.jpg?raw=true "FDDB examples") 
[link]
* Usually GANs transform a noise vector `z` into images. `z` might be sampled from a normal or uniform distribution. * The effect of this is, that the components in `z` are deeply entangled. * Changing single components has hardly any influence on the generated images. One has to change multiple components to affect the image. * The components end up not being interpretable. Ideally one would like to have meaningful components, e.g. for human faces one that controls the hair length and a categorical one that controls the eye color. * They suggest a change to GANs based on Mutual Information, which leads to interpretable components. * E.g. for MNIST a component that controls the stroke thickness and a categorical component that controls the digit identity (1, 2, 3, ...). * These components are learned in a (mostly) unsupervised fashion. ### How * The latent code `c` * "Normal" GANs parameterize the generator as `G(z)`, i.e. G receives a noise vector and transforms it into an image. * This is changed to `G(z, c)`, i.e. G now receives a noise vector `z` and a latent code `c` and transforms both into an image. * `c` can contain multiple variables following different distributions, e.g. in MNIST a categorical variable for the digit identity and a gaussian one for the stroke thickness. * Mutual Information * If using a latent code via `G(z, c)`, nothing forces the generator to actually use `c`. It can easily ignore it and just deteriorate to `G(z)`. * To prevent that, they force G to generate images `x` in a way that `c` must be recoverable. So, if you have an image `x` you must be able to reliable tell which latent code `c` it has, which means that G must use `c` in a meaningful way. * This relationship can be expressed with mutual information, i.e. the mutual information between `x` and `c` must be high. * The mutual information between two variables X and Y is defined as `I(X; Y) = entropy(X)  entropy(XY) = entropy(Y)  entropy(YX)`. * If the mutual information between X and Y is high, then knowing Y helps you to decently predict the value of X (and the other way round). * If the mutual information between X and Y is low, then knowing Y doesn't tell you much about the value of X (and the other way round). * The new GAN loss becomes `old loss  lambda * I(G(z, c); c)`, i.e. the higher the mutual information, the lower the result of the loss function. * Variational Mutual Information Maximization * In order to minimize `I(G(z, c); c)`, one has to know the distribution `P(cx)` (from image to latent code), which however is unknown. * So instead they create `Q(cx)`, which is an approximation of `P(cx)`. * `I(G(z, c); c)` is then computed using a lower bound maximization, similar to the one in variational autoencoders (called "Variational Information Maximization", hence the name "InfoGAN"). * Basic equation: `LowerBoundOfMutualInformation(G, Q) = E[log Q(cx)] + H(c) <= I(G(z, c); c)` * `c` is the latent code. * `x` is the generated image. * `H(c)` is the entropy of the latent codes (constant throughout the optimization). * Optimization w.r.t. Q is done directly. * Optimization w.r.t. G is done via the reparameterization trick. * If `Q(cx)` approximates `P(cx)` *perfectly*, the lower bound becomes the mutual information ("the lower bound becomes tight"). * In practice, `Q(cx)` is implemented as a neural network. Both Q and D have to process the generated images, which means that they can share many convolutional layers, significantly reducing the extra cost of training Q. ### Results * MNIST * They use for `c` one categorical variable (10 values) and two continuous ones (uniform between 1 and +1). * InfoGAN learns to associate the categorical one with the digit identity and the continuous ones with rotation and width. * Applying Q(cx) to an image and then classifying only on the categorical variable (i.e. fully unsupervised) yields 95% accuracy. * Sampling new images with exaggerated continuous variables in the range `[2,+2]` yields sound images (i.e. the network generalizes well). * ![MNIST examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/InfoGAN__mnist.png?raw=true "MNIST examples") * 3D face images * InfoGAN learns to represent the faces via pose, elevation, lighting. * They used five uniform variables for `c`. (So two of them apparently weren't associated with anything sensible? They are not mentioned.) * 3D chair images * InfoGAN learns to represent the chairs via identity (categorical) and rotation or width (apparently they did two experiments). * They used one categorical variable (four values) and one continuous variable (uniform `[1, +1]`). * SVHN * InfoGAN learns to represent lighting and to spot the center digit. * They used four categorical variables (10 values each) and two continuous variables (uniform `[1, +1]`). (Again, a few variables were apparently not associated with anything sensible?) * CelebA * InfoGAN learns to represent pose, presence of sunglasses (not perfectly), hair style and emotion (in the sense of "smiling or not smiling"). * They used 10 categorical variables (10 values each). (Again, a few variables were apparently not associated with anything sensible?) * ![CelebA examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/InfoGAN__celeba.png?raw=true "CelebA examples") 
[link]
* They describe an architecture for deep CNNs that contains short and long paths. (Short = few convolutions between input and output, long = many convolutions between input and output) * They achieve comparable accuracy to residual networks, without using residuals. ### How * Basic principle: * They start with two branches. The left branch contains one convolutional layer, the right branch contains a subnetwork. * That subnetwork again contains a left branch (one convolutional layer) and a right branch (a subnetwork). * This creates a recursion. * At the last step of the recursion they simply insert two convolutional layers as the subnetwork. * Each pair of branches (left and right) is merged using a pairwise mean. (Result: One of the branches can be skipped or removed and the result after the merge will still be sound.) * Their recursive expansion rule (left) and architecture (middle and right) visualized: ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/FractalNet_UltraDeep_Networks_without_Residuals__architecture.png?raw=true "Architecture") * Blocks: * Each of the recursively generated networks is one block. * They chain five blocks in total to create the network that they use for their experiments. * After each block they add a max pooling layer. * Their first block uses 64 filters per convolutional layer, the second one 128, followed by 256, 512 and again 512. * Droppath: * They randomly dropout whole convolutional layers between mergelayers. * They define two methods for that: * Local droppath: Drops each input to each merge layer with a fixed probability, but at least one always survives. (See image, first three examples.) * Global droppath: Drops convolutional layers so that only a single columns (and thereby path) in the whole network survives. (See image, right.) * Visualization: ![Droppath](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/FractalNet_UltraDeep_Networks_without_Residuals__drop_path.png?raw=true "Droppath") ### Results * They test on CIFAR10, CIFAR100 and SVHN with no or mild (crops, flips) augmentation. * They add dropout at the start of each block (probabilities: 0%, 10%, 20%, 30%, 40%). * They use for 50% of the batches local droppath at 15% and for the other 50% global droppath. * They achieve comparable accuracy to ResNets (a bit behind them actually). * Note: The best ResNet that they compare to is "ResNet with Identity Mappings". They don't compare to Wide ResNets, even though they perform best. * If they use image augmentations, dropout and droppath don't seem to provide much benefit (only small improvement). * If they extract the deepest column and test on that one alone, they achieve nearly the same performance as with the whole network. * They derive from that, that their fractal architecture is actually only really used to help that deepest column to learn anything. (Without shorter paths it would just learn nothing due to vanishing gradients.) 
[link]
* They describe a convolutional network that takes in photos and returns where (on the planet) these photos were likely made. * The output is a distribution over locations around the world (so not just one single location). This can be useful in the case of ambiguous images. ### How * Basic architecture * They simply use the Inception architecture for their model. * They have 97M parameters. * Grid * The network uses a grid of cells over the planet. * For each photo and every grid cell it returns the likelihood that the photo was made within the region covered by the cell (simple softmax layer). * The naive way would be to use a regular grid around the planet (i.e. a grid in which all cells have the same size). * Possible disadvantages: * In places where lots of photos are taken you still have the same grid cell size as in places where barely any photos are taken. * Maps are often distorted towards the poles (countries are represented much larger than they really are). This will likely affect the grid cells too. * They instead use an adaptive grid pattern based on S2 cells. * S2 cells interpret the planet as a sphere and project a cube onto it. * The 6 sides of the cube are then partitioned using quad trees, creating the grid cells. * They don't use the same depth for all quad trees. Instead they subdivide them only if their leafs contain enough photos (based on their dataset of geolocated images). * They remove some cells for which their dataset does not contain enough images, e.g. cells on oceans. (They also remove these images from the dataset. They don't say how many images are affected by this.) * They end up with roughly 26k cells, some of them reaching the street level of major cities. * Visualization of their cells: ![S2 cells](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/PlaNet__S2.jpg?raw=true "S2 cells") * Training * For each example photo that they feed into the network, they set the correct grid cell to `1.0` and all other grid cells to `0.0`. * They train on a dataset of 126M images with Exif geolocation information. The images were collected from all over the web. * They used Adagrad. * They trained on 200 CPUs for 2.5 months. * Album network * For photo albums they develop variations of their network. * They do that because albums often contain images that are very hard to geolocate on their own, but much easier if the other images of the album are seen. * They use LSTMs for their album network. * The simplest one just iterates over every photo, applies their previously described model to it and extracts the last layer (before output) from that model. These vectors (one per image) are then fed into an LSTM, which is trained to predict (again) the grid cell location per image. * More complicated versions use multiple passes or are bidirectional LSTMs (to use the information from the last images to classify the first ones in the album). ### Results * They beat previous models (based on handengineered features or nearest neighbour methods) by a significant margin. * In a small experiment they can beat experienced humans in geoguessr.com. * Based on a dataset of 2.3M photos from Flickr, their method correctly predicts the country where the photo was made in 30% of all cases (top1; top5: about 50%). Citylevel accuracy is about 10% (top1; top5: about 18%). * Example predictions (using in coarser grid with 354 cells): ![Examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/PlaNet__examples.png?raw=true "Examples") * Using the LSTMtechnique for albums significantly improves prediction accuracy for these images. 
[link]
* What * They describe a new architecture for GANs. * The architecture is based on letting the Generator (G) create images in multiple steps, similar to DRAW. * They also briefly suggest a method to compare the quality of the results of different generators with each other. * How * In a classic GAN one samples a noise vector `z`, feeds that into a Generator (`G`), which then generates an image `x`, which is then fed through the Discriminator (`D`) to estimate its quality. * Their method operates in basically the same way, but internally G is changed to generate images in multiple time steps. * Outline of how their G operates: * Time step 0: * Input: Empty image `delta C1`, randomly sampled `z`. * Feed `delta C1` through a number of downsampling convolutions to create a tensor. (Not very useful here, as the image is empty. More useful in later timesteps.) * Feed `z` through a number of upsampling convolutions to create a tensor (similar to DCGAN). * Concat the output of the previous two steps. * Feed that concatenation through a few more convolutions. * Output: `delta C0` (changes to apply to the empty starting canvas). * Time step 1 (and later): * Input: Previous change `delta C0`, randomly sampled `z` (can be the same as in step 0). * Feed `delta C0` through a number of downsampling convolutions to create a tensor. * Feed `z` through a number of upsampling convolutions to create a tensor (similar to DCGAN). * Concat the output of the previous two steps. * Feed that concatenation through a few more convolutions. * Output: `delta C1` (changes to apply to the empty starting canvas). * At the end, after all timesteps have been performed: * Create final output image by summing all the changes, i.e. `delta C0 + delta C1 + ...`, which basically means `empty start canvas + changes from time step 0 + changes from time step 1 + ...`. * Their architecture as an image: * ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Generating_Images_with_Recurrent_Adversarial_Networks__architecture.png?raw=true "Architecture") * Comparison measure * They suggest a new method to compare GAN results with each other. * They suggest to train pairs of G and D, e.g. for two pairs (G1, D1), (G2, D2). Then they let the pairs compete with each other. * To estimate the quality of D they suggest `r_test = errorRate(D1, testset) / errorRate(D2, testset)`. ("Which D is better at spotting that the test set images are real images?") * To estimate the quality of the generated samples they suggest `r_sample = errorRate(D1, images by G2) / errorRate(D2, images by G1)`. ("Which G is better at fooling an unknown D, i.e. possibly better at generating lifelike images?") * They suggest to estimate which G is better using r_sample and then to estimate how valid that result is using r_test. * Results * Generated images of churches, with timesteps 1 to 5: * ![Churches](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Generating_Images_with_Recurrent_Adversarial_Networks__churches.jpg?raw=true "Churches") * Overfitting * They saw no indication of overfitting in the sense of memorizing images from the training dataset. * They however saw some indication of G just interpolating between some good images and of G reusing small image patches in different images. * Randomness of noise vector `z`: * Sampling the noise vector once seems to be better than resampling it at every timestep. * Resampling it at every time step often led to very similar looking output images. 
[link]
* They suggest a new architecture for GANs. * Their architecture adds another Generator for a reverse branch (from images to noise vector `z`). * Their architecture takes some ideas from VAEs/variational neural nets. * Overall they can improve on the previous state of the art (DCGAN). ### How * Architecture * Usually, in GANs one feeds a noise vector `z` into a Generator (G), which then generates an image (`x`) from that noise. * They add a reverse branch (G2), in which another Generator takes a real image (`x`) and generates a noise vector `z` from that. * The noise vector can now be viewed as a latent space vector. * Instead of letting G2 generate *discrete* values for `z` (as it is usually done), they instead take the approach commonly used VAEs and use *continuous* variables instead. * That is, if `z` represents `N` latent variables, they let G2 generate `N` means and `N` variances of gaussian distributions, with each distribution representing one value of `z`. * So the model could e.g. represent something along the lines of "this face looks a lot like a female, but with very low probability could also be male". * Training * The Discriminator (D) is now trained on pairs of either `(real image, generated latent space vector)` or `(generated image, randomly sampled latent space vector)` and has to tell them apart from each other. * Both Generators are trained to maximally confuse D. * G1 (from `z` to `x`) confuses D maximally, if it generates new images that (a) look real and (b) fit well to the latent variables in `z` (e.g. if `z` says "image contains a cat", then the image should contain a cat). * G2 (from `x` to `z`) confuses D maximally, if it generates good latent variables `z` that fit to the image `x`. * Continuous variables * The variables in `z` follow gaussian distributions, which makes the training more complicated, as you can't trivially backpropagate through gaussians. * When training G1 (from `z` to `x`) the situation is easy: You draw a random `z`vector following a gaussian distribution (`N(0, I)`). (This is basically the same as in "normal" GANs. They just often use uniform distributions instead.) * When training G2 (from `x` to `z`) the situation is a bit harder. * Here we need to use the reparameterization trick here. * That roughly means, that G2 predicts the means and variances of the gaussian variables in `z` and then we draw a sample of `z` according to exactly these means and variances. * That sample gives us discrete values for our backpropagation. * If we do that sampling often enough, we get a good approximation of the true gradient (of the continuous variables). (Monte Carlo approximation.) * Results * Images generated based on CelebA dataset: * ![CelebA samples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Adversarially_Learned_Inference__celebasamples.png?raw=true "CelebA samples") * Left column per pair: Real image, right column per pair: reconstruction (`x > z` via G2, then `z > x` via G1) * ![CelebA reconstructions](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Adversarially_Learned_Inference__celebareconstructions.png?raw=true "CelebA reconstructions") * Reconstructions of SVHN, notice how the digits often stay the same, while the font changes: * ![SVHN reconstructions](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Adversarially_Learned_Inference__svhnreconstructions.png?raw=true "SVHN reconstructions") * CIFAR10 samples, still lots of errors, but some quite correct: * ![CIFAR10 samples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Adversarially_Learned_Inference__cifar10samples.png?raw=true "CIFAR10 samples") 
[link]
* Autoencoders typically have some additional criterion that pushes them towards learning meaningful representations. * E.g. L1Penalty on the code layer (z), Dropout on z, Noise on z. * Often, representations with sparse activations are considered meaningful (so that each activation reflects are clear concept). * This paper introduces another technique that leads to sparsity. * They use a rank ordering on z. * The first (according to the ranking) activations have to do most of the reconstruction work of the data (i.e. image). ### How * Basic architecture: * They use an Autoencoder architecture: Input > Encoder > z > Decoder > Output. * Their encoder and decoder seem to be empty, i.e. z is the only hidden layer in the network. * Their output is not just one image (or whatever is encoded), instead they generate one for every unit in layer z. * Then they order these outputs based on the activation of the units in z (rank ordering), i.e. the output of the unit with the highest activation is placed in the first position, the output of the unit with the 2nd highest activation gets the 2nd position and so on. * They then generate the final output image based on a cumulative sum. So for three reconstructed output images `I1, I2, I3` (rank ordered that way) they would compute `final image = I1 + (I1+I2) + (I1+I2+I3)`. * They then compute the error based on that reconstruction (`reconstruction  input image`) and backpropagate it. * Cumulative sum: * Using the cumulative sum puts most optimization pressure on units with high activation, as they have the largest influence on the reconstruction error. * The cumulative sum is best optimized by letting few units have high activations and generate most of the output (correctly). All the other units have ideally low to zero activations and low or no influence on the output. (Though if the output generated by the first units is wrong, you should then end up with an extremely high cumulative error sum...) * So their `z` coding should end up with few but high activations, i.e. it should become very sparse. * The cumulative generates an individual error per output, while an ordinary sum generates the same error for every output. They argue that this "blurs" the error less. * To avoid blow ups in their network they use TReLUs, which saturate below 0 and above 1, i.e. `min(1, max(0, input))`. * They use a custom derivative function for the TReLUs, which is dependent on both the input value of the unit and its gradient. Basically, if the input is `>1` (saturated) and the error is high, then the derivative pushes the weight down, so that the input gets into the unsaturated regime. Similarly for input values `<0` (pushed up). If the input value is between 0 and 1 and/or the error is low, then nothing is changed. * They argue that the algorithmic complexity of the rank ordering should be low, due to sorts being `O(n log(n))`, where `n` is the number of hidden units in `z`. ### Results * They autoencode 7x7 patches from CIFAR10. * They get very sparse activations. * Training and test loss develop identically, i.e. no overfitting. 
[link]
* The authors start with a standard ResNet architecture (i.e. residual network has suggested in "Identity Mappings in Deep Residual Networks"). * Their residual block: ![Residual block](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Wide_Residual_Networks__residual_block.png?raw=true "Residual block") * Several residual blocks of 16 filters per convlayer, followed by 32 and then 64 filters per convlayer. * They empirically try to answer the following questions: * How many residual blocks are optimal? (Depth) * How many filters should be used per convolutional layer? (Width) * How many convolutional layers should be used per residual block? * Does Dropout between the convolutional layers help? ### Results * *Layers per block and kernel sizes*: * Using 2 convolutional layers per residual block seems to perform best: ![Convs per block](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Wide_Residual_Networks__convs_per_block.png?raw=true "Convs per block") * Using 3x3 kernel sizes for both layers seems to perform best. * However, using 3 layers with kernel sizes 3x3, 1x1, 3x3 and then using less residual blocks performs nearly as good and decreases the required time per batch. * *Width and depth*: * Increasing the width considerably improves the test error. * They achieve the best results (on CIFAR10) when decreasing the depth to 28 convolutional layers, with each having 10 times their normal width (i.e. 16\*10 filters, 32\*10 and 64\*10): ![Depth and width results](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Wide_Residual_Networks__depth_and_width.png?raw=true "Depth and width results") * They argue that their results show no evidence that would support the common theory that thin and deep networks somehow regularized better than wide and shallow(er) networks. * *Dropout*: * They use dropout with p=0.3 (CIFAR) and p=0.4 (SVHN). * On CIFAR10 dropout doesn't seem to consistently improve test error. * On CIFAR100 and SVHN dropout seems to lead to improvements that are either small (wide and shallower net, i.e. depth=28, width multiplier=10) or significant (ResNet50). ![Dropout](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Wide_Residual_Networks__dropout.png?raw=true "Dropout") * They also observed oscillations in error (both train and test) during the training. Adding dropout decreased these oscillations. * *Computational efficiency*: * Applying few big convolutions is much more efficient on GPUs than applying many small ones sequentially. * Their network with the best test error is 1.6 times faster than ResNet1001, despite having about 3 times more parameters. 
[link]
* The well known method of Artistic Style Transfer can be used to generate new texture images (from an existing example) by skipping the content loss and only using the style loss. * The method however can have problems with large scale structures and quasiperiodic patterns. * They add a new loss based on the spectrum of the images (synthesized image and style image), which decreases these problems and handles especially periodic patterns well. ### How * Everything is handled in the same way as in the Artistic Style Transfer paper (without content loss). * On top of that they add their spectrum loss: * The loss is based on a squared distance, i.e. $1/2 d(I_s, I_t)^2$. * $I_s$ is the last synthesized image. * $I_t$ is the texture example. * $d(I_s, I_t)$ then does the following: * It assumes that $I_t$ is an example for a space of target images. * Within that set it finds the image $I_p$ which is most similar to $I_s$. That is done using a projection via Fourier Transformations. (See formula 5 in the paper.) * The returned distance is then $I_s  I_p$. ### Results * Equal quality for textures without quasiperiodic structures. * Significantly better quality for textures with quasiperiodic structures. ![Overview](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Texture_Synthesis_Through_CNNs_and_Spectrum_Constraints__overview.png?raw=true "Overview") *Overview over their method, i.e. generated textures using style and/or spectrumbased loss.* 
[link]
https://www.youtube.com/watch?v=PRD8LpPvdHI * They describe a method that can be used for two problems: * (1) Choose a style image and apply that style to other images. * (2) Choose an example texture image and create new texture images that look similar. * In contrast to previous methods their method can be applied very fast to images (style transfer) or noise (texture creation). However, per style/texture a single (expensive) initial training session is still necessary. * Their method builds upon their previous paper "Combining Markov Random Fields and Convolutional Neural Networks for Image Synthesis". ### How * Rough overview of their previous method: * Transfer styles using three losses: * Content loss: MSE between VGG representations. * Regularization loss: Sum of xgradient and ygradients (encouraging smooth areas). * MRFbased style loss: Sample `k x k` patches from VGG representations of content image and style image. For each patch from content image find the nearest neighbor (based on normalized cross correlation) from style patches. Loss is then the sum of squared errors of euclidean distances between content patches and their nearest neighbors. * Generation of new images is done by starting with noise and then iteratively applying changes that minimize the loss function. * They introduce mostly two major changes: * (a) Get rid of the costly nearest neighbor search for the MRF loss. Instead, use a discriminatornetwork that receives a patch and rates how real that patch looks. * This discriminatornetwork is costly to train, but that only has to be done once (per style/texture). * (b) Get rid of the slow, iterative generation of images. Instead, start with the content image (style transfer) or noise image (texture generation) and feed that through a single generatornetwork to create the output image (with transfered style or generated texture). * This generatornetwork is costly to train, but that only has to be done once (per style/texture). * MDANs * They implement change (a) to the standard architecture and call that an "MDAN" (Markovian Deconvolutional Adversarial Networks). * So the architecture of the MDAN is: * Input: Image (RGB pixels) * Branch 1: Markovian Patch Quality Rater (aka Discriminator) * Starts by feeding the image through VGG19 until layer `relu3_1`. (Note: VGG weights are fixed/not trained.) * Then extracts `k x k` patches from the generated representations. * Feeds each patch through a shallow ConvNet (convolution with BN then fully connected layer). * Training loss is a hinge loss, i.e. max margin between classes +1 (real looking patch) and 1 (fake looking patch). (Could also take a single sigmoid output, but they argue that hinge loss isn't as likely to saturate.) * This branch will be trained continuously while synthesizing a new image. * Branch 2: Content Estimation/Guidance * Note: This branch is only used for style transfer, i.e if using an content image and not for texture generation. * Starts by feeding the currently synthesized image through VGG19 until layer `relu5_1`. (Note: VGG weights are fixed/not trained.) * Also feeds the content image through VGG19 until layer `relu5_1`. * Then uses a MSE loss between both representations (so similar to a MSE on RGB pixels that is often used in autoencoders). * Nothing in this branch needs to trained, the loss only affects the synthesizing of the image. * MGANs * The MGAN is like the MDAN, but additionally implements change (b), i.e. they add a generator that takes an image and stylizes it. * The generator's architecture is: * Input: Image (RGB pixels) or noise (for texture synthesis) * Output: Image (RGB pixels) (stylized input image or generated texture) * The generator takes the image (pixels) and feeds that through VGG19 until layer `relu4_1`. * Similar to the DCGAN generator, they then apply a few fractionally strided convolutions (with BN and LeakyReLUs) to that, ending in a Tanh output. (Fractionally strided convolutions increase the height/width of the images, here to compensate the VGG pooling layers.) * The output after the Tanh is the output image (RGB pixels). * They train the generator with pairs of `(input image, stylized image or texture)`. These pairs can be gathered by first running the MDAN alone on several images. (With significant augmentation a few dozen pairs already seem to be enough.) * One of two possible loss functions can then be used: * Simple standard choice: MSE on the euclidean distance between expected output pixels and generated output pixels. Can cause blurriness. * Better choice: MSE on a higher VGG representation. Simply feed the generated output pixels through VGG19 until `relu4_1` and the reuse the already generated (see above) VGGrepresentation of the input image. This is very similar to the pixelwise comparison, but tends to cause less blurriness. * Note: For some reason the authors call their generator a VAE, but don't mention any typical VAE technique, so it's not described like one here. * They use Adam to train their networks. * For texture generation they use Perlin Noise instead of simple white noise. In Perlin Noise, lower frequency components dominate more than higher frequency components. White noise didn't work well with the VGG representations in the generator (activations were close to zero). ### Results * Similar quality like previous methods, but much faster (compared to most methods). * For the Markovian Patch Quality Rater (MDAN branch 1): * They found that the weights of this branch can be used as initialization for other training sessions (e.g. other texture styles), leading to a decrease in required iterations/epochs. * Using VGG for feature extraction seems to be crucial. Training from scratch generated in worse results. * Using larger patch sizes preserves more structure of the structure of the style image/texture. Smaller patches leads to more flexibility in generated patterns. * They found that using more than 3 convolutional layers or more than 64 filters per layer provided no visible benefit in quality. ![Example](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Markovian_GANs__example.png?raw=true "Example") *Result of their method, compared to other methods.* ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Markovian_GANs__architecture.png?raw=true "Architecture") *Architecture of their model.* 
[link]
* They describe a method to transfer image styles based on semantic classes. * This allows to: * (1) Transfer styles between images more accurately than with previous models. E.g. so that the background of an image does not receive the style of skin/hair/clothes/... seen in the style image. Skin in the synthesized image should receive the style of skin from the style image. Same for hair, clothes, etc. * (2) Turn simple doodles into artwork by treating the simplified areas in the doodle as semantic classes and annotating an artwork with these same semantic classes. (E.g. "this blob should receive the style from these trees.") ### How * Their method is based on [Combining Markov Random Fields and Convolutional Neural Networks for Image Synthesis](Combining_MRFs_and_CNNs_for_Image_Synthesis.md). * They use the same content loss and mostly the same MRFbased style loss. (Apparently they don't use the regularization loss.) * They change the input of the MRFbased style loss. * Usually that input would only be the activations of a VGGlayer (for the synthesized image or the style source image). * They add a semantic map with weighting `gamma` to the activation, i.e. `<representation of image> = <activation of specific layer for that image>  gamma * <semantic map>`. * The semantic map has N channels with 1s in a channel where a specific class is located (e.g. skin). * The semantic map has to be created by the user for both the content image and the style image. * As usually for the MRF loss, patches are then sampled from the representations. The semantic maps then influence the distance measure. I.e. patches are more likely to be sampled from the same semantic class. * Higher `gamma` values make it more likely to sample from the same semantic class (because the distance from patches from different classes gets larger). * One can create a small doodle with few colors, then use the colors as the semantic map. Then add a semantic map to an artwork and run the algorithm to transform the doodle into an artwork. ### Results * More control over the transfered styles than previously. * Less sensitive to the style weighting, because of the additional `gamma` hyperparameter. * Easy transformation from doodle to artwork. ![Example](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Neural_Doodle__example.png?raw=true "Example") *Turning a doodle into an artwork. Note that the doodle input image is also used as the semantic map of the input.* 
[link]
* They describe a method that applies the style of a source image to a target image. * Example: Let a normal photo look like a van Gogh painting. * Example: Let a normal car look more like a specific luxury car. * Their method builds upon the well known artistic style paper and uses a new MRF prior. * The prior leads to locally more plausible patterns (e.g. less artifacts). ### How * They reuse the content loss from the artistic style paper. * The content loss was calculated by feed the source and target image through a network (here: VGG19) and then estimating the squared error of the euclidean distance between one or more hidden layer activations. * They use layer `relu4_2` for the distance measurement. * They replace the original style loss with a MRF based style loss. * Step 1: Extract from the source image `k x k` sized overlapping patches. * Step 2: Perform step (1) analogously for the target image. * Step 3: Feed the source image patches through a pretrained network (here: VGG19) and select the representations `r_s` from specific hidden layers (here: `relu3_1`, `relu4_1`). * Step 4: Perform step (3) analogously for the target image. (Result: `r_t`) * Step 5: For each patch of `r_s` find the best matching patch in `r_t` (based on normalized cross correlation). * Step 6: Calculate the sum of squared errors (based on euclidean distances) of each patch in `r_s` and its best match (according to step 5). * They add a regularizer loss. * The loss encourages smooth transitions in the synthesized image (i.e. few edges, corners). * It is based on the raw pixel values of the last synthesized image. * For each pixel in the synthesized image, they calculate the squared xgradient and the squared ygradient and then add both. * They use the sum of all those values as their loss (i.e. `regularizer loss = <sum over all pixels> xgradient^2 + ygradient^2`). * Their whole optimization problem is then roughly `image = argmin_image MRFstyleloss + alpha1 * contentloss + alpha2 * regularizerloss`. * In practice, they start their synthesis with a low resolution image and then progressively increase the resolution (each time performing some iterations of optimization). * In practice, they sample patches from the style image under several different rotations and scalings. ### Results * In comparison to the original artistic style paper: * Less artifacts. * Their method tends to preserve style better, but content worse. * Can handle photorealistic style transfer better, so long as the images are similar enough. If no good matches between patches can be found, their method performs worse. ![Nonphotorealistic example images](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Combining_MRFs_and_CNNs_for_Image_Synthesis__examples.png?raw=true "Nonphotorealistic example images") *Nonphotorealistic example images. Their method vs. the one from the original artistic style paper.* ![Photorealistic example images](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Combining_MRFs_and_CNNs_for_Image_Synthesis__examples_real.png?raw=true "Photorealistic example images") *Photorealistic example images. Their method vs. the one from the original artistic style paper.* 
[link]
* They describe a model that upscales low resolution images to their high resolution equivalents ("Single Image Super Resolution"). * Their model uses a deeper architecture than previous models and has a residual component. ### How * Their model is a fully convolutional neural network. * Input of the model: The image to upscale, *already upscaled to the desired size* (but still blurry). * Output of the model: The upscaled image (without the blurriness). * They use 20 layers of padded 3x3 convolutions with size 64xHxW with ReLU activations. (No pooling.) * They have a residual component, i.e. the model only learns and outputs the *change* that has to be applied/added to the blurry input image (instead of outputting the full image). That change is applied to the blurry input image before using the loss function on it. (Note that this is a bit different from the currently used "residual learning".) * They use a MSE between the "correct" upscaling and the generated upscaled image (input image + residual). * They use SGD starting with a learning rate of 0.1 and decay it 3 times by a factor of 10. * They use weight decay of 0.0001. * During training they use a special gradient clipping adapted to the learning rate. Usually gradient clipping restricts the gradient values to `[t, t]` (`t` is a hyperparameter). Their gradient clipping restricts the values to `[t/lr, t/lr]` (where `lr` is the learning rate). * They argue that their special gradient clipping allows the use of significantly higher learning rates. * They train their model on multiple scales, e.g. 2x, 3x, 4x upscaling. (Not really clear how. They probably feed their upscaled image again into the network or something like that?) ### Results * Higher accuracy upscaling than all previous methods. * Can handle well upscaling factors above 2x. * Residual network learns significantly faster than nonresidual network. ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Accurate_Image_SuperResolution__architecture.png?raw=true "Architecture") *Architecture of the model.* ![Examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Accurate_Image_SuperResolution__examples.png?raw=true "Examples") *Superresolution quality of their model (top, bottom is a competing model).* 
[link]
* They present a hierarchical method for reinforcement learning. * The method combines "long"term goals with shortterm action choices. ### How * They have two components: * MetaController: * Responsible for the "long"term goals. * Is trained to pick goals (based on the current state) that maximize (extrinsic) rewards, just like you would usually optimize to maximize rewards by picking good actions. * The MetaController only picks goals when the Controller terminates or achieved the goal. * Controller: * Receives the current state and the current goal. * Has to pick a reward maximizing action based on those, just as the agent would usually do (only the goal is added here). * The reward is intrinsic. It comes from the Critic. The Critic gives reward whenever the current goal is reached. * For Montezuma's Revenge: * A goal is to reach a specific object. * The goal is encoded via a bitmask (as big as the game screen). The mask contains 1s wherever the object is. * They handextract the location of a few specific objects. * So basically: * The MetaController picks the next object to reach via a Qvalue function. * It receives extrinsic reward when objects have been reached in a specific sequence. * The Controller picks actions that lead to reaching the object based on a Qvalue function. It iterates actionchoosing until it terminates or reached the goalobject. * The Critic awards intrinsic reward to the Controller whenever the goalobject was reached. * They use CNNs for the MetaController and the Controller, similar in architecture to the AtariDQN paper (shallow CNNs). * They use two replay memories, one for the MetaController (size 40k) and one for the Controller (size 1M). * Both follow an epsilongreedy policy (for picking goals/actions). Epsilon starts at 1.0 and is annealed down to 0.1. * They use a discount factor / gamma of 0.9. * They train with SGD. ### Results * Learns to play Montezuma's Revenge. * Learns to act well in a more abstract MDP with delayed rewards and where simple Qlearning failed.  # Rough chapterwise notes * (1) Introduction * Basic problem: Learn goal directed behaviour from sparse feedbacks. * Challenges: * Explore state space efficiently * Create multiple levels of spatiotemporal abstractions * Their method: Combines deep reinforcement learning with hierarchical value functions. * Their agent is motivated to solve specific intrinsic goals. * Goals are defined in the space of entities and relations, which constraints the search space. * They define their value function as V(s, g) where s is the state and g is a goal. * First, their agent learns to solve intrinsically generated goals. Then it learns to chain these goals together. * Their model has two hiearchy levels: * MetaController: Selects the current goal based on the current state. * Controller: Takes state s and goal g, then selects a good action based on s and g. The controller operates until g is achieved, then the metacontroller picks the next goal. * MetaController gets extrinsic rewards, controller gets intrinsic rewards. * They use SGD to optimize the whole system (with respect to reward maximization). * (3) Model * Basic setting: Action a out of all actions A, state s out of S, transition function T(s,a)>s', reward by state F(s)>R. * epsilongreedy is good for local exploration, but it's not good at exploring very different areas of the state space. * They use intrinsically motivated goals to better explore the state space. * Sequences of goals are arranged to maximize the received extrinsic reward. * The agent learns one policy per goal. * MetaController: Receives current state, chooses goal. * Controller: Receives current state and current goal, chooses action. Keeps choosing actions until goal is achieved or a terminal state is reached. Has the optimization target of maximizing cumulative reward. * Critic: Checks if current goal is achieved and if so provides intrinsic reward. * They use deep Q learning to train their model. * There are two Qvalue functions. One for the controller and one for the metacontroller. * Both formulas are extended by the last chosen goal g. * The Qvalue function of the metacontroller does not depend on the chosen action. * The Qvalue function of the controller receives only intrinsic direct reward, not extrinsic direct reward. * Both Qvalue functions are reprsented with DQNs. * Both are optimized to minimize MSE losses. * They use separate replay memories for the controller and metacontroller. * A memory is added for the metacontroller whenever the controller terminates. * Each new goal is picked by the metacontroller epsilongreedy (based on the current state). * The controller picks actions epsilongreedy (based on the current state and goal). * Both epsilons are annealed down. * (4) Experiments * (4.1) Discrete MDP with delayed rewards * Basic MDP setting, following roughly: Several states (s1 to s6) organized in a chain. The agent can move left or right. It gets high reward if it moves to state s6 and then back to s1, otherwise it gets small reward per reached state. * They use their hierarchical method, but without neural nets. * Baseline is Qlearning without a hierarchy/intrinsic rewards. * Their method performs significantly better than the baseline. * (4.2) ATARI game with delayed rewards * They play Montezuma's Revenge with their method, because that game has very delayed rewards. * They use CNNs for the controller and metacontroller (architecture similar to the AtariDQN paper). * The critic reacts to (entity1, relation, entity2) relationships. The entities are just objects visible in the game. The relation is (apparently ?) always "reached", i.e. whether object1 arrived at object2. * They extract the objects manually, i.e. assume the existance of a perfect unsupervised object detector. * They encode the goals apparently not as vectors, but instead just use a bitmask (game screen heightand width), which has 1s at the pixels that show the object. * Replay memory sizes: 1M for controller, 50k for metacontroller. * gamma=0.99 * They first only train the controller (i.e. metacontroller completely random) and only then train both jointly. * Their method successfully learns to perform actions which lead to rewards with long delays. * It starts with easier goals and then learns harder goals. 
[link]
https://www.youtube.com/watch?v=vQk_Sfl7kSc&feature=youtu.be * The paper describes a method to transfer the style (e.g. choice of colors, structure of brush strokes) of an image to a whole video. * The method is designed so that the transfered style is consistent over many frames. * Examples for such consistency: * No flickering of style between frames. So the next frame has always roughly the same style in the same locations. * No artefacts at the boundaries of objects, even if they are moving. * If an area gets occluded and then unoccluded a few frames later, the style of that area is still the same as before the occlusion. ### How * Assume that we have a frame to stylize $x$ and an image from which to extract the style $a$. * The basic process is the same as in the original Artistic Style Transfer paper, they just add a bit on top of that. * They start with a gaussian noise image $x'$ and change it gradually so that a loss function gets minimized. * The loss function has the following components: * Content loss *(old, same as in the Artistic Style Transfer paper)* * This loss makes sure that the content in the generated/stylized image still matches the content of the original image. * $x$ and $x'$ are fed forward through a pretrained network (VGG in their case). * Then the generated representations of the intermediate layers of the network are extracted/read. * One or more layers are picked and the difference between those layers for $x$ and $x'$ is measured via a MSE. * E.g. if we used only the representations of the layer conv5 then we would get something like `(conv5(x)  conv5(x'))^2` per example. (Where conv5() also executes all previous layers.) * Style loss *(old)* * This loss makes sure that the style of the generated/stylized image matches the style source $a$. * $x'$ and $a$ are fed forward through a pretrained network (VGG in their case). * Then the generated representations of the intermediate layers of the network are extracted/read. * One or more layers are picked and the Gram Matrices of those layers are calculated. * Then the difference between those matrices is measured via a MSE. * Temporal loss *(new)* * This loss enforces consistency in style between a pair of frames. * The main sources of inconsistency are boundaries of moving objects and areas that get unonccluded. * They use the optical flow to detect motion. * Applying an optical flow method to two frames $(i, i+1)$ returns per pixel the movement of that pixel, i.e. if the pixel at $(x=1, y=2)$ moved to $(x=2, y=4)$ the optical flow at that pixel would be $(u=1, v=2)$. * The optical flow can be split into the forward flow (here `fw`) and the backward flow (here `bw`). The forward flow is the flow from frame i to i+1 (as described in the previous point). The backward flow is the flow from frame $i+1$ to $i$ (reverse direction in time). * Boundaries * At boundaries of objects the derivative of the flow is high, i.e. the flow "suddenly" changes significantly from one pixel to the other. * So to detect boundaries they use (per pixel) roughly the equation `gradient(u)^2 + gradient(v)^2 > length((u,v))`. * Occlusions and disocclusions * If a pixel does not get occluded/disoccluded between frames, the optical flow method should be able to correctly estimate the motion of that pixel between the frames. The forward and backward flows then should be roughly equal, just in opposing directions. * If a pixel does get occluded/disoccluded between frames, it will not be visible in one the two frames and therefore the optical flow method cannot reliably estimate the motion for that pixel. It is then expected that the forward and backward flow are unequal. * To measure that effect they roughly use (per pixel) a formula matching `length(fw + bw)^2 > length(fw)^2 + length(bw)^2`. * Mask $c$ * They create a mask $c$ with the size of the frame. * For every pixel they estimate whether the boundaryequation *or* the disocclusionequation is true. * If either of them is true, they add a 0 to the mask, otherwise a 1. So the mask is 1 wherever there is *no* disocclusion or motion boundary. * Combination * The final temporal loss is the mean (over all pixels) of $c*(xw)^2$. * $x$ is the frame to stylize. * $w$ is the previous *stylized* frame (frame i1), warped according to the optical flow between frame i1 and i. * `c` is the mask value at the pixel. * By using the difference `xw` they ensure that the difference in styles between two frames is low. * By adding `c` they ensure the styleconsistency only at pixels that probably should have a consistent style. * Longterm loss *(new)* * This loss enforces consistency in style between pairs of frames that are longer apart from each other. * It is a simple extension of the temporal (shortterm) loss. * The temporal loss was computed for frames (i1, i). The longterm loss is the sum of the temporal losses for the frame pairs {(i4,i), (i2,i), (i1,i)}. * The $c$ mask is recomputed for every pair and 1 if there are no boundaries/disocclusions detected, but only if there is not a 1 for the same pixel in a later mask. The additional condition is intended to associate pixels with their closest neighbours in time to minimize possible errors. * Note that the longterm loss can completely replace the temporal loss as the latter one is contained in the former one. * Multipass approach *(new)* * They had problems with contrast around the boundaries of the frames. * To combat that, they use a multipass method in which they seem to calculate the optical flow in multiple forward and backward passes? (Not very clear here what they do and why it would help.) * Initialization with previous frame *(new)* * Instead of starting at a gaussian noise image every time, they instead use the previous stylized frame. * That immediately leads to more similarity between the frames. 
[link]
* Certain activation functions, mainly sigmoid, tanh, hardsigmoid and hardtanh can saturate. * That means that their gradient is either flat 0 after threshold values (e.g. 1 and +1) or that it approaches zero for high/low values. * If there's no gradient, training becomes slow or stops completely. * That's a problem, because sigmoid, tanh, hardsigmoid and hardtanh are still often used in some models, like LSTMs, GRUs or Neural Turing Machines. * To fix the saturation problem, they add noise to the output of the activation functions. * The noise increases as the unit saturates. * Intuitively, once the unit is saturating, it will occasionally "test" an activation in the nonsaturating regime to see if that output performs better. ### How * The basic formula is: `phi(x,z) = alpha*h(x) + (1alpha)u(x) + d(x)std(x)epsilon` * Variables in that formula: * Nonlinear part `alpha*h(x)`: * `alpha`: A constant hyperparameter that determines the "direction" of the noise and the slope. Values below 1.0 let the noise point away from the unsaturated regime. Values <=1.0 let it point towards the unsaturated regime (higher alpha = stronger noise). * `h(x)`: The original activation function. * Linear part `(1alpha)u(x)`: * `u(x)`: Firstorder Taylor expansion of h(x). * For sigmoid: `u(x) = 0.25x + 0.5` * For tanh: `u(x) = x` * For hardsigmoid: `u(x) = max(min(0.25x+0.5, 1), 0)` * For hardtanh: `u(x) = max(min(x, 1), 1)` * Noise/Stochastic part `d(x)std(x)epsilon`: * `d(x) = sgn(x)sgn(1alpha)`: Changes the "direction" of the noise. * `std(x) = c(sigmoid(p*v(x))0.5)^2 = c(sigmoid(p*(h(x)u(x)))0.5)^2` * `c` is a hyperparameter that controls the scale of the standard deviation of the noise. * `p` controls the magnitude of the noise. Due to the `sigmoid(y)0.5` this can influence the sign. `p` is learned. * `epsilon`: A noise creating random variable. Usually either a Gaussian or the positive half of a Gaussian (i.e. `z` or `z`). * The hyperparameter `c` can be initialized at a high value and then gradually decreased over time. That would be comparable to simulated annealing. * Noise could also be applied to the input, i.e. `h(x)` becomes `h(x + noise)`. ### Results * They replaced sigmoid/tanh/hardsigmoid/hardtanh units in various experiments (without further optimizations). * The experiments were: * Learn to execute source code (LSTM?) * Language model from Penntreebank (2layer LSTM) * Neural Machine Translation engine trained on Europarl (LSTM?) * Image caption generation with soft attention trained on Flickr8k (LSTM) * Counting unique integers in a sequence of integers (LSTM) * Associative recall (Neural Turing Machine) * Noisy activations practically always led to a small or moderate improvement in resulting accuracy/NLL/BLEU. * In one experiment annealed noise significantly outperformed unannealed noise, even beating careful curriculum learning. (Somehow there are not more experiments about that.) * The Neural Turing Machine learned far faster with noisy activations and also converged to a much better solution. ![Influence of alphas](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Noisy_Activation_Functions__alphas.png?raw=true "Influence of alphas.") *Hardtanh with noise for various alphas. Noise increases in different ways in the saturing regimes.* ![Neural Turing Machine results](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Noisy_Activation_Functions__ntm.png?raw=true "Neural Turing Machine results.") *Performance during training of a Neural Turing Machine with and without noisy activation units.*  # Rough chapterwise notes * (1) Introduction * ReLU and Maxout activation functions have improved the capabilities of training deep networks. * Previously, tanh and sigmoid were used, which were only suited for shallow networks, because they saturate, which kills the gradient. * They suggest a different avenue: Use saturating nonlinearities, but inject noise when they start to saturate (and let the network learn how much noise is "good"). * The noise allows to train deep networks with saturating activation functions. * Many current architectures (LSTMs, GRUs, Neural Turing Machines, ...) require "hard" decisions (yes/no). But they use "soft" activation functions to implement those, because hard functions lack gradient. * The soft activation functions can still saturate (no more gradient) and don't match the nature of the binary decision problem. So it would be good to replace them with something better. * They instead use hard activation functions and compensate for the lack of gradient by using noise (during training). * Networks with hard activation functions outperform those with soft ones. * (2) Saturating Activation Functions * Activation Function = A function that maps a real value to a new real value and is differentiable almost everywhere. * Right saturation = The gradient of an activation function becomes 0 if the input value goes towards infinity. * Left saturation = The gradient of an activation function becomes 0 if the input value goes towards infinity. * Saturation = A activation function saturates if it rightsaturates and leftsaturates. * Hard saturation = If there is a constant c for which for which the gradient becomes 0. * Soft saturation = If there is no constant, i.e. the input value must become +/ infinity. * Soft saturating activation functions can be converted to hard saturating ones by using a firstorder Taylor expansion and then clipping the values to the required range (e.g. 0 to 1). * A hard activating tanh is just `f(x) = x`. With clipping to [1, 1]: `max(min(f(x), 1), 1)`. * The gradient for hard activation functions is 0 above/below certain constants, which will make training significantly more challenging. * hardsigmoid, sigmoid and tanh are contractive mappings, hardtanh for some reason only when it's greater than the threshold. * The fixedpoint for tanh is 0, for the others !=0. That can have influences on the training performance. * (3) Annealing with Noisy Activation Functions * Suppose that there is an activation function like hardsigmoid or hardtanh with additional noise (iid, mean=0, variance=std^2). * If the noise's `std` is 0 then the activation function is the original, deterministic one. * If the noise's `std` is very high then the derivatives and gradient become high too. The noise then "drowns" signal and the optimizer just moves randomly through the parameter space. * Let the signal to noise ratio be `SNR = std_signal / std_noise`. So if SNR is low then noise drowns the signal and exploration is random. * By letting SNR grow (i.e. decreaseing `std_noise`) we switch the model to fine tuning mode (less coarse exploration). * That is similar to simulated annealing, where noise is also gradually decreased to focus on better and better regions of the parameter space. * (4) Adding Noise when the Unit Saturate * This approach does not always add the same noise. Instead, noise is added proportinally to the saturation magnitude. More saturation, more noise. * That results in a clean signal in "good" regimes (nonsaturation, strong gradients) and a noisy signal in "bad" regimes (saturation). * Basic activation function with noise: `phi(x, z) = h(x) + (mu + std(x)*z)`, where `h(x)` is the saturating activation function, `mu` is the mean of the noise, `std` is the standard deviation of the noise and `z` is a random variable. * Ideally the noise is unbiased so that the expectation values of `phi(x,z)` and `h(x)` are the same. * `std(x)` should take higher values as h(x) enters the saturating regime. * To calculate how "saturating" a activation function is, one can `v(x) = h(x)  u(x)`, where `u(x)` is the firstorder Taylor expansion of `h(x)`. * Empirically they found that a good choice is `std(x) = c(sigmoid(p*v(x))  0.5)^2` where `c` is a hyperparameter and `p` is learned. * (4.1) Derivatives in the Saturated Regime * For values below the threshold, the gradient of the noisy activation function is identical to the normal activation function. * For values above the threshold, the gradient of the noisy activation function is `phi'(x,z) = std'(x)*z`. (Assuming that z is unbiased so that mu=0.) * (4.2) Pushing Activations towards the Linear Regime * In saturated regimes, one would like to have more of the noise point towards the unsaturated regimes than away from them (i.e. let the model try often whether the unsaturated regimes might be better). * To achieve this they use the formula `phi(x,z) = alpha*h(x) + (1alpha)u(x) + d(x)std(x)epsilon` * `alpha`: A constant hyperparameter that determines the "direction" of the noise and the slope. Values below 1.0 let the noise point away from the unsaturated regime. Values <=1.0 let it point towards the unsaturated regime (higher alpha = stronger noise). * `h(x)`: The original activation function. * `u(x)`: Firstorder Taylor expansion of h(x). * `d(x) = sgn(x)sgn(1alpha)`: Changes the "direction" of the noise. * `std(x) = c(sigmoid(p*v(x))0.5)^2 = c(sigmoid(p*(h(x)u(x)))0.5)^2` with `c` being a hyperparameter and `p` learned. * `epsilon`: Either `z` or `z`. If `z` is a Gaussian, then `z` is called "halfnormal" while just `z` is called "normal". Halfnormal lets the noise only point towards one "direction" (towards the unsaturated regime or away from it), while normal noise lets it point in both directions (with the slope being influenced by `alpha`). * The formula can be split into three parts: * `alpha*h(x)`: Nonlinear part. * `(1alpha)u(x)`: Linear part. * `d(x)std(x)epsilon`: Stochastic part. * Each of these parts resembles a path along which gradient can flow through the network. * During test time the activation function is made deterministic by using its expectation value: `E[phi(x,z)] = alpha*h(x) + (1alpha)u(x) + d(x)std(x)E[epsilon]`. * If `z` is halfnormal then `E[epsilon] = sqrt(2/pi)`. If `z` is normal then `E[epsilon] = 0`. * (5) Adding Noise to Input of the Function * Noise can also be added to the input of an activation function, i.e. `h(x)` becomes `h(x + noise)`. * The noise can either always be applied or only once the input passes a threshold. * (6) Experimental Results * They applied noise only during training. * They used existing setups and just changed the activation functions to noisy ones. No further optimizations. * `p` was initialized uniformly to [1,1]. * Basic experiment settings: * NAN: Normal noise applied to the outputs. * NAH: Halfnormal noise, i.e. `z`, i.e. noise is "directed" towards the unsaturated or satured regime. * NANI: Normal noise applied to the *input*, i.e. `h(x+noise)`. * NANIL: Normal noise applied to the input with learned variance. * NANIS: Normal noise applied to the input, but only if the unit saturates (i.e. above/below thresholds). * (6.1) Exploratory analysis * A very simple MNIST network performed slightly better with noisy activations than without. But comparison was only to tanh and hardtanh, not ReLU or similar. * In an experiment with a simple GRU, NANI (noisy input) and NAN (noisy output) performed practically identical. NANIS (noisy input, only when saturated) performed significantly worse. * (6.2) Learning to Execute * Problem setting: Predict the output of some lines of code. * They replaced sigmoids and tanhs with their noisy counterparts (NAH, i.e. halfnormal noise on output). The model learned faster. * (6.3) Penntreebank Experiments * They trained a standard 2layer LSTM language model on Penntreebank. * Their model used noisy activations, as opposed to the usually nonnoisy ones. * They could improve upon the previously best value. Normal noise and halfnormal noise performed roughly the same. * (6.4) Neural Machine Translation Experiments * They replaced all sigmoids and tanh units in the Neural Attention Model with noisy ones. Then they trained on the Europarl corpus. * They improved upon the previously best score. * (6.5) Image Caption Generation Experiments * They train a network with soft attention to generate captions for the Flickr8k dataset. * Using noisy activation units improved the result over normal sigmoids and tanhs. * (6.6) Experiments with Continuation * They build an LSTM and train it to predict how many unique integers there are in a sequence of random integers. * Instead of using a constant value for hyperparameter `c` of the noisy activations (scale of the standard deviation of the noise), they start at `c=30` and anneal down to `c=0.5`. * Annealed noise performed significantly better then unannealed noise. * Noise applied to the output (NAN) significantly beat noise applied to the input (NANIL). * In a second experiment they trained a Neural Turing Machine on the associative recall task. * Again they used annealed noise. * The NTM with annealed noise learned by far faster than the one without annealed noise and converged to a perfect solution. 
[link]
#### Motivation: + When sampling a clinical time series, missing values become ubiquitous due to a variety of factors such as frequency of medical events (when a blood test is performed, for example). + Missing values can be very informative about the label  *informative missingness*. + The goal of the paper is to propose a deep learning model that **exploits the missingness patterns** to enhance its performance. #### Time series notation: Multivariate time series with $D$ variables of length $T$: + ${\bf X} = ({\bf x}_1, {\bf x}_2, \ldots, {\bf x}_T)^T \in \mathbb{R}^{T \times D}$. + ${\bf x}_t \in \mathbb{R}^{D}$ is the $t$th measurement of all variables. + $x_t^d$ is the $d$th component of ${\bf x}_t$. Missing value information is incorporated using *masking* and *timeinterval* concepts. + Masking: says which of the entries are missing values. + Masking vector ${\bf m}_t \in \{0, 1\}^D$, $m_t^d = 1$ if $x_t^d$ exists and $m_t^d = 0$ if $x_t^d$ is missing. + Timeinterval: temporal pattern of 'nomissing' observations. Represented by timestamps $s_t$ and time intervals $\delta_t$ (since its last observation). Example: ${\bf X}$: input time series with 2 variables, $$ {\bf X} = \begin{pmatrix} 47 & 49 & NA & 40 & NA & 43 & 55 \\ NA & 15 & 14 & NA & NA & NA & 15 \end{pmatrix} $$ with timestamps $${\bf s} = \begin{pmatrix} 0 & 0.1 & 0.6 & 1.6 & 2.2 & 2.5 & 3.1 \end{pmatrix} $$ The masking vectors ${\bf m}_t$ and time intervals ${\delta}_t$ for each variable are computed and stacked forming the masking matrix ${\bf M}$ and time interval matrix ${\bf \Delta}$ : $$ {\bf M} = \begin{pmatrix} 1 & 1 & 0 & 1 & 0 & 1 & 1 \\ 0 & 1 & 1 & 0 & 0 & 0 & 1 \end{pmatrix} $$ $$ {\bf \Delta} = \begin{pmatrix} 0 & 0.1 & 0.5 & 1.5 & 0.6 & 0.9 & 0.6 \\ 0 & 0.1 & 0.5 & 1.0 & 1.6 & 1.9 & 2.5 \end{pmatrix} $$ #### Proposed Architecture: + GRU (Gated Recurrent Units) with "trainable" decays: + Input decay: which causes the variable to converge to its empirical mean instead of simply filling with the last value of the variable. The decay of each input is treated independently + Hidden state decay: Attempts to capture richer information from missing patterns. In this case the hidden state of the network at the previous time step is decayed. #### Dataset: + MIMIC III v1.4: https://mimic.physionet.org/ + Input events, Output events, Lab events, Prescription events + PhysioNet Challenge 2012: https://physionet.org/challenge/2012/  MIMIC III  PhysioNet 2012   Number of samples ($N$)  19714  4000 Number of variables ($D$) 99  33 Mean number of time steps 35.89  68.91 Maximum number of time steps150  155 Mean of variable missing rate 0.9621 0.8225 #### Experiments and Results: **Methodology** + Baselines: + Logistic Regression, SVM, Random Forest (PhysioNet sampled every 1h. MIMIC sampled every 2h). Forward / backfilling imputation. Masking vector is concatenated input to inform the models what inputs are imputed. + LSTM with mean imputation. + Variations of the proposed GRU model: + GRUmean: impute average of the training set. + GRUforward: impute last value. + GRUsimple: masking vectors and time interval are inputs. There is no imputation. + GRUD: proposed model. + Batch normalization and dropout (p = 0.5) applied to the regression layer. + Normalized inputs to have a mean of 0 and standard deviation 1. + Parameter optimization: early stopping on validation set. **Results** Mortality Prediction (results in terms of AUC): + Proposed GRUD outperforms other models on both datasets: + AUC = 0.8527 $\pm$ 0.003 for MIMICIII and 0.8424 $\pm$ 0.012 for PhysioNet + Random Forest and SVM are the best nonRNN baselines. + GRUsimple was the best RNN variant. Multitask Prediction (results in terms of AUC): + PhysioNet: mortality, <3 days, surgery, cardiac condition. + MIMIC III: 20 diagnostic categories. + The proposed GRUD outperforms other baseline models. #### Positive Aspects: + Instead of performing simple mean imputation or using indicator functions, the paper exploits missing values and missing patterns in a novel way. + The paper performs lengthy comparisons against baselines. #### Caveats: + Clinical mortality datasets usually have very high imbalance between classes. In such cases, AUC alone is not the best metric to evaluate. It would have been interesting to see the results in terms of precision/recall. 
[link]
The paper presents a new deep learning framework for person search. The authors propose to unify two disjoint tasks of 'person detection' and 'person reidentification' into a single problem of 'person search' using a Convolutional Neural Network (CNN). Also, a new largescale benchmark dataset for person search is collected and annotated. It contains $18,184$ images, $8,432$ identities, and $96,143$ pedestrian bounding boxes. Conventional person reidentification approaches detect people first and then extract features for each person, finally classifying to a category (as depicted in the figure below). Instead of breaking this into separate processes of detection and classification, the problem is solved jointly by using a single CNN similar to the FasterRCNN framework. https://i.imgur.com/ISDQd9L.png The proposed framework (shown in the figure below) has a pedestrian proposal net which is used to detect people, and an identification net for extracting features for comparing with the target person. The two modules adapt with each other through joint optimization. In addition, a loss function called Online Instance Matching (OIM) is introduced to cope with problems of using Softmax or pairwise/triplet distance loss functions when the number of identities is large. A lookup table of features from all the labeled identities is maintained. In addition, the approach takes into account many unlabeled identities likely to appear in scene images, as negatives for labeled identities. There are no parameters to learn, the lookup table (LUT) and circular queue (CQ) are just feature buffers. When forward, each labeled identity is matched with all the stored features. When backward, the LUT is updated according to the ID, pushing new features to CQ, and pop outofdate ones. https://i.imgur.com/1Smsi56.png To validate the approach, a new person search dataset is collected. On this dataset, the training accuracy when using Softmax loss is around $15\%$. However, with the OIM loss the accuracy improves consistently. Experiments are also performed to compare the method with baseline approaches. The baseline result is around $74\%$, while the proposed approach result (without unlabeled) is $76.1\%$ and $78.7\%$ with unlabeled data. 
[link]
This paper proposed a class of loss functions applicable to image generation that are based on distance in feature spaces: $$\mathcal{L} = \lambda_{feat}\mathcal{L}_{feat} + \lambda_{adv}\mathcal{L}_{adv} + \lambda_{img}\mathcal{L}_{img}$$ ### Key Points  Using only l2 loss in image space yields oversmoothed results since it leads to averaging all likely locations of details.  L_feat measures the distance in suitable feature space and therefore preserves distribution of fine details instead of exact locations.  Using only L_feat yields bad results since feature representations are contractive. Many nonnatural images also mapped to the same feature vector.  By introducing a natural image prior  GAN, we can make sure that samples lie on the natural image manifold. ### Model https://i.imgur.com/qNzMwQ6.png ### Exp  Training Autoencoder  Generate images using VAE  Invert feature ### Thought I think the experiment section is a little complicated to comprehend. However, the proposed loss seems really promising and can be applied to many tasks related to image generation. ### Questions  Section 4.2 & 4.3 are hard to follow for me, need to pay more attention in the future 
[link]
This paper performs activation maximization (AM) using Deep Generator Network (DGN), which served as a learned natural iamge prior, to synthesize realistic images as inputs and feed it into the DNN we want to understand. By visualizing synthesized images that highly activate particular neurons in the DNN, we can interpret what each of neurons in the DNN learned to detect. ### Key Points  DGN (natural image prior) generates more coherent images when optimizing fullyconnected layer codes instead of lowlevel codes. However, previous studies showed that lowlevel features results in better reconstructions beacuse it contains more image details. The difference is that here DGNAM is trying to synthesize an entire layer code from scratch. Features in lowlevel only has a small, local receptive field so that the optimization process has to independently tune image without knowing the global structure. Also, the code space at a convolutional layer is much more highdimensional, making it harder to optimize.  The learned prior trained on ImageNet can also generalize to Places.  It doesn't generalize well if architecture of the encoder trained with DGN is different with the DNN we wish to inspect.  The learned prior also generalizes to visualize hidden neurons, producing more realistic textures/colors.  When visualizing hidden neurons, DGNAM trained on ImageNet also generalize to Places and produce similar results as [1].  The synthesized images are showed to teach us what neurons in DNN we wish to inspect prefer instead of what prior prefer. ### Model ![](https://cloud.githubusercontent.com/assets/7057863/21002626/b094d7aebd6111e68c95fd4931648426.png) ### Thought Solid paper with diverse visualizations and thorough analysis. ### Reference [1] Object Detectors Emerge In Deep Scene CNNs, B.Zhou et. al. 
[link]
In this paper, the authors present a new measure for evaluating person tracking performance and to overcome problems when using other eventbased measures (MOTA  Multi Object Tracking Accuracy, MCTA  Multi Camera Tracking Accuracy, Handover error) in multicamera scenario. The emphasis is on maintaining correct ID for a trajectory in most frames instead of penalizing identity switches. This way, the proposed measure is suitable for MTMC (Multitarget Multicamera) setting where the tracker is agnostic to the true identities. They do not claim that one measure is better than the other, but each one serves a different purpose. For applications where preserving identity is important, it is fundamental to have measures (like the proposed ID precision and ID recall) which evaluate how well computed identities conform to true identities, while disregarding where or why mistakes occur. More formally, the new pair of precisionrecall measures ($IDP$ and $IDR$), and the corresponding $F_1$ score $IDF_1$ are formulated as: \begin{equation} IDP = \dfrac{IDTP}{IDTP+IDFP} \end{equation} \begin{equation} IDR = \dfrac{IDTP}{IDTP+IDFN} \end{equation} \begin{equation} IDF_1 = \dfrac{2 \times IDTP}{2 \times IDTP + IDFP + IDFN} \end{equation} where $IDTP$ is the True Positive ID, $IDFP$ is the False Positive ID, and $IDFN$ is the False Negative ID for every corresponding association. Another contribution of the paper is a large fullyannotated dataset recorded in an outdoor environment. Details of the dataset: It has more than $2$ million frames of high resolution $1080$p,$60$fps video, observing more than $2700$ identities and includes surveillance footage from $8$ cameras with approximately $85$ minutes of videos for each camera. The dataset is available here: http://vision.cs.duke.edu/DukeMTMC/. Experiments show that the performance of their reference tracking system on another dataset (http://mct.idealtest.org/Datasets.html), when evaluated with existing measures, is comparable to other MTMC trackers. Also, a baseline framework on their data is established for future comparisons. 
[link]
In this paper they prior the representation a logistic regression model using known proteinprotein interactions. They do so by regularizing the weights of the model using the Laplacian encoding of a graph. Here is a regularization term of this form: $$\lambda w_1 + \eta w^T L w,$$ #### A small example: Given a small graph of three nodes A, B, and C with one edge: {AB} we have the following Laplacian: $$ L = D  A = \left[\array{ 1 & 0 & 0 \\ 0 & 1 & 0\\ 0 & 0 & 0}\right]  \left[\array{ 0 & 1 & 0 \\ 1 & 0 & 0\\ 0 & 0 & 0}\right]$$ $$L = \left[\array{ 1 & 1 & 0 \\ 1 & 1 & 0\\ 0 & 0 & 0}\right] $$ If we have a small linear regression of the form: $$y = x_Aw_A + x_Bw_B + x_Cw_C$$ Then we can look at how $w^TLw$ will impact the weights to gain insight: $$w^TLw $$ $$= \left[\array{ w_A & w_B & w_C}\right] \left[\array{ 1 & 1 & 0 \\ 1 & 1 & 0\\ 0 & 0 & 0}\right] \left[\array{ w_A \\ w_B \\ w_C}\right] $$ $$= \left[\array{ w_A & w_B & w_C}\right] \left[\array{ w_A w_B \\ w_A + w_B \\ 0}\right] $$ $$ = (w_A^2 w_Aw_B ) + (w_Aw_B + w_B^2) $$ So because all terms are squared we can remove them from consideration to look at what is the real impact of regularization. $$ = (w_Aw_B ) + (w_Aw_B) $$ $$ = 2w_Aw_B$$ The Laplacian regularization seems to increase the weight values of edges which are connected. Along with the squared terms and the $L1$ penalty that is also used the weights cannot grow without bound. #### A few more experiments: If we perform the same computation for a graph with two edges: {AB, BC} we have the following term which increases the weights of both pairwise interactions: $$ = 2w_Aw_B 2w_Bw_C$$ If we perform the same computation for a graph with two edges: {AB, AC} we have no surprises: $$ = 2w_Aw_B 2w_Aw_C$$ Another thing to think about is if there are no edges. If by default there are selfloops then the degree matrix will have 1 on the diagonal and it will be the identity which will be an $L2$ term. If no self loops are defined then the result is a 0 matrix yielding no regularization at all. #### Contribution: A contribution of this paper is to use the absolute value of the weights to make training easier. $$w^T L w$$ TODO: Add more about how this impacts learning. #### Overview Here a high level figure shows the data and targets together with a graph prior. It looks nice so I wanted to include it. https://i.imgur.com/rnGtHqe.png 