[link]
This paper's proposed method, the cleverly named ORGAN, combines techniques from GANs and reinforcement learning to generate candidate molecular sequences that incentivize desirable properties while still remaining plausibly ondistribution. Prior papers I've read on molecular generation have by and large used approaches based in maximum likelihood estimation (MLE)  where you construct some distribution over molecular representations, and maximize the probability of your true data under that distribution. However, MLE methods can't be less powerful when it comes to incentivizing your model to precisely conform with structural details of your target data distribution. Generative Adversarial Networks (GANs) on the other hand, use a discriminator loss that directly penalizes your generator for being recognizably different from the true data. However, GANs have historically been difficult to use on data like the stringbased molecular representations used in this paper. That's because strings are made up of discrete characters, which need to be sampled from underlying distributions, and we don't naively have good ways of making sampling from discrete distributions a differentiable process. SeqGAN was proposed to remedy this: instead of using the discriminator loss directly as the generator's loss  which would require backpropogating through the sampling operation  the generator is trained with reinforcement learning, using the discriminator score as a reward function. Each addition of an element to the sequence  or, in our case, each addition of a character to our molecular representation string  represents an action, and full sequences are rewarded based on the extent to which they resemble true sequences. https://i.imgur.com/dqtcJDU.png This paper proposes taking that model as a base, but then adding a more actuallyrewardoriented reward onto it, incentivizing the model to produce molecules that have certain predicted properties, as determined by a (also very not differentiable) external molecular simulator. So, just taking a weighted sum of discriminator loss and reward, and using that as your RL reward. After all, if you're already using the policy gradient structures of RL to train the underlying generator, you might as well add on some more traditionallooking RL rewards. The central intuition behind using RL in both of these cases is that it provides a way of using training signals that are unknown or  more to the point  nondifferentiable functions functions of model output. In their empirical tests focusing on molecules, the authors target the RL to optimize for one of solubility, synthesizability, and druggability (three welldefined properties within molecular simulator RDKit), as well as for uniqueness, penalizing any molecule that had been generated already. https://i.imgur.com/WszVd1M.png For all that this is an interesting mechanic, the empirical results are more equivocal. Compared to Naive RL, which directly optimizes for reward without the discriminator loss, ORGAN (Or, ORWGAN, the betterperforming method using a Wasserstein GAN) doesn't have notably better rates of how often its generated strings are valid, and (as you would expect) performs comparably or slightly worse when it comes to optimizing the underlying reward. It does exhibit higher diversity than naive RL on two of the three tasks, but it's hard to get an intuition for the scales involved, and how much that scale of diversity increase would impact real results. 
[link]
In the years before this paper came out in 2017, a number of different graph convolution architectures  which use weightsharing and orderinvariant operations to create representations at nodes in a graph that are contextualized by information in the rest of the graph  had been suggested for learning representations of molecules. The authors of this paper out of Google sought to pull all of these proposed models into a single conceptual framework, for the sake of better comparing and testing the design choices that went into them. All empirical tests were done using the QM9 dataset, where 134,000 molecules have predicted chemical properties attached to them, things like the amount of energy released if bombs are sundered and the energy of electrons at different electron shells. https://i.imgur.com/Mmp8KO6.png An interesting note is that these properties weren't measured empirically, but were simulated by a very expensive quantum simulation, because the former wouldn't be feasible for this large of a dataset. However, this is still a moderately interesting test because, even if we already have the capability to computationally predict these features, a neural network would do much more quickly. And, also, one might aspirationally hope that architectures which learn good representations of molecules for quantum predictions are also useful for tasks with a less available automated prediction mechanism. The framework assumes the existence of "hidden" feature vectors h at each node (atom) in the graph, as well as features that characterize the edges between nodes (whether that characterization comes through sorting into discrete bond categories or through a continuous representation). The features associated with each atom at the lowest input level of the moleculesummarizing networks trained here include: the element ID, the atomic number, whether it accepts electrons or donates them, whether it's in an aromatic system, and which shells its electrons are in. https://i.imgur.com/J7s0q2e.png Given these building blocks, the taxonomy lays out three broad categories of function, each of which different architectures implement in slightly different ways. 1. The Message function, M(). This function is defined with reference to a node w, that the message is coming from, and a node v, that it's being sent to, and is meant to summarize the information coming from w to inform the node representation that will be calculated at v. It takes into account the feature vectors of one or both nodes at the next level down, and sometimes also incorporates feature vectors attached to the edge connecting the two nodes. In a notable example of weight sharing, you'd use the same Message function for every combination of v and w, because you need to be able to process an arbitrary number of pairs, with each v having a different number of neighbors. The simplest example you might imagine here is a simple concatenation of incoming node and edge features; a more typical example from the architectures reviewed is a concatenation followed by a neural network layer. The aggregate message being sent to the receiver node is calculated by summing together the messages from each incoming vector (though it seems like other options are possible; I'm a bit confused why the paper presented summing as the only orderinvariant option). 2. The Update function, U(). This function governs how to take the aggregated message vector sent to a particular node, and combine that with the priorlayer representation at that node, to come up with a nextlayer representation at that node. Similarly, the same Update function weights are shared across all atoms. 3. The Readout function, R(), which takes the finallayer representation of each atom node and aggregates the representations into a final graphlevel representation an orderinvariant way Rather than following in the footsteps of the paper by describing each proposed model type and how it can be described in this framework, I'll instead try to highlight some of the more interesting ways in which design choices differed across previously proposed architectures.  Does the message function being sent from w to v depend on the feature value at both w and v, or just v? To put the question more colloquially, you might imagine w wanting to contextually send different information based on different values of the feature vector at node v, and this extra degree of expressivity (not present in the earliest 2015 paper), seems like a quite valuable addition (in that all subsequent papers include it)  Are the edge features static, categorical things, or are they feature vectors that get iteratively updated in the same way that the node vectors do? For most of the architectures reviewed, the former is true, but the authors found that the highest performance in their tests came from networks with continuous edge vectors, rather than just having different weights for different category types of edge  Is the Readout function something as simple as a summation of all toplevel feature vectors, or is it more complex? Again, the authors found that they got the best performance by using a more complex approach, a Set2Set aggregator, which uses itemtoitem attention within the set of finallayer atom representations to construct an aggregated graplevel embedding The empirical tests within the paper highlight a few more interestingly relevant design choices that are less directly captured by the framework. The first is the fact that it's quite beneficial to explicitly include Hydrogen atoms as part of the graph, rather than just "attaching" them to their nearestby atoms as a count that goes on that atom's feature vector. The second is that it's valuable to start out your edge features with a continuous representation of the spatial distance between atoms, along with an embedding of the bond type. This is particularly worth considering because getting spatial distance data for a molecule requires solving the freeenergy problem to determine its spatial conformation, a costly process. We might ideally prefer a network that can work on bond information alone. The authors do find a nonspatialinformation network that can perform reasonably well  reaching full accuracy on 5 of 13 targets, compared to 11 with spatial information. However, the difference is notable, which, at least from my perspective, begs the question of whether it'd ever be possible to learn representations that can match the performance of spatiallyinformed ones without explicitly providing that information. 
[link]
ScatterNets incorporates geometric knowledge of images to produce discriminative and invariant (translation and rotation) features i.e. edge information. The same outcome as CNN's first layers hold. So why not replace that first layer/s with an equivalent, fixed, structure and let the optimizer find the best weights for the CNN with its leadingedge removed. The main motivations of the idea of replacing the first convolutional, ReLU and pooling layers of the CNN with a twolayer parametric logbased DualTree Complex Wavelets Transform (DTCWT), covered by a few papers, were: Despite the success of CNNs, the design and optimizing configuration of these networks is not well understood which makes it difficult to develop these networks This improves the training of the network as the later layers can learn more complex patterns from the start of learning because the edge representations are already present Converge faster as it has fewer filter weights to learn My takeaway: a slight reduction in the amount of data necessary for training! On CIFAR10 and Caltech101 with 14 selfmade CNN with increasing depth, VGG, NIN and WideResnet: When doing transfer learning(Imagenet): DTSCNN outperformed (“useful margin”) all the CNN architectures counterpart when finetuning with only 1000 examples(balanced over classes). While on larger datasets the gap decreases ending on par with. However, when freezing the first layers on VGG and NIN, as in DTSCNN, the NIN results are in par with, while VGG outperforms! DTSCNN learns faster in the rate but reaches the same target with minor speedup (few mins) Complexity analysis in terms of weights and operations is missing Datasets: CIFAR10 & Caltech101, is a good start point (further step with a substantial dataset like COCO would be a plus). For other modalities/domains, please try and let me know Great work but ablation study is missing such as comparing full training WResNet+DTCWT vs. WResNet 14 citation so far (Cambridge): probably low value per money at the moment https://i.imgur.com/GrzSviU.png 
[link]
This is a paper released by the creators of the DeepChem library/framework, explaining the efforts they've put into facilitating straightforward and reproducible testing of new methods. They advocate for consistency between tests on three main axes. 1. On the most basic level, that methods evaluate on the same datasets 2. That they use canonical train/test splits 3. That they use canonical metrics. To that end, they've integrated a framework they call "MoleculeNet" into DeepChem, containing standardized interfaces to datasets, metrics, and test sets. **Datasets** MoleculeNet contains 17 different datasets, where "dataset" here just means a collection of data labeled for a certain task or set of tasks. The tasks fall into one of four groups:  quantum mechanical prediction (atomization energy, spectra)  prediction of properties of physical chemistry (solubility, lipophilicity)  prediction of biophysical interactions like bonding affinity  prediction of humanlevel physiological properties (toxicity, side effects, whether it passes the blood brain barrier) An interesting thing to note here is that only some datasets contain 3D orientations of molecules, because spatial orientations are properties of *a given conformation* of a molecule, and while some output measures (like binding geometry) depend on 3D arrangement, others (like solubility) don't. **Metrics** The metrics chosen were pretty straightforward  Root Mean Squared Error or Absolute Error for continuous prediction tasks, and ROCAUC or PRCAUC for prediction ones. The only notable nuance was that the paper argued for PRCAUC as the standard metric for datasets with a low number of positives, since that metric is the strictest on false positives. **Test/Train Split** Most of these were fairly normal  random split and timebased split  but I found the idea of a scaffold split (where you cluster molecules by similarity, and assign each cluster to either train or test), interesting. The idea here is that if molecules are similar enough to one another, seeing one of a pair during training might be comparable to seeing an actual shared example between training and test, and have the same propensity for overconfident results. **Models** DeepChem has put together implementations of a number of standard machine learning methods (SVM, Random Forest, XGBoost, Logistic Regression) on molecular features, as well as a number of moleculespecific graphstructured methods. At a high level, these are: https://i.imgur.com/x4yutlp.png  Graph Convolutions, which update atom representations by combining transformations of the features of bonded neighbor atoms  DAGs, which create an "atomcentric" graph for each atom in the molecule and "pull" information inwards from farther away nodes (for the record, I don't fully follow how this one works, since I haven't read the underlying paper)  Weave Model, which maintains both atom representations and pair representations between all pairs of atoms, not just ones bonded to one another, and updates each in a crosscutting way: updating an atom representation from all of its pairs (as well as itself), and then updating a pair representation from the atoms in its pairing (as well as itself). This has the benefit of making information from faraway molecules available immediately, rather than having to propagate through a graph, but is also more computationally taxing  Message Passing Neural Network, which operates like Graph Convolutions except that the feature transform used to pull in information from neighboring atoms changes depending on the type of the bond between atoms  Deep Tensor Neural Network  Instead of bonds, this approach represents atoms in 3D space, and pulls in information based on other atoms nearby in spatial distance **Results** As part of creating its benchmark, MoleculeNet also tested its implementations of its models on all its datasets. It's interesting the extent to which the results form a narrative, in terms of which tasks benefit most from flexible structurebased methods (like graph approaches) vs handengineered features. https://i.imgur.com/dCAdJac.png Predictions of quantum mechanical properties and properties of physical chemistry do consistently better with graphbased methods, potentially suggesting that the features we've thought to engineer aren't in line with useful features for those tasks. By contrast, on biophysical tasks, handengineered features combined with traditional machine learning mostly comes out on top, a fact I found a bit surprising, given the extent to which I'd read about deep learning methods claiming strong results on prediction of things like binding affinity. This was a useful pointer of things I should do some more work to resolve clarity on. And, when it came to physiological properties like toxicity and side effects, results are pretty mixed between graphbased and traditional methods. 
[link]
Tramer et al. study adversarial subspaces, subspaces of the input space that are spanned by multiple, orthogonal adversarial examples. This is achieved by iteratively searching for orthogonal adversarial examples, relative to a specific test example. This can, for example, be done using classical second or firstorder optimization methods for finding adversarial examples with the additional constraint of finding orthogonal adversarial examples. However, the authors also consider different attack strategies that work on discrete input features. In practice, on MNIST, this allows to find, on average, 44 orthogonal directions per test example. This finding indicates that adversarial examples indeed span large adversarial subspaces. Additionally, adversarial examples from the subspaces seem to transfer reasonably well to other models. The remainder of the paper links this ease of transferability to a similarity in decision boundaries learnt by different models from the same hypotheses set. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Recently, DeepMind released a new paper showing strong performance on board game tasks using a mechanism similar to the Value Prediction Network one in this paper, which inspired me to go back and get a grounding in this earlier work. A goal of this paper is to design a modelbased RL approach that can scale to complex environment spaces, but can still be used to run simulations and do explicit planning. Traditional, modelbased RL has worked by learning a dynamics model of the environment  predicting the next observation state given the current one and an action, and then using that model of the world to learn values and plan with. In addition to the advantages of explicit planning, a hope is that modelbased systems generalize better to new environments, because they predict onestep changes in local dynamics in a way that can be more easily separated from longterm dynamics or reward patterns. However, a downside of MBRL is that it can be hard to train, especially when your observation space is highdimensional, and learning a straight model of your environment will lead to you learning details that aren't actually unimportant for planning or creating policies. The synthesis proposed by this paper is the Value Prediction Network. Rather than predicting observed state at the next step, it learns a transition model in latent space, and then learns to predict nextstep reward and future value from that latent space vector. Because it learns to encode latentspace state from observations, and also learns a transition model from one latent state to another, the model can be used for planning, by simulating multiple transitions between latent state. However, unlike a normal dynamics model, whose training signal comes from a loss against observational prediction, the signal for training both latent → reward/value/discount predictions, and latent → latent transitions comes from using this pipeline to predict reward values. This means that if an aspect of the environment isn't useful for predicting reward, it won't generally be encoded into latent state, meaning you don't waste model capacity predicting irrelevant detail. https://i.imgur.com/4bJylms.png Once this model exists, it can be used for generating a policy through a treesearch planning approach: simulating future trajectories and aggregating the predicted reward along those trajectories, and then taking the highestvalue one. The authors find that their model is able to do better than both modelfree and modelbased methods on the tasks they tested on. In particular, they find that it has many of the benefits of a model that predicts full observations, but that the Value Prediction Network learns more quickly, and is more robust to stochastic environments where there's an inherent ceiling on how well a nextstep observation prediction can work. My main question coming into this paper is: how is this different from simply a value estimator like those used in DQN or A2C, and my impression is that the difference comes from this model's ability to do explicit state simulation in latent space, and then predict a value off of the *latent* state, whereas a value network predicts value from observational state. 
[link]
The paper proposes a standardized benchmark for a number of safetyrelated problems, and provides an implementation that can be used by other researchers. The problems fall in two categories: specification and robustness. Specification refers to cases where it is difficult to specify a reward function that encodes our intentions. Robustness means that agent's actions should be robust when facing various complexities of a realworld environment. Here is a list of problems: 1. Specification: 1. Safe interruptibility: agents should neither seek nor avoid interruption. 2. Avoiding side effects: agents should minimize effects unrelated to their main objective. 3. Absent supervisor: agents should not behave differently depending on presence of supervisor. 4. Reward gaming: agents should not try to exploit errors in reward function. 2. Robustness: 1. Selfmodification: agents should behave well when environment allows selfmodification. 2. Robustness to distributional shift: agents should behave robustly when test differs from train. 3. Robustness to adversaries: agents should detect and adapt to adversarial intentions in environment. 4. Safe exploration: agent should behave safely during learning as well. It is worth noting that problems 1.2, 1.4, 2.2, and 2.4 have been described back in "Concrete Problems in AI Safety". It is suggested that each of these problems be tackled in a "gridworld" environment — a 2D environment where the agent lives on a grid, and the only actions it has available are up/down/left/right movements. The benchmark consists of 10 environments, each corresponding to one of 8 problems mentioned above. Each of the environments is an extremely simple instance of the problem, but nevertheless they are of interest as current SotA algorithms usually don't solve the posed task. Specifically, the authors trained A2C and Rainbow with DQN update on each of the environments and showed that both algorithms fail on all of specification problems, except for Rainbow on 1.1. This is expected, as neither of those algorithms are designed for cases where reward function is misspecified. Both algorithms failed on 2.22.4, except for A2C on 2.3. On 2.1, the authors swapped A2C for Rainbow with Sarsa update and showed that Rainbow DQN failed while Rainbow Sarsa performed well. Overall, this is a good groundwork paper with only a few questionable design decisions, such as the design of actual reward in 1.2. It is unlikely to have impact similar to MNIST or ImageNet, but it should stimulate safetyrelated research. 
[link]
MunozGonzalez et al. propose a multiclass data poisening attack against deep neural networks based on backgradient optimization. They consider the common poisening formulation stated as follows: $ \max_{D_c} \min_w \mathcal{L}(D_c \cup D_{tr}, w)$ where $D_c$ denotes a set of poisened training samples and $D_{tr}$ the corresponding clea dataset. Here, the loss $\mathcal{L}$ used for training is minimized as the inner optimization problem. As result, as long as learning itself does not have closedform solutions, e.g., for deep neural networks, the problem is computationally infeasible. To resolve this problem, the authors propose using backgradient optimization. Then, the gradient with respect to the outer optimization problem can be computed while only computing a limited number of iterations to solve the inner problem, see the paper for detail. In experiments, on spam/malware detection and digit classification, the approach is shown to increase test error of the trained model with only few training examples poisened. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Sarkar et al. propose two “learned” adversarial example attacks, UPSET and ANGRI. The former, UPSET, learns to predict universal, targeted adversarial examples. The latter, ANGRI, learns to predict (nonuniversal) targeted adversarial attacks. For UPSET, a network takes the target label as input and learns to predict a perturbation, which added to the original image results in misclassification; for ANGRI, a network takes both the target label and the original image as input to predict a perturbation. These networks are then trained using a misclassification loss while also minimizing the norm of the perturbation. To this end, the target classifier needs to be differentiable – i.e., UPSET and ANGRI require whitebox access. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Ranjan et al. propose to constrain deep features to lie on hyperspheres in order to improve robustness against adversarial examples. For the last fullyconnected layer, this is achieved by the L2softmax, which forces the features to lie on the hypersphere. For intermediate convolutional or fullyconnected layer, the same effect is achieved analogously, i.e., by normalizing inputs, scaling them and applying the convolution/weight multiplication. In experiments, the authors argue that this improves robustness against simple attacks such as FGSM and DeepFool. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Pereyra et al. propose an entropy regularizer for penalizing overconfident predictions of deep neural networks. Specifically, given the predicted distribution $p_\theta(y_ix)$ for labels $y_i$ and network parameters $\theta$, a regularizer $\beta \max(0, \Gamma – H(p_\theta(yx))$ is added to the learning objective. Here, $H$ denotes the entropy and $\beta$, $\Gamma$ are hyperparameters allowing to weight and limit the regularizers influence. In experiments, this regularizer showed slightly improved performance on MNIST and Cifar10. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Liu et al. propose a whitebox attack against defensive distillation. In particular, the proposed attack combines the objective of the Carlini+Wagner attack [1] with a slightly different reparameterization to enforce an $L_\infty$constraint on the perturbation. In experiments, defensive distillation is shown to no be robust. [1] Nicholas Carlini, David A. Wagner: Towards Evaluating the Robustness of Neural Networks. IEEE Symposium on Security and Privacy 2017: 3957 Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Kim et al. propose Concept Activation Vectors (CAV) that represent the direction of features corresponding to specific humaninterpretable concepts. In particular, given a network for a classification task, a concept is defined as a set of images with that concept. A linear classifier is then trained to distinguish images with concept from random images without the concept based on a chosen feature layer. The normal of the obtained linear classification boundary corresponds to the learned Concept Activation Vector (CAV). By considering the directional derivative along this direction for a given input allows to quantify how well the input aligns with the chosen concept. This way, images can be ranked and the model’ sensitivity to particular concepts can be quantified. The idea is also illustrated in Figure 1. https://i.imgur.com/KOqPeag.png Figure 1: Process of constructing Concept Activation Vectors (CAVs). Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Cheney et al. study the robustness of deep neural networks, especially AlexNet, with regard to randomly dropping or perturbing weights. In particular, the authors consider three types of perturbations: synapse knockouts set random weights to zero, node knockouts set all weights corresponding to a set of neurons to zero, and weight perturbations add random Gaussian noise to the weights of a specific layer. These perturbations are studied on AlexNet, considering the top5 accuracy on ImageNet; perturbations are considered per layer. For example, Figure 1 (left) shows the influence on accuracy when knocking out synapses. As can be seen, the lower layers, especially the first convolutional layer, are impacted significantly by these perturbations. Similar observations, Figure 1 (right) are made for random perturbations of weights; although the impact is less significant. Especially highlevel features, i.e., the corresponding layers, seem to be robust to these kind of perturbations. The authors also provide evidence that these results extend to the top1 accuracy, as well as other architectures. For VGG, however, the impact is significantly less pronounced which may also be due to the employed dropout layers. https://i.imgur.com/78T6Gg2.png Figure 1: Left: Influence of setting weights in the corresponding layers to zero. Right: Influence of randomly perturbing weights of specific layers. Experiments are on ImageNet using AlexNet. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Shen et al. introduce APEGAN, a generative adversarial network (GAN) trained to remove adversarial noise from adversarial examples. In specific, as illustrated in Figure 1, a GAN is traiend to specifically distinguish clean/real images from adversarial images. The generator is conditioned on th einput image and can be seen as auto encoder. Then, during testing, the generator is applied to remove the adversarial noise. https://i.imgur.com/mgAbzCT.png Figure 1: The proposed adversarial perturbation eliminating GAN (APEGAN), see the paper for details. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
For a machine learning model to be trusted/ used one would need to be confident in its capabilities of dealing with all possible scenarios. To that end, designing unit test cases for more complex and global problems could be costly and bordering on impossible to create. **Idea**: We need a basic guideline that researchers and developers can adhere to when defining problems and outlining solutions, so that model interpretability can be defined accurately in terms of the problem statement. **Solution**: This paper outlines the basics of machine learning interpretability, what that means for different users, and how to classify these into understandable categories that can be evaluated. This paper highlights the need for interpretability, which arises from *incompleteness*,either of the problem statement, or the problem domain knowledge. This paper provides three main categories to evaluating a model/ providing interpretations:  *Application Grounded Evaluation*: These evaluations are more costly, and involve real humans evaluating real tasks that a model would take up. Domain knowledge is necessary for the humans evaluating the real task handled by the model.  *Human Grounded Evaluation:* these evaluations are simpler than application grounded, as they simplify the complex task and have humans evaluate the simplified task. Domain knowledge is not necessary in such an evaluation.  *Functionally Grounded Evaluation:* No humans are involved in this version of evaluation, here previously evaluated models are perfected or tweaked to optimize certain functionality. Explanation quality is measured by a formal definition of interpretability. This paper also outlines certain issues with the above three evaluation processes, there are certain questions that need answering before we can pick an evaluation method and metric. To highlight the factors of interpretability, we are provided with the Datadriven approach. Here we analyze each task and the various methods used to fulfill the task and see which of these methods and tasks are most significant to the model.  We are introduced to the term latent dimensions of interpretability, i.e. dimensions that are inferred not observed. These are divided into task related latent dimensions and method related latent dimensions, these are a long list of factors that are task specific or method specific. Thus this paper provides a basic taxonomy for how we should evaluate our model, and how these evaluations differ from problem to problem. The ideal scenario outlined is that researchers provide the relevant information to evaluate their proposition correctly (correctly in terms of the domain and the problem scope). 
[link]
The fundamental unit of Reinforcement Learning is the reward function, with a core assumption of the area being that actions induce rewards, with some actions being higher reward than others. But, reward functions are just artificial objects we design to induce certain behaviors; the universe doesn’t hand out “true” rewards we can build off of. Inverse Reinforcement Learning as a field is rooted in the difficulty of designing reward functions, and has the aspiration of, instead of requiring a human to hard code a reward function, inferring rewards from observing human behavior. The rough idea is that if we imagine a human is (even if they don’t know it) operating so as to optimize some set of rewards, we might be able to infer that set of underlying incentives from their actions, and, once we’ve extracted a reward function, use that to train new agents. This is a mathematically quite tricky problem, for the basic reason that a human’s actions are often consistent with a wide range of possible underlying “policy” parameters, and also that a given human policy could be an optimal for a wide range of underlying reward functions. This paper proposes using an adversarial frame on the problem, where you learn a reward function by trying to make reward higher for the human demonstrations you observe, relative to the actions the agent itself is taking. This has the effect of trying to learn an agent that can imitate human actions. However, it specifically designs its model structure to allow it to go beyond just imitation. The problem with learning a purely imitative policy is that it’s hard for the model to separate out which actions the human is taking because they are intrinsically high reward (like, perhaps, eating candy), versus actions which are only valuable in a particular environment (perhaps opening a drawer if you’re in a room where that’s where the candy is kept). If you didn’t realize that the real reward was contained in the candy, you might keep opening drawers, even if you’re in a room where the candy is laying out on the table. In mathematical terms, separating out intrinsic vs instrumental (also known as "shaped") rewards is a matter of making sure to learn separate out the reward associated with a given state from value of taking a given action at that state, because the value of your action is only born out based on assumptions about how states transition between each other, which is a function of the specific state to state dynamics of the you’re in. The authors do this by defining a g(s) function, and a h(s) function. They then define their overall reward of an action as (g(s) + h(s’))  h(s), where s’ is the new state you end up in if you take an action. https://i.imgur.com/3ENPFVk.png This follows the natural form of a Bellman update, where the sum of your future value at state T should be equal to the sum of your future value at time T+1 plus the reward you achieve at time T. https://i.imgur.com/Sd9qHCf.png By adopting this structure, and learning a separate neural network to capture the h(s) function representing the value from here to the end, the authors make it the case that the g(s) function is a purer representation of the reward at a state, regardless of what we expect to happen in the future. Using this, they’re able to use this learned reward to bootstrap good behavior in new environments, even in contexts where a learned value function would be invalid because of the assumptions of instrumental value. They compare their method to the baseline of GAIL, which is a purely imitationlearning approach, and show that theirs is more able to transfer to environments with similar states but different statetostate dynamics. 
[link]
Model Interpretability aims at explaining the inner workings of a model promoting transparency of any decisions made by the model, however for the sake of human acceptance or understanding, these explanations seem to be more geared toward human trust than remaining faithful to the model. **Idea** There is a distinct difference and tradeoff between persuasive and descriptive Interpretations of a model, one promotes human trust while the other stays truthful to the model. Promoting the former can lead to a loss in transparency of the model. **Questions to be answered:**  How do we balance between a persuasive strategy and a descriptive strategy?  How do we combat human cognitive bias? **Solutions:**  *Separating the descriptive and persuasive steps: *  We first generate a descriptive explanation, without trying to simplify it  In our final steps we add persuasiveness to this explanation to make it more understandable  *Explicit inclusion of cognitive features:*  We would include attributes that affect our functional measures of interpretability to our objective function.  This approach has some drawbacks however:  we would need to map the knowledge of the user which is an expensive process.  Any features that we fail to add to the objective function would add to the human cognitive bias  Increased complexity in optimizing of a multiobjective loss function. **Important terms:**  *Explanation Strategy*: An explanation strategy is defined as an explanation vehicle coupled with the objective function, constraints, and hyper parameters required to generate a model explanation  *Explanation model*: An explanation model is defined as the implementation of an explanation strategy, which is fit to a model that is to be interpreted.  *Human Cognitive Bias*: if an explanation model is highly persuasive or tuned toward human trust as opposed to staying true to the model, the overall evaluation of this explanation would be highly biased compared to a descriptive model. This bias can lead from commonalities between human users across a domain, expertise of the application, or the expectation of a model explanation. Such bias is known as implicit human cognitive bias.  *Persuasive Explanation Strategy*: A persuasive explanation strategy aim at convincing a user/ humanizing a model so that the user feels more comfortable with the decisions generated by the model. Fidelity or truthfulness to the model in such a strategy can be very low, which can lead to ethical dilemmas as to where to draw the line between being persuasive and being descriptive. Persuasive strategies do promote human understanding and cognition, which are important aspects of interpretability, however they fail to address the certain other aspects such as fidelity to the model.  *Descriptive Explanation Strategy*: A descriptive explanation strategy stays true to the underlying model, and generates explanations with maximum fidelity to the model. Ideally such a strategy would describe exactly what the inner working of the underlying model is, which is the main purpose of model interpretation in terms of better understanding the actual workings of the model. 
[link]
Model interpretations must be true to the model but must also promote human understanding of the working of the model. To this end we would need an interpretability model that balances the two. **Idea** : Although there exist model interpretations that balance fidelity and human cognition on a local level specific to an underlying model, there are no global model agnostic interpretation models that can achieve the same. **Solution:**  Break up each aspect of the underlying model into distinct compact decision sets that have no overlap to generate explanations that are faithful to the model, and also cover all possible feature spaces of the model.  How the solution dealt with:  *Fidelity* (staying true to the model): the labels in the approximation match that of the underlying model.  *Unambiguity* (single clear decision): compact decision sets in every feature space ensures unambiguity in the label assigned to it.  *Interpretability* (Understandable by humans): Intuitive rule based representation, with limited number of rules and predicates.  *Interactivity* (Allow user to focus on specific feature spaces): Each feature space is divided into distinct compact sets, allowing users to focus on their area of interest.  Details on a “decision set”:  Each decision set is a twolevel decision (a nested ifthen decision set), where the outer ifthen clause specifies the subspace, and the inner ifthen clause specifies the logic of assigning a label by the model.  A default set is defined to assign labels that do not satisfy any of the twolevel decisions  The pros of such a model is that we do not need to trace the logic of an assigned label too far, thus less complex than a decision tree which follows a similar ifthen structure. **Mapping fidelity vs interpretability**  To see how their model handled fidelity vs interpretability, they mapped the rate of agreement (number of times the approximation label of an instance matches the blackbox assigned label) against predefined interpretability complexity defining terms such as:  Number of predicates (sum of width of all decision sets)  Number of rules (a set of outer decision, inner decision, and classifier label)  Number of defined neighborhoods (outer ifthen decision)  Their model reached higher agreement rates to other models at lower values for interpretability complexity. 
[link]
A few years ago, a paper came out demonstrating that adaptive gradient methods (which dynamically scale gradient updates in a perparameter way according to the magnitudes of past updates) have a tendency to generalize less well than nonadaptive methods, even they adaptive methods sometimes look more performant in training, and are easier to hyperparameter tune. The 2017 paper offered a theoretical explanation for this fact based on Adam learning less complex solutions than SGD; this paper offers a different one, namely that Adam performs poorly because it is typically implemented alongside L2 regularization, which has importantly different mechanical consequences than it does in SGD. Specifically, in SGD, L2 regularization, where the loss includes both the actual loss and a L2 norm of the weights, can be made equivalent to weight decay, by choosing the right parameters for each. (see proof below). https://i.imgur.com/79jfZg9.png However, for Adam, this equivalence doesn’t hold. This is true because, in SGD, all the scaling factors are just constants, and for each learning rate value and regularization parameter, a certain weight decay parameter is implied by that. However, since Adam scales its parameter updates not by a constant learning rate but by a matrix, it’s not possible to pick other hyperparameters in a way that could get you something similar to constantparameter weight decay. To solve this, the authors suggest using an explicit weight decay term, rather than just doing implicit weight decay via L2 regularization. This is salient because the L2 norm is added to the *loss function*, and it makes up part of the gradient update, and thus gets scaled down by Adam by the same adaptive mechanism that scales down historically large gradients. When weight decay is moved outside of the form of being a norm calculation inside a loss function, and just something applied to the final update but not actually part of the adaptive scaling calculation, the authors find that 1) Adam is able to get comparable performance on image and sequence tasks (where it has previously had difficult), and 2) that even for SGD, where it was possible to find a optimal parameter setting to reproduce weight decay, having an explicit and decoupled weight decay parameter made parameters that were previously dependent on one another in their optimal values (regularization and learning rate) more independent. 
[link]
In modern machine learning, gradient descent has diversified into a zoo of subtly distinct techniques, all designed, analytically, heuristically, or practically, to ease or accelerate our model’s path through multidimensional loss space. A solid contingent of these methods are Adaptive Gradient methods, which scale the size of gradient updates according to variously calculated historical averages or variances of the vector update, which has the effect of scaling down the updates along feature dimensions that have experienced large updates in the past. The intuition behind this is that we may want to effectively reduce the learning rate (by dividing by a larger number) along dimensions where there have been large or highly variable updates. These methods are commonly used because, as the name suggests, they update to the scale of your dataset and particular loss landscape, avoiding what might otherwise be a lengthy process of hyperparameter tuning. But this paper argues that, at least on a simplified problem, adaptive methods can reach overly simplified and overfit solutions that generalize to test data less well than a nonadaptive, more standard gradient descent method. The theoretical core of the paper is a proof showing limitations of the solution reached by adaptive gradient on a simple toy regression problem, on linearly separable data. It’s a little dodgy to try to recapitulate a mathematical proof in verbal form, but I’ll do my best, on the understanding that you should really read the fully thing to fully verify the logic. The goal of the proof is to characterize the solution weight vector learned by different optimization systems. In this simplified environment, a core informational unit of your equations is X^T(y), which (in a world where labels are either 1 or 1), goes through each feature, and for each feature, takes a dot product between that feature vector (across examples) and the label vector, which has the effect of adding up a positive sum of all the feature values attached to positive examples, and then subtracting out (because of the multiply by 1) all the feature values attached to positive examples. When this is summed, we get a per feature value that will be positive if positive values of the feature tend to indicate positive labels, and negative if the opposite is true, in each case with a magnitude relating to the strength of that relationship. The claim made by the paper, supported by a lot of variable transformations, is that the solution learned by Adaptive Gradient methods reduces to a sign() operation on top of that vector, where magnitude information is lost. This happens because the running gradients that you divide out happen to correspond to the absolute value of this vector, and dividing a vector (which would be the core of the solution in the nonadaptive case) by its absolute value gives you a simple sign. The paper then goes on to show that this edge case can lead to cases of pathological overfitting in cases of high feature dimensionality relative to data points. (I wish I could give more deep insight on why this is the case, but I wasn’t really able to translate the math into intuition, outside of this fact of scaling by gradient magnitudes having the effect of losing potentially useful gradient information. The big question from all this is...does this matter? Does it matter, in particular, beyond a toy dataset, and an artificially simple problem? The answer seems to be a pretty strong maybe. The authors test adaptive methods against hyperparameteroptimized SGD and momentum SGD (a variant, but without the adaptive aspects), and find that, while adaptive methods often learn more quickly at first, SGD approaches pick up later in training, first in terms of test set error at a time when adaptive methods’ training set error still seems to be decreasing, and later even in training set error. So there seems to be evidence that solutions learned by adaptive methods generalize worse than ones learned by SGD, at least on some image recognition and languageRNN models. (Though, interestingly, RMSProp comes close to the SGD test set levels, doing the best out of the adaptive methods). Overall, this suggests to me that doing fully hyperparameter optimized SGD might be a stronger design choice, but that adaptive methods retain popularity because of their (very appealing, practically) lack of need for hyperparameter tuning to at least to a *reasonable* job, even if their performance might have more of a ceiling than that of vanilla SGD. 
[link]
The method they use basically tells the robot to reason as follows: 1. The human gave me a reward function $\tilde{r}$, selected in order to get me to behave the way they wanted. 2. So I should favor reward functions which produce that kind of behavior. This amounts to doing RL (step 1) followed by IRL on the learned policy (step 2); see the final paragraph of section 4. 
[link]
Wang et al. discuss the robustness of $k$nearest neighbors against adversarial perturbations, providing both a theoretical analysis as well as a robust 1nearest neighbor version. Specifically, for low $k$ it is shown that nearest neighbor is usually not robust. Here, robustness is judged in a distributional sense; so for fixed and low $k$, the lowest distance of any training sample to an adversarial sample tends to zero, even if the training set size increases. For $k \in \mathcal{O}(dn \log n)$, however, it is shown that $k$/nearest neighbor can be robust – the prove, showing where the $dn \log n$ comes from can be found in the paper. Finally, they propose a simple but robust $1$nearest neighbor algorithm. The main idea is to remove samples from the training set that cause adversarial examples. In particular, a minimum distance between any two samples with different labels is enforced. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Ghorbani et al. Show that neural network visualization techniques, often introduced to improve interpretability, are susceptible to adversarial examples. For example, they consider common featureimportance visualization techniques and aim to find an advesarial example that does not change the predicted label but the original interpretation – e.g., as measured on some of the most important features. Examples of the socalled top1000 attack where the 1000 most important features are changed during the attack are shown in Figure 1. The general finding, i.e., that interpretations are not robust or reliable, is definitely of relevance for the general acceptance and security of deep learning systems in practice. https://i.imgur.com/QFyrSeU.png Figure 1: Examples of changed interpretations. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Sharma and Chen provide an experimental comparison of different stateoftheart attacks against the adversarial training defense by Madry et al. [1]. They consider several attacks, including the Carlini Wagner attacks [2], elastic net attacks [3] as well as projected gradient descent [1]. Their experimental finding – that the defense by Madry et al. Can be broken by increasing the allowed perturbation size (i.e., epsilon) – should not be surprising. Every network trained adversarially will only defend reliable against attacks from the attacker used during training. [1] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. ArXiv, 1706.06083, 2017. [2] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks.InIEEE Symposiumon Security and Privacy (SP), 39–57., 2017. [3] P.Y. Chen, Y. Sharma, H. Zhang, J. Yi, and C.J. Hsieh. Ead: Elasticnet attacks to deep neuralnetworks via adversarial examples. arXiv preprint arXiv:1709.04114, 2017. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Kanbak et al. propose ManiFool, a method to determine a network’s invariance to transformations by iteratively finding adversarial transformations. In particular, given a class of transformations to consider, ManiFool iteratively alternates two steps. First, a gradient step is taken in order to move into an adversarial direction; then, the obtained perturbation/direction is projected back to the space of allowed transformations. While the details are slightly more involved, I found that this approach is similar to the general projected gradient ascent approach to finding adversarial examples. By finding worstcase transformations for a set of test samples, Kanbak et al. Are able to quantify the invariance of a network against specific transformations. Furthermore, they show that adversarial finetuning using the found adversarial transformations allows to boost invariance, while only incurring a small loss in general accuracy. Examples of the found adversarial transformations are shown in Figure 1. https://i.imgur.com/h83RdE8.png Figure 1: The proposed attack method allows to consider different classes of transformations as shown in these examples. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Brown et al. Introduce a universal adversarial patch that, when added to an image, will cause a targeted misclassification. The concept is illustrated in Figure 1; essentially, a “sticker” is computed that, when placed randomly on an image, causes misclassification. In practice, the objective function optimized can be written as $\max_p \mathbb{E}_{x\sim X, t \sim T, l \sim L} \log p(yA(p,x,l,t))$ where $y$ is the target label and $X$, $T$ and $L$ are te data space, the transformation space and the location space, respectively. The function $A$ takes as input the image and the patch and places the adversarial patch on the image according to the transformation and the location $t$ and $p$. Note that the adversarial patch is unconstrained (in contrast to general adversarial examples). In practice, the computed patch might look as illustrated in Figure 1. https://i.imgur.com/a0AB6Wz.png Figure 1: Illustration of the optimization procedure to obtain adversarial patches. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Wong and Kolter propose a method for learning provablyrobust, deep, ReLU based networks by considering the socalled adversarial polytope of finallayer activations reachable through adversarial examples. Overall, the proposed approach has some similarities to adversarial training in that the overall objective can be written as $\min_\theta \sum_{i = 1}^N \max_{\\Delta\_\infty \leq \epsilon} L(f_\theta(x_i + \Delta), y_i)$. However, in contrast to previous work, the inner maximization problem (i.e. finding the optimal adversarial example to train on) can be avoided by considering the socalled “dual network” $g_\theta$ (note that the parameters $\theta$ are the same for $f$ and $g$): $\min_\theta \sum_{i = 1}^N L(J_\epsilon(x_i, g_\theta(e_{y_i}1^T – I)), y_i)$ where $e_{y_i}$ is the onehot vector of class $y_i$ and $\epsilon$ the maximum perturbation allowed. Both the network $g_\theta$ and the objective $J_\epsilon$ are derive from the dual problem of a linear program corresponding to the adversarial perturbation objective. Considering $z_k$ to be the activations of the final layer (e.g., the logits), a common objective for adversarial examples is $\min_{\Delta} z_k(x + \Delta)_{y} – z_k(x + \Delta)_{y_{\text{target}}}$ with $x$ being a sample, and $y$ being true or target label. Wong and Kolter show that this can be rewritten as a linear program: $\min_{\Delta} c^T z_k(x + \Delta)$. Instead of minimizing over the perturbation $\Delta$, we can also optimize over the activations $z_k$ itself. We can even constrain the activations to a set $\tilde{Z}_\epsilon(x)$ around a sample $x$ in which we want the network’s activations to be robust. In the paper, this set is obtained using a convex relaxation of the ReLU network, where it is assumed that upper on lowe rbounds of all activations can be computed efficiently. The corresponding dual problem then involves $\max_\alpha J_\epsilon(x, g_\theta(x, \alpha))$ with $g_\theta$ being the dual network. Details can be found in the paper; however, I wanted to illustrate the general idea. Because of the simple structure of the network, the dual network is almost idenitical to the true network and the required upper and lower bounds can be computed in a backward style computation. In experiments, Wong and Kolter demonstrate that they can compute reasonable robustness guarantees for simple ReLu networks on MNIST. They also show that the classification boundaries of the learned networks are smoother; however, these experiments have only been conducted on simple 2D toy datasets (with significantly larger networks compared to MNIST). Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
*Note: This is a review of both Self Governing Neural Networks and ProjectionNet.* # [Self Governing Neural Networks (SGNN): the Projection Layer](https://github.com/guillaumechevalier/SGNNSelfGoverningNeuralNetworksProjectionLayer) > A SGNN's word projections preprocessing pipeline in scikitlearn In this notebook, we'll use T=80 random hashing projection functions, each of dimensionnality d=14, for a total of 1120 features per projected word in the projection function P. Next, we'll need feedforward neural network (dense) layers on top of that (as in the paper) to reencode the projection into something better. This is not done in the current notebook and is left to you to implement in your own neural network to train the dense layers jointly with a learning objective. The SGNN projection created hereby is therefore only a preprocessing on the text to project words into the hashing space, which becomes spase 1120dimensional word features created dynamically hereby. Only the CountVectorizer needs to be fitted, as it is a char ngram term frequency prior to the hasher. This one could be computed dynamically too without any fit, as it would be possible to use the [power set](https://en.wikipedia.org/wiki/Power_set) of the possible ngrams as sparse indices computed on the fly as (indices, count_value) tuples, too. ```python import sklearn from sklearn.feature_extraction.text import CountVectorizer from sklearn.pipeline import Pipeline, FeatureUnion from sklearn.random_projection import SparseRandomProjection from sklearn.base import BaseEstimator, TransformerMixin from sklearn.metrics.pairwise import cosine_similarity from collections import Counter from pprint import pprint ``` ## Preparing dummy data for demonstration: ```python class SentenceTokenizer(BaseEstimator, TransformerMixin): # char lengths: MINIMUM_SENTENCE_LENGTH = 10 MAXIMUM_SENTENCE_LENGTH = 200 def fit(self, X, y=None): return self def transform(self, X): return self._split(X) def _split(self, string_): splitted_string = [] sep = chr(29) # special separator character to split sentences or phrases. string_ = string_.strip().replace(".", "." + sep).replace("?", "?" + sep).replace("!", "!" + sep).replace(";", ";" + sep).replace("\n", "\n" + sep) for phrase in string_.split(sep): phrase = phrase.strip() while len(phrase) > SentenceTokenizer.MAXIMUM_SENTENCE_LENGTH: # clip too long sentences. sub_phrase = phrase[:SentenceTokenizer.MAXIMUM_SENTENCE_LENGTH].lstrip() splitted_string.append(sub_phrase) phrase = phrase[SentenceTokenizer.MAXIMUM_SENTENCE_LENGTH:].rstrip() if len(phrase) >= SentenceTokenizer.MINIMUM_SENTENCE_LENGTH: splitted_string.append(phrase) return splitted_string with open("./data/HowtoGrowNeatSoftwareArchitectureoutofJupyterNotebooks.md") as f: raw_data = f.read() test_str_tokenized = SentenceTokenizer().fit_transform(raw_data) # Print text example: print(len(test_str_tokenized)) pprint(test_str_tokenized[3:9]) ``` 168 ["Have you ever been in the situation where you've got Jupyter notebooks " '(iPython notebooks) so huge that you were feeling stuck in your code?', 'Or even worse: have you ever found yourself duplicating your notebook to do ' 'changes, and then ending up with lots of badly named notebooks?', "Well, we've all been here if using notebooks long enough.", 'So how should we code with notebooks?', "First, let's see why we need to be careful with notebooks.", "Then, let's see how to do TDD inside notebook cells and how to grow a neat " 'software architecture out of your notebooks.'] ## Creating a SGNN preprocessing pipeline's classes ```python class WordTokenizer(BaseEstimator, TransformerMixin): def fit(self, X, y=None): return self def transform(self, X): begin_of_word = "<" end_of_word = ">" out = [ [ begin_of_word + word + end_of_word for word in sentence.replace("//", " /").replace("/", " /").replace("", " ").replace(" ", " ").split(" ") if not len(word) == 0 ] for sentence in X ] return out ``` ```python char_ngram_range = (1, 4) char_term_frequency_params = { 'char_term_frequency__analyzer': 'char', 'char_term_frequency__lowercase': False, 'char_term_frequency__ngram_range': char_ngram_range, 'char_term_frequency__strip_accents': None, 'char_term_frequency__min_df': 2, 'char_term_frequency__max_df': 0.99, 'char_term_frequency__max_features': int(1e7), } class CountVectorizer3D(CountVectorizer): def fit(self, X, y=None): X_flattened_2D = sum(X.copy(), []) super(CountVectorizer3D, self).fit_transform(X_flattened_2D, y) # can't simply call "fit" return self def transform(self, X): return [ super(CountVectorizer3D, self).transform(x_2D) for x_2D in X ] def fit_transform(self, X, y=None): return self.fit(X, y).transform(X) ``` ```python import scipy.sparse as sp T = 80 d = 14 hashing_feature_union_params = { # T=80 projections for each of dimension d=14: 80 * 14 = 1120dimensionnal word projections. **{'union__sparse_random_projection_hasher_{}__n_components'.format(t): d for t in range(T) }, **{'union__sparse_random_projection_hasher_{}__dense_output'.format(t): False # only AFTER hashing. for t in range(T) } } class FeatureUnion3D(FeatureUnion): def fit(self, X, y=None): X_flattened_2D = sp.vstack(X, format='csr') super(FeatureUnion3D, self).fit(X_flattened_2D, y) return self def transform(self, X): return [ super(FeatureUnion3D, self).transform(x_2D) for x_2D in X ] def fit_transform(self, X, y=None): return self.fit(X, y).transform(X) ``` ## Fitting the pipeline Note: at fit time, the only thing done is to discard some unused char ngrams and to instanciate the random hash, the whole thing could be independent of the data, but here because of discarding the ngrams, we need to "fit" the data. Therefore, fitting could be avoided all along, but we fit here for simplicity of implementation using scikitlearn. ```python params = dict() params.update(char_term_frequency_params) params.update(hashing_feature_union_params) pipeline = Pipeline([ ("word_tokenizer", WordTokenizer()), ("char_term_frequency", CountVectorizer3D()), ('union', FeatureUnion3D([ ('sparse_random_projection_hasher_{}'.format(t), SparseRandomProjection()) for t in range(T) ])) ]) pipeline.set_params(**params) result = pipeline.fit_transform(test_str_tokenized) print(len(result), len(test_str_tokenized)) print(result[0].shape) ``` 168 168 (12, 1120) ## Let's see the output and its form. ```python print(result[0].toarray().shape) print(result[0].toarray()[0].tolist()) print("") # The whole thing is quite discrete: print(set(result[0].toarray()[0].tolist())) # We see that we could optimize by using integers here instead of floats by counting the occurence of every entry. print(Counter(result[0].toarray()[0].tolist())) ``` (12, 1120) [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 2.005715251142432, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] {0.0, 2.005715251142432, 2.005715251142432} Counter({0.0: 1069, 2.005715251142432: 27, 2.005715251142432: 24}) ## Checking that the cosine similarity before and after word projection is kept Note that this is a yet lowquality test, as the neural network layers above the projection are absent, so the similary is not yet semantic, it only looks at characters. ```python word_pairs_to_check_against_each_other = [ # Similar: ["start", "started"], ["prioritize", "priority"], ["twitter", "tweet"], ["Great", "great"], # Dissimilar: ["boat", "cow"], ["orange", "chewbacca"], ["twitter", "coffee"], ["ab", "ae"], ] before = pipeline.named_steps["char_term_frequency"].transform(word_pairs_to_check_against_each_other) after = pipeline.named_steps["union"].transform(before) for i, word_pair in enumerate(word_pairs_to_check_against_each_other): cos_sim_before = cosine_similarity(before[i][0], before[i][1])[0,0] cos_sim_after = cosine_similarity( after[i][0], after[i][1])[0,0] print("Word pair tested:", word_pair) print("\t  similarity before:", cos_sim_before, "\t Are words similar?", "yes" if cos_sim_before > 0.5 else "no") print("\t  similarity after :", cos_sim_after , "\t Are words similar?", "yes" if cos_sim_after > 0.5 else "no") print("") ``` Word pair tested: ['start', 'started']  similarity before: 0.8728715609439697 Are words similar? yes  similarity after : 0.8542062410985866 Are words similar? yes Word pair tested: ['prioritize', 'priority']  similarity before: 0.8458888522202895 Are words similar? yes  similarity after : 0.8495862181305898 Are words similar? yes Word pair tested: ['twitter', 'tweet']  similarity before: 0.5439282932204212 Are words similar? yes  similarity after : 0.4826046482460216 Are words similar? no Word pair tested: ['Great', 'great']  similarity before: 0.8006407690254358 Are words similar? yes  similarity after : 0.8175049752615363 Are words similar? yes Word pair tested: ['boat', 'cow']  similarity before: 0.1690308509457033 Are words similar? no  similarity after : 0.10236537810666581 Are words similar? no Word pair tested: ['orange', 'chewbacca']  similarity before: 0.14907119849998599 Are words similar? no  similarity after : 0.2019908169580899 Are words similar? no Word pair tested: ['twitter', 'coffee']  similarity before: 0.09513029883089882 Are words similar? no  similarity after : 0.1016460166230715 Are words similar? no Word pair tested: ['ab', 'ae']  similarity before: 0.408248290463863 Are words similar? no  similarity after : 0.42850530886130067 Are words similar? no ## Next up So we have created the sentence preprocessing pipeline and the sparse projection (random hashing) function. We now need a few feedforward layers on top of that. Also, a few things could be optimized, such as using the power set of the possible ngram values with a predefined character set instead of fitting it, and the Hashing's fit function could be avoided as well by passing the random seed earlier, because the Hasher doesn't even look at the data and it only needs to be created at some point. This would yield a truly embeddingfree approach. Free to you to implement this. I wanted to have something that worked first, leaving optimization for later. ## License BSD 3Clause License Copyright (c) 2018, Guillaume Chevalier All rights reserved. ## Extra links ### Connect with me  [LinkedIn](https://ca.linkedin.com/in/chevalierg)  [Twitter](https://twitter.com/guillaume_che)  [GitHub](https://github.com/guillaumechevalier/)  [Quora](https://www.quora.com/profile/GuillaumeChevalier2)  [YouTube](https://www.youtube.com/c/GuillaumeChevalier)  [Dev/Consulting](http://www.neuraxio.com/en/) ### Liked this piece of code? Did it help you? Leave a [star](https://github.com/guillaumechevalier/SGNNSelfGoverningNeuralNetworksProjectionLayer/stargazers), [fork](https://github.com/guillaumechevalier/SGNNSelfGoverningNeuralNetworksProjectionLayer/network/members) and share the love! # ProjectionNets **Notes are from [Issue 1](https://github.com/guillaumechevalier/SGNNSelfGoverningNeuralNetworksProjectionLayer/issues/1)**: Very interesting. I've finally read the [previous supporting paper](https://arxiv.org/pdf/1708.00630.pdf), thanks for the shootout. Here are my thoughts after reading it. To sum up, I think that the projections are at wordlevel instead of at sentence level. This is for two reasons: 1. they use a hidden layer size of only 256 to represent words neurally (whereas sentence representations would be quite bigger), and 2. they seem to use an LSTM on top of the ProjectionNet (SGNN) to model short sentences in their benchmarks, which would mean the ProjectionNet doesn't encode at sentencelevel but at least at a lower level (probably words). Here is my full review: On 80\*14 v.s. 1\*1120 projections:  I thought the set of 80 projection functions was not for time performance, but rather to make the union of potentially different features. I think that either way, if one projection function of 1120 entries would take as much time to compute as 80 functions of 14 entries (80\*14=1120)  please correct me if I'm wrong. On the hidden layer size of 256:  I find peculiar that the size of their FullyConnected layers is only of 256. I'd expect 300 for wordlevel neural representations and rather 2000 for sentencelevel neural representations. This leads me to think that the projection layer is at the wordlevel with char features and not at the sentencelevel with char features. On the benchmark against a nested RNN (see section "4.3 Semantic Intent Classification") in the previous supporting paper:  They say "We use an RNN sequence model with multilayer LSTM architecture (2 layers, 100 dimensions) as the baseline and trainer network. The LSTM model and its ProjectionNet variant are also compared against other baseline systems [...]". The fact they phrase their experiment as "The LSTM model and its ProjectionNet" leads me to think that they pretokenized texts on words and that the projection layer is applied at wordlevel from skipgram char features. This would seem to go in the same direction of the fact they use a hidden layer (FullyConnected) size of only 256 rather than something over or equal to like 1000. On [teacherstudent model training](https://www.quora.com/WhatisateacherstudentmodelinaConvolutionalneuralnetwork/answer/GuillaumeChevalier2):  They seem to use a regular NN like a crutch to assist the projection layer's toplevel layer to reshape information correctly. They even train the teacher at the same time that they train the student SGNN, which is something I hadn't yet seen compared to regular teacherstudent setups. I'd find simpler to use a Matching Networks directly which would be quite simpler than setting up studentteacher learning. I'm not sure how their "graph structured loss functions" works  I yet still assume that they'd need train the whole thing like in word2vec with skipgram or CBOW (but here with the new type of skipgram training procedure instead of the char featureextraction skipgram). I wonder why they did things in a so complicated way. Matching Networks (a.k.a. cosine similarity loss, a.k.a. selfattention queries dotted with either attention keys or values before a softmax) directly with negative sampling seems so much simpler. 
[link]
## Task Add '**rejection**' output to an existing classification model with softmax layer. ## Method 1. Choose some threshold $\delta$ and temperature $T$ 2. Add a perturbation to the input x (eq 2), let $\tilde x = x  \epsilon \text{sign}(\nabla_x \log S_{\hat y}(x;T))$ 3. If $p(\tilde x;T)\le \delta$, rejects 4. If not, return the output of the original classifier $p(\tilde x;T)$ is the max prob with temperature scailing for input $\tilde x$ $\delta$ and $T$ are manually chosen. 
[link]
## Task A neural network for classification typically has a **softmax** layer and outputs the class with the max probability. However, this probability does not represent the **confidence**. If the average confidence (average of max probs) for a dataset matches the accuracy, it is called **wellcalibrated**. Old models like LeNet (1998) was wellcalibrated, but modern networks like ResNet (2016) are no longer wellcalibrated. This paper explains what caused this and compares various calibration methods. ## Figure  Confidence Histogram https://i.imgur.com/dMtdWsL.png The bottom row: group the samples by confidence (max probailities) into bins, and calculates the accuracy (# correct / # bin size) within each bin.  ECE (Expected Calibration Error): average of accuracyconfidence of bins  MCE (Maximum Calibration Error): max of accuracyconfidence of bins ## Analysis  What The paper experiments how models are miscalibrated with different factors: (1) model capacity, (2) batch norm, (3) weight decay, (4) NLL. ## Solution  Calibration Methods Many calibration methods for binary classification and multiclass classification are evaluated. The method that performed the best is **temperature scailing**, which simply multiplies logits before the softmax by some constant. The paper used the validation set to choose the best constant. 
[link]
Residual Networks (ResNets) have greatly advanced the stateoftheart in Deep Learning by making it possible to train much deeper networks via the addition of skip connections. However, in order to compute gradients during the backpropagation pass, all the units' activations have to be stored during the feedforward pass, leading to high memory requirements for these very deep networks. Instead, the authors propose a **reversible architecture** based on ResNets, in which activations at one layer can be computed from the ones of the next. Leveraging this invertibility property, they design a more efficient implementation of backpropagation, effectively trading compute power for memory storage. * **Pros (+): ** The change does not negatively impact model accuracy (for equivalent number of model parameters) and it only requires a small change in the backpropagation algorithm. * **Cons (): ** Increased number of parameters, thus need to change the unit depth to match the "equivalent" ResNet  # Proposed Architecture ## RevNet This paper proposes to incorporate idea from previous reversible architectures, such as NICE [1], into a standard ResNet. The resulting model is called **RevNet** and is composed of reversible blocks, inspired from *additive coupling* [1, 2]: $ \begin{array}{rr} \texttt{RevNet Block} & \texttt{Inverse Transformation}\\ \hline \mathbf{input }\ x & \mathbf{input }\ y \\ x_1, x_2 = \mbox{split}(x) & y1, y2 = \mbox{split}(y)\\ y_1 = x_1 + \mathcal{F}(x_2) & x_2 = y_2  \mathcal{G}(y_1) \\ y_2 = x_2 + \mathcal{G}(y_1) & x_1 = y_1  \mathcal{F}(x_2)\\ \mathbf{output}\ y = (y_1, y_2) & \mathbf{output}\ x = (x_1, x_2) \end{array} $ where $\mathcal F$ and $\mathcal G$ are residual functions, composed of sequences of convolutions, ReLU and Batch Normalization layers, analoguous to the ones in a standard ResNet block, although operations in the reversible blocks need to have a stride of 1 to avoid information loss and preserve invertibility. Finally, for the `split` operation, the authors consider spliting the input Tensor across the channel dimension as in [1, 2]. Similarly to ResNet, the final RevNet architecture is composed of these invertible residual blocks, as well as nonreversible subsampling operations (e.g., pooling) for which activations have to be stored. However the number of such operations is much smaller than the number of residual blocks in a typical ResNet architecture. ## Backpropagation ### Standard The backpropagaton algorithm is derived from the chain rule and is used to compute the total gradients of the loss with respect to the parameters in a neural network: given a loss function $L$, we want to compute the gradients of $L$ with respect to the parameters of each layer, indexed by $n \in [1, N]$, i.e., the quantities $ \overline{\theta_{n}} = \partial L /\ \partial \theta_n$. (where $\forall x, \bar{x} = \partial L / \partial x$). We roughly summarize the algorithm in the left column of **Table 1**: In order to compute the gradients for the $n$th block, backpropagation requires the input and output activation of this block, $y_{n  1}$ and $y_{n}$, which have been stored, and the derivative of the loss respectively to the output, $\overline{y_{n}}$, which has been computed in the backpropagation iteration of the upper layer; Hence the name backpropagation ### RevNet Since activations are not stored in RevNet, the algorithm needs to be slightly modified, which we describe in the right column of **Table 1**. In summary, we first need to recover the input activations of the RevNet block using its invertibility. These activations will be propagated to the earlier layers for further backpropagation. Secondly, we need to compute the gradients of the loss with respect to the inputs, i.e. $\overline{y_{n  1}} = (\overline{y_{n 1, 1}}, \overline{y_{n  1, 2}})$, using the fact that: $ \begin{align} \overline{y_{n  1, i}} = \overline{y_{n, 1}}\ \frac{\partial y_{n, 1}}{y_{n  1, i}} + \overline{y_{n, 2}}\ \frac{\partial y_{n, 2}}{y_{n  1, i}} \end{align} $ Once again, this result will be propagated further down the network. Finally, once we have computed both these quantities we can obtain the gradients with respect to the parameters of this block, $\theta_n$. $ \begin{array}{cll} \hline & \mathbf{ResNet} & \mathbf{RevNet} \\ \hline \mathbf{Block} & y_{n} = y_{n  1} + \mathcal F(y_{n  1}) & y_{n  1, 1}, y_{n  1, 2} = \mbox{split}(y_{n  1})\\ && y_{n, 1} = y_{n  1, 1} + \mathcal{F}(y_{n  1, 2})\\ && y_{n, 2} = y_{n  1, 2} + \mathcal{G}(y_{n, 1})\\ && y_{n} = (y_{n, 1}, y_{n, 2})\\ \hline \mathbf{Params} & \theta = \theta_{\mathcal F} & \theta = (\theta_{\mathcal F}, \theta_{\mathcal G})\\ \hline \mathbf{Backprop} & \mathbf{in:}\ y_{n  1}, y_{n}, \overline{ y_{n}} & \mathbf{in:}\ y_{n}, \overline{y_{n }}\\ & \overline{\theta_n} =\overline{y_n} \frac{\partial y_n}{\partial \theta_n} &\texttt{# recover activations} \\ &\overline{y_{n  1}} = \overline{y_{n}}\ \frac{\partial y_{n}}{\partial y_{n1}} &y_{n, 1}, y_{n, 2} = \mbox{split}(y_{n}) \\ &\mathbf{out:}\ \overline{\theta_n}, \overline{y_{n 1}} & y_{n  1, 2} = y_{n, 2}  \mathcal{G}(y_{n, 1})\\ &&y_{n  1, 1} = y_{n, 1}  \mathcal{F}(y_{n  1, 2})\\ &&\texttt{# gradients wrt. inputs} \\ &&\overline{y_{n 1, 1}} = \overline{y_{n, 1}} + \overline{y_{n,2}} \frac{\partial \mathcal G}{\partial y_{n,1}} \\ &&\overline{y_{n 1, 2}} = \overline{y_{n, 1}} \frac{\partial \mathcal F}{\partial y_{n,2}} + \overline{y_{n,2}} \left(1 + \frac{\partial \mathcal F}{\partial y_{n,2}} \frac{\partial \mathcal G}{\partial y_{n,1}} \right) \\ &&\texttt{ gradients wrt. parameters} \\ &&\overline{\theta_{n, \mathcal G}} = \overline{y_{n, 2}} \frac{\partial \mathcal G}{\partial \theta_{n, \mathcal G}}\\ &&\overline{\theta_{n, \mathcal F}} = \overline{y_{n,1}} \frac{\partial F}{\partial \theta_{n, \mathcal F}} + \overline{y_{n, 2}} \frac{\partial F}{\partial \theta_{n, \mathcal F}} \frac{\partial \mathcal G}{\partial y_{n,1}}\\ &&\mathbf{out:}\ \overline{\theta_{n}}, \overline{y_{n 1}}, y_{n  1}\\ \hline \end{array} $ **Table 1:** Backpropagation in the standard case and for Reversible blocks  ## Experiments ** Computational Efficiency.** RevNets trade off memory requirements, by avoiding storing activations, against computations. Compared to other methods that focus on improving memory requirements in deep networks, RevNet provides the best tradeoff: no activations have to be stored, the spatial complexity is $O(1)$. For the computation complexity, it is linear in the number of layers, i.e. $O(L)$. One small disadvantage is that RevNets introduces additional parameters, as each block is composed of two residuals, $\mathcal F$ and $\mathcal G$, and their number of channels is also halved as the input is first split into two. **Results.** In the experiments section, the author compare ResNet architectures to their RevNets "counterparts": they build a RevNet with roughly the same number of parameters by halving the number of residual units and doubling the number of channels. Interestingly, RevNets achieve **similar performances** to their ResNet counterparts, both in terms of final accuracy, and in terms of training dynamics. The authors also analyze the impact of floating errors that might occur when reconstructing activations rather than storing them, however it appears these errors are of small magnitude and do not seem to negatively impact the model. To summarize, reversible networks seems like a very promising direction to efficiently train very deep networks with memory budget constraints.  ## References * [1] NICE: Nonlinear Independent Components Estimation, Dinh et al., ICLR 2015 * [2] Density estimation using Real NVP, Dinh et al., ICLR 2017 
[link]
 Presents a simple visualization method based on “filter normalization.”  Observed that __the deeper networks become, neural loss landscapes become more chaotic__; causes a dramatic drop in generalization error, and ultimately to a lack of trainability.  Observed that __skip connections promote flat minimizers and prevent the transition to chaotic behavior__; helps explain why skip connections are necessary for training extremely deep networks.  Quantitatively measures nonconvexity.  Studies the visualization of SGD optimization trajectories. 
[link]
## Temporal unit regression network keyword: temporal action proposal; computing efficiency **Summary**: In this paper, Jiyang et al designed a proposal generation and refinement network with high computation efficiency by reusing unit feature on coordinated regression and classification network. Especially, a new metric against temporal proposal called ARF is raised to meet 2 metric criteria: 1. evaluate different method on the same dataset efficiently. 2. capable to evaluate same method's performance across several datasets(generalization capability) **Model**: * decompose video and extract feature to form clip pyramid: 1. A video is first decomposed into short units where each unit has 16/32 frames. 2. extract each unit's feature using C3D/Twostream CNN model. 3. several units' features are average pooled to compose clip level feature. In order to provide context and adaptive for different length action, clip level feature also concatenate surround feature and scaled to different length by concatenating more or fewer clips. Feature for a slip is $f_c = P(\{u_j\}_{s_un_{ctx}}^{s_u})P(\{u_j\}_{s_u}^{e_u})P(\{u_j\}_{e_u}^{e_u+n_{ctx}}) $ 4. for each proposal pyramid, a classifier is used to judge if the proposal contains an action and a regressor is used to provide an offset for each proposal to refine proposal's temporal boundary. 5. finally, during prediction, NMS is used to remove redundant proposal thus provide high accuracy without changing the recall rate. https://i.imgur.com/zqvHOxj.png **Training**: There are two output need to be optimized, the classification result and the regression offset. Intuitively, the distance between the proposal and corresponding ground truth should be measured. In this paper, the authors used the L1 metric for regressor targeted the positive proposals. total loss is measured as follow: $L = \lambda L_{reg}+L_{cls}$ $L_{reg} = \frac{1}{N_{pos}}\sum_{i = 1}^N*l_s^*o_{s,i}  o_{s,i}^*+o_{e,i}  o_{e,i}^*$ During training, the ratio between positive samples and negative samples is set to 1:10. And for each positive proposal, its ground truth is the one with which it has the highest IOU or which it has IOU more than 0.5. **result**: 1. Computation complexity: 880 fps using the C3D feature on TITAN X GPU, while 260 FPS using flow CNN feature on the same machine. 2. Accuracy: mAP@0.5 = 25.6% on THUMOS14 **Conclusion**: Within this paper, it generates proposals by generate candidate at each unit with different scale and then using regression to refine the boundary. *However, there are a lot of redundant proposals for each unit which is an unnecessary waste of computing source; Also, proposals are generated with the predefined length which restricted its adaptivity to different length action; Finally the proposals are generated on the unit level which will suffer granularity problem* 
[link]
## Structured segmented network ### **key word**: action detection in video; computing complexity reduction; structurize proposal **Abstract**: using a temporal action grouping scheme (TAG) to generate accurate proposals, using a structured pyramid to model the temporal structure of each action instance to tackle the issue that detected actions are not complete, using two classifiers to determine class and completeness and using a regressor for each category to further modify the temporal bound. In this paper, Yue Zhao et al mainly tackle the problem of high computing complexity by sampling video frame and remove redundant proposals in video detection and the lack of action stage modeling. **Model**: 1. generate proposals: find continuous temporal regions with mostly high actioness. $P = \{ p_i = [s_i,e_i]\}_{i = 1}^N$ 2. splitting proposals into 3 stages: start, course, and end: first augment the proposal by 2 times symmetrical to center, and course part is the original proposal, while start and end is the left part and right part of the difference between the transformed proposal and original one. 3. build temporal pyramid representation for each stage: first L samples are sampled from the augmented proposal, then twostream feature extractor is used on each one of them and pooling features for each stage 4. build global representation for each proposal by concatenating stagelevel representations 5. a global representation for each proposal is used as input for classifiers * input = ${S_t}_{t = 1} ^{T}$a sequence of T snippet representing the video. each snippet = the frames + an optical flow stack * network: two linear classifiers; L twosteam feature extractor and several pooling layer * output: category and completeness and modification for each proposals. https://i.imgur.com/thM9oWz.png **Training**: * joint loss for classifiers: $L_{cls} = log(P(c_ip_i)* P(b_i,c_i,p_i)) $ * loss for location regression: $\lambda * 1(c_i>=1, b_i = 1) L(u_i,\varphi _i;p_i)$ **Summary**: This paper has three highlights: 1. Parallel: it uses a paralleled network structure where proposals can be processed in paralleled which will shorten the processing time based on GPU 2. temporal structure modeling and regression: give each proposal certain structure so that completeness of proposals can be achieved 3. reduce computing complexity: use two tricks: remove video redundancy by sampling frame; remove proposal redundance 
[link]
## **Keywords** One pixel attack , adversarial examples , differential evolution , targeted and nontargeted attack  ## **Summary** 1. **Introduction ** 1. **Basics** 1. Deep learning methods are better than the traditional image processing techniques in most of the cases in computer vision domain. 1. "Adversarial examples" are specifically modified images with imperceptible perturbations that are classified wrong by the network. 1. **Goals of the paper** 1. In most of the older techniques excessive modifications are made on the images and it may become perceivable to the human eyes. The authors of the paper suggest a method to create adversarial examples by changing only one , three or five pixels of the image. 1. Generating examples under constrained conditions can help in _getting insights about the decision boundaries_ in the higher dimensional space. 1. **Previous Work** 1. Methods to create adversarial examples : 1. Gradientbased algorithms using backpropagation for obtaining gradient information 1. "fast gradient sign" algorithm 1. Greedy perturbation searching method 1. Jacobian matrix to build "Adversarial Saliency Map" 1. Understanding and visualizing the decision boundaries of the DNN input space. 1. Concept of "Universal perturbations" , a perturbation that when added to any natural image can generate adversarial samples with high effectiveness 1. **Advantages of the new types of attack ** 1. _Effectiveness_ : One pixel modification with efficiency ranging from 60%  75%. 1. _SemiBlackBox attack _: Requires only blackbox feedback (probability labels) , no gradient and network architecture required. 1. _Flexibility_ : Can generalize between different types of network architectures. 1. **Methodology** 1. Finding the adversarial example as an optimization problem with constraints.** ** 1. _Differential evolution_ 1. _"Differential evolution" _, a general kind of evolutionary algorithms , used to solve multimodal optimization problems. 1. Does Not make use of gradient information 1. Advantages of DE for generating adversarial images : 1. _Higher probability of finding the global optima_ 1. _Requires less information from the target system_ 1. _Simplicity_ : Independent of the classifier 1. **Results ** 1. CIFAR10 dataset was selected with 3 types of networks architectures , all convolution network , Network in Network and VGG16 network . 500 random images were selected to create the perturbations and run both _targeted_ and_ nontargeted attack._ 1. Adversarial examples were created with only one pixel change in some cases and with 3 and 5 pixel changes in other cases. 1. The attack was generalized over different architectures. 1. Some specific targetpair classes are more vulnerable to attack compared to the others. 1. Some classes are very difficult to perturb to other classes and some cannot be changed at all. 1. Robustness of the class against attack can be broken by using higher dimensional perturbations. 1. **Conclusion** 1. Few pixels are enough to fool different types of networks. 1. The properties of the targeted perturbation depends on its decision boundary. 1. Assumptions made that small changes addictive perturbation on the values of many dimensions will accumulate and cause huge change to the output , might not be necessary for explaining why natural images are sensitive to small perturbation.  ## **Notes ** * Location of data points near the decision boundaries might affect the robustness against perturbations. * If the boundary shape is wide enough it is possible to have natural images far away from the boundary such that it is hard to craft adversarial images from it. * If the boundary shape is mostly long and thin with natural images close to the border, it is easy to craft adversarial images from them but hard to craft adversarial images to them. * The data points are moved in small steps and the change in the class probabilities are observed. ## **Open research questions** 1. Effect of a larger set of initial candidate solutions( Training images) to finding the adversarial image? 1. Generate better adversarial examples by having more iterations of Differential evolution? 1. Why imbalances occur when creating perturbations? 
[link]
This paper feels a bit like watching a 90’s show, and everyone’s in denim and miniskirts, except it’s a 2017 ML paper, and everything uses attention. (I’ll say it again, ML years are like dog years, but more so). That said, that’s not a critique of the paper: finding clever ways to cobble together techniques for your application can be an important and valuable contribution. This paper addresses the problem of text to image generation: how to take a description of an image and generate an image that matches it, and it makes two main contributions: 1) a GAN structure that seems to merge insights from Attention and Progressive GANs in order to select areas of the sentence to inform details in specific image regions, and 2) a novel discriminator structure to evaluate whether a sentence matches an image. https://i.imgur.com/JLuuhJF.png Focusing on the first of these first: their generation system works by an iterative process, that gradually builds up image resolution, and also pulls specific information from the sentence to inform details in each region. The first layer of the network generates a first “hidden state” based on a compressed representation of the sentence as a whole (the final hidden state of a LSTM text encoder, I believe), as well as random noise (typical input to a GAN). Subsequent “hidden states” are calculated by calculating attention weightings between each region of the image, and each word in the sentence, and pulling together a perregion context vector based on that attention map. (As far as I understand it, “region” here refers to the fact that when you’re at lower spatial scales of what is essentially a progressive generation process, 64x64 rather than 256x256, for example, each “pixel” actually represents a larger region of the image). I’m using quotes around “hidden state” in the above paragraph because I think it’s actually pretty confusing terminology, since it suggests a recurrent structure, but this model isn’t actually recurrent: there’s a specific set of weights for resolution block 0, and 1, and 2. This whole approach, of calculating a specific attentionweighted context vector over input words based on where you are in the generation process is very conceptually similar to the original domain of attention, where the attention query would be driven by the hidden state of the LSTM generating the translated version of some input sentence, except, here, instead of translating between languages, you’re translating across mediums. The loss for this model is a combination of perlayer loss, and a final, special, fullresolution loss. At each level of resolution, there exists a separate discriminator, which seems to be able to take in both 1) only an image, and judge whether it thinks that image looks realistic on it’s own, and 2) an image and a global sentence vector, and judge whether the image matches the sentence. It’s not fully clear from the paper, but it seems like this is based on just feeding in the sentence vector as additional input? https://i.imgur.com/B6qPFax.png For each nonfinal layer’s discriminator, the loss is a combination of both of these unconditional and conditional losses. The final contribution of this paper is something they call the DAMSM loss: the Deep Attention Multimodal Similarity Model. This is a fairly complex model structure, whose ultimate goal is to assess how closely a final generated image matches a sentence. The whole structure of this loss is based on projecting regionlevel image features (from an intermediate, 17x17 layer of a pretrained Inception Net) and word features into the same space, and then calculating dot product similarities between them, which are then used to build “visual context vectors” for each word (for each word, created a weighted sum of visual vectors, based on how similar each is to the word). Then, we take each word’s context vector, and see how close it is to the original word vector. If we, again, imagine image and word vectors as being in a conceptually shared space, then this is basically saying “if I take a weighted average of all the things that are the most similar to me, how ultimately similar is that weighted average to me”. This allows there to be a “concept representation” match found when, for example, a particular word’s concept, like “beak”, is only present in one region, but present there very strongly: the context vector will be strongly weighted towards that region, and will end up being very close, in cosine similarity terms, to the word itself. By contrast, if none of the regions are a particularly good match for the word’s concept, this value will be low. DAMSM then aggregates up to an overall “relevance” score between a sentence and image, that’s simply a sum over a word’s “concept representation”, for each word in a sentence. It then calculates conditional probabilities in two directions: what’s the probability of the sentence, given the image (relevance score of (Sent, Imag), divided by that image’s summed relevance with all possible sentences in the batch), and, also, what’s the probability of the image, given the sentence (relevance score of the pair, divided by the sentence’s summed relevance with all possible images in the batch). In addition to this wordlevel concept modeling, DAMSM also has full sentencelevel versions, where it simply calculates the relevance of each (sentence, image) pair by taking the cosine similarity between the global sentence and global image features (the final hidden state of an encoder RNN, and the final aggregated InceptionNet features, respectively). All these losses are aggregated together, to get one that uses both global information, and information as to whether specific words in a sentence are represented well in an image. 
[link]
This paper performs a fascinating toy experiment, to try to see if something languagelike in structure can be effectively induced in a population of agents, if they are given incentives that promote it. In some sense, a lot of what they find “just makes sense,” but it’s still a useful proof of concept to show that it can be done. The experiment they run takes place in a simple, twodimensional world, with a fixed number of landmarks (representing locations goals need to take place), and agents, and actions. In this construction, each agent has a set of internal goals, which can either be actions (like “go to green landmark”) they themselves need to perform, or actions that they want another agent to perform. Agents’ goals are not visible to other agents, but all agents’ reward is defined to be the aggregated reward of all agents together, so if agent A has a goal involving an action of agent B’s, it’s in B’s “interest” to do that action, if it can be communicated to them. In order to facilitate other agents performing goals, at each step, each agent both takes an action, and also emits an “utterance”, which is just a discrete symbolic “word” out of some some fixed vocabulary of words (Note that applying “word” here is a but fuzzy; the agents do not pronounce or spell a characterbased word, they just pick a discrete symbol that is playing the role of a word”. Even though other agents cannot see a given agent’s goals, they can see its public utterances, and so agents learn that communication is a way to induce other agents to perform desired actions. As a mathematically interesting aside: this setup, of allowing each agent to sample a single discrete word out of a small vocabulary at each setting, takes the deployment of some interesting computational tricks to accomplish. First off, in general, sampling a discrete single symbol out of a set of possible symbols is not differentiable, since it’s a discrete rather than continuous action, and derivatives require continuous functions. However, a paper from 2016 proposed a (heuristic) solution to this problem by means of the Gumbel Softmax Trick. This derives from the older “Gumbel Max Trick”, which is the mathematical fact that if you want to sample from a categorical distribution, a computationally easy way to do so is to add a variable sampled from a (0,1) Gumbel distribution to the log probability of each category, and then take the argmax of this as the index of the sample category (I’m not going to go another level down into why this is true, since I think it’s too far afield of the scope of this summary). Generally, argmax functions are also not differentiable. However, they can be approximated with softmaxes, which interpolate between a totally uniform and very nearly discretesample distribution based on a temperature parameter. In practice, or, at least, if this paper does what the original Gumbel Softmax paper did, during training, a discrete sample is taken, but a lowtemperature continuous approximation is used for actual gradient calculation (i.e. for gradients, the model pretends that it used the continuous approximation rather than the discrete sample). https://i.imgur.com/0RpRJG2.png Coming back to the actual communication problem, the authors do find that under these (admittedly fairly sanitized and contrived) circumstances, agents use series of discrete symbols to communicate goals to other agents, which ends up looking a lot like a very simple language. https://i.imgur.com/ZF0EbN4.png As one might expect, in environments where there were only two agents, there was no symbol that ended up corresponding to “red agent” or “blue agent”, since each could realize that the other was speaking to it. However, in threeagent environments, the agents did develop symbols that clearly mapped to these categories, to specify who directions were being given to. The authors also tried cutting off verbal communication; in these situations, the agents used gaze and movement to try to signal what they wanted other agents to do. Probably most entertainingly, when neither verbal nor visual communication was allowed, agents would move to and “physically” push other agents to the location where their action needed to be performed. 
[link]
It's like mask rcnn but for salient instances. code will be available at https://github.com/RuochenFan/S4Net. They invented a layer "mask pooling" that they claim is better than ROI pooling and ROI align. >As can be seen, our proposed binary RoIMasking and ternary RoIMasking both outperform RoIPool and RoIAlign in mAP0.7 . Specifically, our ternary RoIMasking result improves the RoIAlign result by around 2.5 points. This reflects that considering more context information outside the proposals does help for salient instance segmentation Important benchmark attached: https://i.imgur.com/wOF2Ovz.png 
[link]
# Metadata * **Title**: The Do’s and Don’ts for CNNbased Face Verification * **Authors**: Ankan Bansal Carlos Castillo Rajeev Ranjan Rama Chellappa UMIACS  University of Maryland, College Park * **Link**: https://arxiv.org/abs/1705.07426 # Abstract >Convolutional neural networks (CNN) have become the most sought after tools for addressing object recognition problems. Specifically, they have produced stateofthe art results for unconstrained face recognition and verification tasks. While the research community appears to have developed a consensus on the methods of acquiring annotated data, design and training of CNNs, many questions still remain to be answered. In this paper, we explore the following questions that are critical to face recognition research: (i) Can we train on still images and expect the systems to work on videos? (ii) Are deeper datasets better than wider datasets? (iii) Does adding label noise lead to improvement in performance of deep networks? (iv) Is alignment needed for face recognition? We address these questions by training CNNs using CASIAWebFace, UMDFaces, and a new video dataset and testing on YouTubeFaces, IJBA and a disjoint portion of UMDFaces datasets. Our new data set, which will be made publicly available, has 22,075 videos and 3,735,476 human annotated frames extracted from them. # Introduction >We make the following main contributions in this paper: • We introduce a large dataset of videos of over 3,000 subjects along with 3,735,476 human annotated bounding boxes in frames extracted from these videos. • We conduct a large scale systematic study about the effects of making certain apparently routine decisions about the training procedure. Our experiments clearly show that data variety, number of individuals in the dataset, quality of the dataset, and good alignment are keys to obtaining good performance. • We suggest the best practices that could lead to an improvement in the performance of deep face recognition networks. These practices will also guide future data collection efforts. # How they made the dataset  collect youtube videos  automated filtering with yolo and landmark detection projects  crowd source final filtering (AMT  give 50 face images to turks and ask which don't belong)  quality control through sentinels: give turks the same test but with 5 known correct answers, and rank the turks according to how they perform on this ground truth test. If they're good, trust their answers on the real tests.  result: > we have 3,735,476 annotated frames in 22,075 videos. We will publicly release this massive dataset # Questions and experiments ## Do deep recognition networks trained on stills perform well on videos? > We study the effects of this difference between still images and frames extracted from videos in section 3.1 using our new dataset. We found that mixing both still images and the large number of video frames during training performs better than using just still images or video frames for testing on any of the test datasets ## What is better: deeper or wider datasets? >In section 3.2 we investigate the impact of using a deep dataset against using a wider dataset. For two datasets with the same number of images, we call one deeper than the other if on average it has more images per subject (and hence fewer subjects) than the other. We show that it is important to have a wider dataset than a deeper dataset with the same number of images. ## Does some amount of label noise help improve the performance of deep recognition networks? >When training any supervised face classification system, each image is first associated with a label. Label noise is the phenomenon of assigning an incorrect label to some images. Label noise is an inherent part of the data collection process. Some authors intentionally leave in some label noise [25, 6, 7] in the dataset in hopes of making the deep networks more robust. In section 3.3 we examine the effect of this label noise on the performance of deep networks for verification trained on these datasets and demonstrate that clean datasets almost always lead to significantly better performance than noisy datasets. ## Does thumbnail creation method affect performance? >... This leads to generation of different types of bounding boxes for faces. Verification accuracy can be affected by the type of bounding box used. In addition, most recent face recognition and verification methods [35, 31, 33, 5, 9, 34] use some kind of 2D or 3D alignment procedure [41, 14, 28, 8]. ... In section 3.4 we study the consequences of using different thumbnail generation methods on verification performance of deep networks. We show that using a good keypoint detection method and aligning faces both during training and testing leads to the best performance. 
[link]
Normal RL agents in multiagent scenarios treat their opponents as a static part of the environment, not taking into account the fact that other agents are learning as well. This paper proposes LOLA, a learning rule that should take the agency and learning of opponents into account by optimizing "return under one step lookahead of opponent learning" So instead of optimizing under the current parameters of agent 1 and 2 $$V^1(\theta_i^1, \theta_i^2)$$ LOLA proposes to optimize taking into account one step of opponent (agent 2) learning $$V^1(\theta_i^1, \theta_i^2 + \Delta \theta^2_i)$$ where we assume the opponent's naive learning update $\Delta \theta^2_i = \nabla_{\theta^2} V^2(\theta^1, \theta^2) \cdot \eta$ and we add a secondorder correction term on top of this, the authors propose  a learning rule with policy gradients in the case that the agent does not have access to exact gradients  a way to estimate the parameters of the opponent, $\theta^2$, from its trajectories using maximum likelihood in the case you can't access them directly $$\hat \theta^2 = \text{argmax}_{\theta^2} \sum_t \log \pi_{\theta^2}(u_t^2s_t)$$ LOLA is tested on iterated prisoner's dilemma and converges to a titfortat strategy more frequently than the naive RL learning algorithm, and outperforms it. LOLA is tested on iterated matching pennies (similar to prisoner's dilemma) and stably converges to the Nash equilibrium whereas the naive learners do not. In testing on coin game (a higher dimensional version of prisoner's dilemma) they find that naive learners generally choose the defect option whereas LOLA agents have a mostlycooperative strategy. As well, the authors show that LOLA is a dominant learning rule in IPD, where both agents always do better if either is using LOLA (and even better if both are using LOLA). Finally, the authors also propose second order LOLA, which instead of assuming the opponent is a naive learner, assumes the opponent uses a LOLA learning rule. They show that second order LOLA does not lead to improved performance so there is no need to have a $n$th order LOLA arms race. 
[link]
This paper introduces a CNN based segmentation of an object that is defined by a user using four extreme points (i.e. bounding box). Interestingly, in a related work, it has been shown that clicking extreme points is about 5 times more efficient than drawing a bounding box in terms of speed. https://i.imgur.com/9GJvf17.png The extreme points have several goals in this work. First, they are used as a bounding box to crop the object of interest. Secondly, they are utilized to create a heatmap with activations in the regions of extreme points. The heatmap is created as a 2D Gaussian centered around each of the extreme points. This heatmap is matched to the size of the resized crop (i.e. 512x512) and is concatenated with the original RGB channels of the crop. The concatenated input of channel depth=4 is fed to the network which is a ResNet101 with FC and last two maxpool layers removed. In order to maintain the same receptive field, an astrous convolution is used. Pyramid scene parsing module from PSPNet is used to aggregate global context. The network is trained with a standard crossentropy loss weighted by a normalization factor (i.e. a frequency of a class in a dataset). How does it compare to "Efficient Interactive Annotation of Segmentation Datasets with PolygonRNN++ " paper in terms of accuracy? Specifically, if the polygon is wrong it is easy to correct points on the polygon that are wrong. However, it is unclear how to obtain preferred segmentation when no matter how many (greater than four) extreme points are selected, the object of interest is not segmented properly. 
[link]
This paper is about interactive Visual Question Answering (VQA) setting in which agents must ask questions about images to learn. This closely mimics how people learn from each other using natural language and has a strong potential to learn much faster with fewer data. It is referred as learning by asking (LBA) through the paper. The approach is composed of three models: http://imisra.github.io/projects/lba/approach_HQ.jpeg 1. **Question proposal module** is responsible for generating _important_ questions about the image. It is a combination of 2 models:  **Question generator** model produces a question. It is LSTM that takes image features and question type (random choice from available options) as input and outputs a question.  **Question relevance** model that selects questions relevant to the image. It is a stacked attention architecture network (shown below) that takes in generated question and image features and filters out irrelevant to the image questions. https://i.imgur.com/awPcvYz.png 2. **VQA module** learns to predict answer given the image features and question. It is implemented as stacked attention architecture shown above. 3. **Question selection module** selects the most informative question to ask. It takes current state of VQA module and its output to calculate expected accuracy improvement (details are in the paper) to measure how fast the VQA module has a potential to improve for each answer. The single question selection (i.e. best question for VQA to improve the fastest) strategy is based on epsilongreedy policy. This method (i.e. LBA) is shown to be about 50% more data efficient than naive VQA method. As an interesting future direction of this work, the authors propose to use realworld images and include a human in the training as an answer provider. 
[link]
This paper introduces a new AI task  Embodied Question Answering. The goal of this task for an agent is to be able to answer the question by observing the environment through a single egocentric RGB camera while being able to navigate inside the environment. The agent has 4 natural modules: https://i.imgur.com/6Mjidsk.png 1. **Vision**. 224x224 RGB images are processed by CNN to produce a fixedsize representation. This CNN is pretrained on pixeltopixel tasks such as RGB reconstruction, semantic segmentation, and depth estimation. 2. **Language**. Questions are encoded with 2layer LSTMs with 128d hidden states. Separate question encoders are used for the navigation and answering modules to capture important words for each module. 3. **Navigation** is composed of a planner (forward, left, right, and stop actions) and a controller that executes planner selected action for a variable number of times. The planner is LSTM taking hidden state, image representation, question, and previous action. Contrary, a controller is an MLP with 1 hidden layer which takes planner's hidden state, action from the planner, and image representation to execute an action or pass the lead back to the planner. 4. **Answering** module computes an imagequestion similarity of the last 5 frames via a dot product between image features (passed through an fclayer to align with question features) and question encoding. This similarity is converted to attention weights via a softmax, and the attentionweighted image features are combined with the question features and passed through an answer classifier. Visually this process is shown in the figure below. https://i.imgur.com/LeZlSZx.png [Successful results](https://www.youtube.com/watch?v=gVjTeIJfrk) as well as [failure cases](https://www.youtube.com/watch?v=4zH8cz2VlEg) are provided. Generally, this is very promising work which literally just scratches the surface of what is possible. There are several constraints which can be mitigated to push this field to more general outcomes. For example, use more general environments with more realistic graphics and broader set of questions and answers. 
[link]
Huang et al. study adversarial attacks on reinforcement learning policies. One of the main problems, in contrast to supervised learning, is that there might not be a reward in any time step, meaning there is no clear objective to use. However, this is essential when crafting adversarial examples as they are mostly based on maximizing the training loss. To avoid this problem, Huang et al. assume a welltrained policy; the policy is expected to output a distribution over actions. Then, adversarial examples can be computed by maximizing the crossentropy loss using the mostlikely action as ground truth. 
[link]
Biggio and Roli provide a comprehensive survey and discussion of work in adversarial machine learning. In contrast to related work [1,2], they explicitly discuss the relation of recent developments regarding the security of deep neural networks (as primarily discussed in [1] and [2]) and adversarial machine learning in general. The latter can be traced back to early work starting in 2004, e.g. involving adversarial attacks on spam filters. As a result, terminology used by Biggio and Roli is slightly different compared to publications focusing on deep neural networks. However, it also turns out that many approaches recently discussed in the deep learning community (such as adversarial training as defense) has already been introduced earlier regarding other machine learning algorithms. They also give a concise discussion of different threat models that is worth reading. [1] N. Akhtar and A. Mian. Threat of adversarial attacks on deep learning in computer vision: A survey. arXiv.org, abs/1801.00553, 2018. [2] X. Yuan, P. He, Q. Zhu, R. R. Bhat, and X. Li. Adversarial examples: Attacks and defenses for deep learning. arXiv.org, abs/1712.07107, 2017. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Yuan et al. present a comprehensive survey of attacks, defenses and studies regarding the robustness and security of deep neural networks. Published on ArXiv in December 2017, it includes most recent attacks and defenses. For examples, Table 1 lists all known attacks – Yuan et al. categorize the attacks according to the level of knowledge needed, targeted or nontargeted, the optimization needed (e.g. iterative) as well as the perturbation measure employed. As a result, Table 1 gives a solid overview of stateoftheart attacks. Similarly, Table 2 gives an overview of applications reported so far. Only for defenses, a nice overview table is missing. Still, the authors discuss (as of my knowledge) all relevant defense strategies and comment on their performance reported in the literature. https://i.imgur.com/3KpoYWr.png Table 1: An overview of stateoftheart attacks on deep neural networks. https://i.imgur.com/4eq6Tzm.png Table 2: An overview of application sof some of the attacks in Table 1. 
[link]
Ulyanov et al. utilize untrained neural networks as regularizer/prior for various image restoration tasks such as denoising, inpainting and superresolution. In particualr, the standard formulation of such tasks, i.e. $x^\ast = \arg\min_x E(x, x_0) + R(x)$ where $x_0$ is the input image and $E$ a taskdependent data term, is rephrased as follows: $\theta^\ast = \arg\min_\theta E(f_\theta(z); x_0)$ and $x^\ast = f_{\theta^\ast}(z)$ for a fixed but random $z$. Here, the regularizer $R$ is essentially replaced by an untrained neural network $f_\theta$ – usually in the form of a convolutional encoder. The authors argue that the regualizer is effectively $R(x) = 0$ if the image can be generated by the encoder from the fixed code $z$ and $R(x) = \infty$ if not. However, this argument does not necessarily provide any insights on why this approach works (as demonstrated in the paper). A main question addressed in the paper is why the network $f_\theta$ can be used as a prior – regarding the assumption that highcapacity networks can essentially fit any image (including random noise). In my opinion, the authors do not give a convincing answer to this question. Essentially, they argue that random noise is just harder to fit (i.e. it takes longer). Therefore, limiting the number of iterations is enough as regularization. Personally I would argue that this observation is mainly due to prior knowledge put into the encoder architecture and the idea that natural images (or any images with some structure) are easily embedded into lowdimensional latent spaced compared to fully I.i.d. random noise. They provide experiments on a range of tasks including denoising, image inpainting, superresolution and neural network “inversion”. Figure 1 shows some results for image inpainting that I found quite convincing. For the remaining experiments I refer to the paper. https://i.imgur.com/BVQsaup.png Figure 1: Qualitative results for image inpainting. Also see this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Cisse et al. propose parseval networks, deep neural networks regularized to learn orthonormal weight matrices. Similar to the work by Hein et al. [1], the mean idea is to constrain the Lipschitz constant of the network – which essentially means constraining the Lipschitz constants of each layer independently. For weight matrices, this can be achieved by constraining the matrixnorm. However, this (depending on the norm used) is often intractable during gradient descent training. Therefore, Cisse et al. propose to use a perlayer regularizer of the form: $R(W) = \W^TW – I\$ where $I$ is the identity matrix. During training, this regularizer is supposed to ensure that the learned weigth matrices are orthonormal – an efficient alternative to regular matrix manifold optimization techniques (see the paper). [1] Matthias Hein, Maksym Andriushchenko: Formal Guarantees on the Robustness of a Classifier against Adversarial Manipulation. CoRR abs/1705.08475 (2017) Also see this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Feinman et al. use dropout to compute an uncertainty measure that helps to identify adversarial examples. Their socalled Bayesian Neural Network Uncertainty is computed as follows: $\frac{1}{T} \sum_{i=1}^T \hat{y}_i^T \hat{y}_i  \left(\sum_{i=1}^T \hat{y}_i\right)\left(\sum_{i=1}^T \hat{y}_i\right)$ where $\{\hat{y}_1,\ldots,\hat{y}_T\}$ is a set of stochastic predictions (i.e. predictions with different noise patterns in the dropout layers). Here, is can easily be seen that this measure corresponds to a variance computatin where the first term is correlation and the second term is the product of expectations. In Figure 1, the authors illustrate the distributions of this uncertainty measure for regular training samples, adversarial samples and noisy samples for two attacks (BIM and JSMA, see paper for details). https://i.imgur.com/kTWTHb5.png Figure 1: Uncertainty distributions for two attacks (BIM and JSMA, see paper for details) and normal samples, adversarial samples and noisy samples. Also see this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Carlini and Wagner study the effectiveness of adversarial example detectors as defense strategy and show that most of them can by bypassed easily by known attacks. Specifically, they consider a set of adversarial example detection schemes, including neural networks as detectors and statistical tests. After extensive experiments, the authors provide a set of lessons which include:  Randomization is by far the most effective defense (e.g. dropout).  Defenses seem to be datasetspecific. There is a discrepancy between defenses working well on MNIST and on CIFAR.  Detection neural networks can easily be bypassed. Additionally, they provide a set of recommendations for future work:  For developing defense mechanism, we always need to consider strong whitebox attacks (i.e. attackers that are informed about the defense mechanisms).  Reporting accuracy only is not meaningful; instead, false positives and negatives should be reported.  Simple datasets such as MNIST and CIFAR are not enough for evaluation. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Grosse et al. use statistical tests to detect adversarial examples; additionally, machine learning algorithms are adapted to detect adversarial examples onthefly of performing classification. The idea of using statistics tests to detect adversarial examples is simple: assuming that there is a true data distribution, a machine learning algorithm can only approximate this distribution – i.e. each algorithm “learns” an approximate distribution. The ideal adversary uses this discrepancy to draw a sample from the data distribution where data distribution and learned distribution differ – resulting in misclassification. In practice, they show that kernelbased twosample statistics hypothesis testing can be used to identify a set of adversarial examples (but not individual one). In order to also detect individual ones, each classifier is augmented to also detect whether the input is an adversarial example. This approach is similar to adversarial training, where adversarial examples are included in the training set with the correct label. However, I believe that it is possible to again craft new examples to the augmented classifier – as is also possible with adversarial training. 
[link]
Ross and DoshiVelez propose input gradient regularization to improve robustness and interpretability of neural networks. As the discussion of interpretability is quite limited in the paper, the main contribution is an extensive evaluation of input gradient regularization against adversarial examples – in comparison to defenses such as distillation or adversarial training. Specifically, input regularization as proposed in [1] is used: $\arg\min_\theta H(y,\hat{y}) + \lambda \\nabla_x H(y,\hat{y})\_2^2$ where $\theta$ are the network’s parameters, $x$ its input and $\hat{y}$ the predicted output. Here, $H$ might be a crossentropy loss. It also becomes apparent why this regularization was originally called doublebackpropagation because the second derivative is necessary during training. In experiments, the authors show that the proposed regularization is superior to many other defenses including distillation and adversarial training. Unfortunately, the comparison does not include other “regularization” techniques to improve robustness – such as Lipschitz regularization. This makes the comparison less interpretable, especially as the combination of input gradient regularization and adversarial training performs best (suggesting that adversarial training is a meaningful defense, as well). Still, I recommend a closer look on the experiments. For example, the authors also study the input gradients of defended models, leading to some interesting conclusions. [1] H. Drucket, Y. LeCun. Improving generalization performance using double backpropagation. IEEE Transactions on Neural Networks, 1992. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Nayebi and Ganguli propose saturating neural networks as defense against adversarial examples. The main observation driving this paper can be stated as follows: Neural networks are essentially based on linear sums of neurons (e.g. fully connected layers, convolutiona layers) which are then activated; by injecting a small amount of noise per neuron it is possible to shift the final sum by large values, thereby propagating the noisy through the network and fooling the network into misclassifying an example. To prevent the impact of these adversarial examples, the network should be trained in a manner to drive many neurons into a saturated regime – noisy will, so the argument, have less impact then. The authors also give a biological motivation, which I won't go into detail here. Letting $\psi$ be the used activation function, e.g. sigmoid or ReLU, a regularizer is added to drive neurons into saturation. In particular, a penalty $\lambda \sum_l \sum_i \psi_c(h_i^l)$ is added to the loss. Here, $l$ indexes the layer and $i$ the unit in the layer; $h_i^l$ then describes the input to the nonlinearity computed for unit $i$ in layer $l$. $\psi_c$ is the complementary function defined as $\psi_c(z) = \inf_{z': \psi'(z') = 0} z – z'$ It defines the distance of the point $z$ to the nearest saturated point $z'$ where $\psi'(z') = 0$. For ReLU activations, the complementary function is the ReLU function itself; for sigmoid activations, the complementary function is $\sigma_c(z) = \sigma(z)(1  \sigma(z))$. In experiments, Nayebi and Ganguli show that training with the additional penalty yields networks with higher robustness against adversarial examples compared to adversarial training (i.e. training on adversarial examples). They also provide some insight, showing e.g. the activation and weight distribution of layers illustrating that neurons are indeed saturated in large parts. For details, see the paper. I also want to point to a comment on the paper written by Brendel and Bethge [1] questioning the effectiveness of the proposed defense strategy. They discuss a variant of the fast sign gradient method (FSGM) with stabilized gradients which is able to fool saturated networks. [1] W. Brendel, M. Behtge. Comment on “Biologically inspired protection of deep networks from adversarial attacks”, https://arxiv.org/abs/1704.01547. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Xu et al. propose feature squeezing for detecting and defending against adversarial examples. In particular, they consider “squeezing” the bit depth of the input images as well as local and nonlocal smoothing (Gaussian, median filtering etc.). In experiments they show that feature squeezing preserves accuracy while defending against adversarial examples. Figure 1 additionally shows an illustration of how feature squeezing can be used to detect adversarial examples. https://i.imgur.com/Ixv522J.png Figure 1: Illustration of using squeezing for adversarial example detection. Also find this summary on [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Sinha et al. introduce a variant of adversarial training based on distributional robust optimization. I strongly recommend reading the paper for understanding the introduced theoretical framework. The authors also provide guarantees on the obtained adversarial loss – and show experimentally that this guarantee is a realistic indicator. The adversarial training variant itself follows the general strategy of training on adversarially perturbed training samples in a minmax framework. In each iteration, an attacker crafts an adversarial examples which the network is trained on. In a nutshell, their approach differs from previous ones (apart from the theoretical framework) in the used attacker. Specifically, their attacker optimizes $\arg\max_z l(\theta, z)  \gamma \z – z^t\_p^2$ where $z^t$ is a training sample chosen randomly during training. On a side note, I also recommend reading the reviews of this paper: https://openreview.net/forum?id=Hk6kPgZA Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Zantedschi et al. propose Gaussian data augmentation in conjunction with bounded $\text{ReLU}$ activations as defense strategy against adversarial examples. Here, Gaussian data augmentation refers to the practice of adding Gaussian noise to the input during training. 
[link]
Liu et al. propose randomizing neural networks, implicitly learning an ensemble of models, to defend against adversarial attacks. In particular, they introduce Gaussian noise layers before regular convolutional layers. The noise can be seen as additional parameter of the model. During training, noise is randomly added. During testing, the model is evaluated on a single testing input using multiple random noise vectors; this essentially corresponds to an ensemble of different models (parameterized by the different noise vectors). Mathemtically, the authors provide two interesting interpretations. First, they argue that training essentially minimizes an upper bound of the (noisy) inference loss. Second, they show that their approach is equivalent to Lipschitz regularization [1]. [1] M. Hein, M. Andriushchenko. Formal guarantees on the robustness of a classifier against adversarial manipulation. ArXiv:1705.08475, 2017. Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Oh et al. propose two different approaches for whitening black box neural networks, i.e. predicting details of their internals such as architecture or training procedure. In particular, they consider attributes regarding architecture (activation function, dropout, max pooling, kernel size of convolutional layers, number of convolutionaly/fully connected layers etc.), attributes concerning optimization (batch size and optimization algorithm) and attributes regarding the data (data split and size). In order to create a dataset of models, they trained roughly 11k models on MNIST; they ensured that these models have at least 98% accuracy on the validation set and they also consider ensembles. For predicting model attributes, they propose two models, called kenneno and kenneni, see Figure 1. Kenneno takes as input a set of $100$ predictions of the models (i.e. final probability distributions) and tries to directly learn the attributes using a MLP of two fully connected layers. Kenneni instead crafts a single input which allows to reason about a specific model attribute. An example for kenneni is shown in Figure 2. In experiments, they demonstrate that both models are able to predict model attributes significantly better than chance. For details, I refer to the paper. https://i.imgur.com/YbFuniu.png Figure 1: Illustration of the two proposed approaches, kenneno (top) and kenneni (bottom). https://i.imgur.com/ZXj22zG.png Figure 2: Illustration of the images created by kenneni to classify different attributes. See the paper for details. Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Brendel et al. propose a decisionbased blackbox attacks against (deep convolutional) neural networks. Specifically, the socalled Boundary Attack starts with a random adversarial example (i.e. random noise that is not classified as the image to be attacked) and randomly perturbs this initialization to move closer to the target image while remaining misclassified. In pseudo code, the algorithm is described in Algorithm 1. Key component is the proposal distribution $P$ used to guide the adversarial perturbation in each step. In practice, they use a maximumentropy distribution (e.g. uniform) with a couple of constraints: the perturbed sample is a valid image; the perturbation has a specified relative size, i.e. $\\eta^k\_2 = \delta d(o, \tilde{o}^{k1})$; and the perturbation reduces the distance to the target image $o$: $d(o, \tilde{o}^{k1}) – d(o,\tilde{o}^{k1} + \eta^k)=\epsilon d(o, \tilde{o}^{k1})$. This is approximated by sampling from a standard Gaussian, clipping and rescaling and projecting the perturbation onto the $\epsilon$sphere around the image. In experiments, they show that this attack is competitive to whitebox attacks and can attack realworld systems. https://i.imgur.com/BmzhiFP.png Algorithm 1: Minimal pseudo code version of the boundary attack. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Chen et al. propose a gradientbased blackbox attack to compute adversarial examples. Specifically, they follow the general idea of [1] where the following objective is optimized: $\min_x \x – x_0\_2 + c \max\{\max_{i\neq t}\{z_i\} – z_t,  \kappa\}$. Here, $x$ is the adversarial example based on training sample $x_0$. The second part expresses that $x$ is supposed to be misclassified, i.e. the logit $z_i$ for some $i \neq t$ distinct form the true label $t$ is supposed to be larger that the logit $z_t$ corresponding to the true label. This is optimized subject to the constraint that $x$ is a valid image. The attack proposed in [1] assumes a whitebox setting were we have access to the logits and the gradients (basically requiring access to the full model). Chen et al., in contrast want to design a blackbox attacks. Therefore, they make the following changes:  Instead of using logits $z_i$, the probability distribution $f_i$ (i.e. the actual output of the network) is used.  Gradients are approximated by finite differences. Personally, I find that the first point does violate a strict blackbox setting. As company, for example, I would prefer not to give away the full probability distribution but just the final decision (or the decision plus a confidence score). Then, however, the proposed method is not applicable anymore. Anyway, the changed objective looks as follows: $\min_x \x – x_0\_2 + c \max\{\max_{i\neq t}\{\log f_i\} – \log f_t,  \kappa\}$ where, according to the authors, the logarithm is essential for optimization. One remaining problem is efficient optimization with finite differences. To this end, they propose a randomized/stochastic coordinate descent algorithm. In particular, in each step, a ranodm pixel is chosen and a local update is performed by calculating the gradient on this pixel using finite differences and performing an ADAM step. [1] N. Carlini, D. Wagner. Towards evaluating the robustness of neural networks. IEEE Symposium of Security and Privacy, 2017. Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Rozsa et al. describe an adersarial attack against OpenMax [1] by directly targeting the logits. Specifically, they assume a network using OpenMax instead of a SoftMax layer to compute the final class probabilities. OpenMax allows “openset” networks by also allowing to reject input samples. By directly targeting the logits of the trained network, i.e. iteratively pushing the logits in a target direction, it does not matter whether SoftMax or OpenMax layers are used on top, the network can be fooled in both cases. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Engstrom et al. demonstrate that spatial transformations such as translations and rotations can be used to generate adversarial examples. Personally, however, I think that the paper does not address the question where adversarial perturbations “end” and generalization issues “start”. For larger translations and rotations, the problem is clearly a problem of generalization. Small ones could also be interpreted as adversarial perturbations – especially when they are computed under the intention to fool the network. Still, the distinction is not clear ... Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Lu et al. present experiments regarding adversarial examples in the real world, i.e. after printing them. Personally, I find it interesting that researchers are studying how networks can be fooled by physically perturbing images. For me, one of the main conclusions it that it is very hard to evaluate the robustness of networks against physical perturbations. Often it is unclear whether changed lighting conditions, distances or viewpoints to objects might cause the network to fail – which means that the adversarial perturbation did not cause this failure. Also found this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Madry et al. provide an interpretation of training on adversarial examples as sattlepoint (i.e. minmax) problem. Based on this formulation, they conduct several experiments on MNIST and CIFAR10 supporting the following conclusions:  Projected gradient descent might be “strongest” adversary using firstorder information. Here, gradient descent is used to maximize the loss of the classifier directly while always projecting onto the set of “allowed” perturbations (e.g. within an $\epsilon$ball around the samples). This observation is based on a large number of random restarts used for projected gradient descent. Regarding the number of restarts, the authors also note that an adversary should be bounded regarding the computation resources – similar to polynomially bounded adversaries in cryptography.  Network capacity plays an important role in training robust neural networks using the minmax formulation (i.e. using adversarial training). In particular, the authors suggest that increased capacity is needed to fit/learn adversarial examples without overfitting. Additionally, increased capacity (in combination with a strong adversary) decreases transferability of adversarial examples. Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Tramèr et al. introduce both a novel adversarial attack as well as a defense mechanism against blackbox attacks termed ensemble adversarial training. I first want to highlight that – in addition to the proposed methods – the paper gives a very good discussion of stateoftheart attacks as well as defenses and how to put them into context. Tramèr et al. consider blackbox attacks, focussing on transferrable adversarial examples. Their main observation is as follows: oneshot attacks (i.e. one evaluation of the model's gradient) on adversarially trained models are likely to overfit to the model's training loss. This observation has two aspects that are experimentally validated in the paper. First, the loss of the adversarially trained model increases sharply when considering adversarial examples crafted on a different model; second, the network learns to fool the attacker by, locally, misleading the gradient – this means that perturbations computed on adversarially trained models are specialized to the local loss. These observations are also illustrated in Figure 1, however, I refer to the paper for a detailed discussion. https://i.imgur.com/dIpRz9P.png Figure 1: Illustration of the discussed observations. On the left, the loss function of an adversarially trained model considering a sample $x = x + \epsilon_1 x' + \epsilon_2 x''$ where $x'$ is a perturbation computed on the adversarially trained model and $x''$ is a perturbation computed on a different model. On the right, zoomed in version where it can be seen that the loss rises sharply in the direction of $\epsilon_1$; i.e. the model gives misleading gradients. Based on the above observations, Tramèr et al. First introduce a new oneshot attack exploiting the fact that the adversarially trained model is trained on overfitted perturbations and second introduce a new countermeasure for training more robust networks. Their attack is quite simple; they consider one FastGradient Sign Method (FSGM) step, but apply a random perturbation first to leave the local vicinity of the sample first: $x' = x + \alpha \text{sign}(\mathcal{N}(0, I))$ $x'' = x' + (\epsilon  \alpha)\text{sign}(\nabla_{x'} J(x', y))$ where $J$ is the loss function and $y$ the label corresponding to sample $x$. In experiments, they show that the attack has higher success rates on adversarially trained models. To counter the proposed attack, they propose ensemble adversarial training. The key idea is to train the model utilizing not only adversarial samples crafted on the model itself but also transferred from pretrained models. On MNIST, for example, they randomly select 64 FGSM samples from 4 different models (including the one in training). Experimentally, they show that ensemble adversarial training improves the defense again all considered attacks, including FGSM, iterative FGSM as well as the proposed attack. Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
This paper describes an architecture designed for generating class predictions based on a set of features in situations where you may only have a few examples per class, or, even where you see entirely new classes at test time. Some prior work has approached this problem in ridiculously complex fashion, up to and including training a network to predict the gradient outputs of a metanetwork that it thinks would best optimize loss, given a new class. The method of Prototypical Networks prides itself on being much simpler, and more intuitive, so I hope I’ll be able to convey that in this explanation. In order to think about this problem properly, it makes sense to take a few steps back, and think about some fundamental assumptions that underly machine learning. https://i.imgur.com/Q45w0QT.png One very basic one is that you need some notion of similarity between observations in your training set, and potential new observations in your test set, in order to properly generalize. To put it very simplistically, if a test example is very similar to examples of class A that we saw in training, we might predict it to be of class A at testing. But what does it *mean* for two observations to be similar to one another? If you’re using a method like K Nearest Neighbors, you calculate a point’s class identity based on the closest trainingset observations to it in Euclidean space, and you assume that nearness in that space corresponds to likelihood of two data points having come the same class. This is useful for the use case of having new classes show up after training, since, well, there isn’t really a training period: the strategy for KNN is just carrying your whole training set around, and, whenever a new test point comes along, calculating it’s closest neighbors among those trainingset points. If you see a new class in the wild, all you need to do is add the examples of that class to your group of training set points, and then after a few examples, if your assumptions hold, you’ll be able to predict that class by (hopefully) finding those two or three points as neighbors. But what if some dimensions of your feature space matter much more than others for differentiating between classes? In a simplistic example, you could have twenty features, but, unbeknownst to you, only one is actually useful for separating out your classes, and the other 19 are random. If you use the naive KNN assumption, you wouldn’t expect to perform well here, because you will have distances in these 19 meaningless directions spreading out your points, due to randomness, more than the meaningful dimension spread them out due to belonging to different classes. And what if you want to be able to learn nonlinear relationships between your features, which the composability of multilayer neural networks lends itself well to? In cases like those, the features you were handed may be a woefully suboptimal metric space in which to calculate a kind of similarity that corresponds to differences in class identity, so you’ll just have to strike out for the territories and create a metric space for yourself. That is, at a very high level, what this paper seeks to do: learn a transformation between input features and some vector space, such that distances in that vector space correspond as well as possible to probabilities of belonging to a given output class. You may notice me using “vector space” and “embedding” similarity; they are the same idea: the result of that learned transformation, which represents your input observations as dense vectors in some pdimensional space, where p is a chosen hyperparameter. What are the concrete learning steps this architecture goes through? 1. During each training episode, sample a subset of classes, and then divide those classes into training examples, and query examples 2. Using a set of weights that are being learned by the network, map the input features of each training example into a vector space. 3. Once all training examples are mapped into the space, calculate a “mean vector” for class A by averaging all of the embeddings of training examples that belong to class A. This is the “prototype” for class A, and once we have it, we can forget the values of the embedded examples that were averaged to create it. This is a nice update on the KNN approach, since the number of parameters we need to carry around to evaluate is only (numdimensions) * (numclasses), rather than (numdimensions) * (numtrainingexamples). 4. Then, for each query example, map it into the embedding space, and use a distance metric in that space to create a softmax over possible classes. (You can just think of a softmax as a network’s predicted probability, it’s a set of floats that add up to 1). 5. Then, you can calculate the (crossentropy) error between the true output and that softmax prediction vector in the same way as you would for any classification network 6. Add up the prediction loss for all the query examples, and then backpropogate through the network to update your weights The overall effect of this process is to incentivize your network to learn, not necessarily a good prediction function, but a good metric space. The idea is that, if the metric space is good enough, and the classes are conceptually similar to each other (i.e. car vs chair, as opposed to car vs themeaningoflife), a space that does well at causing similar observed classes to be close to one another will do the same for classes not seen during training. I admit to not being sufficiently familiar with the datasets used for testing to have a sense for how well this method compares to more fully supervised classification schemes; if anyone does, definitely let me know! But the paper claims to get state of the art results compared to other approaches in this domain of fewshot learning (matching networks, and the aforementioned metalearning). One interesting note is that the authors found that squared Euclidean distance, when applied within the embedded space, worked meaningfully better than cosine distance (which is a more standard way of measuring distances between vectors, since it measures only angle, rather than magnitude). They suspect that this is because Euclidean distance, but not cosine distance belongs to a category of divergence/distance metrics (called Bregman Divergences) that have a special set of properties such that the point closest on aggregate to all points in a cluster is the average of all those points. If you want to dive way deep into the minutia on this point, I found this blog post quite good: http://mark.reid.name/blog/meetthebregmandivergences.html
1 Comments

[link]
This paper has an unusual and interesting goal, compared to those I more typically read: it wants to develop a “translation” between the messages produced by a model, and natural language used by a human. More specifically, the paper seeks to do this in the context of an twoplayer game, where one player needs to communicate information to the other. A few examples of this are:  Being shown a color, and needing to communicate to your partner so they can choose that color  Driving, in an environment where you can’t see the other car, but you have to send a coordinating message so that you don’t collide Recently, people have started training multiagent that play games like these, where they send “message” vectors back and forth, in a way fully integrated with the rest of the backpropogation procedure. From just observing the agents’ actions, it’s not necessarily clear which communication strategy they’re using. That’s why this paper poses as an explicit problem: how can we map between the communication vectors produced by the agents and the words that would be produced by a human in a similar environment? Interestingly, the paper highlights two different ways you could think about structuring a translation objective. The first is “pragmatic interpretation,” under which you optimize what you communicate about something according to the operation that needs to be performed afterwards. To make that more clear, take a look at the attached picture. Imagine that player one is shown a shape, and needs to use a phrase from the bottom language (based on how many sides the shape has) to describe it to player two, who then needs to guess the size of the shape (big or small), and is rewarded for guessing correctly. Because “many” corresponds to both a large and a small shape, the strategy that optimizes the action that player two takes, conditional on getting player one’s message, is to lie and describe a hexagon as “few”, since that will lead to correct inference about the size of the shape, which is what’s most salient here. This example shows how, if you optimize a translation mapping by trying to optimize the reward that the posttranslation agent can get, you might get a semantically incorrect translation. That might be good for the task at hand, but, because it leaves you with incorrect beliefs about the true underlying mapping, it will generalize poorly to different tasks. The alternate approach, championed by the paper, is to train a translation such that the utterances in both languages are similar insofar as, conditional on hearing them, and having some value for their own current state, the listening player arrives at similar beliefs about the current state of the player sending the message. This is mathematically framed as by defining a metric q, representing the quality of the translation between two z vectors, as: “taking an expectation over all possible contextual states of (player 1, player 2), what is the difference between the distribution of beliefs about the state of player 1 (the sending player) induced in player 2 by hearing each of the z vectors. Because taking the full expectation over this joint distribution is intractable, the approach is instead done by sampling. These equations require that you have reasonable models of human language, and understanding of human language, in the context of games. To do this, the authors used two types of datasets: 1. Linguistic descriptions of objects of things, like the xkcd color dataset. Here, the player’s hidden state is the color that they are trying to describe using some communication scheme. 2. Mechanical turk game runs playing the aforementioned driver game, where they have to communicate to the other driver. Here, the player’s “hidden state” represents a combination of its current location and intentions. From these datasets, they can train simple emulator models that learn “what terms is a human most likely to use for a given color” [p(zx)], and “what colors will a human guess, conditional on those terms”. The paper closes by providing a proof as to how much rewardbased value is lost by optimizing for the true semantic meaning, rather than the most pragmatically useful translation. They find that there is a bound on the gap, and that, in many empirical cases, the observed gap is quite small. Overall, this paper was limited in scope, but provided an interesting conceptual framework for thinking about how you might structure a translation, and the different implications that structure might have on your results. 
[link]
DeepMind’s recently released paper (one of a boatload coming out in the wake of ICLR, which just finished in Vancouver) addresses the problem of building an algorithm that can perform well on tasks that don’t just stay fixed in their definition, but instead evolve and change, without giving the agent a chance to retrain in the middle. An example of this, is one used at various points in the paper: of an agent trying to run East, that finds two of its legs (a different two each time) slowly less functional. The theoretical framework they use to approach this problem is that of meta learning. Meta Learning is typically formulated as: how can I learn to do well on a new task, given only a small number of examples of that task? That’s why it’s called “meta”: it’s an extra, higherlevel optimization loop applied around the process of learning. Typical learning learns parameters of some task, meta learning learns longerscale parameters that make the shortscale, typical learning work better. Here, the task that evolves and changes over time (i.e. a nonstationary task) is seen as a close variant of the the multitask problem. And, so, the hope is that a model that can quickly adapt to arbitrary new tasks can also be used to learn the ability to adapt to a gradually changing task environment. The meta learning algorithm that got most directly adapted for this paper is MAML: Model Agnostic Meta Learning. This algorithm works by, for a large number of tasks, initializing the model at some parameter set theta, evaluating the loss for a few examples on that task, and moving the gradients from the initialization theta, to a taskspecific parameter set phi. Then, it calculating the “test set” performance of the onestep phi parameters, on the task. But then  the crucial thing here  the meta learning model updates its initialization parameters, theta. So, the meta learning model is learning a set of parameters that provides a good jumping off point for any given task within the distribution of tasks the model is trained on. In order to do this well, the theta parameters need to both 1) learn any general information, shared across all tasks, and 2) position the parameters such that an initial update step moves the model in the profitable direction. They adapted this idea, of training a model that could quickly update to multiple tasks, to the environment of a slowly/continuously changing environment, where certain parameters of the task the agent is facing. In this formulation, our set of tasks is no longer random draws from the distribution of possible tasks, but a smooth, Markovwalk gradient over tasks. The main change that the authors made to the original MAML algorithm was to say that each general task would start at theta, but then, as that task gradually evolved, it would perform multiple updates: theta to phi1, phi1 to phi2, and so on. The original theta parameters would then be updated according to a similar principle as the MAML parameters: so as to make the loss, summed over the full nonstationary task (notionally composed of many little subtasks) is as low as possible. 
[link]
This paper’s approach goes a step further away from the traditional word embedding approach  of training embeddings as the lookuptable first layer of an unsupervised monolingual network  and proposes a more holistic form of transfer learning that involves not just transferring over learned knowledge contained in a set of vectors, but a fully trained model. Transfer learning is the general idea of using part or all of a network trained on one task to perform a different task. The most common kind of transfer learning is in the image domain, where models are first trained on the enormous ImageNet dataset, and then several of the lower layers of the network (where more local, smallpixelrange patterns are detected) are transferred, with their weights fixed in place to a new network. The modeler then attaches a few more layers to the top, connects it to a new target, and then is able to much more quickly learn their new target, because the pretraining has gotten them into a useful region of parameterspace. https://i.imgur.com/wjloHdi.png Within NLP, the most common form of transfer learning is initializing the lookup table of vectors that’s used to convert discrete words in to vectors (also known as an embedding) with embeddings pretrained on huge unsupervised datasets, like GloVe, trained on all of English Wikipedia. Again, this makes your overall task easier to train, because you’ve already converted words from their unuseful binary representation (where the word cat is just as far from Peru as it is from kitten) to a meaningful realvalued representation. The approach suggested in this paper goes beyond simply learning the vector input representation of words. Instead, the authors suggest using as word vectors the sequence of encodings produced by an encoderdecoder bidirectional recurrent model. An encoderdecoder model means that you have one part of the network that maps from input sentence to an “encoded” representation of the sentence, and then another part that maps that encoded representation into the proper tokens in the target language. Historically, this encoding had been a single vector for the whole sentence, which tried to conceptually capture all of the words into one vector. More recently, a different approach has grown popular, where the RNN produces a number of encodings equal to the number of input words. Then, when the decoder is producing words in the target sentence, it uses something called “attention” to select a weighted combination of these encodings at each point in time. Under this scheme, the decoder might pull out information about verbs when its own hidden state suggests it needs a verb, and might pull out information about pronoun referents when its own hidden state asks for that. The upshot of all of this is that you end up with a sequence of encoded vectors equal in length to your number of inputs. Because the RNN is bidirectional, which means the encoding is a concatenation of the forward RNN and backward RNN, that means that each of these encodings captures both information about its corresponding word, and contextual information about the rest of the sentence. The proposal of the authors is to train the encoderdecoder outlined above, and, once it is trained, lop off the decoder, and use the encoded sequence of words as your representation of the input sequence of words. An important note in all this is that recurrent encoderdecoder model was itself trained using a lookup table initialized with learned GloVe vectors, so in a sense they’re not substituting for the unsupervised embeddings so much as learning marginal information on top of them. The authors went on to test this approach on a few problems  question answering, logical entailment, and sentiment classification. They compared their use of the RNN encoded word vectors (which they call Context Vectors, or CoVE) with models initialized just using the fixed GloVE word vectors. One important note here is that, because each word vector is learned fully in context, the same word will have a different vector in each sentence it appears in. That’s why you can’t transfer one single vector per word, but instead have to transfer the recurrent model that can produce the vectors. All in all, the authors found that concatenating CoVe vectors to GloVe vectors, and using the concatenated version as input, produced sizable gains on the problems where it was tried. That said, it’s a pretty heavy lift to integrate someone else’s learned weights into your own model, just in terms of getting all the code to play together nicely. I’m not sure if this is a compelling enough result, a la ImageNet pretraining, for practitioners to want to go to the trouble of tacking a nontraining RNN onto the bottom of all their models. If I ever get a chance, I’d be interested to play with the vectors you get out of this model, and look at how much variance you see in the vectors learned for different words across different sentences. Do you see clusters that correspond to sense disambiguation, (a la state of mind, vs a rogue state)? And, how does this contextual approach to the paper I reviewed yesterday, that also learns embeddings on a machine translation task, but does so in terms of training a lookup table, rather than using trained encodings? All in all, I enjoyed this paper: it was a simple idea, and I’m not sure whether it was a compelling one, but it did leave me with some interesting questions. 
[link]
There are mathematicians, still today, who look at deep learning, and get real salty over the lack of convex optimization. That is to say: convex functions are ones where you have an actual guarantees that gradient descent will converge, and mathematicians of olden times (i.e. 2006) spent reams of paper arguing that this or that function had convex properties, and thus could be guaranteed to converge, under this or that set of arcane conditions. And then, Deep Learning came along, with its huge, nonlinear, very much nonconvex objective functions, that it was nonetheless trying to optimize via gradient descent. From the perspective of an optimization theorist, this had the whiff of heresy, but exceptionally effective heresy. And, so, the field of DL has halfexploded, halfstumbled along, showcasing a portfolio of very impressive achievements, but with theory very much a secondary priority relative to performance. Something else that gradient descent isn’t supposed to be able to do is learn models that include discrete (i.e. noncontinuous) operators. Without continuous gradients, the functions don’t have an obvious way to “push” in a certain direction, to modulate the loss at the end of the network. Discrete nodes mean that the value just jumps from being in one state, to being in the other, with no intermediate values. This has historically posed a problem for algorithms fueled by gradient descent. The authors of this paper came up with a solution that is 60% cleverness, and 40% just guessing that “even if we ignore the theory, things will probably work well enough”. But, first, their overall goal: to create a Variational Auto Encoder where the latent states, the compressed internal representation that is typically an array of continuous values, is instead an array of categorical values. The goal of this was 1) to have a representation type that was a better match for the discrete nature of data types like speech (which has distinct phonemes we might like to discretely capture), and, 2) to have a more compressed latent space that would (of necessity) focus on more global information, and leave local pixellevel information to be learned by the expressive PixelCNN decoder. The way they do this is remarkably simple. First, they learn a typical VAE encoder, mapping from the input pixels to a continuous z space. (An interesting sidenote here is that this paper uses spatially organized z; instead of using one single z vector to represent the whole image, they may have 32x32 spatial locations, each of which has its own z vector, to represent at 128x128 image). Then, for each of the spatial regions, they take the continuous vector produced by the network, and compare it to a fixed set of “embedding” vectors, of the same shape. That spatial location is then lumped into the category of the embedding that it’s closest to, meaning that you end up with a compressed layer of 32x32 (in this case) spatial regions, each of which is represented by a categorical number between 0 and maxnumcategories. Then, the network passes forward the embedding that this input vector was just “snapped” to being, Then, the decoder uses the full spatial location set of embeddings to do its decoding. https://i.imgur.com/P8LQRYJ.png The clever thing here comes when you ask how to train the encoder to produce a different embedding, when there was this discrete “jump” that happened. The authors choose to just avoid the problem, more or less. They do that by just taking the gradient signals that come back from the end of the network to the embedding, and just pass those directly to the vector that was used to nearestneighborslookup the embedding. Basically, they pretend that they passed the vector through the rest of the network, rather than the embedding. The embeddings are then trained in a K Means Clustering kind of way; with the embeddings being iteratively updated to be closer to the points that were assigned to their embedding in each round of training. This is the “Vector Quantization” part of VQVAE Overall, this seems to perform quite well: with the low capacity of the latente space meaning that it is incentivized to handle more global structure, while leaving low level pixel details to the decoder. It is also much easier to fit afterthefact distributions over; once we’ve trained a VQVAE, we can easily learn a global model that represents the location by location dependencies between the categories (i.e. a 1 in this corner means at 5 in this other corner is more probable). This gives us the ability to have an analytically specified distribution, in latent space, that actually represents the structure of how these “concept level categories” relate to each other. By contrast, with most continuous latent spaces, it’s intractable to learn an explicit density function after the fact, and thus if we want to be able to sample we need to specify and enforce a prior distribution over z ahead of time. 
[link]
Over the last five years, artificial creative generation powered by ML has blossomed. We can now imagine buildings based off of a sketch, peer into the dogtiled “dreams” of a convolutional net, and, as of 2017, turn images of horses into ones of zebras. This last problem  typically termed imagetoimage translation is the one that CycleGAN focuses on. The kinds of transformations that can full under this category is pretty conceptually broad: zebras to horses, summer scenes to winter ones, images to Monet paintings. (Note: I switch between using horse/zebra as my explanatory example, and using summer/winter. Both have advantages for explaining different conceptual poinfts) However, the idea is the same: you start with image a, which belongs to set A, and you want to generate a mapping of that image into set B, where the only salient change is that it’s now in set B. As a clarifying example: if you started out with a horse, and your goal was to translate it into a zebra, you would hope that the animal is in the same size, relative position, and pose, and that the only element that changed was changing the quality of “horseness” for the quality of “zebraness”. https://i.imgur.com/NCExS7A.png The real trick of CycleGAN is the fact that, unlike prior attempts to solve this problem, they didn’t use paired data. This is understandable, given the prior example: while it’s possible to take a picture of a scene in both summer and winter, you obviously can’t convert a horse into a zebra so that you can take a “paired” picture of it in both forms. When you have paired data, this is a reasonably welldefined problem: you want to learn some mathematical transformation to turn a specific summer image into a specific winter one, and you can use the ground truth winter image as explicit supervision. Since they lack this perimage crossdomain ground truth, the authors of this paper take what would be one question (“is the winter version of this image that the network generated close to the actual known winter version of this image”) and decompose it into two: Does the winter version of this original summer image looks like it belongs to the set of winter images? This is enforced by a GANstyle discriminator, which takes in outputs of the summer>winter generator, and true images of winter, and tries to tell them apart. This loss component pushes generated winter images to have the quality of “winterness”. This is the “Adversarial Loss” Does the winter version of this image contain enough information about this specific original summer image to accurately reconstruct it with an inverted (winter > summer) generator? This constraint pushes the generator to actually translate aspects of this specific image between summer and winter. Without it, as the authors of the paper showed, the model has no incentive to actually do translation, and instead just generates winter images that have nothing to do with the summer image (and, frequently experience mode collapse: only generating a single winter image over and over again). This is termed the “Cycle Consistency Loss” It’s actually the case that there are two versions of both of the above networks; that’s what puts the “cycle” in CycleGAN. In addition to a loss ensuring you can map summer > winter > summer, there’s another one ensuring the other direction, winter > summer > winter holds as well. And, for both of those directions, we use the adversarial loss on the middle “translated” image, and a cycle consistency loss on the last “reconstructed” image. A key point here is that, because of the inherent structure of this loss function requires mapping networks going in both directions, training a winter>summer generator gets you a summer> winter one for free. (Note: this is a totally different model architecture than most of the “style transfer” applications you likely previously seen, though when applied to photograph > painting translation, it can have similar results) 
[link]
They created a really nice trick to optimize the $ {L}_{0} $ Pseudo Norm  Regularization on the sorted (By magnitude) values of the optimization variable. Their code is available at  [The Trimmed Lasso: Sparsity and Robustness](https://github.com/copenhaver/trimmedlasso). 
[link]
In object detection the boost in speed and accuracy is mostly gained through network architecture changes.This paper takes a different route towards achieving that goal,They introduce a new loss function called focal loss. The authors identify class imbalance as the main obstacle toward one stage detectors achieving results which are as good as two stage detectors. The loss function they introduce is a dynamically scaled cross entropy loss,Where the scaling factor decays to zero as the confidence in the correct class increases. They add a modulating factor as shown in the image below to the cross entropy loss https://i.imgur.com/N7R3M9J.png Which ends up looking like this https://i.imgur.com/kxC8NCB.png in experiments though they add an additional alpha term to it,because it gives them better results. **Retina Net** The network consists of a single unified network which is composed of a backbone network and two task specific subnetworks.The backbone network computes the feature maps for the input images.The first subnetwork helps in object classification of the backbone networks output and the second subnetwork helps in bounding box regression. The backbone network they use is Feature Pyramid Network,Which they build on top of ResNet. 
[link]
The paper combines reinforcement learning with active learning to learn when to request labels to improve prediction accuracy.  The model can either predict the label at time step $t$ or request it in the next time step, in form of a onehot vector output of an LSTM with the previous label (if requested) and the current image as an input.  A reward is issued based on the outcome of requesting labels (0.05), or correctly (+1) or incorrectly(1) predicting the label.  The optimal strategy involves storing class embeddings and their labels in memory and only requesting labels if a unseen class is encountered. The model is evaluated against the *Omniglot* dataset and learns a nonnaive strategy to request fewer labels the more data of a class was encountered, using a learned uncertainty measure. The magnitude of the reward for incorrect labeling decides on the amount of requested labels and can be used to maximize the accuracy during prediction. ## Active Learning Active Learning is a special case of semisupervised learning, which aims to reduce the amount of supervision needed during training. The model typically selects which datapoints to label by applying different metrics like most information content, highest uncertainty or other heuristics. ## Reinforcement Learning Reinforcement learning agents try to learn an optimal policy $\pi^*(s_t)$ for a state $s_t$ at time $t$ that will maximize future rewards issued by the environment, by choosing an action $a_t$. The policy is represented by a function $Q^*(s_t, a_t)$, which can be approximated and learned in form of a neural network. 
[link]
The paper starts with the BNN with latent variable and proposes an entropybased and a variancebased measure of prediction uncertainty. For each uncertainty measure, the authors propose a decomposition of the aleatoric term and epistemic term. A simple regression toy experiment proves this decomposition and its measure of uncertainty. Then the author tries to improve the regression toy experiment performance by using this uncertainty measure into an active learning scheme. For each batch, they would actively sample which data to label. The result shows that using epistemic uncertainty alone outperforms using total certainty, which both outperforms simple gaussian process. The result is understandable since epistemic is directly related to model weight uncertainty, and sampling from high aleatoric uncertain area does help supervised learning. Then the authors talk about how to extend the model based RL by adding a risk term which consider both aleatoric term and epistemic term, and its related to modelbias and noise aversion. The experiments on Industrial Benchmark shows the method is able prevent overfitting the learned model and better transfer to real world, but the method seems to be pretty sensitive to $\beta$ and $\gamma$. 
[link]
 *issue:* RL on real systems > sparse and slow data sampling;  *solution:* pretrain the agent with the EGAN;  *performance:* ~20% improvement of training time in the beginning of learning compared to no pretraining; ~5% improvement and smaller variations compared to GAN pretraining. ## Introduction 5G telecom systems > fufill ultralow latency, high robustness, quick response to changed capacity needs, and dynamic allocation of functionality. *Problems:* 1. exploration has an impact on the service quality in realtime service systems; 2. sparse and slow data sampling > extended training duration. ## Enhanced GAN **Fomulas** the training data for RL tasks: $$x = [x_1, x_2] = [(s_t,a),(s_{t+1},r)]$$ the generated data: $$G(z) = [G_1(z), G_2(z)] = [(s'_t,a'),(s'_{t+1},r')] $$ the value function for GAN: $$V(D,G) = \mathbb{E}_{z \sim p_z(z)}[\log(1D(G(z)))] + \lambda D_{KL}(PQ)$$ where the regularization term $D_{KL}$ has the following form: $$D_{KL}(PQ) = \sum_i P(i) \log \frac{P(i)}{Q(i)}$$ **EGAN structure** https://i.imgur.com/FhPxamJ.png **Algorithm** https://i.imgur.com/RzOGmNy.png The enhancer is fed with training data *D\_r(s\_t, a)* and *D\_r(s\_{t+1}, r)*, and trained by supervised learning. After GAN generates synthetic data *D\_t(s\_t, a, s\_{t+1}, r)*, the enhancer could enhance the dependency between *D\_t(s\_t, a)* and *D\_t(s\_{t+1}, r)* and update the weights of GAN. ## Results two lines of experiments on CartPole environment involved with PG agents: 1. one for comparing the learning curves of agents with no pretraining, GAN pretraining and EGAN pretraining. => Result: EGAN > GAN > no pretraining 2. one for comparing the learning curves of agents with EGAN pretraining for various episodes (500, 2000, 5000). => Result: 5000 > 2000 ~= 500 
[link]
## __Background__ RNN language models are composed of: 1. Embedding layer 2. Recurrent layer(s) (RNN/LSTM/GRU/...) 3. Softmax layer (linear transformation + softmax operation) The embedding matrix and the matrix of the linear transformation just before the softmax operation are of the same size (size_of_vocab * recurrent_state_size) . They both contain one representation for each word in the vocabulary. ## __Weight Tying__ This paper shows, that by using the same matrix as both the input embedding and the presoftmax linear transformation (the output embedding), the performance of a wide variety of language models is improved while the number of parameters is massively reduced. In weight tied models each word has just one representation that is used in both the input and output embedding. ## __Why does weight tying work?__ 1. In the paper we show that in untied language models, the output embedding contains much better word representations that the input embedding. We show that when the embedding matrices are tied, the quality of the shared embeddings is comparable to that of the output embedding in the untied model. So in the tied model the quality of the input and output embeddings is superior to the quality of those embeddings in the untied model. 2. In most language modeling tasks because of the small size of the datasets the models tend to overfit. When the number of parameters is reduced in a way that makes sense there is less overfitting because of the reduction in the capacity of the network. ## __Can I tie the input and output embeddings of the decoder of an translation model?__ Yes, we show that this reduces the model's size while not hurting its performance. In addition, we show that if you preprocess your data using BPE, because of the large overlap between the subword vocabularies of the source and target language, __ThreeWay Weight Tying__ can be used. In ThreeWay Weight Tying, we tie the input embedding in the encoder to the input and output embeddings of the decoder (so each word has one representation which is used across three matrices). [This](http://ofir.io/NeuralLanguageModelingFromScratch/) blog post contains more details about the weight tying method. 