![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
## Structured segmented network ### **key word**: action detection in video; computing complexity reduction; structurize proposal **Abstract**: using a temporal action grouping scheme (TAG) to generate accurate proposals, using a structured pyramid to model the temporal structure of each action instance to tackle the issue that detected actions are not complete, using two classifiers to determine class and completeness and using a regressor for each category to further modify the temporal bound. In this paper, Yue Zhao et al mainly tackle the problem of high computing complexity by sampling video frame and remove redundant proposals in video detection and the lack of action stage modeling. **Model**: 1. generate proposals: find continuous temporal regions with mostly high actioness. $P = \{ p_i = [s_i,e_i]\}_{i = 1}^N$ 2. splitting proposals into 3 stages: start, course, and end: first augment the proposal by 2 times symmetrical to center, and course part is the original proposal, while start and end is the left part and right part of the difference between the transformed proposal and original one. 3. build temporal pyramid representation for each stage: first L samples are sampled from the augmented proposal, then two-stream feature extractor is used on each one of them and pooling features for each stage 4. build global representation for each proposal by concatenating stage-level representations 5. a global representation for each proposal is used as input for classifiers * input = ${S_t}_{t = 1} ^{T}$a sequence of T snippet representing the video. each snippet = the frames + an optical flow stack * network: two linear classifiers; L two-steam feature extractor and several pooling layer * output: category and completeness and modification for each proposals. https://i.imgur.com/thM9oWz.png **Training**: * joint loss for classifiers: $L_{cls} = -log(P(c_i|p_i)* P(b_i,c_i,p_i)) $ * loss for location regression: $\lambda * 1(c_i>=1, b_i = 1) L(u_i,\varphi _i;p_i)$ **Summary**: This paper has three highlights: 1. Parallel: it uses a paralleled network structure where proposals can be processed in paralleled which will shorten the processing time based on GPU 2. temporal structure modeling and regression: give each proposal certain structure so that completeness of proposals can be achieved 3. reduce computing complexity: use two tricks: remove video redundancy by sampling frame; remove proposal redundance ![]() |
[link]
One of the dominant narratives of the deep learning renaissance has been the value of well-designed inductive bias - structural choices that shape what a model learns. The biggest example of this can be found in convolutional networks, where models achieve a dramatic parameter reduction by having features maps learn local patterns, which can then be re-used across the whole image. This is based on the prior belief that patterns in local images are generally locally contiguous, and so having feature maps that focus only on small (and gradually larger) local areas is a good fit for that prior. This paper operates in a similar spirit, except its input data isn’t in the form of an image, but a graph: the social graph of multiple agents operating within a Multi Agent RL Setting. In some sense, a graph is just a more general form of a pixel image: where a pixel within an image has a fixed number of neighbors, which have fixed discrete relationships to it (up, down, left, right), nodes within graphs have an arbitrary number of nodes, which can have arbitrary numbers and types of attributes attached to that relationship. The authors of this paper use graph networks as a sort of auxiliary information processing system alongside a more typical policy learning framework, on tasks that require group coordination and knowledge sharing to complete successfully. For example, each agent might be rewarded based on the aggregate reward of all agents together, and, in the stag hunt, it might require collaborative effort by multiple agents to successfully “capture” a reward. Because of this, you might imagine that it would be valuable to be able to predict what other agents within the game are going to do under certain circumstances, so that you can shape your strategy accordingly. The graph network used in this model represents both agents and objects in the environment as nodes, which have attributes including their position, whether they’re available or not (for capture-able objects), and what their last action was. As best I can tell, all agents start out with directed connections going both ways to all other agents, and to all objects in the environment, with the only edge attribute being whether the players are on the same team, for competitive environments. Given this setup, the graph network works through a sort of “diffusion” of information, analogous to a message passing algorithm. At each iteration (analogous to a layer), the edge features pull in information from their past value and sender and receiver nodes, as well as from a “global feature”. Then, all of the nodes pull in information from their edges, and their own past value. Finally, this “global attribute” gets updated based on summations over the newly-updated node and edge information. (If you were predicting attributes that were graph-level attributes, this global attribute might be where you’d do that prediction. However, in this case, we’re just interested in predicting agent-level actions). https://i.imgur.com/luFlhfJ.png All of this has the effect of explicitly modeling agents as entities that both have information, and have connections to other entities. One benefit the authors claim of this structure is that it allows them more interpretability: when they “play out” the values of their graph network, which they call a Relational Forward Model or RFM, they observe edge values for two agents go up if those agents are about to collaborate on an action, and observe edge values for an agent and an object go up before that object is captured. Because this information is carefully shaped and structured, it makes it easier for humans to understand, and, in the tests the authors ran, appears to also help agents do better in collaborative games. https://i.imgur.com/BCKSmIb.png While I find graph networks quite interesting, and multi-agent learning quite interesting, I’m a little more uncertain about the inherent “graphiness” of this problem, since there aren’t really meaningful inherent edges between agents. One thing I am curious about here is how methods like these would work in situations of sparser graphs, or, places where the connectivity level between a node’s neighbors, and the average other node in the graph is more distinct. Here, every node is connected to every other node, so the explicit information localization function of graph networks is less pronounced. I might naively think that - to whatever extent the graph is designed in a way that captures information meaningful to the task - explicit graph methods would have an even greater comparative advantage in this setting. ![]() |
[link]
## **Keywords** One pixel attack , adversarial examples , differential evolution , targeted and non-targeted attack --- ## **Summary** 1. **Introduction ** 1. **Basics** 1. Deep learning methods are better than the traditional image processing techniques in most of the cases in computer vision domain. 1. "Adversarial examples" are specifically modified images with imperceptible perturbations that are classified wrong by the network. 1. **Goals of the paper** 1. In most of the older techniques excessive modifications are made on the images and it may become perceivable to the human eyes. The authors of the paper suggest a method to create adversarial examples by changing only one , three or five pixels of the image. 1. Generating examples under constrained conditions can help in _getting insights about the decision boundaries_ in the higher dimensional space. 1. **Previous Work** 1. Methods to create adversarial examples : 1. Gradient-based algorithms using backpropagation for obtaining gradient information 1. "fast gradient sign" algorithm 1. Greedy perturbation searching method 1. Jacobian matrix to build "Adversarial Saliency Map" 1. Understanding and visualizing the decision boundaries of the DNN input space. 1. Concept of "Universal perturbations" , a perturbation that when added to any natural image can generate adversarial samples with high effectiveness 1. **Advantages of the new types of attack ** 1. _Effectiveness_ : One pixel modification with efficiency ranging from 60% - 75%. 1. _Semi-Black-Box attack _: Requires only black-box feedback (probability labels) , no gradient and network architecture required. 1. _Flexibility_ : Can generalize between different types of network architectures. 1. **Methodology** 1. Finding the adversarial example as an optimization problem with constraints.** ** 1. _Differential evolution_ 1. _"Differential evolution" _, a general kind of evolutionary algorithms , used to solve multimodal optimization problems. 1. Does Not make use of gradient information 1. Advantages of DE for generating adversarial images : 1. _Higher probability of finding the global optima_ 1. _Requires less information from the target system_ 1. _Simplicity_ : Independent of the classifier 1. **Results ** 1. CIFAR-10 dataset was selected with 3 types of networks architectures , all convolution network , Network in Network and VGG16 network . 500 random images were selected to create the perturbations and run both _targeted_ and_ non-targeted attack._ 1. Adversarial examples were created with only one pixel change in some cases and with 3 and 5 pixel changes in other cases. 1. The attack was generalized over different architectures. 1. Some specific target-pair classes are more vulnerable to attack compared to the others. 1. Some classes are very difficult to perturb to other classes and some cannot be changed at all. 1. Robustness of the class against attack can be broken by using higher dimensional perturbations. 1. **Conclusion** 1. Few pixels are enough to fool different types of networks. 1. The properties of the targeted perturbation depends on its decision boundary. 1. Assumptions made that small changes addictive perturbation on the values of many dimensions will accumulate and cause huge change to the output , might not be necessary for explaining why natural images are sensitive to small perturbation. --- ## **Notes ** * Location of data points near the decision boundaries might affect the robustness against perturbations. * If the boundary shape is wide enough it is possible to have natural images far away from the boundary such that it is hard to craft adversarial images from it. * If the boundary shape is mostly long and thin with natural images close to the border, it is easy to craft adversarial images from them but hard to craft adversarial images to them. * The data points are moved in small steps and the change in the class probabilities are observed. ## **Open research questions** 1. Effect of a larger set of initial candidate solutions( Training images) to finding the adversarial image? 1. Generate better adversarial examples by having more iterations of Differential evolution? 1. Why imbalances occur when creating perturbations? ![]() |
[link]
## Keywords Triplet-loss , face embedding , harmonic embedding --- ## Summary ### Introduction **Goal of the paper** A unified system is given for face verification , recognition and clustering. Use of a 128 float pose and illumination invariant feature vector or embedding in the euclidean space. * Face Verification : Same faces of the person gives feature vectors that have a very close L2 distance between them. * Face recognition : Face recognition becomes a clustering task in the embedding space **Previous work** * Previous use of deep learning made use of an bottleneck layer to represent face as an embedding of 1000s dimension vector. * Some other techniques use PCA to reduce the dimensionality of the embedding for comparison. **Method** * This method makes use of inception style CNN to get an embedding of each face. * The thumbnails of the face image are the tight crop of the face area with only scaling and translation done on them. **Triplet Loss** Triplet loss makes use of two matching face thumbnails and a non-matching thumbnail. The loss function tries to reduce the distance between the matching pair while increasing the separation between the the non-matching pair of images. **Triplet Selection** * Selection of triplets is done such that samples are hard-positive or hard-negative . * Hardest negative can lead to local minima early in the training and a collapse model in a few cases * Use of semi-hard negatives help to improve the convergence speed while at the same time reach nearer to the global minimum. **Deep Convolutional Network** * Training is done using SGD (Stochastic gradient descent) with Backpropagation and AdaGrad * The training is done on two networks : - Zeiler&Fergus architecture with model depth of 22 and 140 million parameters - GoogLeNet style inception model with 6.6 to 7.5 million parameters. **Experiment** * Study of the following cases are done : - Quality of the jpeg image : The validation rate of model improves with the JPEG quality upto a certain threshold. - Embedding dimensionality : The dimension of the embedding increases from 64 to 128,256 and then gradually starts to decrease at 512 dimensions. - No. of images in the training data set **Results classification accuracy** : - LFW(Labelled faces in the wild) dataset : 98.87% 0.15 - Youtube Faces DB : 95.12% .39 On clustering tasks the model was able to work on a wide varieties of face images and is invariant to pose , lighting and also age. **Conclusion** * The model can be extended further to improve the overall accuracy. * Training networks to run on smaller systems like mobile phones. * There is need for improving the training efficiency. --- ## Notes * Harmonic embedding is a set of embedding that we get from different models but are compatible to each other. This helps to improve future upgrades and transitions to a newer model * To make the embeddings compatible with different models , harmonic-triplet loss and the generated triplets must be compatible with each other ## Open research questions * Better understanding of the error cases. * Making the model more compact for embedded and mobile use cases. * Methods to reduce the training times. ![]() |
[link]
### Keywords Adversarial example , Perturbations ------ ### Summary ##### Introduction * Explain two properties of neural network that cause it to misclassify images and cause difficulty to get solid understanding of network. 1. Theoretical understanding of the individual high level unit of a network and a combination of these units or layers. 2. Understanding the continuity of input - output mapping space and the stability of the output wrt. the input. * Performing a few experiments on different networks and architectures 1. MNIST dataset - Autoencoder , Fully Connected net 2. ImageNet - “AlexNet” 3. 10M youtube images - “QuocNet” ##### Understanding individual units of the Network * Previous work used individual images to maximize the activation value of each feature unit. Similar experiment was done by the authors on the MNIST data set. * The interpretation of the results are as following ; 1. Random direction vector (V) gives rise to similarly interpretable semantic properties. 2. Each feature unit is able to generate invariance on a particular subset of input distribution. https://i.imgur.com/SeyXJeV.png ##### Blind spots in the neural network * Output layers are highly non-linear and are able to give a nonlinear generalization over the input space. * It is possible for the output layers to give non-significant probabilities to regions of the input space that contain no training examples in their vicinity. Ie. It is possible to obtain probability of the different viewpoints of the object without training. * Deep learning kernel methods can't be assumed to have smooth decision boundaries. * Using optimization techniques, small changes to the image can lead to very large deviations in the output * __“Adversarial examples”__ represent pockets or holes in the input-space which are difficult to find simply moving around the input images. ##### Experimental Results * Adversarial examples that are indistinguishable from the actual image can be created for all networks. 1. Cross model generalization : Adversarial images created for one network can affect the other networks also. 2. Cross training generalization https://i.imgur.com/drcGvpz.png ##### Conclusion * Neural network have a counter intuitive properties wrt. the working of the individual units and discontinuities. * Occurance of the adversarial examples and its properties. ----- ### Notes * Feeding adversarial examples during the model training can improve the generalization of the model. * The adversarial examples on the higher layers are more effective than those of input and lower layers. * Adversarial examples affect models trained with different hyper parameters. * According to the the test conducted , autoencoders are more resilient to the adversarial examples. * Deep learning networks which are trained from purely supervised training are unstable to a few particular types of perturbations. Small addition of perturbations to the input leads to large perturbations at the output of the last layers. ### Open research questions [1] Comparing the effects of adversarial examples on lower layers to that of the higher layers. [2] Dependence of the adversarial attacks on training data set of the model. [3] Why the adversarial examples generalize across different hyperparameters or training sets. [4] How often do adversarial example occur? ![]() |