[link]
## Summary The broad goal of this paper is to understand how a neural network learns the underlying distribution of the input data and the properties of the network that describes its generalization power. Previous literature tries to use statistical measures like Rademacher complexity, uniform stability and VC dimension to explain the generalization error of the model. These methods explain generalization in terms of the number of parameters in the model along with the applied regularization. The experiments performed in the [Section 2] of the paper show that the learning capacity of a CNN cannot be sufficiently explained by traditional statistical learning theory. Even the effect of different regularization strategies in CNN is shown to be potentially unrelated to the generalization error, which contradicts the theory behind VC dimension. The experiments of the paper show that the model is able to learn some underlying patterns for random labels and input with different amounts of gaussian noise. When the authors gradually increase the noise in the inputs the generalization error gradually increases while the training error is still able to reach zero. The authors have concluded that big networks are able to completely memorise the complete dataset. ## Personal Thoughts 1) Firstly we need a new theory to explain why and how CNN memorizes the inputs and generalizes itself to new data. Since the paper shows that regularization doesn't have too much effect on the generalization for big networks, maybe the network is actually memorizing the whole input space. But the memorization is very strategic in the sense that only the inputs (eg. noise) where no underlying simple features are found, are completely memorized unlike inputs with a stronger signal where patterns can be found. This may explain the discrepancy in number of training steps between ‘true labels’ and noisy inputs in [Figure 1 a.]. My very general understanding of Information Bottleneck Hypothesis [4] is that networks compresses noisy input data as much as possible while preserving important information. For a network more time is taken to compress noise compared to strong signals in images. This may give some intuision behind the learning process taking place. 2) CNN is highly non-linear with millions of parameters and has a very complex loss landscape. There might be multiple minima and we need a theory to explain which of these minima gives the highest generalization. Unfortunately the working of SGD is still a black box and is very difficult to characterize. There are many interesting phenomena like adversarial attacks, effect of optimizer used on the weights found (Daniel Jiwoong et al., 2016) and the actual understanding of non-linearity in CNN (Ian J. Goodfellow et al., 2015) that all point to lapses in our overall understanding of very high dimensional manifolds. This requires rigorous experimentation to study and understand the effect of the network architecture, optimizer and the actual input (Nitish Shirish et al.,2017) to the network independently on generalization. ## References 1. Im, Daniel Jiwoong et al. “An empirical analysis of the optimization of deep network loss surfaces.” arXiv: Learning (2016): n. pag. 2. Goodfellow, Ian J. and Oriol Vinyals. “Qualitatively characterizing neural network optimization problems.” CoRR abs/1412.6544 (2015): n. pag. 3. Keskar, Nitish Shirish et al. “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.” ArXiv abs/1609.04836 (2017): n. pag. 4. https://www.youtube.com/watch?v=XL07WEc2TRI |
[link]
# **Introduction** ### **Goal of the paper** * The goal of this paper is to use an RGB-D image to find the best pose for grasping an object using a parallel pose gripper. * The goal of this algorithm is to also give an open loop method for manipulation of the object using vision data. ### **Previous Research** * Even the state of the art in grasp detection algorithms fail under real world circumstances and cannot work in real time. * To perform grasping a 7D grasp representation is used. But usually a 5D grasping representation is used and this is projected back into 7D space. * Previous methods directly found the 7D pose representation using only the vision data. * Compared to older computer vision techniques like sliding window classifier deep learning methods are more robust to occlusion , rotation and scaling. * Grasp Point detection gave high accuracy (> 92%) but was helpful for only grasping cloths or towels. ### **Method** * Grasp detection is generally a computer vision problem. * The algorithm given by the paper made use of computer vision to find the grasp as a 5D representation. The 5D representation is faster to compute and is also less computationally intensive and can be used in real time. * The general grasp planning algorithms can be divided into three distinct sequential phases ; 1. Grasp detection 1. Trajectory planning 1. Grasp execution * One of the most major tasks in grasping algorithms is to find the best place for grasping and to map the vision data to coordinates that can be used for manipulation. * The method makes use of three neural networks : 1. 50 deep neural network (ResNet 50) to find the features in RGB image. This network is pretrained on the ImageNet dataset. 1. Another neural network to find the feature in depth image. 1. The output from the two neural networks are fed into another network that gives the final grasp configuration as the output. * The robot grasping configuration can be given as a function of the x,y,w,h and theta where (x,y) are the centre of the grasp rectangle and theta is the angle of the grasp rectangle. * Since very deep networks are being used (number of layers > 20) , residual layers are used that helps in improving the loss surface of the network and reduce the vanishing gradient problems. * This paper gives two types of networks for the grasp detection ; 1. Uni-Modal Grasp Predictor * These use only an RGB 2D image to extract the feature from the input image and then use the features to give the best pose. * A Linear - SVM is used as the final classifier to classify the best pose for the object. 1. Multi-Modal Grasp Predictor * This model makes use of both the 2D image and the RGB-D image to extract the grasp. * RGB-D image is decomposed into an RGB image and a depth image. * Both the images are passed through the networks and the outputs are the combined together to a shallow CNN. * The output of the shallow CNN is the best grasp for the object. ### **Experiments and Results** * The experiments are done on the Cornell Grasp dataset. * Almost no or minimum preprocessing is done on the images except resizing the image. * The results of the algorithm given by this paper are compared to unimodal methods that use only RGB images. * To validate the model it is checked if the predicted angle of grasp is less than 30 degrees and that the Jaccard similarity is more than 25% of the ground truth label. ### **Conclusion** * This paper shows that Deep-Convolutional neural networks can be used to predict the grasping pose for an object. * Another major observation is that the deep residual layers help in better extraction of the features of the grasp object from the image. * The new model was able to run at realtime speeds. * The model gave state of the art results on Cornell Grasping dataset. ---- ### **Open research questions** * Transfer Learning concepts to try the model on real robots. * Try the model in industrial environments on objects of different sizes and shapes. * Formulating the grasping problem as a regression problem. |
[link]
## **Keywords** Progressive GAN , High resolution generator --- ## **Summary** 1. **Introduction** 1. **Goal of the paper** 1. Generation of very high quality images using progressively increasing size of the generator and discriminator. 1. Improved training and stability of GANs. 1. New metric for evaluating GAN results. 1. A high quality version of CELEBA-HQ dataset. 1. **Previous Research** 1. Generative methods help to produce new samples from higher-dimensional data distributions such as images . 1. The common approaches for generative methods are : 1. Autoregressive models : Produce sharp images and are slow to evaluate. eg PixelCNN 1. Variational Autoencoders : Easy to train but produces blurry images. 1. Generative Adversarial Neural Network : Produces sharp images at small resolutions but are highly unstable. 1. **Method** 1. **Basic GAN architecture** 1. Gan consists of two major parts : 1. _Generator_ : Creates a sample image from latent code which look very close to the training images. 1. _Discriminator_: Discriminator is trained to assess how close the sample image looks to the training image. 1. To measure the overlap between the training and the generated distributions many methods are used like Jensen-Shannon divergence , least-squares divergence and Wasserstein Distance. 1. Larger resolution generations cause problems because it becomes difficult for both the training and the generated networks amplifying the gradient problem. Larger resolutions also require large memory and can cause problems. 1. A mechanism is also proposed to stop the generator from participating in escalation that causes mode collapse problem. 1. **Progressive growing of GANs** 1. The primary method for the GAN training is to start off from a low resolution image and add extra layers in each step of the training process. 1. Lower resolution images are more stable as they have very less class information and as the resolution of the image increases further smaller details and features are added to the image. 1. This leads to a smooth increase in the quality of image instead of the network learning lot of details in one single step. 1. **Mini-batch separation** 1. GANs tend to capture only a very small set of features from the image. 1. "Minibatch discrimination" is used to generate feature vector for each individual image along with one for the the mini batch of images also. ![alt_text](https://i.imgur.com/dHFl5OV.png "image_tooltip") 1. **Conclusion** 1. Higher resolution images are able to be generated which are robust and efficient. 1. Improved quality of the generated images is given. 1. Reduced training time for a comparable result and output quality and resolution. --- ## **Notes** * Gradient Problem : At higher resolutions it becomes easier to tell the differences between the training and the testing images [1]. This is referred to as the gradient problem. * Mode Collapse : The generator is incapable of creating a large variety of samples and get stuck. ## **Open research questions** 1. Improved methods for a true photorealism generation of images. 1. Improved semantic sensibility and improved understanding of the dataset. ## **References** 1. [https://blog.acolyer.org/2018/05/10/progressive-growing-of-gans-for-improved-quality-stability-and-variation/](https://blog.acolyer.org/2018/05/10/progressive-growing-of-gans-for-improved-quality-stability-and-variation/) 1. [https://medium.com/@jonathan_hui/gan-why-it-is-so-hard-to-train-generative-advisory-networks-819a86b3750b](https://medium.com/@jonathan_hui/gan-why-it-is-so-hard-to-train-generative-advisory-networks-819a86b3750b) |
[link]
## **Keywords** One pixel attack , adversarial examples , differential evolution , targeted and non-targeted attack --- ## **Summary** 1. **Introduction ** 1. **Basics** 1. Deep learning methods are better than the traditional image processing techniques in most of the cases in computer vision domain. 1. "Adversarial examples" are specifically modified images with imperceptible perturbations that are classified wrong by the network. 1. **Goals of the paper** 1. In most of the older techniques excessive modifications are made on the images and it may become perceivable to the human eyes. The authors of the paper suggest a method to create adversarial examples by changing only one , three or five pixels of the image. 1. Generating examples under constrained conditions can help in _getting insights about the decision boundaries_ in the higher dimensional space. 1. **Previous Work** 1. Methods to create adversarial examples : 1. Gradient-based algorithms using backpropagation for obtaining gradient information 1. "fast gradient sign" algorithm 1. Greedy perturbation searching method 1. Jacobian matrix to build "Adversarial Saliency Map" 1. Understanding and visualizing the decision boundaries of the DNN input space. 1. Concept of "Universal perturbations" , a perturbation that when added to any natural image can generate adversarial samples with high effectiveness 1. **Advantages of the new types of attack ** 1. _Effectiveness_ : One pixel modification with efficiency ranging from 60% - 75%. 1. _Semi-Black-Box attack _: Requires only black-box feedback (probability labels) , no gradient and network architecture required. 1. _Flexibility_ : Can generalize between different types of network architectures. 1. **Methodology** 1. Finding the adversarial example as an optimization problem with constraints.** ** 1. _Differential evolution_ 1. _"Differential evolution" _, a general kind of evolutionary algorithms , used to solve multimodal optimization problems. 1. Does Not make use of gradient information 1. Advantages of DE for generating adversarial images : 1. _Higher probability of finding the global optima_ 1. _Requires less information from the target system_ 1. _Simplicity_ : Independent of the classifier 1. **Results ** 1. CIFAR-10 dataset was selected with 3 types of networks architectures , all convolution network , Network in Network and VGG16 network . 500 random images were selected to create the perturbations and run both _targeted_ and_ non-targeted attack._ 1. Adversarial examples were created with only one pixel change in some cases and with 3 and 5 pixel changes in other cases. 1. The attack was generalized over different architectures. 1. Some specific target-pair classes are more vulnerable to attack compared to the others. 1. Some classes are very difficult to perturb to other classes and some cannot be changed at all. 1. Robustness of the class against attack can be broken by using higher dimensional perturbations. 1. **Conclusion** 1. Few pixels are enough to fool different types of networks. 1. The properties of the targeted perturbation depends on its decision boundary. 1. Assumptions made that small changes addictive perturbation on the values of many dimensions will accumulate and cause huge change to the output , might not be necessary for explaining why natural images are sensitive to small perturbation. --- ## **Notes ** * Location of data points near the decision boundaries might affect the robustness against perturbations. * If the boundary shape is wide enough it is possible to have natural images far away from the boundary such that it is hard to craft adversarial images from it. * If the boundary shape is mostly long and thin with natural images close to the border, it is easy to craft adversarial images from them but hard to craft adversarial images to them. * The data points are moved in small steps and the change in the class probabilities are observed. ## **Open research questions** 1. Effect of a larger set of initial candidate solutions( Training images) to finding the adversarial image? 1. Generate better adversarial examples by having more iterations of Differential evolution? 1. Why imbalances occur when creating perturbations? |
[link]
## Keywords Triplet-loss , face embedding , harmonic embedding --- ## Summary ### Introduction **Goal of the paper** A unified system is given for face verification , recognition and clustering. Use of a 128 float pose and illumination invariant feature vector or embedding in the euclidean space. * Face Verification : Same faces of the person gives feature vectors that have a very close L2 distance between them. * Face recognition : Face recognition becomes a clustering task in the embedding space **Previous work** * Previous use of deep learning made use of an bottleneck layer to represent face as an embedding of 1000s dimension vector. * Some other techniques use PCA to reduce the dimensionality of the embedding for comparison. **Method** * This method makes use of inception style CNN to get an embedding of each face. * The thumbnails of the face image are the tight crop of the face area with only scaling and translation done on them. **Triplet Loss** Triplet loss makes use of two matching face thumbnails and a non-matching thumbnail. The loss function tries to reduce the distance between the matching pair while increasing the separation between the the non-matching pair of images. **Triplet Selection** * Selection of triplets is done such that samples are hard-positive or hard-negative . * Hardest negative can lead to local minima early in the training and a collapse model in a few cases * Use of semi-hard negatives help to improve the convergence speed while at the same time reach nearer to the global minimum. **Deep Convolutional Network** * Training is done using SGD (Stochastic gradient descent) with Backpropagation and AdaGrad * The training is done on two networks : - Zeiler&Fergus architecture with model depth of 22 and 140 million parameters - GoogLeNet style inception model with 6.6 to 7.5 million parameters. **Experiment** * Study of the following cases are done : - Quality of the jpeg image : The validation rate of model improves with the JPEG quality upto a certain threshold. - Embedding dimensionality : The dimension of the embedding increases from 64 to 128,256 and then gradually starts to decrease at 512 dimensions. - No. of images in the training data set **Results classification accuracy** : - LFW(Labelled faces in the wild) dataset : 98.87% 0.15 - Youtube Faces DB : 95.12% .39 On clustering tasks the model was able to work on a wide varieties of face images and is invariant to pose , lighting and also age. **Conclusion** * The model can be extended further to improve the overall accuracy. * Training networks to run on smaller systems like mobile phones. * There is need for improving the training efficiency. --- ## Notes * Harmonic embedding is a set of embedding that we get from different models but are compatible to each other. This helps to improve future upgrades and transitions to a newer model * To make the embeddings compatible with different models , harmonic-triplet loss and the generated triplets must be compatible with each other ## Open research questions * Better understanding of the error cases. * Making the model more compact for embedded and mobile use cases. * Methods to reduce the training times. |
[link]
### Keywords Adversarial example , Perturbations ------ ### Summary ##### Introduction * Explain two properties of neural network that cause it to misclassify images and cause difficulty to get solid understanding of network. 1. Theoretical understanding of the individual high level unit of a network and a combination of these units or layers. 2. Understanding the continuity of input - output mapping space and the stability of the output wrt. the input. * Performing a few experiments on different networks and architectures 1. MNIST dataset - Autoencoder , Fully Connected net 2. ImageNet - “AlexNet” 3. 10M youtube images - “QuocNet” ##### Understanding individual units of the Network * Previous work used individual images to maximize the activation value of each feature unit. Similar experiment was done by the authors on the MNIST data set. * The interpretation of the results are as following ; 1. Random direction vector (V) gives rise to similarly interpretable semantic properties. 2. Each feature unit is able to generate invariance on a particular subset of input distribution. https://i.imgur.com/SeyXJeV.png ##### Blind spots in the neural network * Output layers are highly non-linear and are able to give a nonlinear generalization over the input space. * It is possible for the output layers to give non-significant probabilities to regions of the input space that contain no training examples in their vicinity. Ie. It is possible to obtain probability of the different viewpoints of the object without training. * Deep learning kernel methods can't be assumed to have smooth decision boundaries. * Using optimization techniques, small changes to the image can lead to very large deviations in the output * __“Adversarial examples”__ represent pockets or holes in the input-space which are difficult to find simply moving around the input images. ##### Experimental Results * Adversarial examples that are indistinguishable from the actual image can be created for all networks. 1. Cross model generalization : Adversarial images created for one network can affect the other networks also. 2. Cross training generalization https://i.imgur.com/drcGvpz.png ##### Conclusion * Neural network have a counter intuitive properties wrt. the working of the individual units and discontinuities. * Occurance of the adversarial examples and its properties. ----- ### Notes * Feeding adversarial examples during the model training can improve the generalization of the model. * The adversarial examples on the higher layers are more effective than those of input and lower layers. * Adversarial examples affect models trained with different hyper parameters. * According to the the test conducted , autoencoders are more resilient to the adversarial examples. * Deep learning networks which are trained from purely supervised training are unstable to a few particular types of perturbations. Small addition of perturbations to the input leads to large perturbations at the output of the last layers. ### Open research questions [1] Comparing the effects of adversarial examples on lower layers to that of the higher layers. [2] Dependence of the adversarial attacks on training data set of the model. [3] Why the adversarial examples generalize across different hyperparameters or training sets. [4] How often do adversarial example occur? |