Welcome to ShortScience.org! |
[link]
This paper presents a combination of the inception architecture with residual networks. This is done by adding a shortcut connection to each inception module. This can alternatively be seen as a resnet where the 2 conv layers are replaced by a (slightly modified) inception module. The paper (claims to) provide results against the hypothesis that adding residual connections improves training, rather increasing the model size is what makes the difference. |
[link]
This paper describes how to apply the idea of batch normalization (BN) successfully to recurrent neural networks, specifically to LSTM networks. The technique involves the 3 following ideas: **1) Careful initialization of the BN scaling parameter.** While standard practice is to initialize it to 1 (to have unit variance), they show that this situation creates problems with the gradient flow through time, which vanishes quickly. A value around 0.1 (used in the experiments) preserves gradient flow much better. **2) Separate BN for the "hiddens to hiddens pre-activation and for the "inputs to hiddens" pre-activation.** In other words, 2 separate BN operators are applied on each contributions to the pre-activation, before summing and passing through the tanh and sigmoid non-linearities. **3) Use of largest time-step BN statistics for longer test-time sequences.** Indeed, one issue with applying BN to RNNs is that if the input sequences have varying length, and if one uses per-time-step mean/variance statistics in the BN transformation (which is the natural thing to do), it hasn't been clear how do deal with the last time steps of longer sequences seen at test time, for which BN has no statistics from the training set. The paper shows evidence that the pre-activation statistics tend to gradually converge to stationary values over time steps, which supports the idea of simply using the training set's last time step statistics. Among these ideas, I believe the most impactful idea is 1). The papers mentions towards the end that improper initialization of the BN scaling parameter probably explains previous failed attempts to apply BN to recurrent networks. Experiments on 4 datasets confirms the method's success. **My two cents** This is an excellent development for LSTMs. BN has had an important impact on our success in training deep neural networks, and this approach might very well have a similar impact on the success of LSTMs in practice. |
[link]
This paper presents a feed-forward neural network architecture for processing graphs as inputs, inspired from previous work on Graph Neural Networks. In brief, the architecture of the GG-NN corresponds to $T$ steps of GRU-like (gated recurrent units) updates, where T is a hyper-parameter. At each step, a vector representation is computed for all nodes in the graph, where a node's representation at step t is computed from the representation of nodes at step $t-1$. Specifically, the representation of a node will be updated based on the representation of its neighbors in the graph. Incoming and outgoing edges in the graph are treated differently by the neural network, by using different parameter matrices for each. Moreover, if edges have labels, separate parameters can be learned for the different types of edges (meaning that edge labels determine the configuration of parameter sharing in the model). Finally, GG-NNs can incorporate node-level attributes, by using them in the initialization (time step 0) of the nodes' representations. GG-NNs can be used to perform a variety of tasks on graphs. The per-node representations can be used to make per-node predictions by feeding them to a neural network (shared across nodes). A graph-level predictor can also be obtained using a soft attention architecture, where per-node outputs are used as scores into a softmax in order to pool the representations across the graph, and feed this graph-level representation to a neural network. The attention mechanism can be conditioned on a "question" (e.g. on a task to predict the shortest path in a graph, the question would be the identity of the beginning and end nodes of the path to find), which is fed to the node scorer of the soft attention mechanism. Moreover, the authors describe how to chain GG-NNs to go beyond predicting individual labels and predict sequences. Experiments on several datasets are presented. These include tasks where a single output is required (on a few bAbI tasks) as well as tasks where a sequential output is required, such as outputting the shortest path or the Eulerian circuit of a graph. Moreover, experiments on a much more complex and interesting program verification task are presented. |
[link]
Disclaimer: I am an author # Intro Experience replay (ER) and generative replay (GEN) are two effective continual learning strategies. In the former, samples from a stored memory are replayed to the continual learner to reduce forgetting. In the latter, old data is compressed with a generative model and generated data is replayed to the continual learner. Both of these strategies assume a random sampling of the memories. But learning a new task doesn't cause **equal** interference (forgetting) on the previous tasks! In this work, we propose a controlled sampling of the replays. Specifically, we retrieve the samples which are most interfered, i.e. whose prediction will be most negatively impacted by the foreseen parameters update. The method is called Maximally Interfered Retrieval (MIR). ## Cartoon for explanation https://i.imgur.com/5F3jT36.png Learning about dogs and horses might cause more interference on lions and zebras than on cars and oranges. Thus, replaying lions and zebras would be a more efficient strategy. # Method 1) incoming data: $(X_t,Y_t)$ 2) foreseen parameter update: $\theta^v= \theta-\alpha\nabla\mathcal{L}(f_\theta(X_t),Y_t)$ ### applied to ER (ER-MIR) 3) Search for the top-$k$ values $x$ in the stored memories using the criterion $$s_{MI}(x) = \mathcal{L}(f_{\theta^v}(x),y) -\mathcal{L}(f_{\theta}(x),y)$$ ### or applied to GEN (GEN-MIR) 3) $$ \underset{Z}{\max} \, \mathcal{L}\big(f_{\theta^v}(g_\gamma(Z)),Y^*\big) -\mathcal{L}\big(f_{\theta}(g_\gamma(Z)),Y^*\big) $$ $$ \text{s.t.} \quad ||z_i-z_j||_2^2 > \epsilon \forall z_i,z_j \in Z \,\text{with} \, z_i\neq z_j $$ i.e. search in the latent space of a generative model $g_\gamma$ for samples that are the most forgotten given the foreseen update. 4) Then add theses memories to incoming data $X_t$ and train $f_\theta$ # Results ### qualitative https://i.imgur.com/ZRNTWXe.png Whilst learning 8s and 9s (first row), GEN-MIR mainly retrieves 3s and 4s (bottom two rows) which are similar to 8s and 9s respectively. ### quantitative GEN-MIR was tested on MNIST SPLIT and Permuted MNIST, outperforming the baselines in both cases. ER-MIR was tested on MNIST SPLIT, Permuted MNIST and Split CIFAR-10, outperforming the baselines in all cases. # Other stuff ### (for avid readers) We propose a hybrid method (AE-MIR) in which the generative model is replaced with an autoencoder to facilitate the compression of harder dataset like e.g. CIFAR-10. |
[link]
Keypoint detection is an important step in various tasks such as SLAM, panorama stitching, camera calibration, and more. Efficient keypoint detectors, FAST (Features from Accelerated and Segments Test) for example, would detect keypoints where a relatively high brightness change is observed in relation to surrounding pixels. Most probably, the keypoints would be located on edges, as shown below: https://i.imgur.com/ylC4BM3.jpg Let's consider another image shown below. Here, while the detector is capable of detecting many keypoints, they are mostly located on trees (see subfigure (a) below). This causes redundancy and this paper focuses on solving it by selecting locally strong keypoints that are well distributed all over the image (subfigure (c)). https://i.imgur.com/1MqZhmT.png The algorithm requires input keypoints to be sorted in decreasing order of strength. The keypoints are then processed in that order and points that fall within the suppression range of a current keypoint are removed from the consideration. The process is repeated for the next unsuppressed keypoint in the sorted order. The process continues until no points remain. If the number of filtered points is not close enough to what we require, the suppression range is modified and the process is repeated. The suppression range is modified using binary search. https://i.imgur.com/qryscZP.png Binary search requires lower and upper bounds to operate. Naive initialization would be setting lower bound to 1 pixel, while upper bound - image width or height. On the other hand, this paper proposes better initialization of the suppression range that is dependent on image height $H_I$, width $W_I$, number of input keypoints $n$ as well as the number of output keypoints $m$. * Upper bound: $\frac{H_I + W_I + 2m - \sqrt{\Delta}}{2m - 1}$ (see paper for more details) * Lower bound: $\frac{1}{2}\sqrt{\frac{n}{m}}$ This initializing helps to decrease the number of iterations to convergence by a factor of three: https://i.imgur.com/7NCpgbi.png Homogenous location of keypoints is beneficial for SLAM algorithm and reduce translational and rotational errors compared to naive filtering when evaluated on KITTI: https://i.imgur.com/4wq0kLK.png Overall, this paper proposes a fast, efficient, and effective method to post-process noisy and redundant keypoint detections with a clear benefit to SLAM. The codes are publically available in multiple languages: C++, Python, Java, and Matlab. See https://github.com/BAILOOL/ANMS-Codes |